Planning / Issues / Ideas
- Are we going to use Moose directly, or would it be better to use Any::Moose? Moose itself has a pretty heavy start-up time penalty; any module that uses Moose is pretty much a no-go for, say, CGI programming. (Though fine for web programming where processes are persistent - e.g. mod_perl.) Using Any::Moose at least for the core stuff (RDF::Trine, RDF::Query) would keep the module start-up speed fine, but allow them to integrate into Moose apps well.
- Web::ID uses Any::Moose.
- RDF::Generator::Void tried to use Any::Moose, but it got too hard to support, with a lot of incompatibilities, which caused problems both for packaging and coding.
- The Any::Moose author will be there, we can discuss at the hackathon.
- We should migrate away from Error.pm, and we may then use a Moose-dependant try/catch error thingie, but which?
- Consider naming the new low level modules something other than "RDF::Trine". Maybe plain old "Trine". This would give us more freedom to change the API because the old versions would still exist for backwards compatibility.
Ideas for a parser API.
- RDF::Trine::Role::Parser - a role with default implementations of parse_into_model, etc.
- RDF::Trine::Parser::RDFXML, Turtle, etc - each of these provide a 'parse' method, and then consume the parser role. Have a default namespace that new parsers/serializers can use that are auto-loaded by Module::Pluggable.
- RDF::Trine::ParserCollection - does not actually do any parsing, but is a place where parsers register themselves, and provides parser_by_media_type, parser_by_filename, etc methods.
[kasei] I'd like a way for parsers to indicate their capabilities. Examples are: whether the parser returns triples, quads, or something else (e.g. variable bindings from srx files); whether the parser provides any common functionality beyond the 'parse triples/quads' level (e.g. being able to parse a single RDF term); whether the parser can populate a NamespaceMap.
Could be similar to parser API.
[kasei] Similar desire for indicating capabilities as above for Parsers. Examples are: what type of thing can be serialized (triple/quad/variable bindings); whether the serializer can be used on individual RDF terms; whether a NamespaceMap can be used for serialization; whether the serialization format can/will use a base URI.
[kasei] Thoughts on what roles might exist for stores:
- triplestore - the Store only can store triples, so the model code should make it look like a quadstore that only has a default graph. implement get_triples, and model code should provide get_quads functionality
- quadstore - the store can store quads, so the model code should back-fill get_triples functionality
- queryplanner - the store implements query_plan() functionality that can provide efficient, store-specific query plans for RDF::Query
[kasei] Need to figure out how to use subtypes to indicate that a store must implement either the triplestore OR the quadstore role.
[kasei] Should the current functionality in RDF::Trine::Model become a role that can be applied to all stores? This would simplify the ability of higher level APIs (such as in RDF::Query) probing the capabilities of the stores, but that could also be handled by moose delegation.
[kjetilk] NamespaceMap could be a Trait. Any other good choices for traits?
[kasei] To represent algebra operations, I've currently got a huge (wide) class hierarchy including things like RDF::Query::Algebra::BasicGraphPattern, ::GroupGraphPattern, ::Optional, ::Minus, ::Triple, ::Union, etc. Algebra objects from these classes are composed by the query parser into an algebra expression tree (e.g. a simple query might be represented as Project( Filter( (?x = ?y), BasicGraphPattern( Triple( :a :p ?x ), Triple( :b :q ?y ) ) ) ). Each class constructor takes a specific set of arguments representing its children operations, but there's a lot of similarity. For example, almost all of the children are from a very small set of types (they are either other algebra operations, ::Node objects, ::Expression objects, or integers).
Is there a way to represent all of this in a more concise manner than one class per algebra operator? One of the most useful parts of this class hierarchy is the ability to serialize an algebra expression as a SPARQL string (or a more concise representation) which often requires operation-specific code. Is there an extensible way to do this without a full class definition/implementation per operation?
More radical Store and Algebra ideas
KjetilK's ideas are more radical about the RDF::Trine::Store layer and the RDF::Query::Algebra layer and how they can interact. Lets discuss which parts of this is feasible, but first, lets think about why we might want to move to Moose:
Goal for the Stores
I wouldn't be sure about moving to Moose if it was just about the warm and fuzzy feelings it provides when programming. We need high performance too, it has to be a key focus. So, what I hope to acheive is that Moose affords a flexibility that enables us to take advantage of performance enhancements and practices of underlying implementations in a way that will yield improved performance for RDF::Trine and RDF::Query-based solutions.
Feel free to modify any part of this text.
The trouble with APIs/full implementations
Jena/ARQ or Sesame SAIL seem to only provide (for read) the equivalent of get_statements (based on a quick look into the APIs), thus the query engine implementation will break down any SPARQL query to individual statements, and the query evaluation happens almost entirely in the query engine (e.g. ARQ) which is then far from the store. This makes it hard to take advantage of stuff like a very low selecitivy of a certain FILTER, and thus makes the store work much harder than it should.
So, you have SPARQL implementations like Virtuoso or 4store, where the query engine is much closer to the store, supposedly they are inseparable. The trouble with them is that they don't afford the programmer as much control as the APIs. Obviously, the programmer can parse the SPARQL result and regain that control, but that comes at the cost of serializing and parsing the result. Moreover, we trust RDF::Query to be a full SPARQL 1.1 implementation, we do not necessarily trust other implementations.
Another problem is that if people want to create extensions, like stRDF/stSPARQL, they can't do that without deep hacks, well below the API.
Currently, we have three levels: get_statements (query engines breaks the query down to statements), get_pattern (query engine breaks down to basic graph patterns) and finally, passing the full query right through to the underlying store (query engine only does the serialize/parse roundtrip). Perhaps we could do more.
- Constraint programming In their ESWC2012 paper, le Clément et al showed how they used constraint programming to greatly improve the performance of some FILTER queries. This seems like a prime example of stuff we should cater for, as it is fairly narrow in scope, i.e. it should be possible to implement it for the queries where it is relevant, and retain a more conventational relational algebra based engine for the rest. The authors themselves call for a hybrid approach.
- Fulltext index It makes sense to store literals in a fulltext index, for example Xapian.
- Selectivity estimation for group graph patterns. It is quite hard to compute the selectivity of parts of a (recursive) group graph pattern, it is hard enough to do it for a basic graph pattern. However, it would be great for research (and future performance tuning) if we could allow implementations to do it.
- stRDF/stSPARQL are extensions to RDF and SPARQL, but it is basically a datatype hack, so a stRDF graph remains an RDF graph. One could use straight SPARQL to query the graph too, but without spatial extensions, the performance is terrible. Thus, some nice mechanism to extend the query engine in such restricted areas would be really nice.
Etc. These are just some examples, feel free to add more.
So, the basic idea here is that a store is a composition of roles.
E.g. the current basic functionality that a Store must have are implementations of what is currently get_statements, remove_statement and add_statement. I figure they could be a Store::Role::Core. Then stuff like count_statements, get_pattern and remove_statements could go into e.g. Store::Role::Patterns (perhaps bad name, but the idea is hopefully clear). Then, we have to have something that makes it clear that any new Store must implement the methods of Store::Role::Core, and it may implement some or all of the methods in Store::Role::Patterns if it can do that better than the default implementation, which is in Store::Role::Patterns itself. Furthermore, if somebody has a Xapian fulltext index running, they could implement a Role that overrides a default Store::Role::Fulltext role (which on its own only stores literals, and does only substring matching without indexing).
In a conversation with Kostis Kyzirakos and Manolis Karpathiotakisthe on ESWC, who did the stSPARQL implementation in Sesame, they said they needed to hack deep to do it. Perhaps we can accommodate this on the API level by allowing the implementation of a Role to be sufficient to do it? If so, we would have a more flexible solution than the Java folks.
The constraint programming stuff looks very interesting, and for stores that implement FILTERs that way, it would be nice if we could override the default behaviour of evaluating the FILTERs in the query engine, and rather do it in the store. Perhaps again, a Role that overrides the default implementation.
Then, we have stuff like the interaction of basic graph patterns and filters.
What we end up with is that a store is a class, but that class just looks like a composition of Roles, e.g. a basic Memory store would do just:
package R:T:Store::Memory; with 'R:T:Store::Role::Defaults'; with 'R:T:Store::Role::Core::Memory'; 1;
where R:T:Store::Role::Defaults contains the default implementation of things, and R:T:Store::Role::Core::Memory contains the implementation of get_statements, remove_statement and add_statement that is specific for the Memory store.
In the case of a advanced store, which uses a lot of different stuff, you'd have
package R:T:Store::MyAdvanced; with 'R:T:Store::Role::Defaults'; with 'R:T:Store::Role::Core::MyAdvanced'; with 'R:T:Store::Role::Patterns::MyAdvanced'; with 'R:T:Store::Role::Fulltext::Xapian'; with 'R:T:Store::Role::Extension::StRDF'; 1;
This is the class that finally will be instantiated. Basically, this would enable users, i.e. programmers to compose a store that fits their purpose better, and so take better advantage of different performance characteristics.
At the end of the day, we might end up with a Role hierarchy in Trine that resembles the current Algebra hierarchy in Query. Whether this is evil, I don't know, but I think we can figure out a nice design.
In the end, I hope this will both create an API that will remove many of the disadvantages of today, will give us a flexibility to enable new ways of creating triple stores, and give us the warm, fuzzy feelings of Moose but with high performance since we exploit features of underlying implementations.
Very loose threads
- How about implementating different query plan algorithms? Could the same Role composition idea be applied?
- Etag support on materialized iterators (for improved query caching)?
- RDF::ACL support in a RDF::Trine::Model
- Make query rewriting easy.