IntroToRDFforPerlHackers

From Semantic Web with Perl Wiki
Jump to: navigation, search

Introduction to the Semantic Web for Perl Hackers

With the Moving to Moose hackathon coming up, we'll try to get everyone on the same page as we would be happy to get input from the broader community.

Contents

History

The Semantic Web arose when three communities collided in the late 1990s, the Web community (including and most prominently, timbl, who invented the thing), the library community (who looked for something to annotate anything), and the AI community. They realised that they had something in common, and that they should work together.

What is RDF?

RDF is a simple data model:

  • It has resources (in terms of programming, think "objects") and literals (think "non-reference scalars").
  • Resources may be identified with URIs.
    • Or strictly, with IRIs which are a generalised version of URIs allowing almost any Unicode character.
  • Literals may be qualified with a data type, or a language identifier (or neither).
    • Data types are resources, and always identified with URI.
    • Language identifiers are ISO 639 codes ("en" for English, etc)
  • Resources can be described using {property, value} pairs.
    • The property is a resource, and is always identified with a URI.
    • The value may be a literal, or another resource.
    • Properties are repeatable. (Not like a Perl hash where each key has just one value.)
    • The combination of a {property, value} pair, plus the subject resource being described, is often referred to as a "triple" or "statement". They are typically represented in {subject, property, value} order.
  • RDF data sets are sets of triples.
    • If you know about graph theory, then an RDF data set can also be thought of as a graph where the subjects and values are "nodes", and the properties are "edges".
      • As a result, a data set is usually called a "graph".

RDF has just enough complexity to be able to describe pretty much anything you like. (Like a three-legged chair; any fewer and it would fall over; any more would just be sugar.) Its pervasive use of URIs as identifiers allows RDF in different files, different databases, and even published by different organisations, to refer to the same resources, forming a single global database.

Working with RDF

  • An RDF graph can be serialised in several different formats.
    • Key formats include:
      • RDF/XML - the oldest format and most widely supported, but typically thought quite unreadable and difficult to hand write.
      • Turtle - a more popular text-based format.
  • There exists a query language called SPARQL to query an RDF graph the same way you might query an SQL database.

Common Gotchas

  • Resources may be identified with URIs. You are probably already familiar with URIs identifying digital resources on the Web (documents, images, etc). RDF expands this, allowing resources to identify things which are not digital resources - URIs may identify people, places, car parts, or abstract notions like "love".
  • While resources may be identified with URIs, they are not always. Resources not identified with URIs are referred to as "blank nodes". There is nothing special about blank nodes, but resources without an identifier can be less easy to work with in some cases.
  • Resources may be identified with multiple URIs:
    If you see the same URI mentioned in two places, they are identifying the same resource (URIs are a global namespace).
    But if you see two different URIs, you can not usually assume they refer to different resources.
    Similarly, a blank node (see above) can correspond to a resource with a URI.
  • An RDF graph is a set of statements which are assumed to be truths. You should not generally assume that if a statement is absent, the statement must be false.

An Example Triple

Here is an example RDF triple, in Turtle:

<http://dbpedia.org/resource/Perl> <http://www.w3.org/2000/01/rdf-schema#label> "Perl"@en  .
  • The subject is a resource (the abstract concept of the Perl programming language), and is identified using a URI.
  • The property is a resource (the abstract concept of something having a label), also identified using a URI.
  • The value is a literal, the string "Perl" in English.

Our Stuff

The key parts of the Perl RDF stack are:

  • RDF-Trine provides an object-oriented interface for working with resources, literals, statements and graphs. It provides serializers to write graphs to files in various formats; and parsers to read them in again. It provides an API for RDF data stores, with implementations of that API allowing RDF data to be stored in memory, or in an SQL database.
    • Various modules with additional parsers, serializers, stores, etc for RDF-Trine are available on CPAN.
  • RDF-Query provides allows RDF data stores to be queried using SPARQL.
    • kasei (its author) is deeply involved in the W3C SPARQL Working Group, and RDF-Query is one of the SPARQL 1.1 reference implementations.
  • Attean is set to replace both of the above with a new Moo-based framework.

The query engine basically parses the query into a bunch of Algebra objects, and then those objects are fed into a query planner, which has a bunch of Plan classes, and each of the Plan objects is then executed on the top of an RDF::Trine::Model. The model inherits from RDF::Trine storage classes, which again implements RDBMS storage, memory storage (different flavours), a Redis storage, etc.

For read and write operations, these storages actually has to implement only three methods: get_statements (which matches a statement, with undef is a wildcard to match any of those components), remove_statement (which will remove a single statement) and add_statement, well, you get the idea... In addition, you might want to come up with better implementations of count_statements, and size.

So, in conclusion, to use the full SPARQL query language, you need just get_statements, remove_statement and add_statement, and a bit of magic around the constructor and stuff. RDF::Query will do the rest.

One key problem here is that if the underlying store has some great optimizations, you cannot necessarily exploit that so well without sacrifacing some of the quality we have with a reference implementation. Today, you can also implement a get_pattern method that can exploit some optimizations really well, but other than that, the only alternative is to a pass a query all the way down to the underlying store unmodified, let it parse and execute it and parse result you get back. kasei and KjetilK wrote a paper about that, where Attean put forward as a solution.

Hackathon: main directions of work

It seems we're working in two main directions: The first is to rengineer the low-level API, the serializers, parsers, stores and stuff. This could lead to a rather different way of programming the upper layers also, instead of initializing a serializer, a publishing class would say "with 'serializerrole'", and stuff like that.

Then, a big discussion of stores and optimizations would be interesting. This is dealt with in MooseLowLevelAPI

The other major direction we see for the hackathon, is to look into the overlapping semantics of RDF and Perl/Moose. There is already some work here. This could also lead to a radical rethinking of how to program SemWeb stuff. Which is badly needed. This is in RDFSemanticsToMoose.

Further reading