An Introduction to RDF, the Semantic Web and Linked Data that isn't very much longer than it's title
This was originally in response to an email, it got longer than I'd planned so I decided to recycle it.
This field (RDF, Semantic Web, Linked Data, call it what you will) starts with one or two little conceptual shifts that aren't particularly obvious to regular Web and/or data users. Unfortunately they're taken as read by most people writing about this stuff...
So, starting with the statement:
"my dog's name is Basil"
A traditional database table containing this info might look like:
|
|||||||
The actual name of my dog is just a simple string "Basil".
The ID of the critter in question will be unique in the local database, but for data like this to work on the Web, a truly global ID is needed for the critter. The Web uses URLs to locate pages. But URLs can also be considered identifiers for those pages. So why not use URLs to identify things other than pages: people, places, cars and...dogs. (conceptual shift).
So here's a URL to identify the hound in question:
http://dannyayers.com/pets/Basil
I own the domain "dannyayers.com" and so I can use the URLs there as I please (see WebArch).
Other information in the database is also local, the concepts of "Pets" and "Name". Things can have URLs, so why not concepts and relations? I could define my own identifiers for those concepts (as with the concept of my dog) but I happen to know there are existing definitions I can use:
For the local term "Pets" I'll use:
http://purl.org/stuff/pets/Pet
and for the local term "name" I'll use:
http://xmlns.com/foaf/0.1/name
I don't own those domains, but the definitions there are the same as those I'm using in the DB:
foaf:name - A name for some thing.
pets:Pet - The class of animals kept for pleasure rather than utility.
(slight cheating here - I set up http://purl.org/stuff/pets/ :)
These are examples of (RDF) vocabularies, collections of terms people have defined for particular domains. Lots of these are available, and it's straightforward to define your own The terms have URLs so are globally reusable.
Now to glue these bits together, which is where the Resource Description Framework (RDF) comes in. It allows statements to be made of the form (known as triples) :
subject property object
Where subject is the thing being talked about, property is some characteristic of the thing and object is the value of the
characteristic. These are formal logical statements, comparable to the
relational logic of traditional databases or predicates of logical languages
(e.g. Prolog does it like property(subject, object) ).
In this example it's:
subject : http://dannyayers.com/pets/Basil
property : http://xmlns.com/foaf/0.1/name
object : "Basil"
There's a little more information available too, my dog was in the Pets table, it's saying he's a member of the class Pet.
Class membership is expressed in RDF as another triple, here it would be:
subject : http://dannyayers.com/pets/Basil
property :
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
object : http://purl.org/stuff/pets/Pet
RDF (which is really a kind of entity-relationship data model) can be expressed in various formats, the most readable being Turtle. The triples above could look like this:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
.
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix pets: <http://purl.org/stuff/pets/> .
<http://dannyayers.com/pets/Basil> rdf:type pets:Pet .
<http://dannyayers.com/pets/Basil> foaf:name "Basil" .
(It can be written more concisely than this, but this way it's easier to see what's going on).
Ok, time for another little conceptual shift.
When we publish documents on the Web we are making statements about things, in whatever language we speak. The same can apply to logical statements expressed in a formal language - like RDF. By publishing statements like this the Web can be used as a (very big) database. If I put the Turtle above in a file and publish that on the Web I'm effectively adding to the global database.
Because RDF is based around URLs, the component parts of statements can be treated as links. What's more, each statement can be considered a (typed) link :
subject --property--> object
- this is the idea behind Linked Data.
With information on the Web, if we see a link on a page we can click on it and do a "follow your nose" to related information. One huge advantage of expressing data as RDF on the Web is that because entities and relationships are expressed as URLs, to find out more information an agent can do the "follow your nose" thing from statements to discover more information.
HTML already does something very similar, if I had a page containing a link like:
<a href="http://dannyayers.com" rel="home">My Home
Page</a>
it's expressing a relationship between the current page and the linked
page which could be interpreted as a triple:
<thispage> x:hasHomePage <http://dannyayers.com>
.
- except the interpretation isn't really formally defined.
(The Microformats initiative has conventions for such statements in HTML, RDFa is a way of expressing any RDF in HTML, and Microdata can be seen as a cut-down version of RDFa included in HTML5)
So data can be expressed "natively" on the Web. Mathematically speaking, the Web as a whole has a graph structure. Tim Berners-Lee in fact offered an alternative name for the World Wide Web - the "Giant Global Graph"
But, though this Giant Global Graph be considered one big database, it can't exactly be queried like a SQL Server installation. However a particular RDF format file (like the bit of Turtle above) or bunch of RDF files can be treated as a subset of all the GGG and can be worked on in isolation.
Fairly early in the history of RDF developers noticed that it wasn't convenient to work with such material directly, so started building RDF stores (often known as triplestores) which allowed programmatic access to the data. After those were around for a little while it became apparent that using programmatic access was a clunky approach and the SPARQL query language was born. SPARQL works over an RDF store very much like SQL works over a relational DB, in many ways it's quite similar.
Here's a SPARQL query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?x ?y WHERE {
?x foaf:name ?y .
}
If you look back to the Turtle example, the bit inside the { } part here is
a similar shape. SPARQL operates through a kind of pattern matching, if you ran
this query over the data above the structures would be matched and the variable
?x would get the value
<http://dannyayers.com/pets/Basil> and ?y
the value "Basil".
The example data above is only talking about one dog, but on the Web at large there could be millions of named things that would match that query. The triplestore contains just a little chunk of the (Semantic) Web. I think it's useful to think of triplestores as being data caches - they aren't the "original" data, just some bits of it pulled out for local analysis.
Another little conceptual shift is needed at this point. If we want to be able to work with little pieces of the world's data in this way, we need to take into consideration that there is other information out there.
Say we wanted to ask "is there a dog called Sasha". Given the data above, a traditional database would reply "no" (and a SPARQL query on a triplestore would return no results). But it may well be that there is a dog somewhere called Sasha (there is, and she's currently biting my jumper because I haven't taken them out yet), we just don't know about it. So to reflect the way things are in reality, the "open world model" is used, that is to say facts are either true or unknown, rather than the usual database and logic-based language "closed world model" where things are either true or false.
Using the Web as a huge database is an extremely powerful idea, but not without its problems. Unlike most databases, on the Web anyone can say what they want about anything, whether it's true or false. Someone might even state:
<http://dannyayers.com/pets/Basil> foaf:name "Sasha" .
RDF itself doesn't deal with such issues. As far as RDF is concerned the dog now has two names. There is no notion in the language of deleting statements, and changing a statement somewhere is conceptually the same as adding a new statement. But in practice a system will work with information that's a recent as possible from trustworthy sources (trust and provenance on the Web are big issues!).
One last question : "How do I explore links?" - in principle the answer is "use HTTP", in practice it depends on what tools people have built. One tool is SNORQL:
http://wiki.dbpedia.org/OnlineAccess#h28-5
if you click on one of the examples it builds a corresponding SPARQL query. The links in the results can be followed.