I've probably read Getting Real half a dozen
times since the release of the free online version last year. The
agile process seems to fit quite nicely with RDF-based tools
(Semantic CrunchBase was the
most recent proof of concept for me). I'm currently writing a
DevX article about using RDF and
SPARQL in combination with Getting Real and wondered about
quantitative numbers for such an approach. As I usually don't
record hours for personal projects, I had to create a new one:
sillily named "dooit", a
to-do list manager.Many of you will know about the W3C relational to RDF mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational to RDF mapping.
To this effect, I recently summarized the group discussions and some of our own experiences around the topic at http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling
I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis a vis the following points, even if any working group will unavoidably encounter these issues on the way. A blog post on the other hand can be more specific.
I gave a talk to the RDB2RDF XG this spring, withe these slides.
The main point is that people would really like to map on the fly if they only could. Making an RDF warehouse is not a value in itself but it is true that in some cases this cannot be avoided.
At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on the fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language.
We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are joining using such triples, we end up with unions of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on demand mapping case, there must be a lot of typpe inference logic that is not relevant in the ETL scenario.
To make the point clearer, suppose a query like "list the organizations whose representatives have published about x." Suppose that there are 3 databases mapped, all of which have a table of organizations, persons with affiliation to organizations, publications by these persons and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with tag XX. It is a matter of common sense in this scenario that a publication will have the author and the author's affiliation in the same database. The RDB to RDF mapping does not however necessarily know this if all that it is told is that a table makes IRI's of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRI's from one mapping are disjoint from IRI's from another: A paper in database X will usually not have an author in database Y. The ID's in database Y, even if perchance equal to the ID's in X do not mean the same thing and there is no point joining across databases by them.
This entire question is a non-issue in the ETL scenario but is absolutely vital in the real time mappin. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRI's with one pattern and keys from another using another patterm, it must be inferrable from the patterns whether the sets of IRI's will be disjoint.
This is critical. Otherwise we will be joining everything to everything else and there will be orders of magnitude of penalty compared to hand crafted SQL over the same data sources.
SPARQL queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class and when there are few columns that can map to a given predicate and when classes and predicates are literals in the query.
Virtuoso has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso's SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third party SQL, no such tricks can be used.
The above example suggests that for supporting on the fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run time cost and complexity.
But we must not lose the baby with the bathwater. Aside offering global identifiers, RDF's attractions include subclasses and subpredicates. In relational terms these translate to unions and do involve some added cost. A mapping system just has to have means of dealing with this cost and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well behaved subsets of SPARQL and mappings.
Whether to warehouse or not to? If one has hundreds of sources of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last years RDB to RDF mappping workshop in Cambridge, Mass. about a system of relational mapping and on-demand RDF-izers of diverse semistructured biomedical data, e.g.spreadsheets. The issue certainly exists and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores and another comes from a less structured repository of ETL'ed physical triples.
Our take is that if something is a large or very large relational store, then map, else ETL. With Virtuoso, we can mix mapped and local triples but this is not a genmerally available feature of triple stores and standardization will likely have to wait until there are more implementations.
This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.
I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the query language.
The history of database is by and large about moving from a specialized application towrds a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these features of a platform instead.
It is now time to apply this principle to graph traversal.
The rationale is that graph operations are somewhat tedious to write in a parallelizable, latency tolerant manner. Writingt them as one would for memory based data structures is easier but totally unsclable as soon as there is any latency involve, i.e. disk reads or messages between cluster peers.
The ad hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was a part of b and b a part of c, the implied fact of a being a part of c would be inserted explicitly into the database as a pre-query step.
This is simple and often efficient but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad hoc.
Also, this becomes next to impossible when the scale approaches web scale or if some of the data is liable to be on and off included into or excluded from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand (backward chaining) and mapping of relational data into RDF without copying.
The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows defining recursive views. The idea there is to define for example a tree walk as a union of the data of the starting node plus the recursive walk of the starting node's immediate children.
The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving joins between such recursive views. This model of recursion seems to lose SQL's non-procedural nature. One can no longer easily rerrange joins based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimnize that I decided to take another approach.
Take a question like "list the parts of products of category C which have materials that are classified as toxic." Suppose that the product categories are a tree, the product parts are a tree and the materials classification is a tree taxonony where "toxic" has a multilevel substyructure.
Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product and up the product hierarchy to see if the produt is in the right category. One and the same query should be evaluable either way depending on what indices exist, what the cardinalities of the relations are and so forth. Regular cost based optimization.
Specially with RDF, there are many problems of this type. In regular SQL, it is a long standing cultural practice to flatten hierarchies but this is not so with RDF.
With Virtuoso, we see SPARQL as reducing to SQL. Any RDF oriented database engine or query optimization feature is accessed via SQL. Thus, if we address run time recursion in the Virtuoso query engine, this becomes ipso facto an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.
We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e. select in another select's from clause, with a transitive clause.
Consider the data:
create table knows (p1 int, p2 int, primary key (p1, p2)) alter index knows on knows partition (p1 int); create index knows2 on knows (p2, p1) partition (p2 int);
We represent a social network with the many to many relation knows. The persons are identified by integers.
insert into knows values (1, 2); insert into knows values (1, 3); insert into knows values (2, 4);
select * from (select transitive t_in (1) t_out (2) t_distinct p1, p2 from knows) k where k.p1 = 1;
We obtain the result:
| P1 | P2 |
|---|---|
| 1 | 3 |
| 1 | 2 |
| 1 | 4 |
The operation is reverseble:
select * from (select transitive t_in (1) t_out (2) t_distinct p1, p2 from knows) k where k.p2 = 4;
| P1 | P2 |
|---|---|
| 2 | 4 |
| 1 | 4 |
Since now we give P2, we traverse from P2 towards P1. The result set states that 4 is known by 2 and 2 is known by 1.
To see what would happen if x knowing y also meant y knowing x, one could write:
select * from (select transitive t_in (1) t_out (2) t_distinct p1, p2 from (select p1, p2 from knows union all select p2, p1 from knows) k2) k where k.p2 = 4;
| P1 | P2 |
|---|---|
| 2 | 4 |
| 1 | 4 |
| 3 | 4 |
Now, since we know that 1 and 4 are related, we can ask how they are related.
select * from (select transitive t_in (1) t_out (2) t_distinct p1, p2, t_step (1) as via, t_step ('step_no') as step, t_step ('path_id') as path from knows) k where p1 = 1 and p2 = 4; | P1 | P2 | VIA | STEP | PATH |
|---|---|---|---|---|
| 1 | 4 | 1 | 0 | 0 |
| 1 | 4 | 2 | 1 | 0 |
| 1 | 4 | 4 | 2 | 0 |
The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e. p1, has number 0. Since there can be multiple sollutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.
For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn searchh result views can be written as:
select p2, dist, (select count (*) from knows c where c.p1 = k.p2) from (select transitive t_in (1) t_out (2) t_distinct p1, p2, t_step ('step_no') as dist from knows) k where p1 = 1 order by dist, 3 desc; | P2 | DIST | aggregate |
|---|---|---|
| 2 | 1 | 1 |
| 3 | 1 | 0 |
| 4 | 2 | 0 |
The queries shown above work on Virtuoso 6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.
Furthermore, for queries where both ends of a path arte given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels and we stop when we find the first intersection. Thus we reach 2 * 30^3 = 54000 nodes and not 30^6 nodes.
Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one off ad hoc query. Besides, the traversal order would in such a case not be optimizable by the DBMS.
In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, travelling salesman, the 8 queens chess problem etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.
Many of you will know about the W3C relational to RDF mapping incubator activity. The group is planning to suggest forming a working group for drawing up a specification for relational to RDF mapping.
To this effect, I recently summarized the group discussions and some of our own experiences around the topic at http://esw.w3.org/topic/Rdb2RdfXG/ReqForMappingByOErling
I will here discuss this less formally and more in the light of our own experience. A working group goal statement must be neutral vis a vis the following points, even if any working group will unavoidably encounter these issues on the way. A blog post on the other hand can be more specific.
I gave a talk to the RDB2RDF XG this spring, withe these slides.
The main point is that people would really like to map on the fly if they only could. Making an RDF warehouse is not a value in itself but it is true that in some cases this cannot be avoided.
At first sight, one would think that a mapping specification could be neutral as regards whether one stores the mapped triples as triples or makes them on demand. There is almost no comparison between the complexity of doing non-trivial mappings on the fly versus mapping as ETL. Some of this complexity spills over into the requirements for a mapping language.
We expect to have a situation where one virtual triple can have many possible sources. The mapping is a union of mapped databases. Any integration scenario will have this feature. In such a situation, if we are joining using such triples, we end up with unions of all databases that could produce the triples in question. This is generally not desired. Therefore, in the on demand mapping case, there must be a lot of typpe inference logic that is not relevant in the ETL scenario.
To make the point clearer, suppose a query like "list the organizations whose representatives have published about x." Suppose that there are 3 databases mapped, all of which have a table of organizations, persons with affiliation to organizations, publications by these persons and finally a table of tags for the publications. Now, we want the laboratories that have published with articles with tag XX. It is a matter of common sense in this scenario that a publication will have the author and the author's affiliation in the same database. The RDB to RDF mapping does not however necessarily know this if all that it is told is that a table makes IRI's of publications by applying a certain pattern to the primary key of the publications table. To infer what needs to be inferred, the system must realize that IRI's from one mapping are disjoint from IRI's from another: A paper in database X will usually not have an author in database Y. The ID's in database Y, even if perchance equal to the ID's in X do not mean the same thing and there is no point joining across databases by them.
This entire question is a non-issue in the ETL scenario but is absolutely vital in the real time mappin. This is also something that must be stated, at least implicitly, in any mapping. If a mapping translates keys of one place to IRI's with one pattern and keys from another using another patterm, it must be inferrable from the patterns whether the sets of IRI's will be disjoint.
This is critical. Otherwise we will be joining everything to everything else and there will be orders of magnitude of penalty compared to hand crafted SQL over the same data sources.
SPARQL queries translate quite well to SQL when there is only one table that can produce a triple with a subject of a given class and when there are few columns that can map to a given predicate and when classes and predicates are literals in the query.
Virtuoso has some SQL extensions for dealing with breaking a wide table into a row per column. This facilitates dealing with predicates that are not known at query compile time. If the table in question is not managed by Virtuoso, Virtuoso's SQL virtualization/federation takes care of the matter. If a mapping system goes directly to third party SQL, no such tricks can be used.
The above example suggests that for supporting on the fly mapping without relying on owning the SQL underneath, some subsets of SPARQL may have to be defined. For example, one will probably have to require that all predicates be literals. The alternative is prohibitive run time cost and complexity.
But we must not lose the baby with the bathwater. Aside offering global identifiers, RDF's attractions include subclasses and subpredicates. In relational terms these translate to unions and do involve some added cost. A mapping system just has to have means of dealing with this cost and of recognizing cases where this cost is prohibitive. Some further work is likely to be required for defining well behaved subsets of SPARQL and mappings.
Whether to warehouse or not to? If one has hundreds of sources of which some are not even relational, some ETL would seem necessary. Kashiup Vipul gave a position paper at last years RDB to RDF mappping workshop in Cambridge, Mass. about a system of relational mapping and on-demand RDF-izers of diverse semistructured biomedical data, e.g.spreadsheets. The issue certainly exists and any mapping work will likely encounter integration scenarios where one part is fairly neatly mapped from relational stores and another comes from a less structured repository of ETL'ed physical triples.
Our take is that if something is a large or very large relational store, then map, else ETL. With Virtuoso, we can mix mapped and local triples but this is not a genmerally available feature of triple stores and standardization will likely have to wait until there are more implementations.
This was a quick summary, by no means comprehensive, on what an eventual RDB2RDF working group would come across. This is a sort of addendum to the requirements I outlined on the ESW wiki.
I have mentioned on a couple of prior occasions that basic graph operations ought to be integrated into the query language.
The history of database is by and large about moving from a specialized application towrds a generic platform. The introduction of the DBMS itself is the archetypal example. It is all about extracting the common features of applications and making these features of a platform instead.
It is now time to apply this principle to graph traversal.
The rationale is that graph operations are somewhat tedious to write in a parallelizable, latency tolerant manner. Writingt them as one would for memory based data structures is easier but totally unsclable as soon as there is any latency involve, i.e. disk reads or messages between cluster peers.
The ad hoc nature and very large volume of RDF data makes this a timely question. Up until now, the answer to this question has been to materialize any implied facts in RDF stores. If a was a part of b and b a part of c, the implied fact of a being a part of c would be inserted explicitly into the database as a pre-query step.
This is simple and often efficient but tends to have the downside that one makes a specialized warehouse for each new type of query. The activity becomes less ad hoc.
Also, this becomes next to impossible when the scale approaches web scale or if some of the data is liable to be on and off included into or excluded from the set being analyzed. This is why with Virtuoso we have tended to favor inference on demand (backward chaining) and mapping of relational data into RDF without copying.
The SQL world has taken steps towards dealing with recursion with the WITH - UNION construct which allows defining recursive views. The idea there is to define for example a tree walk as a union of the data of the starting node plus the recursive walk of the starting node's immediate children.
The main problem with this is that I do not very well see how a SQL optimizer could effectively rearrange queries involving joins between such recursive views. This model of recursion seems to lose SQL's non-procedural nature. One can no longer easily rerrange joins based on what data is given and what is to be retrieved. If the recursion is written from root to leaf, it is not obvious how to do this from leaf to root. At any rate, queries written in this way are so complex to write, let alone optimnize that I decided to take another approach.
Take a question like "list the parts of products of category C which have materials that are classified as toxic." Suppose that the product categories are a tree, the product parts are a tree and the materials classification is a tree taxonony where "toxic" has a multilevel substyructure.
Depending on the count of products and materials, the query can be evaluated as either going from products to parts to materials and then climbing up the materials tree to see if the material is toxic. Or one could do it in reverse, starting with the different toxic materials, looking up the parts containing these, going to the part tree to the product and up the product hierarchy to see if the produt is in the right category. One and the same query should be evaluable either way depending on what indices exist, what the cardinalities of the relations are and so forth. Regular cost based optimization.
Specially with RDF, there are many problems of this type. In regular SQL, it is a long standing cultural practice to flatten hierarchies but this is not so with RDF.
With Virtuoso, we see SPARQL as reducing to SQL. Any RDF oriented database engine or query optimization feature is accessed via SQL. Thus, if we address run time recursion in the Virtuoso query engine, this becomes ipso facto an SQL feature. Besides, we remember that SQL is a much more mature and expressive language than the current SPARQL recommendation.
We will here look at some simple social network queries. A later article will show how to do more general graph operations. We extend the SQL derived table construct, i.e. select in another select's from clause, with a transitive clause.
Consider the data:
create table knows (p1 int, p2 int, primary key (p1, p2)) alter index knows on knows partition (p1 int); create index knows2 on knows (p2, p1) partition (p2 int);
We represent a social network with the many to many relation knows. The persons are identified by integers.
insert into knows values (1, 2); insert into knows values (1, 3); insert into knows values (2, 4);
select * from (select transitive t_in (1) t_out (2) t_distinct p1, p2 from knows) k where k.p1 = 1;
We obtain the result:
| P1 | P2 |
|---|---|
| 1 | 3 |
| 1 | 2 |
| 1 | 4 |
The operation is reverseble:
select * from (select transitive t_in (1) t_out (2) t_distinct p1, p2 from knows) k where k.p2 = 4;
| P1 | P2 |
|---|---|
| 2 | 4 |
| 1 | 4 |
Since now we give P2, we traverse from P2 towards P1. The result set states that 4 is known by 2 and 2 is known by 1.
To see what would happen if x knowing y also meant y knowing x, one could write:
select * from
(select transitive t_in (1) t_out (2) t_distinct p1, p2
from (select p1, p2 from knows union all select p2, p1 from knows) k2) k where k.p2 = 4;
| P1 | P2 |
|---|---|
| 2 | 4 |
| 1 | 4 |
| 3 | 4 |
Now, since we know that 1 and 4 are related, we can ask how they are related.
select * from
(select
transitive t_in (1) t_out (2) t_distinct p1,
p2,
t_step (1) as via,
t_step ('step_no') as step,
t_step ('path_id') as path from knows) k
where p1 = 1 and p2 = 4;
| P1 | P2 | VIA | STEP | PATH |
|---|---|---|---|---|
| 1 | 4 | 1 | 0 | 0 |
| 1 | 4 | 2 | 1 | 0 |
| 1 | 4 | 4 | 2 | 0 |
The two first columns are the ends of the path. The next column is the person that is a step on the path. The next one is the number of the step, counting from 0, so that the end of the path that corresponds to the end condition on the column designated as input, i.e. p1, has number 0. Since there can be multiple sollutions, the last column is a sequence number allowing distinguishing multiple alternative paths from each other.
For LinkedIn users, the friends ordered by distance and descending friend count query, which is at the basis of most LinkedIn searchh result views can be written as:
select
p2,
dist,
(select count (*) from knows c where c.p1 = k.p2)
from (select transitive t_in (1) t_out (2) t_distinct p1,
p2,
t_step ('step_no') as dist from knows) k
where p1 = 1 order by dist, 3 desc;
| P2 | DIST | aggregate |
|---|---|---|
| 2 | 1 | 1 |
| 3 | 1 | 0 |
| 4 | 2 | 0 |
The queries shown above work on Virtuoso 6. When running in cluster mode, several thousand graph traversal steps may be proceeding at the same time, meaning that all database access is parallelized and that the algorithm is internally latency tolerant. By default, all results are produced in a deterministic order, permitting predictable slicing of result sets.
Furthermore, for queries where both ends of a path arte given, the optimizer may decide to attack the path from both ends simultaneously. So, supposing that every member of a social network has an average of 30 contacts and we need to find a path between two users that are no more than 6 steps apart, we begin at both ends, expanding each up to 3 levels and we stop when we find the first intersection. Thus we reach 2 * 30^3 = 54000 nodes and not 30^6 nodes.
Writing a generic database driven graph traversal framework on the application side, say in Java over JDBC, would easily be over a thousand lines. This is much more work than can be justified just for a one off ad hoc query. Besides, the traversal order would in such a case not be optimizable by the DBMS.
In a future blog post I will show how this feature can be used for common graph tasks like critical path, itinerary planning, travelling salesman, the 8 queens chess problem etc. There are lots of switches for controlling different parameters of the traversal. This is just the beginning. I will also give examples of the use of this in SPARQL.
psd posted a photo:
Phil and I grabbed a bite of lunch in the Tate Modern members' room. There's a nice view from the balcony!
A couple of weeks ago I attended the second Semantic Bar Camp which took place at the Orange research labs at Issy les Moulineaux, near Paris. This was a great opportunity to meet many of the French researchers in the Semantic Web space, to take part in the French debate, and to help convince interested parties of the reality of the technology.
Jean Rohmer of the large French defense group Thales played the role of the devil's advocate, arguing that the Semantic Web was just pie in the sky theory without practical applications. We delved into various aspects of the theory of the Semantic Web, and I underlined how the biological/evolutionary aspect of language, the Academie Francaise notwithstanding, was a key aspect in understanding the evolution of the web of data. But the best argument was a simple demonstration of the Beatnik Address Book, which showed how hyperdata could solve the serious problem of 2008: the growing number of closed social networks. At the next camp I hope we will be able to delve much more deeply into how to build real practical applications.
Many thanks to Karima Rafes for organizing this well attended bar camp ( pictures ). Stephane Lauriere from XWiki and who is on the Nepomuk Semantic Desktop project, also posted some photos. And I would like to recommend Alexandre Passant's blog to all french speaking readers.
The talk I gave is now available online with audio as "Building Secure, Open and Distributed Social Network Applications".
Following on the success last year, JavaOne 2008 has lined up three talks on the Semantic Web, a 200% increase. The program should be an excellent way for Java enthusiasts to get a feel for how the Semantic Web is getting used in real application making money for real start ups, how to develop such apps in Java, how to build open social networks that bridge the social networking data silos, and with the help of Dean Allemang cover some theoretical grounds from a practical perspective .
Here is the timetable of the sessions at JavaOne. Highlighted in green are the three semantic web sessions. Highlighted in gray are 4 of the 5 sessions on Google's Open Social API, which reveals the importance social networks are taking in development. I don't think though that that API solves the real problem of current social networks: The Data silo problem. Only Semantic Web technologies can do that.
Update Sept 2008: Two of the talks are now available online:
Below are the details of the sessions in tabular format. I believe they should complement each other very well.
| Session Title: | Developing Semantic Web Applications on the Java™ Platform |
| Session Time: | Thursday - 05/08/2008 - 1:30 PM-2:30 PM |
| Session ID: | PAN-5542 |
| Session Description: | The semantic web is nearing the point of widespread practical adoption: • The core specifications have stabilized. • Tools and frameworks implementing key features have been through several development cycles (for a listing see http://esw.w3.org/topic/SemanticWebTools). • An increasing number of major software companies have developed semantically enabled products or are actively researching the space. As companies start to translate theory into real Java™ technology-based applications, they are confronted with a host of practical software engineering issues: • What is the standard or recommended functional architecture of a semantic application? • How does that architecture relate to the semantic web standards? • Which of those standards are stable, and which can be expected to evolve in ways that would significantly affect prior applications? • What types of tools/frameworks exist that can be leveraged to help implement semantic applications on the Java platform? • How mature are the various categories of semantic web tools/frameworks? • Can API standardization be expected for certain tool/framework categories? • What best practices exist for the design and implementation of Java technology-based semantic applications? • What best practices exist for the deployment of Java technology-based semantic applications? • What future trends in Java platform support for semantic application development can be expected? This panel session gathers together semantics experts from the software industry to address these and other practical issues relating to the development of semantic applications on the Java platform. |
| Track: | Next Generation Web |
| Session Type: | Panel Session |
| Duration: | 60 minutes |
| Speaker(s): | Jans Aasman, Franz Inc; Dean Allemang, TopQuadrant Inc. ; Brian Sletten, Zepheira, LLC; Henry Story, Sun Microsystems, Inc.; Lew Tucker, Radar Networks |
| Session Title: | Beatnik: Building an Open Social Network Browser |
| Session Time: | Thursday - 05/08/2008 - 7:30 PM-8:20 PM |
| Session ID: | BOF-5911 |
| Session Description: | The recent growth of social networking sites is revealing the limits of the current ad hoc data architecture used by Web 2.0 sites. A typical example is that you cannot link to a person in a Facebook account from a LinkedIn account. What is needed to solve these problems is hyperdata, the ability to link data universally. Hyperdata is to data what hypertext is to text. Where hypertext enables text to link up to other text, hyperdata enables data to link up to other data globally. Where HTML enables open, distributed hypertext, the semantic web enables open, distributed hyperdata. Anybody can publish data that then becomes reachable by any tool crawling the web of relations. To illustrate the power of hyperdata, this session presents Beatnik, a social network browser and editor written entirely in the Java™ programming language that consumes any of the millions of available friend-of-a-friend (FOAF) files already published on the web and enables users to publish information about themselves and their own social network. It shows how you can drag and drop a FOAF URL onto Beatnik and start exploring a web of relations and find up-to-date information about where your friends live, who their friends are, and where people are currently located. With a click of a button, Beatnik will publish all your own relations to your web server in a nonintrusive way to make you part of the first globally available open social network. After a quick overview of the semantic web and FOAF, the presentation takes a detailed look at how the Beatnik client is built. This involves digging into one of the many Java technology-based semantic web frameworks, such as Sesame, and its APIs; a Java-platform-to-RDF mapper, such as so(m)mer or Elmo; and how this enables inferencing on the Java platform. On the server side, the presentation looks at how you can easily publish the contents of an LDAP database into any of the numerous RDF formats using JSR 311, the Java API for RESTful Web Services. It also covers the use of the Atom Publishing Protocol as a publication mechanism and discusses various security techniques for limiting the view of a personal graph of information by using OpenID and distributed-web-of-trust techniques. |
| Track: | Cool Stuff, Cool Desktop; Cool Stuff, Cool Next Gen Web; Open Source, Open Source Next Gen Web; Cool Stuff; Desktop; Next Generation Web; Open Source |
| Session Type: | Birds-of-a-Feather Session (BOF) |
| Duration: | 50 minutes |
| Speaker(s): | Tim Boudreau, Sun Microsystems, Inc.; Henry Story, Sun Microsystems, Inc. |
| Session Title: | Semantic Web for the Working Ontologist |
| Session Time: | Friday 05/09/2008 - 1:30 PM-2:30 PM |
| Session ID: | TS-5555 |
| Session Description: | This session presents the basics of practical semantic web deployment using standards-based tools on the Java™ platform. It covers the Resource Description Framework (RDF) as the fundamental mashup language of the web; SPARQL, the query language for RDF; and RDFS and OWL, which provide simple inferencing capabilities. In the distributed world of the web, information is moving from a hypertext paradigm to a hyperdata paradigm--the web today is not just a web of documents but also a web of data. But that data is available on the web and in the enterprise in a wide variety of forms: HTML, XML, RSS, spreadsheets, databases, and so on. RDF provides a uniform way to identify information in a distributed setting to form a web of data. The session demonstrates a Java technology-based platform (built on Eclipse) that uses RDF as an interlingua for merging information from multiple web sources. Java technology plays a key role in the success of the system in several ways. First, it uses the large variety of public domain semantic web software available on the Java platform as the basis of interoperability at the API level. Second, it uses the Eclipse framework as a visual editing environment for the ontologies. Finally, it uses the modularity of the Eclipse plug-in environment to enable a sort of plug-and-play architecture among semantic components. One of the basic ideas of the semantic web is that semantic models, or “ontologies,” can be used to describe how data fits together. In the context of the web of hyperdata, an ontology can describe how data in one source relates to data from another, or even which sources of data should be merged to answer a particular question or support a particular application. The idea is that, armed with these tools, a working ontologist can describe hyperdata applications without resorting to a general-purpose programming language. TopQuadrant has used these standards to construct a workbench for building semantic applications. Semantic mashups can be built by use of RDFS and OWL. TopQuadrant has also developed a visual flow editor for describing how distributed data can be merged in novel ways; it calls this editor SPARQLMotion, because it extends the standard query language SPARQL with intuitive information flow diagrams modeled in OWL. SPARQLMotion modules can be connected with a simple point-and-click interface to create novel arrangements. |
| Track: | Next Generation Web |
| Session Type: | Technical Session |
| Duration: | 60 minutes |
| Speaker(s): | Dean Allemang, TopQuadrant Inc. |
Following on my previous post RDFAuth: sketch of a buzzword compliant authentication protocol, Toby Inkster came up with a brilliantly simple scheme that builds very neatly on top of the Secure Sockets Layer of https. I describe the protocol shortly here, and will describe an implementation of it in my next post.
Simple global ( passwordless if using a device such as the Aladdin USB e-Token ) authentication around the web would be extremely valuable. I am currently crumbling under the number of sites asking me for authentication information, and for each site I need to remember a new id and password combination. I am not the only one with this problem as the data portability video demonstrates. OpenId solves the problem but the protocol consumes a lot of ssl connections. For hyperdata user agents this could be painfully slow. This is because they may need access to just a couple of resources per server as they jump from service to service.
As before we have a very simple scenario to consider. Romeo wants to find out where Juliette is. Juliette's hyperdata Address Book updates her location on a regular basis by PUTing information to a protected resource which she only wants her friends and their friends to have access to. Her server knows from her foaf:PersonalProfileDocument who her friends are. She identifies them via dereferenceable URLs, as I do, which themselves usually (the web is flexible) return more foaf:PersonalProfileDocuments describing them, and pointing to further such documents. In this way the list of people able to find out her location can be specified in a flexible and distributed manner. So let us imagine that Romeo is a friend of a friend of Juliette's and he wishes to talk to her. The following sequence diagram continues the story...
The stages of the diagram are listed below:
First Romeo's User Agent HTTP GETs Juliette's public foaf file located at http://juliette.net/. The server returns a representation ( in RDFa perhaps ) with the same semantics as the following N3:
@prefix : <#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix todo: <http://eg.org/todo#> .
@prefix openid: <http://eg.org/openid/todo#> .
<> a foaf:PersonalProfileDocument;
foaf:primaryTopic :juliette ;
openid:server <https://aol.com/openid/service>; # see The Openid Sequence Diagram .
:juliette a foaf:Person;
foaf:name "Juliette";
foaf:openid <>;
foaf:blog </blog>;
rdfs:seeAlso <https://juliette.net/protected/location>;
foaf:knows <http://bblfish.net/people/henry/card#me>,
<http://www.w3.org/People/Berners-Lee/card#i> .
<https://juliette.net/protected/location> a todo:LocationDocument .
Romeo's user agent receives this representation and decides to follow the https protected resource because it is a todo:LocationDocument.
In the communication in stage 2, Romeo's user agent also passes along his foaf id. This can be done either by:
Agent-Id header pointing to the foaf Id of the user. Like this:
Agent-Id: http://romeo.net/#romeo
This would be similar to the current From: header, but instead of requiring an email address, a direct name of the agent would be required. (An email address is only an indirect identifier of an agent).
X509v3 extensions:
...
X509v3 Subject Alternative Name:
URI:http://romeo.net/#romeo
I am not sure if it would be correct use of the X509 Alternative names field. So this would require more standardization work with the X509 community. But it shows a way where the two communities could meet. The advantage of having the id as part of the certificate is that this could add extra weight to the id, depending on the trust one gives the Certificate Authority that signed the Certificate.
If the Certificate is signed by a CA that Juliette trusts and the foaf id is part of the certificate, then she will trust that the owner of the User Agent is the entity named by that id. She can then jump straight to step 6 if she knows enough about Romeo that she trusts him.
Having Certificates signed by CA's is expensive though. The protocol described here will work just as well with self signed certificates, which are easy to generate.
<http://romeo.net/> . Romeo's foaf server returns a document containing a graph of relations similar to the graph described by the following N3:
@prefix : <#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix wot: <http://xmlns.com/wot/0.1/> .
@prefix wotodo: <http://eg.org/todo#> .
<> a foaf:PersonalProfileDocument;
foaf:primaryTopic :romeo .
:romeo a foaf:Person;
foaf:name "Romeo";
is wot:identity of [ a wotodo:X509Certificate;
wotodo:dsaWithSha1Sig """30:2c:02:14:78:69:1e:4f:7d:37:36:a5:8f:37:30:58:18:5a:
f6:10:e9:13:a4:ec:02:14:03:93:42:3b:c0:d4:33:63:ae:2f:
eb:8c:11:08:1c:aa:93:7d:71:01""" ;
] ;
foaf:knows <http://bblfish.net/people/henry/card#me> .
PREFIX wot: <http://xmlns.com/wot/0.1/>
PREFIX wotodo: <http://eg.org/todo#>
SELECT { ?sig }
WHERE {
[] a wotodo:X509Certificate;
wotodo:signature ?sig;
wot:identity <http://romeo.net/#romeo> .
}
Juliette's web server can discover the certificate signature and compare it with the one sent by Romeo's user agent. If the two are identical, then Juliette's server knows that the User Agent who has access to the private key of the certificate sent to it, and who claims to be the person identified by the URI http://romeo.net/#romeo, is in agreement as to the identity of the certificate with the person who has write access to the foaf file http://romeo.net/. So by proving that it has access to the private key of the certificate sent to the server, the User Agent has also proven that it is the person described by the foaf file.
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
<http://bblfish.net/people/henry/card#me> foaf:knows <http://romeo.net/#romeo> .
As a result of the policy of allowing all friends of Juliette's friends to be able to read the location document, the server sends out a document containing relations such as the following:
@prefix contact: <http://www.w3.org/2000/10/swap/pim/contact#> .
@prefix : <http://juliette.org/#> .
:juliette
contact:location [
contact:address [ contact:city "Paris";
contact:country "France";
contact:street "1 Champs Elysees" ]
] .
To give everyone a chance to try out the So(m)mer Address Book, I have made it available via Java Web Start: just click on the picture to the right, and try it out.
The Address Book is currently demoware: it shows how one can build virally an open distributed social network client that solves the social network data silo problem (video). No need to have an account on every social networking site on which you have friends, and so maintain your data on each one. You can simply belong to one network and link to all your friends wherever they are. With one click of a button you can publish your social network to your own web server, using ftp, scp, WebDAV, or even Atom. You can then link to other people who have (or not in fact), a foaf file. By pressing the space bar when selecting a friend, the Address Book with then GET their file. So you can browse your social network.
To get going you can explore my social network by dragging my foaf file icon
onto the first pane of the application.
In BOF-5911 which I will be presenting on Thursday at 7:30pm I will be presenting the social networking problem, demonstrating how the So(m)mer Address Book solves it, and showing in detail how it is build, what the problems are, and what work remains. I will also discuss how this can be used to create global single sign on based on a network of trust.
An improved version of the presentation I gave is now available online with audio as Building Secure, Open and Distributed Social Network Applications
The upcoming semantic conference in San Jose, is getting going tomorrow, with an excellent list of speakers and subjects. Here are some highlights of the sessions relating to topics on which I blog regularly.
Many more interesting talks will make sure I will spend another packed week. The full program is available online.
My presentation is now available online with audio as part of the longer Building Secure, Open and Distributed Social Network Applications
I am in the Bay Area about to start my third week of conference/workshops with the combined themes of Java, identity, semantic web, and data portability.
The first week at JavaOne went very well. The Semantic Web Panel attracted way over 500 people by my guesstimate (no official figure yet), and Dean Allemang's talk "Semantic Web for the working Ontologist", that took place on the last day attracted well over three hundred attendees. My BOF, happened late at night at the same time as a big party, and only attracted 30 or so attendees. But on the whole JavaOne proved a great success.
Speaking to members of the liberty group at Sun, I discovered the existence of the Internet Identity Workshop in Mountain View, and decided this would be a good opportunity to learn more about this space. This was a very good use of my time, as it helped me get more familiar with many of the problems and technologies in this space. I put forward some of the ideas I had been discussing here relating the semantic web and distributed web of trust ideas using OpenId and foaf+ssl, which seemed to hold up quite well under the close scrutiny of the community. A few fun conversations with Eve Maler (aka xmlgrrl) on the relations between the semantic web and XML nicely spiced up the evenings :-)
That workshop was closely followed by a one day Data Sharing Summit, addressing issues raised by the Data Portability group, which I have been following relatively closely. This one day session was very helpful for my understanding of the types of problems that need solving. An ontology for what can be done with information in a foaf file would indeed be very helpful. This would have to allow one to specify in simple terms what relations could be republished or which ones should not be.
So next on the list is the Semantic Technology Conference in San Jose, which will bring all these threads together. For more on that see see my post on the Semantic Tech highlights.
The presentation I was giving is now available online with audio as Building Secure, Open and Distributed Social Network Applications
The World Bank, whose mission is to end world poverty by providing financial and technical assistance to developing countries around the world, is now offering an API to developers (our World Bank API profile).The bank considers the development of its API a component of this mission:
We are releasing this API because we believe this information can be mapped, visualized and mashed up in an unlimited number of ways that will help develop a better understanding of trends and patterns around key development issues.
The API offers programmable access to key World Bank data sources:
In total, the API offers 114 indicators. Jon Udell catalogs the list of indicators in his recent post “World Bank data now available through APIs” (there’s also an interesting thread of discussion about REST API design and how the World Bank API could be more RESTful from Stefan Tilkov and Jon Udell).
The API is REST based with output data available in XML and JSON formats. Authentication is via API key. The API’s documentation includes pages that describe the details of each of the current 11 API methods. An API tutorial that walks you through configuring and applying the API is also available. We have more details on the API in our directory.
The development of the World Bank API reflects the growing awareness that making information databases available to developers has benefits for organizations of all kinds, including organizations with global humanitarian missions, like the World Bank.
Serendipity brought me a copy of this article on Jock Gill’s vision of small-scale grass farming operations. He thinks they’ll be able to produce biomass fuel, in a sustainable and decentralized way, for local production of heat and power. We had a long talk about this, and related themes, which will appear in two upcoming episodes of my Innovators show. Meanwhile, this paragraph from the article keeps echoing in my head:
He said a high school student in Morrisville has completed a successful prototype of a green wood chip combustion unit that can produce 50,000 BTUs of heat per hour. Gill said the student is confident his system could also burn dry grass tablets.
Good old Yankee ingenuity, in other words, hasn’t yet run its course. As we reconfigure our energy systems, that latent talent will flourish again.

Youtube gives a way to insert a video in your pages. You can select a few options and the system gives you a piece of html code to insert in your Web page.
<object width="425" height="349">
<param name="movie" value="http://www.youtube.com/v/ZuNNhOEzJGA&hl=fr&fs=1&rel=0&color1=0x006699&color2=0x54abd6&border=1"></param>
<param name="allowFullScreen" value="true"></param>
<embed src="http://www.youtube.com/v/ZuNNhOEzJGA&hl=fr&fs=1&rel=0&color1=0x006699&color2=0x54abd6&border=1" type="application/x-shockwave-flash" allowfullscreen="true" width="425" height="349"></embed>
</object>
embed is an element which is part of HTML 5 Working Draft but not part of XHTML 1.0 or XHTML 1.1. The embed element in this example is a fallback of the object element. It says if the object element is not working, use the embed element. So I decided to just cut the embed element in the XHTML 1.1 page.
<object width="425" height="349">
<param name="movie" value="http://www.youtube.com/v/ZuNNhOEzJGA&hl=fr&fs=1&rel=0&color1=0x006699&color2=0x54abd6&border=1"></param>
<param name="allowFullScreen" value="true"></param>
</object>
The code stopped working. The video was not displayed at all in the page. It probably means that the object element has no effect at all and embed is always triggered. So I started to explore what was missing. First, the param element is an empty element, so there is no need for a closing element.
<object width="425" height="349">
<param name="movie" value="http://www.youtube.com/v/ZuNNhOEzJGA&hl=fr&fs=1&rel=0&color1=0x006699&color2=0x54abd6&border=1"/>
<param name="allowFullScreen" value="true">
</object>
Then I moved the information in the param element to the object element. And finally I added a textual information about the content of the video in case the video would not work properly.
<object width="425" height="349"
type="application/x-shockwave-flash"
data="http://www.youtube.com/v/ZuNNhOEzJGA&hl=fr&fs=1&rel=0&color1=0x006699&color2=0x54abd6&border=1">
<p>Interview of Philippe Le Hégaret about Video codec</p>
</object>
I finally tested it in Camino (Version 1.6.1Int-v2 (1.8.1.14 2008051211)), Opera (9.52, Révision 4916), Firefox (Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; fr; rv:1.9.0.1) Gecko/2008070206 Firefox/3.0.1), and Safari (Version 3.1.2 (5525.20.1)) and it worked well.
A couple of years ago, Joe Gregorio explained Why so many Python web frameworks? and showed how to create your own Web framework with a few lines of code. The most fundamental bricks are packaged in the standard python library.
There are always been many hypothesis about why the Web was successful, all of them can't be verified, because we can't restart the experiment. But retrospectively, it is interesting to look at what has been reused widely. libwww is a library including all modules to create a Web application, such as http, html, etc. It has been initially written by Tim Berners-Lee, and was freely available. People could just take the library and put an interface on top of it to make a browser, a Web server, an indexing bot.
One of the very difficult and technical part of a browser is the layout (or rendering) engine. It takes the Web content and displays it after having processed the style and scripting information. Some of these layout engines are open source such as WebKit, Gecko and KHTML. Others are sometimes sold to third parties for developing specific products. Web developers take them and create new browsers or use them in their applications. For sure it is a tad more complicated than just taking the libraries but it became "easier" to create a browser. It is known as BYOB.
Clojure has been discussed here before. It fits in the Lisp family with S-expressions, macros, and functions as values. Like most Lisps, it's dynamically typed and impure. But its target focus is concurrency so symbols are defined immutably by default; its standard library's collection structures are immutable and persistent; and its various mutable concepts are threadsafe except, of course, for the back doors implied by I/O and JVM library interoperability. See vars, refs, and agents.
What I wanted to highlight is position paper of sorts that Rich Hickey has posted on Clojure's Approach to Identity and State. An excerpt:
While some programs are merely large functions, e.g. compilers or theorem provers, many others are not - they are more like working models, and as such need to support what I'll refer to in this discussion as identity. By identity I mean a stable logical entity associated with a series of different values over time. Models need identity for the same reasons humans need identity - to represent the world.
...
In Clojure's model, value calculation is purely functional. Values never change. New values are functions of old, not mutations. But logical identity is well supported, via atomic references to values (Refs and Agents). Changes to references are controlled/coordinated by the system - i.e. cooperation is not optional and not manual.
Hickey also addresses the choice to not follow Erlang's model.
There are other ways to model identity and state, one of the more popular of which is the message-passing actor model, best exemplified by the quite impressive Erlang. ... It is important to understand that the actor model was designed to address the problems of distributed programs. And the problems of distributed programs are much harder ... Clojure may eventually support the actor model for distributed programming, paying the price only when distribution is required, but I think it is quite cumbersome for same-process programming. YMMV of course.
The essay is worth a read on a couple of levels of interest to LtU. At an abstract level, it's a good example of a well-articulated design justification. Agree or not, it's clear that Hickey gave thought to his decisions. Too many language designers fall into the trap of blindly inheriting semantics from a favorite language and end up putting new lipstick on the same pig. Any language designer would do well to write an essay or two like this before jumping into their venture.
At the specific level, the core approach is certainly worthy of discussion and alternative designs. Is mutable state really the best way to deal with "working models"? Are there things that the pure functional camp can do to make "working models" more convenient, e.g. do Clean's uniqueness types fit the bill at least for sequential programming, or are they too restrictive? Are there things that can make Erlang style actors less cumbersome to use especially when distribution isn't necessary?
But a full-fledged browser. One that behaves, however, as a platform to host applications best tied to cloud computing with built-in local persistence for offline computing. Sure, in its current form Chrome can’t compete with Silverlight or Flex/AIR for what Adobe calls “expressiveness,” meme-speak for rich graphics, animations, integrated video and other visual UI goodies.
Chrome may shut it off for good. It’s possible that various open source Chrome technologies could melt into Safari and Firefox. But –– whether as a stand-alone product or a progenitor of fast, powerful and expressive browsers –– Chrome signals to anybody but the diehard Microsoft constituents that the browser itself, not a proprietary plug-in or a separate runtime, is the future of RIAs. With its huge ecosystem, Microsoft can live with that. At least until its enterprise monopoly seriously erodes. But Adobe cannot.
In a world where the online pie is divided among the .NET army of Microsoft, the browser-gang of Apple+Mozilla+Google, and the lone Adobe, it’s not difficult to predict whose share will shrink into insignificance. If the exclusion of Flash from the iPhone wasn’t a wake-up call for Adobe, Chrome should certainly be one.
Cascading’s logical model abstracts away MapReduce into a convenient tuples, pipes, and taps model. Data is represented as “Tuples”, a named list of objects. For example, I can have a tuple (”url”, “stats”), where “url” is a Hadoop “Text” object and “stats” is my own “UrlStats” complex object, containing methods for getting “numberOfHits” and “averageTimeSpent”. Tuples are kept together in “streams”, and all tuples in a stream have the exact same fields.
An operation on a stream of tuples is called a “Pipe”. There are a few kinds of pipes, each encompassing a category of transformations on a tuple stream. For instance, the “Each” pipe will apply a custom function to each individual tuple. The “GroupBy” pipe will group tuples together by a set of fields, and the “Every” pipe will apply an “aggregator function” to all tuples in a group at once.
One of the most powerful features of Cascading is the ability to fork and merge pipes together.
Once you have constructed your operations into a “pipe assembly”, you then tell Cascading how to retrieve and persist the data using an abstraction called “Tap”. “Taps” know how to convert stored data into Tuples and vice versa, and have complete control over how and where the data is stored. Cascading has a lot of built-in taps - using SequenceFiles and Text formats via HDFS are two examples. If you want to store data in your own format, you can define your own Tap. We have done this here at Rapleaf and it has worked seamlessly.
Cory Doctorow asked Bruce Schneier to give him a hand designing wedding rings. Not an obvious combination until you realise these are crypto rings…
There are two great discussions going on over at both blogs. Cory has asked his crowd to help design a cipher for his crypto wedding rings. While Bruce simply said Contest: Cory Doctorow’s Cipher Wheel Rings.
The discussion on both posts is worth reading. A mixture of things popping up about the similarity between the three rings and the Enigma machine as well as comments about Jefferson’s Wheel Cipher.
Like most things Cory does (or says) there’s an element of the slightly bizarre. The prize, a not to be sniffed-at signed copy of Little Brother.
The full set of photos are on Cory’s Flickr account, tagged weddingring.
Comparisons with the Enigma machine, I suspect, are bogus. While there is a visual similarity with the Enigma’s wheels the Enigma’s cipher was implemented in the electronics within the machine. The letters on the rotors simply enabling the correct starting positions to be selected. The Enigma machines perform a substitution cipher, but with the additional complexity that the substitution pattern changes for each letter through the message. I don’t see a way to do that with these rings. There may be rotor ciphers that could be implemented - I don’t know.
Jefferson’s cipher is a much closer match, a fully manual system consisting of 26 wheels with the alphabet scrambled differently on each one. Similar to the Enigma machine, sender and receiver had to have the order of the wheels synchronised and each letter would use a different substitution scheme, though Jefferson’s not as thorough as the Enigma.
As the rings cannot be altered and the alphabet is in order on all three wheels, any attempt that results in one character of cipher text for each character of plain text will be a simple substitution cipher. While it may take several complex steps to arrive at the cipher character it will only take an attacker one step to go back.
So, if you’re thinking about this problem seriously there are some things you have to decide on first…
This is isn’t an unreasonable assumption (putting aside that the details have been published online). It’s not that long ago that messages were transferred in plain text relying only on the emperor’s seal - made in wax with a ring only he carried.
There are suggestions on the blogs of using most recent blog posts, first pages of known books and other items as keys to drive the cipher. This then involves taking the character from the key and the character from the plaintext and some form of mathematical computation (shifting rings up or down, finding the next dot above or below, that kind of thing) to arrive at the cipher text character.
Knowing Bruce’s views on secrecy and security, even suggesting it is pure heresy. Considering the ring to be secret may be part of this, or may not. Some of the ideas I’ve had fall outside being encryption and really fall into the realm of a ’secret encoding’. But hey, something has to be secret and if it can’t be the ring, or the key, the maybe it has to be the algorithm.
Then, of course, you have to decide what to do with the rings. Any Cryptographic algorithm fulfils one of four basic purposes:
These algorithms use the same key to encrypt and decrypt the text. They may use a single algorithm, like ROT13, or they may use a matched pair of algorithms, like many other substitution ciphers.
These algorithms use one key to encrypt and another to decrypt. The keys in this case are paired and are usually termed public and private keys. Typically you would use the recipients public key to encrypt and they would use their own private key to decrypt.
Used mostly for storing passwords (I can’t think of another use), these algorithms enable you to reliably convert plain text into a hash with little possibility of reversing the process. For passwords this means you store the hash of the password, then compare the hashed version of any sign-in attempt with the stored hash.
Signing means adding some kind of addendum to the message that confirms you wrote it. Again this is done using public/private key pairs. You use your private key to create a hashed version of the message which others can then verify using your public key.
As well as thinking about all of that good stuff it might be worth looking for clues in the design of the rings. Bruce must have had something in mind when designing the rings.
Here are the obvious things to notice:
Less obvious:
Yep, that’s all I spotted :-(
I’ll be chatting with a coupe of colleagues to see if we can put our heads together and also watching to see what the winner comes up with.
psd posted a photo:
Another conference, another attempt at getting a decent headshot of yours-truely.
Over the holidays I accidentally picked up a book by Marie-Louise von Franz, "The interpretation of fairy tales", and could not put it down until I reached the last page. I then ordered five other book of hers. If you have ever found fairy tales interesting but puzzling, you don't know how much you have missed. Seen from Marie-Louise's perspective, one of the closest students of Carl Gustav Jung, each of these are nuggets of deep knowledge of the human soul. Here is my translation of the first paragraph of my French translation of "The feminine in Fairy tales":
At the origin, and until approximately the XVIIth century, fairy tales were not so much meant for children as for the adult population. This situation was kept alive in the rural areas where until a relatively recently, story tellers would animate traditional vigils. Progressively though the development of the rational current and its refusal of the irrational, led these tales to be seen as just absurd old woman's stories, just good enough to amuse the children.
My guess is that a lot of this still holds true today, though a lot of her thoughts must have been integrated in one way or another by now. Once one starts learning to read fairy tales like this - though I keep being surprised at how deep her reading goes, and it will not be an easy task to get even close to it - one can start seeing how this can be applied to other arts such as film. I recently saw Jim Jarmush's "Dead Man" which is full of such interesting symbolism for example.