Ttracker

Web Page Traffic Analytics

Status

2009-08-02

  • Data-gathering working
  • - collects visited page URI, visitors IP address, referer, user-agent, datetime
  • - extracts any data embedded in the visited page: dc, erdf, openid, microformats, rdfa (this happens once only, when the page gets its first hit)
  • Saves to a Talis Platform store (as well as a custom Apache2 error log - note you might need to refresh the log page, and there may be debugging junk in there)
  • Marker script tested on local and remote domains (including this Wiki's template /usr/lib/python2.5/site-packages/Trac-0.11.4-py2.5.egg/trac/templates/theme.html )
  • Confirmed operation for Firefox, Opera and IE

Sample of RDF generated for a page hit

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:ns0="http://www.w3.org/2006/http#"
  xmlns:ns1="http://purl.org/stuff/cgi#"
  xmlns:ns2="http://purl.org/stuff/hsh#"
  xmlns:dct="http://purl.org/dc/terms/">

  <rdf:Description rdf:nodeID="request">
    <rdf:type rdf:resource="http://www.w3.org/2006/http#Request"/>
    <ns0:requestURI rdf:resource="http://danny.ayers.name/test.html"/>
    <ns1:remoteAddr>79.9.5.104</ns1:remoteAddr>
    <ns2:referer rdf:resource="http://danny.ayers.name/test.html"/>
    <ns2:agent>Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1</ns2:agent>
    <dct:date>2009-08-02T07:47:36Z</dct:date>
  </rdf:Description>

</rdf:RDF>

Sample of data extracted from a page

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:ns0="http://purl.org/dc/elements/1.1/">

  <rdf:Description rdf:about="http://hyperdata.org/ttracker/samples/page2.html">
    <ns0:format>text/html; charset=UTF-8</ns0:format>
    <ns0:title>Page Two</ns0:title>
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
  </rdf:Description>

</rdf:RDF>

Sample Queries

SPARQL endpoint : http://api.talis.com/stores/danja-dev1/services/sparql

Info about requests:

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
prefix http: <http://www.w3.org/2006/http#> 
prefix ns1: <http://purl.org/stuff/cgi#> 
prefix dct: <http://purl.org/dc/terms/> 

select ?s ?p ?o
where {
   ?s a http:Request .
   ?s ?p ?o .
}

Info about pages:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX foaf: <http://xmlns.com/foaf/0.1/>

SELECT ?s ?p ?o
WHERE {
   ?s a foaf:Document ;
      ?p ?o .
}

Evidence most popular browsers are supported:

prefix hsh: <http://purl.org/stuff/hsh#>
prefix h: <http://www.w3.org/2006/http#>

select distinct ?uri ?agent
where {
?s h:requestURI <http://danny.ayers.name/test.html> .
   ?s hsh:agent ?agent .

}

Next Steps

(write up between steps)

  • Figure out better way of tracking client than IP, ideally without using cookies
  • Grab more data
  • Simple SPARQL
  • Figure out pre-post-to-store caching strategy
  • Simple reporting via SPARQL plus XML/XSLT and/or JSON/Javascript

later...

  • blog
  • live deployment (on hyperdata.org for starters)
  • hook up to Piwik or other existing reporting widgets