Ttracker
Web Page Traffic Analytics
Status
2009-08-02
- Data-gathering working
- - collects visited page URI, visitors IP address, referer, user-agent, datetime
- - extracts any data embedded in the visited page: dc, erdf, openid, microformats, rdfa (this happens once only, when the page gets its first hit)
- Saves to a Talis Platform store (as well as a custom Apache2 error log - note you might need to refresh the log page, and there may be debugging junk in there)
- Code commented and explanation started at TtrackerHowItWorks also Notes on Cross-Domain Ajax
- CGI Environment Variables vocab created
- Marker script tested on local and remote domains (including this Wiki's template /usr/lib/python2.5/site-packages/Trac-0.11.4-py2.5.egg/trac/templates/theme.html )
- Confirmed operation for Firefox, Opera and IE
- demo, source (demo is in samples dir), SPARQL endpoint
- Key pages used during research tagged del.icio.us ttracker
Sample of RDF generated for a page hit
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ns0="http://www.w3.org/2006/http#"
xmlns:ns1="http://purl.org/stuff/cgi#"
xmlns:ns2="http://purl.org/stuff/hsh#"
xmlns:dct="http://purl.org/dc/terms/">
<rdf:Description rdf:nodeID="request">
<rdf:type rdf:resource="http://www.w3.org/2006/http#Request"/>
<ns0:requestURI rdf:resource="http://danny.ayers.name/test.html"/>
<ns1:remoteAddr>79.9.5.104</ns1:remoteAddr>
<ns2:referer rdf:resource="http://danny.ayers.name/test.html"/>
<ns2:agent>Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1</ns2:agent>
<dct:date>2009-08-02T07:47:36Z</dct:date>
</rdf:Description>
</rdf:RDF>
Sample of data extracted from a page
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:ns0="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://hyperdata.org/ttracker/samples/page2.html">
<ns0:format>text/html; charset=UTF-8</ns0:format>
<ns0:title>Page Two</ns0:title>
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Document"/>
</rdf:Description>
</rdf:RDF>
Sample Queries
SPARQL endpoint : http://api.talis.com/stores/danja-dev1/services/sparql
Info about requests:
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix http: <http://www.w3.org/2006/http#>
prefix ns1: <http://purl.org/stuff/cgi#>
prefix dct: <http://purl.org/dc/terms/>
select ?s ?p ?o
where {
?s a http:Request .
?s ?p ?o .
}
Info about pages:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?s ?p ?o
WHERE {
?s a foaf:Document ;
?p ?o .
}
Evidence most popular browsers are supported:
prefix hsh: <http://purl.org/stuff/hsh#>
prefix h: <http://www.w3.org/2006/http#>
select distinct ?uri ?agent
where {
?s h:requestURI <http://danny.ayers.name/test.html> .
?s hsh:agent ?agent .
}
Next Steps
(write up between steps)
- Figure out better way of tracking client than IP, ideally without using cookies
- Grab more data
- Simple SPARQL
- Figure out pre-post-to-store caching strategy
- Simple reporting via SPARQL plus XML/XSLT and/or JSON/Javascript
later...
- blog
- live deployment (on hyperdata.org for starters)
- hook up to Piwik or other existing reporting widgets