This demonstration shows when data.gov datasets were last updated, as determined by their HTTP responses. How up-to-date are the datasets listed in data.gov? We start with a time line histogram for a moment in time.
On the afternoons of __ and __, we crawled data.gov's links to datasets and recorded the modification dates reported by the source agency's HTTP servers. Of __ data.gov datasets, __ datasets returned modification dates on __ and __ returned modification dates on __. The list of data.gov datasets came from the 13 Sept 2010 version of their Dataset 92.
Number of Datasets Modified since __
Process used
The following diagram shows the process used to obtain the last-modified dates of the data.gov datasets.
3 - The data.gov details page is requested and scraped to obtain references to data sources. Redirects are followed and HTTP HEADs are recorded in RDF using Turtle syntax and terminology from the Proof Markup Language ontology.
6 - This Javascript is included within THIS PAGE to query the LOGD SPARQL endpoint for dataset modification dates (using this query and these nightly-cached results). It calculates the histogram and uses Google Visualization's Annotated Timeline to render the results. A few notes are hard-coded to provide a frame of reference.
Use of provenance
The following diagram illustrates how the Proof Markup Language was used to record the HTTP headers obtained from the government agency servers. A portion of this data graph is selected by the SPARQL query listed above (the headers are excluded). The red links show the HTTP redirects and HTML anchor hrefs from the data.gov details page to the actual dataset URL on the hosting agency's domain. The green boxes represent URIs for the information obtained (HTTP headers), while the purple boxes associate the information received with its source and when it was received. For more discussion on how provenance is used in LOGD and csv2rdf4lod, see
A look at how csv2rdf4lod incorporates provenance into its tabular conversions.
The sharp peak at the ends shows that 200-350 datasets are updated every couple of days.The large jump in the middle of August 2010 is not a temporal lag b/c it remains in both crawls that are 3 weeks apart.
An interesting pattern is emerging...
On another channel Tim observed that, "an interesting pattern is emerging..." I assume the pattern he is referring to is...
The question is, what does this *really* mean?
John
Interesting pattern 28 Sept vs 10 Oct
The sharp peak at the ends shows that 200-350 datasets are updated every couple of days.The large jump in the middle of August 2010 is not a temporal lag b/c it remains in both crawls that are 3 weeks apart.
I don't trust the 22 Sept crawl
Because of the big jump in Nov 2008 that does not appear in any of the other crawls up to 10 Oct 2010.