survey: dataset modification dates compared

This demonstration shows when datasets were last updated, as determined by their HTTP responses. How up-to-date are the datasets listed in We start with a time line histogram for a moment in time.
On the afternoons of __ and __, we crawled's links to datasets and recorded the modification dates reported by the source agency's HTTP servers. Of __ datasets, __ datasets returned modification dates on __ and __ returned modification dates on __. The list of datasets came from the 13 Sept 2010 version of their Dataset 92.
change blue crawl: change red crawl:

Number of Datasets Modified since __

Process used

The following diagram shows the process used to obtain the last-modified dates of the datasets.

Use of provenance

The following diagram illustrates how the Proof Markup Language was used to record the HTTP headers obtained from the government agency servers. A portion of this data graph is selected by the SPARQL query listed above (the headers are excluded). The red links show the HTTP redirects and HTML anchor hrefs from the details page to the actual dataset URL on the hosting agency's domain. The green boxes represent URIs for the information obtained (HTTP headers), while the purple boxes associate the information received with its source and when it was received. For more discussion on how provenance is used in LOGD and csv2rdf4lod, see A look at how csv2rdf4lod incorporates provenance into its tabular conversions.

Uses Dataset: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
No votes yet

An interesting pattern is emerging...

On another channel Tim observed that, "an interesting pattern is emerging..." I assume the pattern he is referring to is... 

  • A very large percentage of the 3,335 datasets continue to be modified, and
  • From 10-02 to 10-10 this percentage has increased from 70.1% to 78.8% of all datasets

The question is, what does this *really* mean?


Interesting pattern 28 Sept vs 10 Oct

The sharp peak at the ends shows that 200-350 datasets are updated every couple of days.The large jump in the middle of August 2010 is not a temporal lag b/c it remains in both crawls that are 3 weeks apart.

I don't trust the 22 Sept crawl

Because of the big jump in Nov 2008 that does not appear in any of the other crawls up to 10 Oct 2010.