International Open Government Data Search aims to discover, document and analyze data catalogs that are published by different governments all over the world. We at Tetherless World Constellation are investigating these open government datasets using semantic web technologies. The datasets are published in a variety of data formats for eg.. csv,excel .. These are translated to Resource Description Framework by a variety of in-house toolkits that get them onto the Linked Data cloud. As of now we have accumulated 192 catalogs from 43 countries in about 24 different languages. This list is exponentially growing as governments of the world are increasingly publishing data for public consumption. These catalogs can be queried from a SPARQL end point here.
Distribution of Languages present in IOGDS :The different languages present in IOGDS are shown below. English is by far the most dominated language as a lot of the open government data comes from the English speaking countries. It is very interesting to see that countries from Asia, Europe and Africa starting to publish catalogs in their own native languages. IOGDS thus is slowly becoming a one stop shop for all the worlds open government data. A simple word cloud captures the diversity of languages in IOGDS.
We found a total of 24 languages represented across all catalogs in the IOGDS collection. English is by far the most prominent language (98 catalogs / 652,176 datasets), since most of the open government data published through late 2012 came from English-speaking countries. Other notable languages included French (19 catalogs / 528,153 datasets), Spanish (18 catalogs / 8,444 datasets), Italian (14 catalogs / 2,256 datasets), and German (12 catalogs / 1,584 datasets). 19 languages were represented in the remaining 31 catalogs. This is captured by the bar chart below :
Catalogs published by each country : United States is the leader in publishing open data currentlywith 453,859 datasets followed by France with 353,394 datasets, Canada with 179,131 datasets, the United Kingdom with 12,131 datasets and Spain with 10,076 datasets.
"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay."-Sherlock Holmes, (The Adventure of the Copper Beaches)
Although the method for publishing datasets has not yet been standardized, there is a lot of information in the metadata of the datasets. Metadata is Data about Data. This metadata has been very useful for us to gain an insight into what kind of data is out there in the different government catalogs. It gives us a lot of information about the most common datasets that all governments publish, the agencies that have the highest number of datasets and also common categories that are published across countries.
Why Metadata : Datasets crawled by the IOGDS are in various file formats. Some can be crawled and downloaded, while others are not easily available to access by automated methods. Since data publication efforts are not standardized it becomes difficult to understand what is out there in the datasets. Most of these datasets have metadata i.e. data that explains what the dataset is about in a short description or using keywords. A rudimentary textual analysis of the metadata throws up a lot of questions and also gives us an idea of the different agencies that publish data and also high frequency topics. Metadata analysis also helps us to identify areas that are covered well by governments and areas that have very little coverage. We also extend our metadata analysis to give us an insight into what kind of data is published in the catalogs. Thus there would not be a case of data swamping by a particular catalog(reword). Our analysis also reveals how different datasets talk about the same topic. For example datasets in the United States would categorize a topic as “Defense” whereas the same dataset from United Kingdom will be categorized as “Defence”. (better example?) This clearly shows the need for a standardized way of publishing data across the world . One of our tasks moving forward is to be able to answer queries on “what is out there”, “where is it located” and “who publishes it”, since automatic structured data reading is a computationally exhaustive task. It is the metadata that gives us a clue as to how these questions can be answered. We also look at analysing what different countries publish the most and see topic distribution across geographic boundaries. This also makes an interesting case for cross-country dataset comparison.
What's out there in the catalogs? : IOGDS captures all the metadata present in the datasets and catalogs. To identify trends among data we use the keywords and topic categories from these catalogs. Each of the catalogs has a set of distinct keywords that we can use to understand the diversity of content. In addition to diversity of content this kind of keyword analysis also shows us how one specific category can swamp the entire dataset, For example we show how the keyword frequencies are affected when "geodata" i.e. geographic datasets(41,348 datasets). This is shown in the word cloud below :
The top 150 keywords in the catalog Data.gov(with geodata)
Keyword analysis of Data.gov with geodata catalogs excluded (reducing the count to 4826 datasets) gives us a more representative picture of the diversity of data, as depicted by the word cloud below.
Top keywords in Data.gov(non geo data)
Catalogs like the New York City’s nycopendata.socrata.com dataset give an idea of the priorities of the city governments and their stakeholders .For example, the word cloud below displays the top keywords from 859 datasets published by New York City [http://nycopendata.socrata.com]. Clearly, NYC has placed heavy emphasis on the publication of datasets containing location data and datasets relevant to education and community services.
The top keywords from New York City (Nycopendata.socrata.com)
The Tetherless World Constellation has also been looking at analyzing datasets from the New York State's Open Data Portal data.ny.gov. The datasets are available in a variety of formats here. Preliminary analysis of the meta data is shown below:
Top keywords from all the datasets hosted at data.ny.gov
The top 1000 keywords from the datasets are shown below :
Top keywords from data.ny.gov
The different categories published by New York State are shown below :
Top Categories published by New York State
The word cloud below shows the top keywords from San Fransisco's[http://data.sf.org] catalogs.
Top keywords from San Francisco(data.sf.org)
IOGDS has enabled us to study the publication tendencies of agencies within governments, as well as international organizations. We present some of them below :
Keywords from Data.Nasa.gov
The word cloud below displays the top keywords from the 1405 datasets published by the US government's Medicare program [http://data.medicare.gov]:
Top keywords from Data.medicare.gov
Keywords from other countries : As mentioned above governments across the world are slowly gearing up to publish data across datasets. While we are extending our capabilities to handle multi lingual text, we present some word clouds on the keyword distribution in different countries.
Keywords from United Kingdom's Catalog Data.gov.uk
Keywords from Canada's catalog Data.edmonton.ca
Keywords from Australia's data.gov.au
Categories in the datasets : In addition to the keywords, metadata in IOGDS also includes categories. While keywords represent the lower level datasets, the categories are common to a collection of datasets. This distribution gives us an insight into the top categories that governments across the world want to publish. We present some category wise analysis for each of the government published data and also common categories across governments.
All the top categories published by catalogs across United States
All the top categories published by United Kingdom
Analyzing dataset Categories: Analysis of the distribution of categories across countries gives us some insight into the priorities of governments as they publish their data. In the following table we use categories to summarize the publication similarities and differences between selected governments.
Metadata Design and Dataset search: TWCs IOGDS vocabulary is a reference implementation of the W3C Data Catalog Vocabulary. Site specific metadata scraped as CSV is then converted to RDF by csv2rdf4lod developed by TWC PhD student Tim Lebo. The metadata schema for IOGDS is available here. We also provide a faceted search for the cataloged datasets. This can be accessed here.
More Analytics : Scroll down below to see all the word clouds of different catalogs of countries that are in the Linked Data Cloud.
|Top keywords from datasets in the city of Chicago from Data.cityofchicago.org||128.35 KB |
|Top keywords from New York City ||181.59 KB |
|Top keywords from Canada's Data.gc.ca||45.13 KB |
|Top keywords from Kenya ||165 KB |
|Top keywords from Dublin(Ireland).png||731.99 KB |
|Top keywords from Civicapps.org(Portland,United States)||112.29 KB |
|Top keywords from Baltimore||183.93 KB |
|Top keywords from California||247.34 KB |
|Top keywords from Illinois||93.21 KB |
|Top keywords from Data.medicare.gov(USA)||133.56 KB |
|Top keywords from Data.Nasa.gov||158.21 KB |
|Top keywords from Data.nola.gov(NewOrleans,USA)||59.12 KB |
|Top keywords from Data.ok.gov(Oklahoma,USA)||148.02 KB |
|Top keywords from Data.oregon.gov(Oregon,USA)||185.35 KB |
|Top keywords from Data.seattle.gov(Seattle,USA)||294.43 KB |
|Top keywords from Data.wa.gov(Washington,USA)||33.2 KB |
|Top keywords from Datacatalog.cookcountyil.gov(CookCounty,USA)||217.38 KB |
|Top keywords from Datakc.org(KingCounty,Washington,USA)||106.47 KB |
|Top keywords from San Francisco(datasforg)||245.23 KB |
|Top keywords from New York City||181.59 KB |
|Top keywords from Data.gov.uk(United Kingdom)||44.21 KB |
|Top keywords from Data.london.gov.uk(London,UK)||81.93 KB |
|Top keywords from Datagm.org.uk(GreaterManchester,UK)||116.12 KB |
|Top keywords from Opendata.warwickshire.gov_.uk(Warwickshire,UnitedKingdom)||28.45 KB |
|Top keywords from Data.edmonton.ca(Edmonton,Canada)||72.79 KB |
|Top keywords from Data.gc.ca(Canada)||45.13 KB |
|Top keywords from Regina.ca(Regina,Canada)||577.22 KB |
|Top keywords from Data.gov.au(Australia)||129.7 KB |
|Top keywords from Data.vic.gov.au(Victoria,Australia)||61.26 KB |
|Top keywords from Data.govt.nz(NewZealand)||93.83 KB |
|Top keywords from Statcentral.ie(Ireland)||246.45 KB |
|Top keywords from Opendatauganda.com(Uganda(NonGovt))||42.1 KB |
|Top keywords from Data.gov(United States).png||318.2 KB |
|Top 150 keywords from Data.gov||136.31 KB |
|Top keywords from Data.gov 150-250||145.07 KB |
|Categories present in United States Catalogs||665.59 KB |
|Categories present in United Kingdom Catalogs||633.31 KB |
|Data.gov(non geo).png||228.61 KB |
|languages.png||16.26 KB |
|Data.gov(no geodata).png||228.61 KB |
|Table 1.png||59.21 KB |
|TopUnigram.png||410.18 KB |
|Top Keyword Bigrams.png||212.63 KB |
|Top Categories.png||82.19 KB |
|Top_1000_unigrams.png||496.46 KB |
|No_Analysis_unigram.png||340.3 KB |
|No_Analysis.png||325.98 KB |
|data_ny_gov.png||143.56 KB |