Understanding LOGD Data

Level: 
LOGD 101
Contributor: 
Description: 
This tutorial describes the structure of Linking Open Government Data (LOGD) data and how it is associated with the Open Government Data (OGD).
Prerequisites: 
Prerequisites: 

What to Expect

By the end of this tutorial you should be able to understand the how how the tabular government data is mapped to its RDF-based LOGD representation, and the basic elements of LOGD data.

What You Need to Know

This tutorial assumes you are familiar with concepts found in the following resources:
  • Resource Description Framework (RDF) is a standard model for data interchange on the Web. See [1]
  • Comma-Separated Values (CSV) is a simple text format for a database table. See [2]
  • Terse RDF Triple Language (Turtle) is a syntax language for serializing RDF. We use it throughout our tutorials to encode RDF data. See [3]

Open Government Data (OGD)

Open Government Data(OGD) refers to publicly available government data. Recently, many countries, such as the US (http://data.gov) and the UK (http://data.gov.uk), have released central open government data portals. In this tutorial, we will use "Dataset 1623 (OMH Claims Listed by State)" which is cataloged at http://data.gov as an example.

OGD Metadata

Each dataset published at Data.gov has a web page showing its metadata (e.g. who published it, summary of its content, where to download it, where to find additional information for understanding the dataset, and etc.). For example, we can collect some metadata (see below) for "Dataset 1623" from its Data.gov URL http://www.data.gov/details/1623.
 Step 1. go to http://www.data.gov/details/1623
Sample metadata about Dataset 1623 (source: http://www.data.gov/details/1623)
title OMH Claims Listed by State
description Total count of Claims received by Region, State and fiscal year.
agency Department of Health and Human Services

OGD Raw Data Files

An important mission of OGD portals is to support citizens in the downloading of raw data files. The web pages for Data.gov datasets have a dedicated section "Download Information" that lists available raw data in various formats, including XML, wikipedia:CSV CSV and XLS (Excel format). For example, Dataset 1623 has a raw data file in XLS (can be opened by Microsoft Excel or similar tools) downloadable at http://www.data.gov/download/1623/xls.
 Step 2. download a file from  http://www.data.gov/download/1623/xls
The following image shows a fragment of the raw data (accessed on Sep 17, 2010). It is easy to see that the raw data is essentially a table listing total OMHA claims received by region, state, and fiscal year.
tutorial-Understanding-LOGD-Data-1623-full.png
As show from the figure above, the raw data is not normalized for machine consumption
  • The table header is not on the first row.
  • Values in column 1 are missing for the sake of visual abbreviation (e.g., "Mid-West" should apply to everything from "Connecticut" through "West Virginia").
  • Values in column 2 are referencing entities that are very commonly recognized -- and are already mentioned in a slew of other datasets in a variety of slightly different ways (e.g., "MD" and "24", instead of "Maryland").
  • Columns 3 through 7 contain integers (_not_ the characters "1", ",", "0", "2", and "9"), but the integers are in comma-separated format

LOGD RDF Data

In this tutorial we're focusing on government data organized as tables. Although tabular data can be easily recognized by human users, clean-ups and format-conversions are needed to ensure that government data can be consumed by machines. The TWC LOGD RDF Data of a dataset is created by several automated (or semi-automated) conversion processes. Note that users need to assign a version identifier to the RDF Data because the conversion is done a snapshot of the raw data at a certain time. The RDF data of LOGD is available through "dump files" for downloading or by dereferencing an HTTP URI. For example, the zipped RDF dump file for Dataset 1623 (version 2010-Sept-17) is available at this link.
 step 3. download http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17 
   * rename the saved file from "data-gov-1623-2010-Sept-17" to "data-gov-1623-2010-Sept-17.tar.gz"
   * run Linux shell command "tar -zxf data-gov-1623-2010-Sept-17.tar.gz" to unzip the file
Note: to unzip the file in windows, see http://www.gzip.org/. The downloaded RDF data has been encoded using Turtle syntax. The dump file appends results from several conversion processes, and we will only explain several essential fragments of the data file.

Namespace Declaration

The namespace declarations are used to support wikipedia:QName of URI. Below is the content on lines 2,13,23,24 and 25 in the dump file. Each line declared a prefix with corresponding wikipedia:XML_namespace.
...
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
...
@prefix ov: <http://open.vocab.org/terms/> .
...
@prefix raw: <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/raw/> .
@prefix e1: <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/enhancement/1/> .
@prefix ds1623: <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/> .
...

Property Definitions

The property definitions are used to (i) preserve the original text of each table header field name; and (ii) add additional descriptions contributed in data conversion process. Below is the content on lines from 5589 to 5592 in the dump file. The "rdfs:label" declares a human-readable label for the property "raw:region", which expands to a URI "http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/raw/region".
raw:region ov:csvCol "1"^^xsd:integer ;
  ov:csvHeader "Region" ;
  rdfs:label "Region" ;
  rdfs:range rdfs:Literal .

RDF Data Generated from Raw Conversion

The table areas in the raw data are converted into RDF representation following simple rules:
  • each row (except the header row) is identified by an RDF resource with unique HTTP URI
  • each column is associated with an RDF property with unique HTTP URI
  • each cell (in non-header rows) is recorded by an RDF triple "(s p o)" where s is the row's URI, p is the column's URI and o is the value of the cell.

A "raw" conversion simply translates the raw data (in CSV format) into RDF representation without minimal manual operation. Below is the content on lines 5624-5642 (generated by raw conversion) in the RDF dump file.
  • "ds1623:thing_8" on the first row is the wikipedia:URI of the record. It is a wikipedia:QName of the HTTP URI "http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/thing_8"
  • "http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17" on the first row is the URI of the version of the dataset.
  • the entire first line corresponds to one RDF triple meaning "the record 'ds1623:thing_8' is referenced by the version of the dataset". This triple is automatically added by the TWC LOGD converter.
  • the second line corresponds to one RDF triple meaning "the record 'ds1623:thing_8' is related to a region named 'Mid-Atlantic'". This triple is associated with the 1st column ("region") of the raw data and the corresponding cell ("Mid-Atlantic") on the 8th row and the 1st column. It is notable that the "region" of "ds1623:thing_9" has empty string value.
  • the numbers after "raw:total" are still encoded in comma separated string.
ds1623:thing_8 dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> ;
 	raw:region "Mid-Atlantic" ;
 	raw:state "District of Columbia" ;
	raw:fiscal_year_06 "12" ;
	raw:fiscal_year_07 "289" ;
	raw:fiscal_year_08 "342" ;
	raw:fiscal_year_09 "376" ;
	raw:total "1,019" ;
	ov:csvRow "8"^^xsd:integer .
 
ds1623:thing_9 dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> ;
	raw:region "" ; 
	raw:state "Maryland" ; 
	raw:fiscal_year_06 "1,029" ;
	raw:fiscal_year_07 "3,565" ;
	raw:fiscal_year_08 "4,014" ;
	raw:fiscal_year_09 "2,403" ;
	raw:total "11,011" ;
	ov:csvRow "9"^^xsd:integer .

RDF Data Generated from Enhancement Conversion

An "enhancement" conversion converts the raw data (in CSV format) into an RDF representation based on a manually-generated configuration file. Below is the content on lines 82-100 (generated by an enhancement conversion) in the RDF dump file. The RDF data encodes the first two records of the data table, corresponding to 8th and 9th rows in the raw data (Excel file).
  • on the first line, the URI of record is the same as the one in raw conversion. This allows incrementally add enhanced descriptions to the existing descriptions.
  • on the second line, a new RDF property "e1:region" has been created in addition to the "raw:region". Note that the range of the two RDF properties are different.
  • on the second line, a new RDF resource "value_of_region:Mid-Atlantic" is promoted from the original literal string in raw data. By assigning the named entity (a region in this case) a unique URI, users can later add more descriptions or links to the entity, e.g. linking to wikipedia:Mid-Atlantic_states, in the future.
  • on the third line, the number "12" is now annotated with a datatype. This not only helps users to better underestand the meaning of data, but also supports triple stores' aggregation functions (e.g. sum) on such data which cannot be used on plain literals.
ds1623:thing_8 dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> ;
	e1:region value_of_region:Mid-Atlantic ; 
	e1:state value_of_state:District_of_Columbia ;
	e1:fiscal_year_06 "12"^^xsd:integer ;
	e1:fiscal_year_07 "289"^^xsd:integer ;
	e1:fiscal_year_08 "342"^^xsd:integer ;
	e1:fiscal_year_09 "376"^^xsd:integer ;
	e1:total "1019"^^xsd:integer ;
	ov:csvRow "8"^^xsd:integer .

ds1623:thing_9 dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> ;
	e1:region value_of_region:Mid-Atlantic ;
	e1:state value_of_state:Maryland ;
	e1:fiscal_year_06 "1029"^^xsd:integer ;
	e1:fiscal_year_07 "3565"^^xsd:integer ;
	e1:fiscal_year_08 "4014"^^xsd:integer ;
	e1:fiscal_year_09 "2403"^^xsd:integer ;
	e1:total "11011"^^xsd:integer ;
	ov:csvRow "9"^^xsd:integer .
 
Your rating: None Average: 5 (1 vote)