Mashing up LOGD data with SPARQL

Level: 
LOGD 101
Contributor: 
Contributor: 
Description: 
This tutorial describes SPARQL queries that mash up LOGD data from different datasets. It contains examples showing how the SPARQL queries are connected with LOGD data at different levels of granularity.
Prerequisites: 
Prerequisites: 
Prerequisites: 
Prerequisites: 
Prerequisites: 

What to Expect

By the end of this tutorial you should be able to understand how to use SPARQL queries to mash up LOGD datasets, and how SPARQL query is connected to LOGD data and metadata at different levels of granularity.

What You Need to Know

This tutorial assumes familiarity in the following areas:

The Basic Idea

We assume required LOGD data are loaded into a triples store (a RDF data management system like database but provides SPARQL query interface). As shown in the following figure, the content of dump file of the specific version of a LOGD dataset is first loaded into one RDF dataset named by the version's URI (we use "RDF Dataset 1" in the figure). By manually looking at the dataset's metadata and sample data, users can identify how records from different RDF datasets can be linked by common objects which have the same value. The data linking knowledge is then used to compose a SPARQL query on the two RDF datasets as shown on top of the triple store. In what follows, we will provide real world example SPARQL queries that mash up LOGD data.
logd-design-mashup-sparql.png

Section 1. Mashing up LOGD data using String Matching

Datasets and Samples

In this section, we will be using datasets 1356 and 353, obtained from Data.gov. The specific RDF conversions we will be looking at are: In producing these RDF versions of the datasets, one of the primary challenges involved establishing mappings between property values. For instance, establishing the fact that the entity "Alabama" referenced in dataset 1356 is equivalent to the "Alabama" from dataset 353. In our RDF conversion, US states in both datasets are referenced by a common set of string-based FIPS codes. Consider the example below, comparing RDF from both datasets:
Dataset Example
Dataset 1356 RDF/XML
<http://logd.tw.rpi.edu/source/data-gov/dataset/1356/version/2009-Dec-03/thing_2>
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/county_code> "000" ;
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_code> "01" ;
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/agi> "75020497";
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_abbrv> "AL" .
Properties Used:
  • "state_code": state's FIPS code
  • "county_code": county's code, "000" means the record is about the entire state, not a specific county of the state
  • "agi": state's adjusted gross income (AGI)
  • "state_abbrv": state name's abbreviation
Dataset 353 RDF/XML
 <http://logd.tw.rpi.edu/source/data-gov/dataset/353/version/1st-anniversary/thing_3>
   <http://logd.tw.rpi.edu/source/data-gov/dataset/353/vocab/raw/pub_fips> "01" ;
   <http://logd.tw.rpi.edu/source/data-gov/dataset/353/vocab/raw/popu_st> "4599030";
   <http://logd.tw.rpi.edu/source/data-gov/dataset/353/vocab/raw/stlaname> "ALABAMA PUBLIC LIBRARY SERVICE" .
Properties Used:
  • "pub_fips" state's FIPS code
  • "popu_st" state's population
  • "stlaname" state library agency's name
Here, the properties "state_code" (Dataset 1356) and "pub_fips" (Dataset 353) both point to a standard FIPS code representing "Alabama".

SPARQL Query

The two RDF dataset conversions we mentioned earlier: are presently loaded into the TWC LOGD triple store (http://logd.tw.rpi.edu/sparql). Information from both datasets can be retrieved by the following SPARQL query:
SELECT distinct ?state_abbv ?agi ?population 
WHERE {
 GRAPH  <http://logd.tw.rpi.edu/source/data-gov/dataset/353/version/1st-anniversary>{
  
   ?s1  <http://logd.tw.rpi.edu/source/data-gov/dataset/353/vocab/raw/popu_st> ?population.
   ?s1  <http://logd.tw.rpi.edu/source/data-gov/dataset/353/vocab/raw/pub_fips> ?state_fipscode .   
 }
 GRAPH <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/version/2009-Dec-03> {
   ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_abbrv> ?state_abbv .
   ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/county_code> "000" .
   ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/agi> ?agi.
   ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_code> ?state_fipscode .
 }
} 
order by ?state_fipscode
Here, the line  ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/county_code> "000" . is used to constrain results to state-level information only (as opposed to both state and county level). Sending this query to the TWC LOGD triple store returns results of the form:
state_abbv agi population
"AL" "92162773" "4599030"

Important Points

  • Establishing property value mappings between datasets is a challenging problem, which makes it an important area for future research. Standardized representations of entities, such as FIPS codes for US States, can be used for this purpose.
  • Selective use of RDF properties in SPARQL queries can be used to narrow results obtained. For instance, dataset 1356 has many rows related to a state, but the constraint "county_code=000" help us focus on state-level data, as opposed to county-level data.

Section 2. Mashing up LOGD data using Enhanced LOGD data

Datasets

In this section, we will be using datasets 1356 and 1623, obtained from Data.gov. The specific RDF conversions we will be looking at are: Earlier, we mentioned that establishing property value mappings between datasets can be challenging, and proposed the use of standardized string-based representations for handling this (e.g. using FIPS codes for representing US States). However, doing this doesn't produce explicit semantic linking between datasets. In addition, it forces datasets to adopt a common string-based representation for a given property. In practice, this is often not feasible - while one dataset may refer to US states their full name, others may do so by 2-letter postal codes. Therefore, we attempt to provide RDF-based definitions for property values where possible. In the case of representing US States, two state representations are defined: one corresponding to a postal code, and another for the full name.
Property Example
Postal Code RDF/XML for Dataset 1356 entry:
<http://logd.tw.rpi.edu/source/data-gov/dataset/1356/version/2009-Dec-03/thing_2>
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_code> "01" ;
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/agi> "75020497";
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/enhancement/1/state_abbrv> 
     <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/typed/state_abbreviation/AL> ;
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_abbrv> "AL" .
RDF/XML for state abbreviation entry:
<http://logd.tw.rpi.edu/source/data-gov/dataset/1356/typed/state_abbreviation/AL>
     owl:sameAs   <http://dbpedia.org/resource/Alabama>.
Properties Used:
  • "state_code": state's FIPS code
  • "agi": state's adjusted gross income (AGI)
  • "state_abbrv": state name's abbreviation
Full Name RDF/XML for Dataset 1623 entry:
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/thing_31>
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/raw/state> "Alabama" ;
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/enhancement/1/state> <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/value-of/state/Alabama> ;
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/enhancement/1/fiscal_year_07> 647 .
RDF/XML for state entry:
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/value-of/state/Alabama>
     owl:sameAs   <http://dbpedia.org/resource/Alabama>.
Properties Used:
  • "state": state's name
  • "fiscal_year_07": medicare claims in fiscal year 2007
In both examples, the RDF-based representation for "Alabama" is asserted to be equivalent to the representation on DBPedia.  

SPARQL Query

Below is a SPARQL query for retrieving data from: Among other things, this query returns URIs (?altStateURI) equivalent to the state representation used in each entry (?state_abbrv):
SELECT distinct ?state_abbrv ?agi ?claims ?altStateURI
WHERE {
 GRAPH  <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17>{  
  ?s1  <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/enhancement/1/fiscal_year_07> ?claims.
  ?s1  <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/enhancement/1/state> ?state1623.   
  ?state1623  owl:sameAs ?altStateURI .
 }
 GRAPH <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/version/2009-Dec-03> {
  ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_abbrv> ?state_abbrv .
  ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/county_code> "000" .
  ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/agi> ?agi.
  ?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/enhancement/1/state_abbrv> ?state1356.
  ?state1356 owl:sameAs ?altStateURI .
 }
} 
order by ?state_abbrv
Below are sample results
state_abbv agi claims altStateURI
"AK" "17312636" "30" ^^<http://www.w3.org/2001/XMLSchema#integer> <http://dbpedia.org/resource/Alaska>
"AL" "92162773" "647" ^^<http://www.w3.org/2001/XMLSchema#integer> <http://dbpedia.org/resource/Alabama>

Important Points

  • The two datasets cannot be connected by simple string matching, since variations on string-based property values may be used (e.g. one dataset may refer to US States by their full name, while another may use postal codes). Therefore, our RDF conversion promoted the use of RDF resource-based property values, linkable to other sections of the Linked Data Cloud (such as DBPedia). This data enhancement provided by our RDF conversion can help developers by saving them the hassle of writing ad hoc code for linking datasets in applications.
Your rating: None Average: 1 (3 votes)