Description:
This tutorial describes SPARQL queries that mash up LOGD data from different datasets. It contains examples showing how the SPARQL queries are connected with LOGD data at different levels of granularity.
What to Expect
By the end of this tutorial you should be able to understand how to
use SPARQL queries to mash up LOGD datasets, and how SPARQL query is
connected to LOGD data and metadata at different levels of granularity.
What You Need to Know
This tutorial assumes familiarity in the following areas:
The Basic Idea
We assume required LOGD data are loaded into a triples store (a RDF
data management system like database but provides SPARQL query
interface).
As shown in the following figure, the content of dump file of the
specific version of a LOGD dataset is first loaded into one
RDF dataset
named by the version's URI (we use "RDF Dataset 1" in the figure).
By manually looking at the dataset's metadata and sample data, users can
identify how records from different RDF datasets can be linked by
common objects which have the same value.
The data linking knowledge is then used to compose a SPARQL query on the
two RDF datasets as shown on top of the triple store.
In what follows, we will provide real world example SPARQL queries that
mash up LOGD data.
Section 1. Mashing up LOGD data using String Matching
Datasets and Samples
In this section, we will be using datasets
1356 and
353, obtained from Data.gov.
The specific RDF conversions we will be looking at are:
In producing these RDF versions of the datasets, one of the primary
challenges involved establishing mappings between property values. For
instance, establishing the fact that the entity "Alabama" referenced in
dataset 1356 is equivalent to the "Alabama" from dataset 353. In our
RDF conversion, US states in both datasets are referenced by a common
set of string-based FIPS codes. Consider the example below, comparing
RDF from both datasets:
Here, the properties "state_code" (Dataset 1356) and "pub_fips"
(Dataset 353) both point to a standard FIPS code representing "Alabama".
SPARQL Query
The two RDF dataset conversions we mentioned earlier:
are presently loaded into the TWC LOGD triple store (
http://logd.tw.rpi.edu/sparql). Information from both datasets can be retrieved by the following SPARQL query:
SELECT distinct ?state_abbv ?agi ?population
WHERE {
GRAPH <http://logd.tw.rpi.edu/source/data-gov/dataset/353/version/1st-anniversary>{
?s1 <http://logd.tw.rpi.edu/source/data-gov/dataset/353/vocab/raw/popu_st> ?population.
?s1 <http://logd.tw.rpi.edu/source/data-gov/dataset/353/vocab/raw/pub_fips> ?state_fipscode .
}
GRAPH <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/version/2009-Dec-03> {
?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_abbrv> ?state_abbv .
?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/county_code> "000" .
?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/agi> ?agi.
?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_code> ?state_fipscode .
}
}
order by ?state_fipscode
Here, the line ?s2 <
http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/county_code> "000" . is used to constrain results to state-level information only (as opposed to both state and county level).
Sending this query to the TWC LOGD triple store returns results of the form:
state_abbv | agi | population
|
"AL" | "92162773" | "4599030"
|
Important Points
- Establishing property value mappings between datasets is a
challenging problem, which makes it an important area for future
research. Standardized representations of entities, such as FIPS codes
for US States, can be used for this purpose.
- Selective use of RDF properties in SPARQL queries can be used
to narrow results obtained. For instance, dataset 1356 has many rows
related to a state, but the constraint "county_code=000" help us focus
on state-level data, as opposed to county-level data.
Section 2. Mashing up LOGD data using Enhanced LOGD data
Datasets
In this section, we will be using datasets
1356 and
1623, obtained from Data.gov.
The specific RDF conversions we will be looking at are:
Earlier, we mentioned that establishing property value mappings
between datasets can be challenging, and proposed the use of
standardized string-based representations for handling this (e.g. using
FIPS codes for representing US States). However, doing this doesn't
produce explicit semantic linking between datasets. In addition, it
forces datasets to adopt a common string-based representation for a
given property. In practice, this is often not feasible - while one
dataset may refer to US states their full name, others may do so by
2-letter postal codes. Therefore, we attempt to provide RDF-based
definitions for property values where possible. In the case of
representing US States, two state representations are defined: one
corresponding to a postal code, and another for the full name.
In both examples, the RDF-based representation for "Alabama" is asserted to be equivalent to the representation on DBPedia.
SPARQL Query
Below is a SPARQL query for retrieving data from:
Among other things, this query returns URIs (?altStateURI) equivalent
to the state representation used in each entry (?state_abbrv):
SELECT distinct ?state_abbrv ?agi ?claims ?altStateURI
WHERE {
GRAPH <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17>{
?s1 <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/enhancement/1/fiscal_year_07> ?claims.
?s1 <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/vocab/enhancement/1/state> ?state1623.
?state1623 owl:sameAs ?altStateURI .
}
GRAPH <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/version/2009-Dec-03> {
?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/state_abbrv> ?state_abbrv .
?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/county_code> "000" .
?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/raw/agi> ?agi.
?s2 <http://logd.tw.rpi.edu/source/data-gov/dataset/1356/vocab/enhancement/1/state_abbrv> ?state1356.
?state1356 owl:sameAs ?altStateURI .
}
}
order by ?state_abbrv
Below are sample results
Important Points
- The two datasets cannot be connected by simple string matching,
since variations on string-based property values may be used (e.g. one
dataset may refer to US States by their full name, while another may use
postal codes). Therefore, our RDF conversion promoted the use of RDF
resource-based property values, linkable to other sections of the Linked
Data Cloud (such as DBPedia). This data enhancement provided by our RDF
conversion can help developers by saving them the hassle of writing ad
hoc code for linking datasets in applications.