Description:
This tutorial describes the organization of RDF-based LOGD data published on this site and the corresponding metadata.
What to Expect
By the end of this tutorial you should be able to understand the
structure of LOGD data, from which source the dataset was obtained, from
which OGD dataset the RDF-based LOGD dataset was converted, the human
contribute registry metadata for LOGD datasets, and etc. This tutorial
complements the tutorial
Understanding LOGD Data.
What You Need to Know
This tutorial assumes you are familiar with concepts found in the following resources:
- Resource Description Framework (RDF) is a standard model for data interchange on the Web. See [1]
- Terse RDF Triple Language (Turtle) is a syntax language for serializing RDF. We use it throughout our tutorials to encode RDF data. See [2]
LOGD Data Organization Diagram
The LOGD data organization Diagram (see figure below) provides an abstract model on the organization of LOGD data.
The "Levels of structural data granularity" dimension explains different levels of granularity of LOGD data.
- "source" refers to the data publishers who maintains a catalog OGD datasets for download. An example source is Data.gov (http://data.gov).
- "dataset" refers to an OGD dataset. A dataset is typically
determined by the data publishers, for example, "Dataset 1623 (OMH
Claims Listed by State)" is a dataset entry in Data.gov catalog (see http://www.data.gov/details/1623).
- "table" refers to a data table (organized in tabular
structure) in OGD datasets. Although an OGD dataset often contains one
table, it may also contain multiple tables. In "Dataset 1623", there is
only one table. Note that the data in OGD dataset may be stored in
non-tabular structure, e.g. an XML tree. Those data structures are out
of scope of this tutorial.
- "record" refers to a data row in a data table.
The "Data Publishing Stages" dimension explains the data publishing process of LOGD data.
- at "dataset" stage, raw OGD data are available for download at
certain Web locations. Note that raw OGD data are subject to change by
the data publishers: users may download different versions of dataset
from the same URL.
- at "version" stage, snapshots of raw OGD data are created and
versioned. This stage archives the content of the OGD data at a certain
time point and provides persistent access to the capture version of the
raw OGD data. Note that a dataset may contain multiple parts (e.g. data
tables) each of which is stored in a static file.
- at "conversion layer" stage, conversion configurations are
used to convert the raw OGD data into the corresponding LOGD data. The
basic conversion configuration is "raw", which is automatically
generated with the minimal manual input. A number of manually crafted
enhancement configurations are also allowed to generate monotonically incremental LOGD data.
LOGD Metadata by Example
All LOGD metadata are embedded in the corresponding LOGD data dumps.
In what follows, we use "Dataset 1623" to explain the actual metadata
used to describe and relate the concepts introduced in the previous
section. Examples below are extracted from the data dump for data.gov's
Dataset 1623's (version 2010-Sept-17).
Step1. download http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17
* rename the saved file from "data-gov-1623-2010-Sept-17" to "data-gov-1623-2010-Sept-17.tar.gz"
* run Linux shell command "tar -zxf data-gov-1623-2010-Sept-17.tar.gz" to unzip the file
Note that the dump file appends the LOGD data from all "conversion
layers" of the "version". In this file, readers may see some lines
(lines 1,1388 and 5563) starting with "# BEGIN", each of which should be
considered as the syntactic separator for the LOGD data produced by a
"conversion layer". Such "append" operation is possible because the
"conversion layers" are monotonic increments. Please also note the
"append" operation may keep redundant metadata.
# BEGIN: publish/data-gov-1623-2010-Sept-17.e1.ttl:
...
# BEGIN: publish/data-gov-1623-2010-Sept-17.e2.ttl:
...
# BEGIN: publish/data-gov-1623-2010-Sept-17.raw.ttl:
Source Metadata
The metadata for source (lines 1104-1106) shows the URI, webpage and identifier of the data source.
- the URI of dataset uniquely identifies the dataset and it is created via the following rule
- <source_uri> ::= <base_uri>/source/<source_identifier>
- the dataset is related to its html web page via "foaf:isPrimaryTopicOf"
<http://logd.tw.rpi.edu/source/data-gov> a foaf:Agent ;
dcterms:identifier "data-gov" ;
foaf:isPrimaryTopicOf <http://logd.tw.rpi.edu/source_page/data-gov> .
Dataset Metadata
The metadata for a dataset (lines 1048-1056) includes both description and relations to other entities.
- its URI uniquely identifies itself and is created based on the following rule
- <dataset_uri> ::= <base_uri>/source/<source_identifier>/dataset/<dataeset_identifier>
- it is related to its source using "dcterms:source" property
- it is related to its subsets using "void:subset". Note the
subsets include all versions of the dataset and the metadata of the
dataset.
- its web page is annotated using "foaf:isPrimaryTopicOf"
- the modification date of a version of the dataset is annotated
using "dcterms:modified". A dataset with multiple versions should have
multiple values for "dcterms:modified".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623> a data-gov_vocab:Dataset , conversion:Dataset , void:Dataset , conversion:UnVersionedDataset ;
conversion:base_uri "http://logd.tw.rpi.edu" ;
conversion:source_identifier "data-gov" ;
conversion:dataset_identifier "1623" ;
dcterms:source <http://logd.tw.rpi.edu/source/data-gov> ;
dcterms:identifier "data-gov 1623" ;
foaf:isPrimaryTopicOf <http://logd.tw.rpi.edu/source/data-gov/dataset_page/1623> ;
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> , <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/subset/meta> ;
dcterms:modified "2010-09-09T12:32:49.632-00:05"^^xsd:dateTime .
Besides the machine produced metadata, additional informational
metadata about the dataset may be obtained from the data catalog portal.
In our example, Data.gov also publishes the manually contributed
registry metadata (
Dataset 92 Data.gov catalog) which includes metadata for Dataset 1623 (also see
human readable page of registry metadata). Following are some sample data extracted from
Dataset 92 (version 2010-Sep-20). This metadata can be easily integrated with the above metadata as they are describing the same dataset URI.
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623> dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-gov/dataset/92/version/2010-Sep-20> ;
e1:id "1623" ;
dcterms:identifier "1623" ;
e1:url <http://www.data.gov/details/1623> ;
foaf:homepage <http://www.data.gov/details/1623> ;
e1:title "OMH Claims Listed by State" ;
dcterms:title "OMH Claims Listed by State" ;
e1:agency "Department of Health and Human Services" ;
...
Version Metadata
The metadata for a version (lines 1058-1069) includes both description and relations to other entities.
- its URI uniquely identifies itself and is created based on the following rule
- <version_uri> ::=
<base_uri>/source/<source_identifier>/dataset/<dataset_identifier>/version/<version_identifier>
- it is uniquely typed using "conversion:VersionedDataset "
- it is related to its source using "dcterms:source" property
- it is related to its html web page using "foaf:isPrimaryTopicOf"
- it is related to its subsets using "void:subset". Note the
subsets include the conversion layers of the version. As the example
data dump is concatenation of the LOGD data of each conversion layer,
there is only one value of "void:subset" on line 1069. However,
additional subsets are mentioned in other places (e.g. line 6382).
- its modification date is annotated using "dcterms:modified"
- its file dump is annotated using "void:dataDump".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> a data-gov_vocab:Dataset , conversion:Dataset , conversion:VersionedDataset , void:Dataset ;
conversion:base_uri "http://logd.tw.rpi.edu" ;
conversion:source_identifier "data-gov" ;
conversion:dataset_identifier "1623" ;
conversion:version_identifier "2010-Sept-17" ;
conversion:dataset_version "2010-Sept-17" ;
dcterms:source <http://logd.tw.rpi.edu/source/data-gov> ;
dcterms:identifier "data-gov 1623 2010-Sept-17" ;
dcterms:modified "2010-09-09T12:32:49.632-00:05"^^xsd:dateTime ;
foaf:isPrimaryTopicOf <http://logd.tw.rpi.edu/source/data-gov/dataset_page/1623/version/2010-Sept-17> ;
void:dataDump <http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17> ;
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/enhancement/1> .
(also see lines 6371 and 6382)
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> a data-gov_vocab:Dataset , conversion:Dataset , conversion:VersionedDataset , void:Dataset ;
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw> .
Conversion Layer Metadata
The metadata for a a conversion layer of a version of a dataset
(lines 6384-6398) includes both description and relations to other
entities.
- its URI uniquely identifies itself and is created based on the following rule
- <conversion_uri> ::=
<base_uri>/source/<source_identifier>/dataset/<dataset_identifier>/conversion/<conversion_identifier>
- the enhancement identifier are assigned according to the following rule
- "raw" for the raw conversion layer, i.e. the basic conversion with minimal human input
- "enhancement/1", "enhancement/2",... for the enhancement conversion layers
- it is uniquely typed by "conversion:LayerDataset"
- it is related to its source using "dcterms:source" property
- it is related to its subsets using "void:subset". Note the subsets include its sample dataset
- it is related to its used predicates using "conversion:uses_predicate"
- it is related to several URIs of its sample records using "void:exampleResource"
- its creation date is annotated using "dcterms:created"
- its modification date is annotated using "dcterms:modified", indicating the last execution of the conversion.
- its size (number of triples) is annotated using "conversion:num_triples"
- its homepage (html web page) is annotated using "foaf:isPrimaryTopicOf"
- its file dump is annotated via "void:dataDump".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw> a data-gov_vocab:Dataset , conversion:Dataset , conversion:LayerDataset , void:Dataset ;
dcterms:modified "2010-09-09T12:20:34.769-00:05"^^xsd:dateTime ;
conversion:base_uri "http://logd.tw.rpi.edu" ;
conversion:source_identifier "data-gov" ;
conversion:dataset_identifier "1623" ;
conversion:version_identifier "2010-Sept-17" ;
conversion:dataset_version "2010-Sept-17" ;
conversion:conversion_identifier "raw" ;
void:dataDump <http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17.raw> ;
void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw/subset/sample> ;
dcterms:created "2010-09-09T12:20:34.520-00:05"^^xsd:dateTime ;
dcterms:source <http://logd.tw.rpi.edu/source/data-gov> ;
dcterms:identifier "data-gov 1623 2010-Sept-17 raw" ;
conversion:uses_predicate raw:column_8 , raw:total , raw:fiscal_year_09 , raw:fiscal_year_08 , raw:fiscal_year_07 , raw:fiscal_year_06 , raw:state , raw:region ;
void:exampleResource ds1623:thing_8 , ds1623:thing_68 .
also see addition statistics (line 6469)
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw> conversion:num_triples "787"^^xsd:integer .
It is notable that the properties used by the LOGD data produced by the
enhancement conversion layers can be linked to the properties used by
the LOGD produced by the raw conversion layer. Following example (lines
33-38) shows the connection (see the last line) via
"conversion:enhances".
e1:region ov:csvCol "1"^^xsd:integer ;
ov:csvHeader "Region" ;
conversion:enhancement_layer "1" ;
rdfs:label "Region" ;
rdfs:range rdfs:Resource ;
conversion:enhances raw:region .
Metadata for Special LOGD Datasets
The LOGD conversion process also produces two small subset datasets for convenience.
Meta Dataset
A "meta dataset" is a subset data of a dataset, and its includes all
metadata of the dataset generated by the converter. The example below
(lines 1088-1093) show the metadata about the "meta dataset".
- its URI uniquely identifies itself and is created based on the following rule
- <datasetsample_uri> ::= <base_uri>/source/<source_identifier>/dataset/<dataset_identifier>/subset/meta
- it is uniquely typed using "conversion:MetaDataset"
- its file dump is annotated using "void:dataDump".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/subset/meta> a data-gov_vocab:Dataset , conversion:Dataset , conversion:MetaDataset , void:Dataset ;
conversion:base_uri "http://logd.tw.rpi.edu" ;
conversion:source_identifier "data-gov" ;
conversion:dataset_identifier "1623" ;
void:vocabulary <http://rdfs.org/ns/void#> , <http://inference-web.org/2.0/pml-provenance.owl#> , <http://inference-web.org/2.0/pml-justification.owl#> , <http://xmlns.com/foaf/0.1/> , <http://purl.org/dc/terms/> , <http://purl.org/NET/scovo#> ;
void:dataDump <http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17.void> .
Dataset Sample
A "dataset sample" is a subset dataset of the LOGD data produced by a
conversion layer, and it includes 100 random records of the table of
the dataset. The example below (lines 6407-6414) show the metadata about
the "Sample dataset".
- its URI uniquely identifies itself and is created based on the following rule
- <datasetsample_uri> ::=
<base_uri>/source/<source_identifier>/dataset/<dataset_identifier>/version/<version_identifier>/conversion/<conversion_identifier>/subset/sample
- it is uniquely typed using "conversion:DatasetSample "
- its file dump is annotated using "void:dataDump".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw/subset/sample> a data-gov_vocab:Dataset , conversion:Dataset , conversion:DatasetSample , void:Dataset ;
conversion:base_uri "http://logd.tw.rpi.edu" ;
conversion:source_identifier "data-gov" ;
conversion:dataset_identifier "1623" ;
conversion:version_identifier "2010-Sept-17" ;
conversion:dataset_version "2010-Sept-17" ;
conversion:conversion_identifier "raw" ;
void:dataDump <http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17.raw.sample> .
Where to Go Next
With the understanding of the LOGD data, we suggest the following links for future reading and investigations.