Understanding LOGD Metadata

Level: 
LOGD 101
Contributor: 
Contributor: 
Description: 
This tutorial describes the organization of RDF-based LOGD data published on this site and the corresponding metadata.
Prerequisites: 
Prerequisites: 

What to Expect

By the end of this tutorial you should be able to understand the structure of LOGD data, from which source the dataset was obtained, from which OGD dataset the RDF-based LOGD dataset was converted, the human contribute registry metadata for LOGD datasets, and etc. This tutorial complements the tutorial Understanding LOGD Data.

What You Need to Know

This tutorial assumes you are familiar with concepts found in the following resources:
  • Resource Description Framework (RDF) is a standard model for data interchange on the Web. See [1]
  • Terse RDF Triple Language (Turtle) is a syntax language for serializing RDF. We use it throughout our tutorials to encode RDF data. See [2]

LOGD Data Organization Diagram

The LOGD data organization Diagram (see figure below) provides an abstract model on the organization of LOGD data. The "Levels of structural data granularity" dimension explains different levels of granularity of LOGD data.
  • "source" refers to the data publishers who maintains a catalog OGD datasets for download. An example source is Data.gov (http://data.gov).
  • "dataset" refers to an OGD dataset. A dataset is typically determined by the data publishers, for example, "Dataset 1623 (OMH Claims Listed by State)" is a dataset entry in Data.gov catalog (see http://www.data.gov/details/1623).
  • "table" refers to a data table (organized in tabular structure) in OGD datasets. Although an OGD dataset often contains one table, it may also contain multiple tables. In "Dataset 1623", there is only one table. Note that the data in OGD dataset may be stored in non-tabular structure, e.g. an XML tree. Those data structures are out of scope of this tutorial.
  • "record" refers to a data row in a data table.
The "Data Publishing Stages" dimension explains the data publishing process of LOGD data.
  • at "dataset" stage, raw OGD data are available for download at certain Web locations. Note that raw OGD data are subject to change by the data publishers: users may download different versions of dataset from the same URL.
  • at "version" stage, snapshots of raw OGD data are created and versioned. This stage archives the content of the OGD data at a certain time point and provides persistent access to the capture version of the raw OGD data. Note that a dataset may contain multiple parts (e.g. data tables) each of which is stored in a static file.
  • at "conversion layer" stage, conversion configurations are used to convert the raw OGD data into the corresponding LOGD data. The basic conversion configuration is "raw", which is automatically generated with the minimal manual input. A number of manually crafted enhancement configurations are also allowed to generate monotonically incremental LOGD data.
logd-design-metadata-dataset.png

LOGD Metadata by Example

All LOGD metadata are embedded in the corresponding LOGD data dumps. In what follows, we use "Dataset 1623" to explain the actual metadata used to describe and relate the concepts introduced in the previous section. Examples below are extracted from the data dump for data.gov's Dataset 1623's (version 2010-Sept-17).
 Step1. download http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17
  * rename the saved file from "data-gov-1623-2010-Sept-17" to "data-gov-1623-2010-Sept-17.tar.gz"
  * run Linux shell command "tar -zxf data-gov-1623-2010-Sept-17.tar.gz" to unzip the file
Note that the dump file appends the LOGD data from all "conversion layers" of the "version". In this file, readers may see some lines (lines 1,1388 and 5563) starting with "# BEGIN", each of which should be considered as the syntactic separator for the LOGD data produced by a "conversion layer". Such "append" operation is possible because the "conversion layers" are monotonic increments. Please also note the "append" operation may keep redundant metadata.
  # BEGIN: publish/data-gov-1623-2010-Sept-17.e1.ttl:
  ...
  # BEGIN: publish/data-gov-1623-2010-Sept-17.e2.ttl:
  ...
  # BEGIN: publish/data-gov-1623-2010-Sept-17.raw.ttl:
 

Source Metadata

The metadata for source (lines 1104-1106) shows the URI, webpage and identifier of the data source.
  • the URI of dataset uniquely identifies the dataset and it is created via the following rule
<source_uri> ::= <base_uri>/source/<source_identifier>
  • the dataset is related to its html web page via "foaf:isPrimaryTopicOf"
<http://logd.tw.rpi.edu/source/data-gov> a foaf:Agent ;
	dcterms:identifier "data-gov" ;
	foaf:isPrimaryTopicOf <http://logd.tw.rpi.edu/source_page/data-gov> .

Dataset Metadata

The metadata for a dataset (lines 1048-1056) includes both description and relations to other entities.
  • its URI uniquely identifies itself and is created based on the following rule
<dataset_uri> ::= <base_uri>/source/<source_identifier>/dataset/<dataeset_identifier>
  • it is related to its source using "dcterms:source" property
  • it is related to its subsets using "void:subset". Note the subsets include all versions of the dataset and the metadata of the dataset.
  • its web page is annotated using "foaf:isPrimaryTopicOf"
  • the modification date of a version of the dataset is annotated using "dcterms:modified". A dataset with multiple versions should have multiple values for "dcterms:modified".
 
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623> a data-gov_vocab:Dataset , conversion:Dataset , void:Dataset , conversion:UnVersionedDataset ;
	conversion:base_uri "http://logd.tw.rpi.edu" ;
	conversion:source_identifier "data-gov" ;
	conversion:dataset_identifier "1623" ;
	dcterms:source <http://logd.tw.rpi.edu/source/data-gov> ;
	dcterms:identifier "data-gov 1623" ;
	foaf:isPrimaryTopicOf <http://logd.tw.rpi.edu/source/data-gov/dataset_page/1623> ;
	void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> , <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/subset/meta> ;
	dcterms:modified "2010-09-09T12:32:49.632-00:05"^^xsd:dateTime .
Besides the machine produced metadata, additional informational metadata about the dataset may be obtained from the data catalog portal. In our example, Data.gov also publishes the manually contributed registry metadata (Dataset 92 Data.gov catalog) which includes metadata for Dataset 1623 (also see human readable page of registry metadata). Following are some sample data extracted from Dataset 92 (version 2010-Sep-20). This metadata can be easily integrated with the above metadata as they are describing the same dataset URI.
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623> dcterms:isReferencedBy <http://logd.tw.rpi.edu/source/data-gov/dataset/92/version/2010-Sep-20> ;
	e1:id "1623" ;
	dcterms:identifier "1623" ;
	e1:url <http://www.data.gov/details/1623> ;
	foaf:homepage <http://www.data.gov/details/1623> ;
	e1:title "OMH Claims Listed by State" ;
	dcterms:title "OMH Claims Listed by State" ;
	e1:agency "Department of Health and Human Services" ;
       ...

Version Metadata

The metadata for a version (lines 1058-1069) includes both description and relations to other entities.
  • its URI uniquely identifies itself and is created based on the following rule
<version_uri> ::= <base_uri>/source/<source_identifier>/dataset/<dataset_identifier>/version/<version_identifier>
  • it is uniquely typed using "conversion:VersionedDataset "
  • it is related to its source using "dcterms:source" property
  • it is related to its html web page using "foaf:isPrimaryTopicOf"
  • it is related to its subsets using "void:subset". Note the subsets include the conversion layers of the version. As the example data dump is concatenation of the LOGD data of each conversion layer, there is only one value of "void:subset" on line 1069. However, additional subsets are mentioned in other places (e.g. line 6382).
  • its modification date is annotated using "dcterms:modified"
  • its file dump is annotated using "void:dataDump".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> a data-gov_vocab:Dataset , conversion:Dataset , conversion:VersionedDataset , void:Dataset ;
	conversion:base_uri "http://logd.tw.rpi.edu" ;
	conversion:source_identifier "data-gov" ;
	conversion:dataset_identifier "1623" ;
	conversion:version_identifier "2010-Sept-17" ;
	conversion:dataset_version "2010-Sept-17" ;
	dcterms:source <http://logd.tw.rpi.edu/source/data-gov> ;
	dcterms:identifier "data-gov 1623 2010-Sept-17" ;
	dcterms:modified "2010-09-09T12:32:49.632-00:05"^^xsd:dateTime ;
	foaf:isPrimaryTopicOf <http://logd.tw.rpi.edu/source/data-gov/dataset_page/1623/version/2010-Sept-17> ;
	void:dataDump <http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17> ;
	void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/enhancement/1> .
(also see lines 6371 and 6382)
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17> a data-gov_vocab:Dataset , conversion:Dataset , conversion:VersionedDataset , void:Dataset ;
	void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw> .

Conversion Layer Metadata

The metadata for a a conversion layer of a version of a dataset (lines 6384-6398) includes both description and relations to other entities.
  • its URI uniquely identifies itself and is created based on the following rule
<conversion_uri> ::= <base_uri>/source/<source_identifier>/dataset/<dataset_identifier>/conversion/<conversion_identifier>
  • the enhancement identifier are assigned according to the following rule
"raw" for the raw conversion layer, i.e. the basic conversion with minimal human input
"enhancement/1", "enhancement/2",... for the enhancement conversion layers
  • it is uniquely typed by "conversion:LayerDataset"
  • it is related to its source using "dcterms:source" property
  • it is related to its subsets using "void:subset". Note the subsets include its sample dataset
  • it is related to its used predicates using "conversion:uses_predicate"
  • it is related to several URIs of its sample records using "void:exampleResource"
  • its creation date is annotated using "dcterms:created"
  • its modification date is annotated using "dcterms:modified", indicating the last execution of the conversion.
  • its size (number of triples) is annotated using "conversion:num_triples"
  • its homepage (html web page) is annotated using "foaf:isPrimaryTopicOf"
  • its file dump is annotated via "void:dataDump".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw> a data-gov_vocab:Dataset , conversion:Dataset , conversion:LayerDataset , void:Dataset ;
	dcterms:modified "2010-09-09T12:20:34.769-00:05"^^xsd:dateTime ;
	conversion:base_uri "http://logd.tw.rpi.edu" ;
	conversion:source_identifier "data-gov" ;
	conversion:dataset_identifier "1623" ;
	conversion:version_identifier "2010-Sept-17" ;
	conversion:dataset_version "2010-Sept-17" ;
	conversion:conversion_identifier "raw" ;
	void:dataDump <http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17.raw> ;
	void:subset <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw/subset/sample> ;
	dcterms:created "2010-09-09T12:20:34.520-00:05"^^xsd:dateTime ;
	dcterms:source <http://logd.tw.rpi.edu/source/data-gov> ;
	dcterms:identifier "data-gov 1623 2010-Sept-17 raw" ;
	conversion:uses_predicate raw:column_8 , raw:total , raw:fiscal_year_09 , raw:fiscal_year_08 , raw:fiscal_year_07 , raw:fiscal_year_06 , raw:state , raw:region ;
	void:exampleResource ds1623:thing_8 , ds1623:thing_68 .
also see addition statistics (line 6469)
  <http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw> conversion:num_triples "787"^^xsd:integer .

It is notable that the properties used by the LOGD data produced by the enhancement conversion layers can be linked to the properties used by the LOGD produced by the raw conversion layer. Following example (lines 33-38) shows the connection (see the last line) via "conversion:enhances".
e1:region ov:csvCol "1"^^xsd:integer ;
	ov:csvHeader "Region" ;
	conversion:enhancement_layer "1" ;
	rdfs:label "Region" ;
	rdfs:range rdfs:Resource ;
	conversion:enhances raw:region .

Metadata for Special LOGD Datasets

The LOGD conversion process also produces two small subset datasets for convenience.

Meta Dataset

A "meta dataset" is a subset data of a dataset, and its includes all metadata of the dataset generated by the converter. The example below (lines 1088-1093) show the metadata about the "meta dataset".
  • its URI uniquely identifies itself and is created based on the following rule
<datasetsample_uri> ::= <base_uri>/source/<source_identifier>/dataset/<dataset_identifier>/subset/meta
  • it is uniquely typed using "conversion:MetaDataset"
  • its file dump is annotated using "void:dataDump".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/subset/meta> a data-gov_vocab:Dataset , conversion:Dataset , conversion:MetaDataset , void:Dataset ;
	conversion:base_uri "http://logd.tw.rpi.edu" ;
	conversion:source_identifier "data-gov" ;
	conversion:dataset_identifier "1623" ;
	void:vocabulary <http://rdfs.org/ns/void#> , <http://inference-web.org/2.0/pml-provenance.owl#> , <http://inference-web.org/2.0/pml-justification.owl#> , <http://xmlns.com/foaf/0.1/> , <http://purl.org/dc/terms/> , <http://purl.org/NET/scovo#> ;
	void:dataDump <http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17.void> .

Dataset Sample

A "dataset sample" is a subset dataset of the LOGD data produced by a conversion layer, and it includes 100 random records of the table of the dataset. The example below (lines 6407-6414) show the metadata about the "Sample dataset".
  • its URI uniquely identifies itself and is created based on the following rule
<datasetsample_uri> ::= <base_uri>/source/<source_identifier>/dataset/<dataset_identifier>/version/<version_identifier>/conversion/<conversion_identifier>/subset/sample
  • it is uniquely typed using "conversion:DatasetSample "
  • its file dump is annotated using "void:dataDump".
<http://logd.tw.rpi.edu/source/data-gov/dataset/1623/version/2010-Sept-17/conversion/raw/subset/sample> a data-gov_vocab:Dataset , conversion:Dataset , conversion:DatasetSample , void:Dataset ;
	conversion:base_uri "http://logd.tw.rpi.edu" ;
	conversion:source_identifier "data-gov" ;
	conversion:dataset_identifier "1623" ;
	conversion:version_identifier "2010-Sept-17" ;
	conversion:dataset_version "2010-Sept-17" ;
	conversion:conversion_identifier "raw" ;
	void:dataDump <http://logd.tw.rpi.edu/source/data-gov/file/1623/version/2010-Sept-17/conversion/data-gov-1623-2010-Sept-17.raw.sample> .

Where to Go Next

With the understanding of the LOGD data, we suggest the following links for future reading and investigations.
Your rating: None Average: 1 (2 votes)