LOD Data Quality Vocabulary: LODQ

Details

Namespace: http://logd.tw.rpi.edu/lodq#
Imports:
Version: 0.1
Created: 12 April 2011
Modified: 13 April 2011
Authors: olyerickson, alvarograves
Discussion:

Introduction

Insert more extensive intro to subject of linked data quality here. Include mention of Quality Indicators for Linked Data Datasets discussion, originated by Leigh Dodds in Jan 2010...

LODQ is a proposed vocabulary to describe the "quality" of datasets inspired by the 15 metrics proposed by Glenn McDonald in his email to public-lod[1], later augmented with three additional metrics from Dave Reynolds.[2]

LOQD is intended to be domain-extensible, enabling each of the quality metrics to be defined based on the requirements of a particular domain. Thus each LOQD metric is represented as a class which should be further defined by the domain. This is illustrated by the following figure:

Our objective is to enable assertions of the form...

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix lodq: <http://logd.tw.rpi.edu/lodq#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix olo: <http://purl.org/ontology/olo/core#> .
@prefix ex: <http://example.org/> .
 
ex:binaryMetric
    lodq:explicitValue [
        olo:slot [
            olo:index 1 ;
            olo:item "yes"
        ], [
            olo:index 2 ;
            olo:item "no"
        ] ;
        a olo:OrderedList
    ] ;
    dcterms:creator <http://alvaro.graves.cl> ;
    a lodq:Metric .
 
ex:myMetric
    lodq:maxValue "10"^^xsd:int ;
    lodq:minValue "1"^^xsd:int ;
    dcterms:creator <http://alvaro.graves.cl> ;
    a lodq:Metric .
 
<http://logd.tw.rpi.edu/source/data-gov/dataset/4383/version/2010-Oct-22/conversion/raw>
    a lodq:Dataset .
 
[]
    lodq:completeness <http://logd.tw.rpi.edu/source/data-gov/dataset/4383/version/2010-Oct-22/conversion/raw> ;
    lodq:usesMetric ex:myMetric ;
    lodq:value "6"^^xsd:int ;
    dcterms:creator <http://tw.rpi.edu/instances/JohnErickson> ;
    a lodq:Measure .
 
[]
    lodq:justifiedBy [
        dcterms:description "The license is http://...." ;
        a lodq:Justification
    ] ;
    lodq:licensed <http://logd.tw.rpi.edu/source/data-gov/dataset/4383/version/2010-Oct-22/conversion/raw> ;
    lodq:usesMetric ex:binaryMetric ;
    lodq:value "yes" ;
    dcterms:creator <http://tw.rpi.edu/instances/JohnErickson> ;
    a lodq:Measure .

Vocabulary Definitions

The follow classes are inspired by Glenn and Dave's descriptions of their metrics for describing data quality:

Classes

Class: lodq:Dataset
Subclass of:
Has Subclasses:
Has Properties:
Description: A URI-named dataset.
Class: lodq:Metric
Subclass of:
Has Subclasses: Defined by domain...
Description: A superclass for domain-specific metric definition. Each metric instantiates this.

Properties

Property: lodq:qualityMeasure
Is of type: owl:dataTypeProperty
Range: rdfs:Literal
Has Subproperties: accuracy, intelligibility, referentialCorrespondence, completeness, boundedness, typing, modelingCorrectness, modelingGranularity, connectedness, isomorphism, currency, modelConsistency, attribution, history, internalConsistency, licensed, sustainable, authoritative
Description: A superproperty to all of the quality metric properties. May be extended by domain-specific metrics not considered here.
Property: lodq:accuracy
Subproperty of: lodq:qualityMeasure
Description: Are the individual nodes that refer to factual information factually and lexically correct. Like, is Chicago spelled "Chigaco" or does the dataset say its population is 2.7?
Property: lodq:intelligibility
Subproperty of:
lodq:qualityMeasure
Description: Are there human-readable labels on things, so you can tell what a thing is when you're looking at? Is there a model, so you can tell what questions you can ask? If a thing has multiple labels (or a set of owl:sameAs things havemlutiple labels), do you know which (or if) one is canonical?
Property: lodq:referentialCorrespondence
Subproperty of:
lodq:qualityMeasure
Description: If a set of data points represents some set of real-world referents, is there one and only one point per referent? If you have 9,780 data points representing cities, but five of them are "Chicago", "Chicago, IL", "Metro Chicago", "Metropolitain Chicago, Illinois" and "Chicagoland", that's bad.
Property: lodq:completeness
Subproperty of:
lodq:qualityMeasure
Description: Where you have data representing a clear finite set of referents, do you have them all? All the countries, all the states, all the NHL teams, etc? And if you have things related to these sets, are those projections complete? Populations of every country? Addresses of arenas of all the hockey teams?
Property: lodq:boundedness
Subproperty of:
lodq:qualityMeasure
Description: Where you have data representing a clear finite set of referents, is it unpolluted by other things? E.g., can you get a list of current real countries, not mixed with former states or fictional empires or adminstrative subdivisions?
Property: lodq:Typing
Subproperty of:
lodq:qualityMeasure
Description: Do you really have properly typed nodes for things, or do you just have literals? The first president of the US was not "George Washington"^^xsd:string, it was a person whose name-renderings include "George Washington". Your ability to ask questions will be constrained or crippled if your data doesn't know the difference.
Property: lodq:modelingCorrectness
Subproperty of:
lodq:qualityMeasure
Description: Is the logical structure of the data properly represented? Graphs are relational databases without the crutch of "rows"; if you screw up the modeling, your queries will produce garbage.
Property: lodq:modelingGranularity
Subproperty of:
lodq:qualityMeasure
Description: Did you capture enough of the data to actually make use of it. ":us :president :george_washington" isn't exactly wrong, but it's pretty limiting. Model presidencies, with their dates, and you've got much more powerful data.
Property: lodq:connectedness
Subproperty of:
lodq:qualityMeasure
Description: If you're bringing together datasets that used to be separate, are the join points represented properly. Is the US from your country list the same as (or owl:sameAs) the US from your list of presidencies and the US from your list of world cities and their populations?
Property: lodq:isomorphism
Subproperty of:
lodq:qualityMeasure
Description: If you're bring together datasets that used to be separate, are their models reconciled? Does an album contain songs, or does it contain tracks which are publications of recordings of songs, or something else? If each data point answers this question differently, even simple-seeming queries may be intractable.
Property: lodq:currency
Subproperty of:
lodq:qualityMeasure
Description: Is the data up-to-date?
Property: lodq:modelConsistency
Subproperty of:
lodq:qualityMeasure
Description: whichever way you make modelling decisions such as direction of relations (from country to president, from president to country) it is done consistently so you don't have to ask many permutations of the same query. Note: Was "Directionality"
Property: lodq:attribution
Subproperty of:
lodq:qualityMeasure
Description: If your data comes from multiple sources, or in multiple batches, can you tell which came from where?
Property: lodq:history
Subproperty of:
lodq:qualityMeasure
Description: If your data has been edited, can you tell how and by whom?
Property: lodq:internalConsistency
Subproperty of:
lodq:qualityMeasure
Description: Do the populations of your counties add up to the populations of your states? Do the substitutes going into your soccer matches balance the substitutes going out?
Property: lodq:licensed
Subproperty of:
lodq:qualityMeasure
Description: The license under which the data can be used is clearly defined, ideally in a machine checkable way.
Property: lodq:sustainable
Subproperty of:
lodq:qualityMeasure
Description: There is some credible basis for believing the data will be maintained as current (e.g. backed by some appropriate organization or by a sufficiently large group of individuals, has been updated frequently in the past).
Property: lodq:authoritative
Subproperty of:
lodq:qualityMeasure
Description: Is the provider of the data a credible authority on the subject. For example, in the UK then Companies House has the definitive information on registered UK companies and no amount of crowd sourcing can change that fact that if the company is not registered with them then it is not registered

References & Resources

  1. Glenn McDonald, 15 Ways to Think About Data Quality (Just for a Start) (8 Apr 2011 21:10:05 -0400)
  2. Dave Reynolds, Re: 15 Ways to Think About Data Quality (Just for a Start) (12 Apr 2011 09:21:36 +0100)
  3. The Pedantic Web Group Working with data publishers, tool builders, application developers and standards groups to create a more interoperable Web of Data...
  4. Semanticweb.com discussion on Quality Indicators for Linked Data Datasets Question added Jan 2010; Glenn McDonald's "15 Ways" added to the discussion 12 April 2011