Featured

This tag indicates that a page/demo/etc. is featured.

URI Design Principles: Creating Persistent URIs for Government Linked Data

Version: 23 October 2013

Table of Contents

  1. Goals for Persistent Open Government Data URIs
  2. URI Design Overview
  3. Example Persistent URIs for Government Linked Data
  4. References & Resources

1. Goals for Persistent Open Government Data URIs

As an increasing number of governments and government agencies have begun publishing linked open government data, policies and best practices will emerge for Uniform Resource Identifier (URI) design[0] in open government data release. RPI TWC is working extensively with key players in the United States and international linked open government data initiatives to develop URI schemes that are useful today and in the future.

The URI scheme demonstrated by the TWC Instance Hub provides rich descriptive information about entities for humans examining open government data, while encouraging intuitive navigation and exploration of this data. The implementation of this scheme in the instance hub shows its utility for data publishing, and we hope to see government support and adoption of this scheme in order to further test its viability.

The principles recommended in this document should produce URIs with the following characteristics:
  • URIs that are easily re-hosted. This means that a generated should be easily transformed from one BASE URI-space (e.g. logd) to another, allowing easier adoption by government agencies.
    • For example: the pattern http://logd.tw.rpi.edu/id/epa-gov/XXXXXX can easily (based on syntax) be transformed to http://epa.gov/id/XXXXXX should the agency "EPA" adopt this scheme
  • Concise URIs with as little "cruft" as possible
  • URIs that span many domains including:
    • National identifiers (e.g. governmental agencies, states, zip codes)
    • State-level identifiers (e.g. counties, congressional districts)
    • Agency-level identifiers (e.g. EPA facilities)

2. URI Design Overview

URI Template: ‘http://’ [base] / ‘id’ / ([category]/[token]*)+ /[token]+

For the case of the TWC RPI Instance Hub [V1, V2] the value of base will be: logd.tw.rpi.edu

Notes

  • id
    • The id path is recommended to avoid "polluting" the top namespace [base] with identifiers.
    • id is preferred over e.g. "instance-hub" in order to have as short a token as possible. id doesn't add any semantics, but is a syntactic way to distinguish these URIs from others.
    • This token is consistent with data.gov.uk URI design recommendations
  • category
    • This element indicates the type of an entity
    • Categories allow us to describe an entity and also aid in navigation. If desired, in an implementation, category pages can serve to list all entities falling within their category
    • Category elements should be short, but don’t lose expressivity in favor of shortness
    • URIs ending in a category should not be the URIs for the concept itself.

A tree style representation of URI “categories”, which present listings of the entities that are logically categorized beneath them.
  • token
    • An element which, when qualified by categories and other elements, identifies a unique element
    • All entity URIs should end with a token
    • A token can be followed by a category or another token, if they follow logically
    • Tokens are unique when fully qualified; although a given token might not be unique within an instance hub, it must be unique within the category or token that precedes it.
    • Consider the token “Office_of_Inspector_General”:
      • /id/us/fed/agency/Department_of_Justice/Office_of_Inspector_General
      • /id/us/fed/agency/General_Services_Adminstration/Office_of_Inspector_General
      • /id/us/fed/agency/Department_of_Defense/Office_of_Inspector_General

3. Example Persistent URIs for Government Linked Data

Note: this section is still in review for consistency with the revised recommendations...

3.1. US Government Agencies

Owner
federal
Suggested
http://BASE/id/us/fed/agency/NAME/SUBNAME
Example
http://BASE/id/us/fed/agency/Commerce/National_Oceanic_and_Atmospheric_Administration
Example
http://BASE/id/us/fed/agency/Department_of_Health_and_Human_Services/Centers_for_Disease_Control

3.2. Canadian Federal Agencies [4]

Owner
federal
Suggested
http://BASE/id/ca/fed/agency/NAME/SUBNAME
Example
http://BASE/id/ca/fed/agency/Department_of_Agriculture_and_Agri-Food
http://BASE/id/ca/fed/agency/Agriculture_and_Agri-Food_Canada
Example
http://BASE/id/ca/fed/agency/Department_of_Industry
http://BASE/id/ca/fed/agency/Industry_Canada
Notes:
Both the "Legal Title" and "Applied Title" are shown

3.3. US States and Territories

Owner
federal
Suggested
http://BASE/id/us/state/NAME
Example
http://logd.tw.rpi.edu/id/us/state/Vermont
Include
FIPS code, two-letter code, name, dbpedia/geonames/govtrack sameAs
Notes:
States and territories may be identified by FIPS 5-2 codes, two-letter abbreviations, or full names in published datasets
Not all states/territories have two-letter abbreviations
FIPS 5-2 has been withdrawn as a FIPS standard (2008)
The use of full names is probably the most stable approach

3.4. Canadian Provinces and Territories

Owner
federal
Suggested
http://BASE/id/ca/province/NAME
Example
http://logd.tw.rpi.edu/id/ca/province/Alberta
Include
Two-letter province codes, name, dbpedia/geonames/govtrack sameAs
Notes
Canadian provinces and territories have not been implemented in RPI's Instance Hub (Sep 2013)

3.5. US Counties

Owner
federal
Suggested
http://BASE/id/us/state/STATE/COUNTY
Example
http://BASE/id/us/state/Alaska/Bethel_Census_Area
Notes
Just like US states, counties are identified by FIPS codes (FIPS 6-4), but these have been withdrawn (2008).
US county names seem stable, although two US states do not refer to them as "counties": Alaska (borough) and Louisiana (parish).
Hierarchy based on the state/territory URIs seems like the best design.

3.6. US Zip codes

Owner
USPS
Suggested
http://BASE/id/usps-com/zip/CODE
Example
http://BASE/id/usps-com/zip/09510
Include
code, link to state
Notes
The Census Bureau uses ZIP Code Tabulation Areas (ZCTA) based on ZIP codes.

3.7. US Congressional districts

Owner
state
Suggested
http://BASE/id/us/STATE/congressional-district/NUMBER
Example
http://BASE/id/us/ma/congressional-district/4
Include
link to state, dbpedia sameAs
Notes
STATE here can be a two-letter code because (at least for present-day districts and non-voting delegations) we only have data for places with two-letter codes: the the 50 states, DC, AS, GU, MP, PR, and VI.

3.8. US Agencies: EPA Facilities

Owner
EPA
Suggested
http://BASE/id/epa-gov/facility/ID
Example
http://BASE/id/epa-gov/facility/110007995027
Include
link to facility detail report, link to state: http://iaspub.epa.gov/enviro/fii_query_detail.disp_program_facility?p_registry_id=110007995027
Notes
EPA facility IDs are used in EPA Facilities Registry System (FRS) datasets (e.g. Dataset 1008)

4. References & Resources

[0] Uniform Resource Identifiers (URI): Generic Syntax
[1] legislation.gov.uk URIs
[2] Creating Linked Data - Part II: Defining URIs
[3] The Real Deal: data.gov.uk
[4] Treasury Board of Canada: Registry of Applied Titles
[5] See also: Phil Archer, Study on Persistent URIs, with identification of best practices and recommendations on the topic for the Member States and the European Commission (2012). Also available as PDF
[6] See also: Hans Overbeek and Linda van den Brink, Towards a national URI-Strategy for Linked Data of the Dutch public sector (2013)

lod-apps

Description: 
Software package, offering RESTful web services including: phpCsv2Rdf, phpJson2Rdf, and phpSparqlExplorer.

Graphite RDF Browser

Description: 
Quick and dirty RDF viewer.
By Christopher Gutteridge.

Graphite

Description: 
Quick and Dirty PHP Library for hacking with Linked Data. For playing with small graphs in a jQueryish way.
By Christopher Gutteridge:

sparql2kml

Description: 
Munge a sparql query directly onto a google map.
By Christopher Gutteridge.

graphitejs

Description: 
Javascript version of the php library. (work in progress)
by Christopher Gutteridge.

geo2kml

Description: 
Finds lat/long in rdf and shows it on a map.
By Christopher Gutteridge.

stuff2rdf

Description: 
Convert formats Christopher Gutteridge thinks of into any RDF formats Christopher Gutteridge thinks of.

Linking US and China's GDP data

Description: 
This demo tries to link the GDP data published by US government and China government. It also adjusts the result GDP data using currency exchange rate and population data.
Contributor:

1. Visualizing GDP of the US and China



source: http://www.bea.gov/national/index.htm#gdp US Bureau of Economic Analysis (BEA)
note: The real value of USD in each year may vary because it has not been adjusted by economic factors such as inflating rate.


source: http://www.stats.gov.cn/tjsj/ndsj/laodong/2006/html/01-03.htm -- National Bureau of Statistics of China

2. Aligning GDP of the US and China by currency Exchange Rate



source: http://www.federalreserve.gov/releases/h10/hist/dat00_ch.txt US Federal Reserve Statistical Release
note: the exchange rate is a simple average of daily exchange rate over the calendar year.


3. Adding population factor, GDP per capita



source: http://www.stats.gov.cn/tjsj/ndsj/laodong/2006/html/01-02.htm National Bureau of Statistics of China
Note: the population numbers are estimated based on census data.


4. Adding inflation factor, use adjusted dollar based on 2005





Uses Technology: 
Uses Technology: 
Uses Technology: 
Thumbnail: 

US Global Foreign Aid

Description: 
This application presents historical foreign aid data (ranging from 1951 to 2008) from the United States Agency for International Development (USAID), the Department of Agriculture and the Department of State, mashed up with information from the New York Times API and CIA World Factbook.
Contributor:
Contributor:
Uses Dataset: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Thumbnail: 

Virtuoso Triple Store

Description: 
Triple store by OpenLink Software.
TWC developed shell scripts to wrap calls to virtuoso: pload and vdelete.

Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007)

Description: 
A PopSciGrid demo that illustrates how tobacco policy coverage rate, cigarette price/tax, and smoking rate have changed over time.
Contributor:
Contributor:

This demo is part of the Population Sciences Grid Project. For more information, please contact Professor Deborah McGuinness.

Please note: The National Cancer Institute’s (NCI) PopSciGrid Community Health Data Portal enables users to explore the relationship between cancer risk factors, health outcomes, and factors in the social and political environment. The maps, graphs, and tables on this website are not intended to convey information about causal relationships, and the NCI does not endorse any conclusions or inferences that individual users may draw from the maps, tables, or graphs produced by this website. Learn more about cancer.

Browser considerations: Firefox and IE 8 on Windows XP have difficulties displaying the following demonstrations. IE 7 on Windows Vista also has difficulties. Demonstrations can be successfully viewed using IE 8 on Windows 7, and Firefox and Safari on Mac. Flash is required for all browsers.

Using the demo: Select a year and a data series to show a heatmap that compares data for different states on a US map (the map may be blank in some combination because impacteen datasets (other than the policy data) only cover data between 1991 and 2007). Select a state to show the evolution of data series from 1991 to 2007.

year:
data:


state:








Smoking prevalence refers to the estimated percentage of adults in the US who currently smoke.

PopSciGrid Project Contributors

National Cancer Institute

  • Abdul R. Shaikh
  • Richard Moser
  • Bradford W. Hesse
  • Glen D. Morgan
  • Erik M. Augustson
  • Yvonne Hunt
  • Zaria Tatalovich
  • Gordon Willis
  • Kelly Blake
  • Paul Courtney (contractor)
  • Lila Finney (contractor)
  • Amy Sanders (contractor)

Rensselaer Polytechnic Institute

  • Deborah L. McGuinness
  • Li Ding
  • Tim Lebo
  • Jim McCusker

Northwestern University

  • Noshir Contractor
  • Yun Huang
  • Hugh Devlin
  • York Yao
Uses Technology: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Thumbnail: 

csv2rdf4lod

Description: 
In its simplest form, csv2rdf4lod is a quick and easy way to produce an RDF encoding of data available in Comma-Separated-Values (CSV). In its advanced form, csv2rdf4lod is a custom reasoner tailored for some heavy-duty data integration.
See https://github.com/timrdf/csv2rdf4lod-automation/wiki for the source code, documentation, examples, and issues tracking.

LOGD resources relating to csv2rdf4lod:

  • Please read the following tutorial to start using cvs2rdf4lod.

  • The REST based conversion service?

  • The SPARQL endpoint http://logd.tw.rpi.edu/sparql hosts RDF data created by csv2rdf4lod

Yahoo! Pipes

Description: 
Pipes is a powerful composition tool to aggregate, manipulate, and mashup content from around the web.

RDFa

Description: 
RDFa provides a set of XHTML attributes to augment visual data with machine-readable hints

JQuery

Description: 
The Write Less, Do More, JavaScript Library

Javascript

Description: 
an object-oriented scripting language used to enable programmatic access to objects within both the client application and other applications.

PHP

Description: 
general-purpose scripting language that was originally designed for web development, to produce dynamic web pages

SIMILE Exhibit

Description: 
Exhibit lets you easily create web pages with advanced text search and filtering functionalities, with interactive maps, timelines, and other visualizations

Google Visualization API

Description: 
The Google Visualization API lets you access multiple sources of structured data that you can display, choosing from a large selection of visualizations.

SparqlProxy

Description: 
A TWC LOGD service that help users to proxy SPARQL endpoint access, cache SPARQL query results, and convert SPARQL query results into many different formats.

Source Code and Resources

SparqlProxy is a TWC LOGD service:

RESTful Service Interface Description

Parameter Status Description
service-uri stable URI of SPARQL service.
query stable SPARQL query string
query-uri stable URI of SPARQL query. Note you can only use one of "query-uri" and "query" because they are mutually exclusive.
output stable (optional) the output format. Default is xml. All values a listed as below:
  • xml => SPARQL/XML
  • sparqljson => SPARQL/JSON
  • exhibit => EXHIBIT/JSON
  • gvds => GoogleViz/JSON
  • csv => CSV
  • html => HTML
  • tablerow => Table(Row)
  • tablecol => Table(Col)
To keep backward comparability, both "sparqljosn" and "sparql" can be used to output SPARQL/JOSN format.
callback experimental (optional) callback function name. This param is only applicable to two output formats: exhibit, sparqljson
tqx experimental (optional) for google visluziation api only. e.g. version:0.6;reqId:1;responseHandler:myQueryHandler
refresh-cache experimental (optional) SparqlProxy use a cache by default. User may opt out (avoid caching) by setting "refresh-cache=on" in service request
textoutput experimental (optional) set "text/plain" as content-type in HTTP response header, so users can view the result in browser. To enable it, add "textoutput=yes" in service request
ui-option experimental (optional) SparqlProxy allow users to show the query of SPARQL query result. To enable it, add "ui-option=query" in service request

Example Usage

The following example uses DBpedia SPARQL endpoint and the following SPARQL query (listing 10 triples, also published at http://logd.tw.rpi.edu/query/stat_ten_triples.sparql)
http://dbpedia.org/sparql
http://logd.tw.rpi.edu/query/stat_list_ten_triples.sparql
SELECT ?s ?p ?o WHERE {?s ?p ?o} limit 10

Discussion

Q: Why use cache in sparqlproxy.
A: The use of cache in LOGD sparql endpoint has historical reasons. The cache capability is used specifically reduce load on repetitive queries over LOGD server (e.g. load dynamically generated web pages). The cache will automatically expire, or you may use "refresh-cache=on" option in RESTful request to bypass the cache.
Q: Why two sparqlproxy instances?
A: We maintain two sparqlproxy instances on LOGD server: http://logd.tw.rpi.edu/sparql dedicates to the LOGD sparql endpoint, and http://logd.tw.rpi.edu/ws/sparqlproxy.php allows users to access other sparql endpoints. Users may download the sparqlproxy code and install an instance on their own computer, allowing external users to access their internal triplestores.

SPARQL

Description: 
A query language for RDF data. SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF
SPARQL is a query language for RDF.

Clean Air Status and Trends - Ozone

Description: 
This demo shows a map of the U.S. depicting the Clean Air Status and Trends Network (CASTNET) Ozone monitoring stations and their average Ozone level. Clicking a station shows more detailed information and a link to a graphical representation of the time-series of measurements.
Contributor:
Contributor:

About


For this demo, we used RDF data derived from ozone and visibility readings provided by the EPA's Castnet project. This RDF combines raw data on Castnet readings (ozone and visibility) with corresponding geographic information on the sites of readings, which the raw Castnet data lacks. On the provided map of the US, every yellow dot represents a single Casetnet site and dot size corresponds to an average Ozone reading for that site, where larger dots represent larger averages. When a Castnet site is clicked, a small pop-up opens, displaying more information on that site, along with a link. This link takes the user to another page that displays in a timeline all the Ozone and Visibility data available for that site. This timeline uses the Google Visualization API giving users added functionality in how they view the data in the timeline. To allow users to filter through Castnet readings, a faceted browsing interface (from the MIT Simile Exhibit API) is provided.

Interesting observations

This demo uses SPARQL to combine raw data on the Castnet Project (dataset 8) from Data.gov and geographic information on Castnet Stations (Dataset 10001) from the EPA website. This enables the demo to plot Castnet readings on a US map, allowing users to look for geographic relationships within the Castnet data. When a user clicks on a Castnet station, historical ozone values for that station can then be viewed through a timeline. Finally, the use of faceted browsing aids users in interacting with presented data, allowing for filtering based on different criteria.

Technology Highlights


  • This uses the RDF to Google visualization conversion process of using a SPARQL service and XSLT to put the data in JSON format.

    • SPARQL is used to grab data from two datasets.

  • This also uses MIT Simile Labs Exhibit to make the faceted browsing map.

    • A Web service is used to convert JSON from SPARQL into JSON that Exhibit can use.

  • The visualization used is the annotated timeline.


Uses Dataset: 
Uses Dataset: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Uses Technology: 
Thumbnail: 
Syndicate content