TWC LOGD and the Semantic Web Challenge 2010

Congratulations! TWC LOGD won the second prize (CNY 5,000) in the Semantic Web Challenge 2010!

1. Does the TWC LOGD Portal meet the SWC2010 “Minimum Requirements?”

1.1 “End User Application”      

The TWC LOGD Portal is a comprehensive semantic web application dedicated to publishing Linked Data versions of open government data for end users and to the sharing of tools, services and expertise supporting an open government data ecosystem. The Portal serves a wide range of consumers and producers ranging from informed citizens and domain experts to developers of novel applications and web sites that will be enriched by government data. The TWC LOGD Portal remains an essential resource for government web site developers working to publish open data using Semantic Web technologies.

  • The TWC LOGD Portal landing page combines data from multiple sources in order to demonstrate the use of Semantic Web principles to create a dynamic micro-app that is useful in its own right. How the landing page works:
    • VoiD metadata from converted government datasets is stored in the TWC LOGD "Data" triple store
    • Metadata from demos and tutorials is published via RDFa-annotated XHTML pages and synchronized with the "LOD Cache"
    • Relevant news items are maintained in the Data-gov site
    • Several RSS feeds are published at the Data-Gov and LOGD sites
    • The content panels on the TWC LOGD Portal front page are based on live SPARQL queries across the site data using XSLT and the Google Ajax API
  •                        
  • As of late September 2010 more than 8.5 billion RDF triples have been made accessible to end users through the TWC LOGD Portal, from 436 RDFized datasets published by 11 different data sources; the majority of these are from Data.gov.
  • The 40+ embedded, live mashups and visualizations featured on the TWC LOGD portal are themselves useful end-user applications, created to demonstrate replicable coding practices for consuming government-related data in the Web of Data.
  • Developer-users of the TWC LOGD Portal demos can find summaries describing the datasets, technologies and queries used; comprehensive tutorials clarify the methods.
  • The TWC LOGD Portal and its predecessor have seen more than 400K page visits from 134 countries and 4K cities since going live in 2009.
 

1.2.a. “diverse ownership or control”        

  • data published through the TWC LOGD Portal is aggregated data from a variety of government and non-government sources, each with its own degree of authority and trustworthiness.    
  • The owners/controllers of the Open Government Data include numerous U.S. government agencies. 
  • We also integrate data from other social entities (e.g. DBpedia, New York Times, Twitter, Google Search). 

1.2.b. “heterogeneous sources”       

  • Our demos are diverse mashups and visualizations, including demonstrating the integration of data from multiple sources including DBpedia, the New York Times API, and open government data produced by non-US sources. For example, the "CASTNET Ozone Map" mashes up data sources from Data.gov and epa.gov.
  • TWC LOGD landing page is itself a mashup of data from multiple sources. As shown below, the content panels on the TWC LOGD front page are based on live SPARQL queries across the site data using XSLT and the Google Ajax API. The sources include metadata of the TWC LOGD datasets stored at the TWC LOGD Data triple store; metadata of demos and tutorials published via RDFa-annotated XHTML pages that are synchronized with the LOD Cache; relevant news items maintained in the Data-Gov website; and RSS feeds from the TWC LOGD Portal, Data-Gov’s SemDiff Service and the Google News site.
  • Our converter also RDFized raw data in CSV, XML and Fixed-width text. 
  • See Section 3 for more details.

1.2.c. “real world data”        

  • The data we integrate are a good coverage of real world data. They are from a variety of government agencies as well as from diversified non-government social entities including news, locations, peoples, events etc.  
  • Therefore, TWC LOGD has been included in the Sept 2010 version of the Linked Open Data Cloud.

1.3.a. “using Semantic Web technologies”          

The TWC LOGD Portal enables a Semantic Web-based ecosystem:   

                        

  • Data Conversion/Creation: Most government datasets are released in “raw” or unstructured formats. The TWC LOGD Portal converts these “raw datasets” to RDF using the TWC LOGD converter. The dataset versioning mechanism is used with a semantic diff (SemDiff) service to compute changes in the Data.gov dataset catalog. Semantic-capable content management systems, such as Semantic MediaWiki and Drupal with RDF modules, have been used to preserve user-generated metadata in normal content publishing activities using RDFa-annotated XHTML          
  • Data Query/Access: The converted datasets may be accessed through the TWC LOGD Portal in many ways. Each dataset has a summary web page that aggregates manually-contributed metadata (e.g. title, description, agency) and automatically-generated metadata (e.g. number of triples, links to data dumps). The metadata of datasets can also accessed by dereferencing URIs (Linked Data principle). In order to support users in accessing the datasets via query, there is also a publicly-accessible SPARQL endpoint hosting a selection of datasets and the automatically recorded metadata of the TWC LOGD datasets. A TWC-developed tool - SparqlProxy is used to enhance our SPARQL endpoint to return results in a richer set of formats such as JSON and HTML table. A TWC-developed tool - LOD cache is used to synchronize RDF data published via RDFa-annotated web pages throughout the Portal.
  • Education and Community Portal:  RDFa on Drupal is used to publish both human- and machine-readable metadata embedded in TWC LOGD demos and tutorial pages; this approach hides the details of RDF creation from end users and enables a SPARQL-queryable site with novel features including the dynamically-generated list of demos on the TWC LOGD front page.

1.3.b. “data manipulated/processed in interesting ways”          

  • We leverage RDFa to preserve structure metadata in a content management system and then support a queryable Web using a LOD cache. 
  • We also provide powerful data enhancement functions, e.g. promoting literal strings into DBpedia URI, and enable cell-based conversion for multi-dimension data tables.
  •  Our mashups help multiple dataset analysis across domains and times so as to reveal hidden facts (or stimulate hypotheses) which are impossible within a single dataset.  A team graduate and undergraduate students have created over 40 different mashups and visualizations on the TWC LOGD Portal. These mashups are diverse, including demonstrating the integration of data from multiple sources including DBpedia, the New York Times API, and open government data produced by non-US sources (see a-g); deploying data via web and mobile interfaces (see a); supporting interactive analysis for specific domains including health, policy and financial data (see e-g); consuming integrated data using readily-available Web-based services (see h,i); and designing data access tools (see j) and semantic data integration tools (see k)
                    

1.3.c. “central role in achieving things that alternative technologies cannot do as well.       

  • Data Integration: Data integration is vital to the objectives of OGD. Linking and integrating data in novel ways help consumers uncover new patterns and correlations and create new knowledge. Linked Data principles and semantic web technologies make it easy to connect heterogeneous datasets without advance coordination or planning.    
  • Fast: The TWC LOGD Portal demonstrates how Linked Data principles and semantic web technologies may be applied to decrease development costs and increase the reuse of data models, links and visualization techniques. It advocates a bottom-up approach, encouraging developers to collaboratively model data, define terms, link terms and concepts to other heterogeneous datasets, and to use generic visualization libraries and APIs to more quickly get useful applications running. For example, the “CASTNET Ozone Map” was created within two weeks and iteratively enhanced over the span of a month. Similarly, in September 2010 four unique demos were created to support Tobacco Prevalence study in the NIH project PopSciGrid.   

  • Low Cost: Developers don't need to be expert in semantic technologies or Linked Data principles to create semantically-enabled LOGD mashups. Undergraduate students in RPI’s Fall 2009 Web Science class created mashups using semantic technologies and datasets found on the TWC LOGD Portal. Given a two-hour introduction to basics like RDF and SPARQL and patterns for using visualization tools like the Google Visualization API, each group created visualizations mashing at least two converted datasets in less than two weeks. Similarly, in August 2010 the US Data.gov project hosted a Mash-a-thon workshop, organized in part by TWC to engage government developers and data curators in hands-on learning using TWC LOGD tools and datasets. In just two days four teams successfully built LOGD-based mashups, demonstrating the low cost of knowledge transfer and the rapid learning process inherent in the best practices embodied by the TWC LOGD Portal.

2. What “Additional Desirable Features” does TWC LOGD demonstrate?

The TWC LOGD Portal extends Drupal 6 with semantic technologies to create a flexible, scalable collaboration environment with an attractive and functional Web interface. Portal content is dynamically presented using a combination of Drupal, XSLT, SPARQL and RDFa. SPARQL is also used to query external data and present results to users. Most Portal pages have “Like” (enabled by the Open Graph Protocol and RDFa) and “Rate” buttons, enabling end users to give feedback.

The scalability of the TWC LOGD infrastructure is demonstrated by the large and diverse datasets being actively converted and published on a daily basis. TWC LOGD mashups also exhibit how distributed services and data sources may be integrated using semantic technologies. Our open source code presents best practices for combining dynamic and static data in scalable real-time visualizations.

In addition to interactive demos of data mashups, the TWC LOGD Portal hosts multimedia documents including videos and publications to help stakeholders understand demos and tutorials.

The unique work represented by the TWC LOGD Portal has been rigorously evaluated in the US Government and discussed in the White House blog, commending its use of Data.gov datasets in innovative ways to generate practical applications and mashups. Details of TWC’s role and the impact of applying semantic web technologies can be found in Sections 3.1 and 3.2.

TWC is actively engaged with organizations inside and outside of government, and this enables us to receive and act on feedback concerning our data conversion, tools, and applications. Section 3.3 details this interaction. Diagrams as above illustrate the TWC LOGD Portal's data workflow. Section 3.3 shows additional work on data and provenance that goes beyond pure information retrieval.

The TWC LOGD Portal uses several approaches to ensure the accuracy of results by improving the quality of converted data. Evolving heuristics based on statistical analysis are used to connect entities with the same meaning across massive data, such as linking datasets based on US States.

The TWC LOGD Portal currently acts like a research-oriented, US-based site. It provides diversified accessibility including the “White House Visitor” demo on iPad, submitted into iTunes store for distribution. No alternate language versions of our content or links to translation services are currently available.   

AttachmentSize
castnet.png101.88 KB
workflow.png106.22 KB
mashup.png155.08 KB
portal.png76.23 KB
Paper.pdf1.53 MB
cbox.jpg12.5 KB