What happened on 12-13 Dec 2010?

Contributor: 
Status: 
posted
Description: 
<p><em>Clarify what changes were made to LOGD (processes, triplestore(s), etc) over the weekend 10-12 Dec that broke applications on 13 Dec and what was done to solve the problem(s)</em></p> <meta http-equiv="content-type" content="text/html; charset=" utf-8""="">

MY QUESTION:

I'd like the team to explain with as much clarity as possible:
  1. What changes were made to our processes and triplestore(s) over the weekend 10-12 Dec that resulted in the breakage of applications on 13 Dec?
  2. What was done to solve the problem(s)?
  3. Were this permanent or temporary fixes?
  4. What recommendations do we have for moving forward?

BACKGROUND:

On Monday morning, 13 Dec my queries for the "Elsevier App" http://logd.tw.rpi.edu/twc_us_government_dataset_search_application were coming back empty (example query available via that page). Investigating further, I found that most or all of the dataset data was missing from the LOGD dataset catalog listings table http://logd.tw.rpi.edu/datasets. Over the course of the morning basic LOGD results returned, but dataset(s) critical to the SciVerse app were not re-loaded until late in the evening, when Tim and I had a chance to chat in detail.

Tim notes

first part of problem: added new 92. version 2010-Dec-11
I spent some time to pull together all of the enhancements and use the "new" approach we discussed.
Wrote it up at http://logd.tw.rpi.edu/lab/faq/need_instruction_adding_creating_new_vers... (two root causes: dcterms with elements/1.1/ namespace instead of purl.org/dc/terms/
AND
promoted agency to URI to objectSameAs) that night, 1 1 * * * /work/data-gov/v2010/csv2rdf4lod/data/source/populate-to-endpoint.sh &> /dev/null ran and found the latest version of 92 (2010-Dec-11) and
/work/data-gov/v2010/csv2rdf4lod/data/source/twc-rpi-edu/dataset-catalog and put both into the Metadata named graph. that cron job also blows away the Dataset named graph and repopulates with all publish/*.void.ttl each time someone whined, I just reran /work/data-gov/v2010/csv2rdf4lod/data/source/populate-to-endpoint.sh as root and they stopped complaining. Li changed some sparql queries to handle the promoted agencies, but not sure if he got them all.
  • cd /work/data-gov/v2010/csv2rdf4lod/data/source; cr-publish-void-to-endpoint.sh -n auto will list all voids files going into the Dataset named graph.
  • [lebot@sam source]$ for a92 in `cat voids.txt | grep data-gov | grep "/92/" | awk '{print $3}'`; do echo $a92; grep "\"10\"" $a92 ; done
    data-gov/92/version/2010-Dec-11/publish/data-gov-92-2010-Dec-11.void.ttl
    e1:id "10" ;
    dcterms:identifier "10" ;
    data-gov/92/version/2010-Sep-20/publish/data-gov-92-2010-Sep-20.void.ttl
    data-gov/92/version/2010-Sep-13/publish/data-gov-92-2010-Sep-13.void.ttl
    data-gov/92/version/2010-Aug-23/publish/data-gov-92-2010-Aug-23.void.ttl
    data-gov/92/version/1st-anniversary/publish/data-gov-92-1st-anniversary.void.ttl
  • The "provenance pass" from the converter was producing data instead of JUST provenance.
    That output is stored in the automatic/*e1.void.ttl, which makes it into publish/*.void.ttl, which makes it to logd:Dataset named graph.
    I've added a cr-var that defaults to NOT include the row/column provenance so that LOGD does not experience these problems.
    I am pushing the csv2rdf4lod shell scripts to LOGD now.
    I will rerun the bad version of 92 and repopulate-endpoint.sh
  • Q: "Why is PML going into VOID?"
    A: some things called "void" are incorrectly named and should be "meta", as void was our first meta and there are now others (pml, dc, scovo, etc).)