Description:
<p><em>Clarify what changes were made to LOGD (processes, triplestore(s), etc) over the weekend 10-12 Dec that broke applications on 13 Dec and what was done to solve the problem(s)</em></p>
<meta http-equiv="content-type" content="text/html; charset=" utf-8""="">
MY QUESTION:
I'd like the team to explain with as much clarity as possible:
- What changes were made to our processes and triplestore(s) over the weekend 10-12 Dec that resulted in the breakage of applications on 13 Dec?
- What was done to solve the problem(s)?
- Were this permanent or temporary fixes?
- What recommendations do we have for moving forward?
BACKGROUND:
On Monday morning, 13 Dec my queries for the "Elsevier App"
http://logd.tw.rpi.edu/twc_us_government_dataset_search_application were coming back empty (example query available via that page). Investigating further, I found that most or all of the dataset data was missing from the LOGD dataset catalog listings table
http://logd.tw.rpi.edu/datasets.
Over the course of the morning basic LOGD results returned, but dataset(s) critical to the SciVerse app were not re-loaded until late in the evening, when Tim and I had a chance to chat in detail.
Tim notes
first part of problem: added new 92. version 2010-Dec-11
I spent some time to pull together all of the enhancements and use the "new" approach we discussed.
Wrote it up at
http://logd.tw.rpi.edu/lab/faq/need_instruction_adding_creating_new_vers...
(two root causes: dcterms with elements/1.1/ namespace instead of purl.org/dc/terms/
AND
promoted agency to URI to objectSameAs)
that night,
1 1 * * * /work/data-gov/v2010/csv2rdf4lod/data/source/populate-to-endpoint.sh &> /dev/null
ran and found the latest version of 92 (2010-Dec-11) and
/work/data-gov/v2010/csv2rdf4lod/data/source/twc-rpi-edu/dataset-catalog
and put both into the Metadata named graph.
that cron job also blows away the Dataset named graph and repopulates with all publish/*.void.ttl
each time someone whined, I just reran /work/data-gov/v2010/csv2rdf4lod/data/source/populate-to-endpoint.sh as root and they stopped complaining.
Li changed some sparql queries to handle the promoted agencies, but not sure if he got them all.
- cd /work/data-gov/v2010/csv2rdf4lod/data/source; cr-publish-void-to-endpoint.sh -n auto will list all voids files going into the Dataset named graph.
- [lebot@sam source]$ for a92 in `cat voids.txt | grep data-gov | grep "/92/" | awk '{print $3}'`; do echo $a92; grep "\"10\"" $a92 ; done
data-gov/92/version/2010-Dec-11/publish/data-gov-92-2010-Dec-11.void.ttl
e1:id "10" ;
dcterms:identifier "10" ;
data-gov/92/version/2010-Sep-20/publish/data-gov-92-2010-Sep-20.void.ttl
data-gov/92/version/2010-Sep-13/publish/data-gov-92-2010-Sep-13.void.ttl
data-gov/92/version/2010-Aug-23/publish/data-gov-92-2010-Aug-23.void.ttl
data-gov/92/version/1st-anniversary/publish/data-gov-92-1st-anniversary.void.ttl
- The "provenance pass" from the converter was producing data instead of JUST provenance.
That output is stored in the automatic/*e1.void.ttl, which makes it into publish/*.void.ttl, which makes it to logd:Dataset named graph.
I've added a cr-var that defaults to NOT include the row/column provenance so that LOGD does not experience these problems.
I am pushing the csv2rdf4lod shell scripts to LOGD now.
I will rerun the bad version of 92 and repopulate-endpoint.sh
- Q: "Why is PML going into VOID?"
A: some things called "void" are incorrectly named and should be "meta", as void was our first meta and there are now others (pml, dc, scovo, etc).)