Getting started with csv2rdf4lod to create verbatim RDF conversions of tabular data

Level: 
TWC LOGD Portal
Contributor: 
Contributor: 
Description: 
In this tutorial, we will walk you through setting up your conversion environment, understanding some design decisions and their motivations, and converting a CSV into RDF.

What to expect


By the end of this tutorial you should be able to convert a CSV file to RDF after installing the csv2rdf4lod converter on your Unix-based system.
What you need to know


  • Familiarity with information modeling
  • How to use the Unix command line
  • Understanding LOGD Data and Understanding LOGD Metadata, which is the output of the conversion process described in this tutorial. It is a good idea to know what you will be producing before you produce it.
Installation


Let's install the conversion software. Download the csv2rdf4lod.tgz tarball and save it in the directory you wish to deploy it in. Next, use the following commands to unzip the tarball and run its install script:
 curl -O http://www.rpi.edu/~lebot/csv2rdf4lod.tgz
 tar xzf csv2rdf4lod.tgz
 cd csv2rdf4lod
 ./install.sh
If you run the ls command in csv2rdf4lod/ (the directory you are currently in), you should see three directories (bin/, data/, doc/) a README.txt file and the source-me.sh script. The install.sh script which you previously executed is no longer needed and was moved into the bin/ directory. Next, run the following command:
 # working directory: csv2rdf4lod/
 
 $ source source-me.sh 
source-me.sh creates the environment variable CSV2RDF4LOD_HOME and adds a few paths to PATH and CLASSPATH. It is important to make sure that you run this script before doing any work with the conversion software. source-me.sh needs to be sourced for _each_ new shell you want to use to run csv2rdf4lod, so consider putting it into your .login or similar environment setup script.

Running cr-vars.sh will show all environment variables used to control csv2rdf4lod automation. All variables are set automatically in csv2rdf4lod/source-me.sh (which is created by csv2rdf4lod/install.sh). It is a good idea to check the variables before running conversions. Documentation for each variable is in csv2rdf4lod/bin/setup.sh. Below is the output of running cr-vars.sh
 # working directory: anywhere, as long as you ran source-me.sh
 
 $ cr-vars.sh
 CSV2RDF4LOD_HOME                                         /work/data-gov/v2010/csv2rdf4lod
 CSV2RDF4LOD_BASE_URI                                     http://logd.tw.rpi.edu
 CSV2RDF4LOD_BASE_URI_OVERRIDE                            (not required, $CSV2RDF4LOD_BASE_URI will be used.)
 CSV2RDF4LOD_CONVERT_ONLY_EXAMPLE_ROWS                    (will default to: false)
 --
 CSV2RDF4LOD_PUBLISH_DELAY_UNTIL_ENHANCED                 false
 CSV2RDF4LOD_PUBLISH_COMPRESS                             (will default to: false)
 CSV2RDF4LOD_PUBLISH_TTL                                  true
 CSV2RDF4LOD_PUBLISH_TTL_LAYERS                           true
 CSV2RDF4LOD_PUBLISH_NT                                   false
 CSV2RDF4LOD_PUBLISH_RDFXML                               false
 --
 CSV2RDF4LOD_PUBLISH_SUBSET_VOID                          true
 CSV2RDF4LOD_PUBLISH_SUBSET_SAMEAS                        true
 --
 CSV2RDF4LOD_PUBLISH_LOD_MATERIALIZATION                  false
 CSV2RDF4LOD_PUBLISH_LOD_MATERIALIZATION_WWW_ROOT         /var/www/html/logd.tw.rpi.edu
 CSV2RDF4LOD_PUBLISH_LOD_MATERIALIZATION_WRITE_FREQUENCY  (will default to: 1,000,000)
 CSV2RDF4LOD_PUBLISH_LOD_MATERIALIZATION_REPORT_FREQUENCY (will default to: 1,000)
 CSV2RDF4LOD_CONCURRENCY                                  2
 --
 CSV2RDF4LOD_PUBLISH_TDB                                  false
 CSV2RDF4LOD_PUBLISH_TDB_DIR                              (will default to: VVV/publish/tdb/)
 CSV2RDF4LOD_PUBLISH_TDB_INDIV                            false
 --
 CSV2RDF4LOD_PUBLISH_4STORE                               false
 CSV2RDF4LOD_PUBLISH_4STORE_KB                            (will default to: csv2rdf4lod -- leading to /var/lib/4store/csv2rdf4lod)
 --
 CSV2RDF4LOD_PUBLISH_VIRTUOSO                             false
 --
 see documentation for variables in:
 /work/data-gov/v2010/csv2rdf4lod/bin/setup.sh

Retrieve datasets for conversion


Now that you have successfully installed and configured the conversion software, let's test it on some data. Before we do that, however, let's go over some basics about the directory structure used by the converter. Move to the data-gov/ directory:
 $ cd csv2rdf4lod/data/source/data-gov/
csv2rdf4lod/data/source/data-gov/ will act as a repository for all datasets that you download from the source "data-gov". When you run the retrieval script dg-create-dataset-dir.sh (we'll talk about this shortly) to download the source files for a given dataset, a new directory will be created for that dataset in csv2rdf4lod/data/source/data-gov/. The name of the new directory will correspond to the dataset identifier that "data-gov" assigns.

Here is an example using Dataset 1450. Assuming dg-create-dataset-dir.sh has already been run to download the CSV file for Dataset 1450, then there should be a directory 1450/ listed within csv2rdf4lod/data/source/data-gov/. Here is how you would check:
 $ cd csv2rdf4lod/data/source/data-gov/
 $ ls
   1450/
Keep in mind that dg-create-dataset-dir.sh needs to be run before data-gov/ is populated with any dataset directories. If you are following this tutorial, at this point csv2rdf4lod/data/source/data-gov/ should be empty (because we haven't done anything yet!).

Now let's take a look at how we can download dataset source CSV files using the retrieval script dg-create-dataset-dir.sh. The script dg-create-dataset-dir.sh accepts the name of a data-gov dataset identifier (e.g., for Dataset 1450, the identifier would be 1450) to retrieve its source CSV and convert it to raw RDF. All scripts begin with "cr-" or "dg-" to provide an easy listing within the terminal. To see what scripts are available, type "dg-" and press tab to list the available scripts. Scripts starting with "cr-" can be used for datasets other than data-gov, while scripts starting with "dg-" are specific to data-gov.

Let's see what "dg-" scripts are available:
 $ dg- (and press tab to show possible completions)
   dg-create-dataset-dir.sh  
   dg-get-mod-date.sh
If no completions are provided, make sure source-me.sh was run, as this sets the PATH to include the location of the dg- scripts.

Let's try using dg-create-dataset-dir.sh to download the CSV files for Dataset 1492. Run the script as follows:
 # working directory: csv2rdf4lod/data/source/data-gov
 
 $ dg-create-dataset-dir.sh 1492
It should take about a minute to complete and outputs a stream of text while executing (which isn't important at this point). Basically, the above command creates the directory 1492/, queries http://www.data.gov/raw/1492/ for the formats offered (XLS and CSV), downloads them into 1492/, and unzips them (if compressed).

You should now be able to verify (using the ls command as described earlier) that directory 1492/ exists in csv2rdf4lod/data/source/data-gov/. Go into 1492/version/2010-Jan-21 and list its contents (you should currently be in csv2rdf4lod/data/source/data-gov/):
 # working directory: csv2rdf4lod/data/source/data-gov
 
 $ cd 1492/version/2010-Jan-21
 $ ls
   convert-1492.sh  
   manual/
   source/
If all went well, you should see the convert-1492.sh script listed in 1492/version/2010-Jan-21. But if there was not any CSV file being downloaded under the source directory, the converting script would not be created. There are two cases:
1) If it is a csv already with the wrong extension, just run cr-create-convert-sh on the file (source/*.txt)
2) If it is NOT a csv, then make it a CSV into manual/, then run cr-create-convert-sh.sh on manual/*.csv
This script is what allows us to convert the dataset's source CSV file to RDF and also make subsequent enhancements. Before we can make any enhancements, we need to generate RDF from Dataset 1492's CSV file.

Do this by running convert-1492.sh:
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ ./convert-1492.sh
Once the script has been executed, if you list the contents of 1492/version/2010-Jan-21 (which you should be in), you will see that new directories have been created.
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ ls
   convert-1492.sh 
   manual/
   automatic/
   publish/
   doc/
   source/
Of interest to us right now are the manual/, automatic/ and source/ directories. The source/ directory contains all the source files (CSV, XLS) downloaded when we ran the dg-create-dataset-dir.sh script. We are going to leave this directory alone from this point on. In manual/, there should be a single file called data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.params.ttl. We'll discuss this more in the next section when we do enhancements, but for now you just need to know that this is the parameters file used to make changes to the RDF generated by the convert script. RDF generated automatically from the CSVs in /source are placed in the /automatic directory.

To see the raw RDF that was automatically produced for Dataset 1492 after running convert-1492.sh, execute the following:
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ vi automatic/data.gov.FEMAPublicAssistanceSubGrantee.csv.raw.ttl

Make an enhancement


As you saw previously, we used the convert-1492.sh to generate RDF files (populated in automatic/) from Dataset 1492's source CSV (contained in source/). This same script is what we use whenever we want to generate an enhancement. Enhancements can be made by editing the enhancement parameters file (manual/data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.params.ttl).

If we edited manual/data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.params.ttl to include an enhancement, re-running the convert script would produce an enhanced RDF file:
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ ./convert-1492.sh
 $ less automatic/data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.ttl
The enhanced versions are named using *e1.ttl, *e2.ttl, and so on. This contrasts with the original *raw.ttl.

Note that, until some enhancement parameters are provided, the enhanced RDF files will resemble the raw RDF files, but with a different predicate namespace. We haven't made any changes to manual/data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.params.ttl at this point, however, so running convert-1492.sh would not produce any new enhancement files.

Let's walk through an example of how we could enhance Dataset 1492. The directory csv2rdf4lod/data/source/data.gov/1492/version/2010-Jan-21/ will be the working directory from which enhancements should be run (called the "conversion cockpit"). It is advised that you do not leave the cockpit when making enhancements, because all scripts are oriented to execute from this location.

First, open the params file:
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ vi manual/data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.params.ttl

find the following parameters in the file as shown below:
      conversion:enhance [
         ov:csvCol         1;
         ov:csvHeader     "Disaster Number";
         conversion:label "Disaster Number";
         conversion:range  todo:Literal;
      ];
      conversion:enhance [
         ov:csvCol         2;
         ov:csvHeader     "Declaration Date";
         conversion:label "Declaration Date";
         conversion:range  todo:Literal;
      ];

and edit them to match this:
      conversion:enhance [
         ov:csvCol         1;
         ov:csvHeader     "Disaster Number";
         conversion:label "Disaster Number";
         conversion:domain_name "Disaster Aid Obligation";
         conversion:range  rdfs:Resource;
         conversion:range_name "Disaster";
      ];
      conversion:enhance [
         ov:csvCol         2;
         ov:csvHeader     "Declaration Date";
         conversion:label "Declaration Date";
         conversion:range  xsd:date;
         conversion:date_pattern "MM/dd/yyyy";
         conversion:bundled_by [ ov:csvCol 1 ];
      ];
Next, re-run the conversion script and open the enhancement file (which should have appeared in automatic/ after running the script)
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ ./convert-1492.sh
 $ vi automatic/data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.ttl
Below are two entries from the enhancement file which demonstrate what has been accomplished by the converter. If you conduct a search of the contents of the enhancement file, you should be able to find these two entries (or ones similar to them).
ds1492:disaster_Aid_Obligation_93774 
        dcterms:isReferencedBy     <http://logd.tw.rpi.edu/source/data-gov/dataset/1492/version/2010-Jan-21> ;
        a                          ds1492_vocab:Disaster_Aid_Obligation ;
        e1:disaster_number         <http://logd.tw.rpi.edu/source/data-gov/dataset/1492/typed/disaster/3406> ;
        e1:incident_type           "Severe Storm(s)" ;
        e1:state                   "California" ;
        e1:county                  "Amador" ;
        e1:applicant_name          "ALLEN" ;
        e1:education_applicant     "No" ;
        e1:number_of_projects      "1" ;
        e1:federal_share_obligated "$75,000.00" ;
        ov:csvRow                  "93774"^^xsd:integer .
                
        ...     
        
<http://logd.tw.rpi.edu/source/data-gov/dataset/1492/typed/disaster/1239> dcterms:identifier "1239" ;
        a                   ds1492_vocab:Disaster ;
        rdfs:label          "1239" ;
        e1:declaration_date "1998-08-26"^^xsd:date .
Looking at the enhanced version of the RDF, we see the disaster was promoted to a rdfs:Resource and disaster_Aid_Obligation_93774 now refers to its URI instead of a literal. Next, we see that the disaster's declaration date is describing the disaster instead of the obligation. Finally, we see that the disaster date is a properly-typed xsd:date instead of an untyped literal.

The enhancement process can be broken down into these three basic steps, just rinse and repeat:

Step 1: Edit the parameters file to configure the enhancements you want to add
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ vi manual/data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.params.ttl 
(The exhaustive running list of enhancement parameters is here.)

Step 2: Run the convert script
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ ./convert-1492.sh
Step 3: View the enhancement file to verify that the appropriate enhancements have been configured correctly
 # working directory: csv2rdf4lod/data/source/data-gov/1492/version/2010-Jan-21
 
 $ vi automatic/data.gov.FEMAPublicAssistanceSubGrantee.csv.e1.ttl
You should now know how to install, configure and run the conversion software, as well as the basic conversion process necessary to make enhancements. Where to go from here

Your rating: None Average: 3.8 (4 votes)