Level:
LOGD Related Technologies
Description:
Instructions for installing, configuring, and managing Virtuoso SPARQL endpoints (community edition)
Overview
This tutorial documents instructions for installing, configuring and managing Virtuoso Open Source Edition (VOSE) on 64-bit Linux servers. Some contents of this tutorial are from the
VOSE documentation - TODO add link, but tailored specifically to the machine setup of TWC@RPI. It also introduces shell scripts to fulfill common administrative tasks (e.g., probing status of an VOSE SPARQL endpoint, starting/stopping/restarting an VOSE endpoint), which are developed by the LOGD team at RPI.
Installation
Packages and source code can be downloaded at
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSDownload. For the purpose of this tutorial, we are using the archived package available at
http://sf.net/projects/virtuoso/files. Please note that checking out source code from Virtuoso's CVS server is also possible, please refer to
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSDownload for more detailed information.
After downloading the archived package (virtuoso-opensource-6.1.1.tar.gz), unzip it to the server you want to have Virtuoso installed.
A detailed guide to compile and install Virtuoso is available online at
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VOSMake. The following is a step-by-step walk-through.
- Make sure you have all the Package Dependencies.
- Set the compiler flags according to the hardware processor and OS of your machine.
- Unpack the downloaded VOSE package and navigate to that folder.
- In the command prompt, enter
./autogen.sh
- This will check the presence and right version of the required packages.
- Enter
./configure
- Enter
./configure
- Enter
make
- Enter
make install
If no error happens during any of the above steps, the installation should be finished.
Administrative Tasks
Manual Start-up
The Virtuoso server instance can be started by calling
/opt/virtuoso6/bin/virtuoso-t -f &
under the directory where the virtuoso.ini is located. Default directory to find virtuoso.ini is
/opt/virtuoso6/var/lib/virtuoso/db
.
Manual Shutdown
The Virtuoso server instance can be shutdown using the following steps:
Alternatively, the following shell command can also shut down a running Virtuoso instance:
/opt/virtuoso6/bin/isql 1111 dba <password> -K
Start-up/Shutdown Scripts
We have come up with some command line scripts on 64bit Linux (CentOS 5) to start-up/shutdown/restart the Virtuoso server instance and SPARQL endpoint in a single command.
- To check status of the Virtuoso instance:
sudo /etc/init.d/virtuosod status
- To start the Virtuoso instance:
sudo /etc/init.d/virtuosod start
- To stop the Virtuoso instance:
sudo /etc/init.d/virtuosod stop
- To restart the Virtuoso instance:
sudo /etc/init.d/virtuosod restart
Please note that:
Loading Triples
We have come up with some command line utility scripts for loading triples in different formats into a named graph in the Virtuoso triple store. The scripts are located at
google code and are installed on LOGD at
/opt/virtuoso/scripts
Newer, forked, versions of the scripts are available at
github.
The formats supported are:
- RDF/XML
- Turtle
- N-triples
- N-quad
Please follow these steps to load a data file (in either of the formats above) into a named graph:
- Change directory to where the scripts are located.
cd /opt/virtuoso/scripts
- run the script vload, with exactly three arguments:
- format: [rdf | ttl | nt | nq] corresponds to RDF/XML, Turtle, N-triples, and N-quad respectively.
- data_file: path to the raw data file.
- graph_uri: named graph uri into which the triples should be loaded
sudo ./vload nt /path/to/data/file/data-1554.nt http://data-gov.tw.rpi.edu/vocab/Dataset_1554
- wait until the loading finishes. Depending on the size of the loaded dataset, this might take several seconds to several hours.
Deleting Named Graphs
There is a utility command for deleting a specific named graph from the triple store. It is located at
/opt/virtuoso/scripts
It takes only one argument, the URI of the named graph to be deleted. So, to delete all the triples in the named graph <
http://data-gov.tw.rpi.edu/vocab/Dataset_1554>, you can use the following command.
sudo ./vdelete http://data-gov.tw.rpi.edu/vocab/Dataset_1554
Performance Tuning
There are online documentations on how to tune VOSE for better performance, such as the one at
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning and
http://plato.cs.rpi.edu:8890/doc/html/rdfperformancetuning.html.
Generally, configuring some of the parameters in the virtuoso.ini file to proper values helps to improve performance both in terms of loading big datasets and query evaluation. The following is a list of parameters in the virtuoso.ini file that needs to look at:
- ServerThreads
- Max number of threads used in the server, should be set close to the number of concurrent connections if heavy usage is expected. A value of 100 should work on most systems.
- O_DIRECT
- This may be useful if a large fraction of RAM is configured as database buffers. If this is on, the file system cache will not grow at the expense of the database process, for example it is less likely to swap out memory that Virtuoso uses for its own database buffers.
- NumberOfBuffers
- This controls the amount of RAM used by Virtuoso to cache database files. This has a critical performance impact and thus the value should be fairly high for large databases. Exceeding physical memory in this setting will have a significant negative impact. For a database-only server about 65% of available RAM could be configured for database buffers. Please also note that each buffer takes about 8700 Bytes (please cf. http://docs.openlinksw.com/virtuoso/dbadm.html for details about the size of each buffer).
- CompileProceduresOnStartup
- Setting this to 0 will speed up virtuoso startup, because stored procedures will not be loaded until the first time they are called.
- FDsPerFile
- Number of file descriptors per file to be obtained from OS. This parameter only effects databases that use striping. Having multiple FDs per file means that as many concurrent I/O operations may simultaneously be pending per file. This allows more flexibility for the OS to schedule the operations, potentially improving file I/O throughput.
- ResultSetMaxRows
- This setting is used to limit the number of the rows in the result. Sometimes adjusting the value of this parameter helps to prevent D.O.S attack.
Currently, our experiences is that on a 64bit Linux machine with 8 CPU cores (2*Quad core processor) and 32GB memory, setting the NumberOfBuffers parameter to the value of (32959832*0.6/8 = 2,400,000) will increase the performance significantly.
See also
http://tw.rpi.edu/web/inside/endpoints/installing-virtuoso