GigaScience´s Impact Factor

Even though the editors of GigaScience don’t like Impact Factors (and I agree with them), GigaScience has received a very high Impact Factor, 7.46. I’m quite happy since we published a paper in GigaScience last year, Enhanced reproducibility of SADI web service workflows with Galaxy and Docker.

Tagged , , , , , , , , , , ,

Transforming CSV data to RDF with Grafter

Part of my work is to develop pipelines to transform already existing Open Data (Usually CSVs in some data portal, like CKAN) into RDF and hopefully Linked Data. If I have to do the transformation myself, interactively, I normally use Google Refine with the RDF plugin. However, what I need now is a batch pipeline that I can plug into a bigger Java platform.

Therefore, I’m looking at Grafter. Even though I have never programmed in Clojure (or any other functional language whatsoever!), Grafter’s approach seems very sensible and intuitive. Additionally, I have always wanted to use Tawny-OWL, so probably it will be easier if I learn a bit of Clojure with Grafter first. Coming from Java/Perl/Python, the functional approach felt a bit weird in the beggining, but it actually makes more sense when defining pipelines to process data.

I have gone through the Grafter guide using Leiningen in Ubuntu 14.04. So far so good (I had to install Leiningen manually though, since Ubuntu’s Leiningen package was very outdated). In order to run the Grafter example in Eclipse (Mars), or any other Clojure program, one needs to install first the CounterClockWise plugin. Note that if you want to also use GitHub, like me, there is bug that prevents the project from being properly cloned, when you choose the “New project wizard”: I cloned with the General project wizard, copied the files from another Grafter project, and surprisingly it worked (trying to convert the project to Leiningen/Clojure didn’t work!).

My progress converting data obtained in Gipuzkoa Irekia to RDF can be seen at GitHub. Also, I’m aiming at adding Data Cube SPARQL constraints as Clojure test, here.

 

Tagged , , , , ,

Servicios OpenLinkedData

Hemos conseguido al adjudicación del contrato para implementar parte de Open Data Euskadi como Linked Data: Contratación de los Servicios OpenLinkedData.

Three levels of reproducibility: Docker, Galaxy, Linked Data

[Originally posted at LinkedIn]

I have just stumbled upon this thread on why one should use Galaxy (https://www.biostars.org/p/50034/). One of the reasons posted is reproducibility, but Galaxy only solves one level of reproducibility, “functional reproducibility” (What I did with the data). There is at least two other levels, one “bellow” Galaxy and another one “above” Galaxy:

  • Bellow: computational environment: Operating System, library dependencies, binaries.
  • Above: semantics. What the data means.

In order to be completely reproducible, one has to be reproducible on the three levels:

  1. Computational: Docker.
  2. Functional: Galaxy.
  3. Semantics: URIs, RDF, SPARQL, OWL.

And how to do it is described in our GigaScience paper, “Enhanced reproducibility of SADI Web Service Worfkflows with Galaxy and Docker”:-) (http://www.gigasciencejournal.com/content/4/1/59)

Just to emphasize and clarify, the 3 levels would be:
3.- Semantics: what the data means.
2.- Functional: what I did with the data.
1.- Computational: how I did it.

Charla TikiTalka sobre Linked Data

El viernes 12 Febrero di una charla sobre Datos enlazados y Web Semántica (Slides.com), como parte del evento TikiTalka organizado por VE Interactive Bilbao. Las otras charlas fueron muy interesantes y había cerveza gratis y futbolín, ¿Qué más se puede pedir?

Docencia Linked Open Data IZFE

He subido las slides y resultados de ejercicios de la docencia de dos días que he dado en IZFE sobre Linked Open Data.

SADI-Docker for Galaxy

About

SADI is a framework to define Semantic Web Services that consume and produce RDF. On the other hand, Docker is a container-based virtualisation environment for deploying applications very easily, without configuration or installation of dependencies. Therefore I have created SADI-Docker, a Docker image containing all the necessary programs and dependencies to invoke SADI services: Galaxy tool-files are also provided to execute such programs as regular Galaxy tools. Therefore, SADI can be used within Galaxy with a minimal installation (only the Docker image and the Galaxy XML files, see bellow). Even more, the SADI-Docker image can be used as a regular Docker image, runing it as a standalone Operating System pre-configured to invoke SADI services.

Installation

Install Docker and do the thingy for avoiding sudo access:

 $ sudo apt-get install docker.io
 $ sudo groupadd docker
 $ sudo gpasswd -a your_user docker
 $ sudo service docker.io restart

(You might need to log out and back in, and also I had to install apparmor).

Pull the SADI-Docker image to your Docker repository:

 $ docker pull mikeleganaaranguren/sadi:v6

Check that it has been succesfully pulled:

 $ docker images
REPOSITORY TAG IMAGE ID CREATED VIRTUAL SIZE
 mikeleganaaranguren/sadi v6 0bb03066587d 46 hours ago 580.3 MB

Download/clone the latest Galaxy version:

 $ git clone https://github.com/galaxyproject/galaxy.git

Download/clone this repository and copy the `tools/SADI-Docker` directory to the `tools` directory in your Galaxy installation. You can also install the Galaxy tools from within your Galaxy instance as regular Galaxy tools from the Galaxy tool shed. There are five Galaxy tools:

  • SADI-Docker-sadi_client: a SADI client for synchronous SADI services.
  • SADI-Docker-RDFSyntaxConverter: a tool to convert between different RDF syntaxes, including from RDF to TSV files.
  • SADI-Docker-mergeRDFgraphs: a tool to merge different RDF graphs into one.
  • SADI-Docker-SPARQLGalaxy: a tool to perform SPARQL queries against RDF files.
  • SADI-Docker-rapper: a tool to convert RDF files to different syntaxes.SADI-Docker-tab2rdf: a tool to produce RDF files from TSV files.

Add the following section to `config/tool_conf.xml` to add the tools to Galaxy (first copy `tool_conf.xml.sample` to `tool_conf.xml`):

SADI-docker-1

Change the Galaxy configuration so that it can run Docker images as if they were regular tools installed in your system. Add a destination, `docker_local`, to your configuration, and make it the default. Copy `config/job_conf.xml.sample_basic` to `config/job_conf.xml` and add these lines to `config/job_conf.xml` (change `docker_memory` if necessary):

SADSI-docker_2

(look at `job_conf.xml.sample_advanced` for more options regarding how Galaxy invokes Docker containers, since there are a lot of options).

Run Galaxy and the tools should appear under `Docker SADI services`:

tools

Use case

In order to test the installation, you can run a pre-defined workflow. Upload the file workflow/UniProt_IDs.txt, to your current Galaxy history. Then you can import the workflow in Galaxy (Workflows; Import or Upload Workflow; choose file workflow/SADI-Docker_use_case.ga). You can also find the workflow at the tool shed. Then run the workflow, choosing the UniProt_IDs.txt dataset as input for the first step.

The workflow answers the following question: Given a set of UniProt proteins, which ones are related to PubMed abstracts containing the term “brain”, and what are they KEGG entries? The workflow starts from a simple list of UniProt identifiers, and retrieves different datasets from a regular SADI service (to obtain KEGG entries) and a set of 3 OpenLifeData2SADI services (to obtain PubMed abstracts). The results are then merged and queried to obtain the KEGG entries of proteins that are related to PubMed abstracts that contain the term.

workflow

The SADI services used in the workflow are:

And the SPARQL query to obtain the result:

 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
 PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
 PREFIX sadi: <http://sadiframework.org/ontologies/predicates.owl#>
 PREFIX lsrn: <http://purl.oclc.org/SADI/LSRN/>
SELECT ?protein ?label ?KEGG
 WHERE {
 ?protein rdf:type lsrn:UniProt_Record .
 ?protein sadi:isEncodedBy ?KEGG .
 ?protein ?prot2hgnc ?hgnc .
 ?hgnc ?hgnc2omim ?omim .
 ?omim ?omim2pubmed ?pubmed .
 ?pubmed rdfs:label ?label .
 FILTER (regex (?label, 'brain'))
 }

Notes

This project is a continuation of SADI-Galaxy-Docker, with the inverse approach, hence the name: SADI-Galaxy-Docker was a complete Galaxy server, configured with SADI tools, within a Docker image; SADI-Docker is a Docker image with only SADI tools, and any Galaxy instance can invoke the image.

Tab2rdf is a “fork” of the tool tab2rdf. This version adds option for the user to define no base URI, i.e. all the entities of the tab file have their own URI.

When using the SADI client on its own, the input dataset’s datatypes must be edited, stating that the input is an RDF file.

The docker image can also be built without pulling it, using the Dockerfile:

 FROM ubuntu:14.04
 MAINTAINER Mikel Egaña Aranguren <mikel.egana.aranguren@gmail.com>
# Install the necessary stuff with apt-get
RUN apt-get update && apt-get install -y wget python python-setuptools raptor2-utils libraptor2-0
# apt-get install python-rdflib is not working so use easy_install instead
RUN easy_install rdflib
# SADI does not like OpenJDK so install Java from http://www.duinsoft.nl/packages.php?t=en
RUN wget http://www.duinsoft.nl/pkg/pool/all/update-sun-jre.bin
 RUN sh update-sun-jre.bin
RUN mkdir /sadi
 COPY sadi_client.jar /sadi/
 COPY RDFSyntaxConverter.jar /sadi/
 COPY __init__.py /sadi/
 COPY MergeRDFGraphs.py /sadi/
 COPY tab2rdf.py /sadi/
 COPY sparql.py /sadi/
 RUN chmod a+x /sadi/*
 ENV PATH $PATH:/sadi
Follow

Get every new post delivered to your Inbox.

Join 192 other followers