Sunday, April 17, 2011

Using XSL to filter biomedical data

This post gives a short introduction to XSLT (for Extended Stylesheet Language Transformation) that will be used to transform an XML file into something else. This something else could be another XML file with a different structure, a CSV file or any other text content.

This is a very tiny tutorial intended to help you in setting your environment and doing a first transformation using biomedical data, so I won't go into details of the language nor try to provide an exhaustive list of available operations, there is already plenty of good literature and reference on this subject (see last paragraph "Further reading").

I used this technology some years ago to dynamically transform a stream of data into either a HTML page or a WAP page (for mobile phone internet browser) depending on the client being detected. But here we are going to use XSL for handling and transforming a XML document resulting from a search in UniProt (list of proteins).

Objective: let's say we want to transform the XML document downloaded from UnitProt into a nicer CSV file with two columns: the protein name and the gene name.

You will see that the corresponding filter could be easily adapted to your needs or to any other biomedical source of data (MeSH, Drugbank, etc.).

Requirements/Environment
In order to run XSL in your computer you need the following software installed and properly configured:
  • Java 1.6 or higher
  • Download and install xalan-j_2_7_1-bin
  • once you have Xalan installed, define the environment variable XALAN_HOME that will point to Xalan root folder (e.g. XALAN_HOME=C:\java\xalan-j_2_7_1)
  • update your Java classpath, it should at least include:

    CLASSPATH=%XALAN_HOME%\xalan.jar;%XALAN_HOME%\xalan.jar;%XALAN_HOME%\serializer.jar;%XALAN_HOME%\xml-apis.jar;%XALAN_HOME%\xercesImpl.jar
Input data
We will use the XML data produced by UniProtKB when searching for proteins related to organism "Moloney murine leukemia virus". This is random example to demonstrate the technique on a small XML file, but of course the same could be applied on different data.

Follow these steps to get the XML file:
  1. go to http://www.uniprot.org
  2. run a search using the query:
    organism:"Moloney murine leukemia virus"
  3. in the search result page click the Download link
  4. download the file "Complete Data in XML format" and save it using the filename uniprot.xml
So we have a XML file to play with. You may open it in a text editor to understand the structure of the XML content before doing the XSL script. Note that Uniprot returned around 22 proteins, so you should find in the XML file 22 <entry/> blocks

Write your XSL script
We will write a simple XSL script that will output 2 columns of data: a column with the "Protein fullname" and a column with the "Gene name". Both columns will be separated with a TAB character.

Open a text editor and create a file uniprot.xsl.

A XSL script is itself an XML file! So it should logically start by the header:

<?xml version="1.0" encoding="ISO-8859-1"?>

Then the whole XML document should be written between the "stylesheet" tag:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
...
</xsl:stylesheet>

The next line will say to the XSL interpreter to match the root of the XML document:

<xsl:template match="/">

Then this prints the column header of our CSV file (a TAB between two headers and a line feed character at the end):

<!-- Print columns headers -->
<xsl:text>Accession</xsl:text>
<xsl:text>&#x9;</xsl:text>
<xsl:text>Gene name</xsl:text>
<xsl:text>&#10;</xsl:text>

The following block is a bit more sophisticated. It is a loop on each "entry" tag of your XML file. For each entry it will print the full name of the protein (follow the XML path), and the gene name:

<xsl:for-each select="uniprot/entry">
<xsl:value-of select="accession"/>
<xsl:text>&#x9;</xsl:text>
<xsl:value-of select="gene/name"/>
<xsl:text>&#x9;</xsl:text>
<xsl:text>&#10;</xsl:text>
</xsl:for-each>

For your convenience here is the complete XSL file content:

<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/">
<!-- Print columns headers -->
<xsl:text>Accession</xsl:text>
<xsl:text>&#x9;</xsl:text>
<xsl:text>Gene name</xsl:text>
<xsl:text>&#10;</xsl:text>
<xsl:for-each select="uniprot/entry">
<xsl:value-of select="accession"/>
<xsl:text>&#x9;</xsl:text>
<xsl:value-of select="gene/name"/>
<xsl:text>&#x9;</xsl:text>
<xsl:text>&#10;</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>


Run your XSL script
Once you have your input XML and your XSL written, you are ready to try it. The following will parse the input XML file and output the desired data defined previously in CSV file that you may us in Excel:

java org.apache.xalan.xslt.Process -IN uniprot.xml -XSL uniprot.xsl -OUT uniprot.csv


Tip: remember that if you manipulate large XML file, your Java runtime may complain about the heap space. You can increase the heap space by adding the parameter -Xms to the java command line.

Further reading:

Saturday, April 16, 2011

Development of Drugs in Public-Private Partnerships environments

This week we organized a debate about the Development of Drugs in Public-Private Partnerships (PPP) environments. It raised some passionate interventions by the participants.

Here are the slides we used to introduce this subject. We are using Orphan Diseases as a potential field where PPP could be fruitful and we present the project Grants4Targets as an example of PPP.

If you are looking for data for your market research on drug development I suggest you have a look to the References slide (slide #18), it contains interesting materials.