A simple Taverna tutorial – Part 2: Multiple sequence alignment
The aim of this tutorial is to generate a multiple sequence alignment from a Blast report. Multiple sequence alignments are important for the identification of conserved sequence regions across divergent sequences.
First you will need the following code to parse out the accession numbers from the report:
import java.util.regex.Matcher;
import java.util.regex.Pattern;public class BlastAccessionParser {
// Unique regular expression to parse out swissprot accession numbers from a Blast report
private final Pattern pattern = Pattern.compile(”[OPQ]\\w++”);
public String parseAccNo(String blastReport) {
Matcher m;
// Splits each line in the text by a carriage return//
String[] lines = blastReport.split(”\n”);
StringBuffer result = new StringBuffer();// Iterates through the lines of text. If a match is found then the next step proceeds.
// This iterates throught to the 32nd line of the Blast report to match the first 10 sequence hits.
for (int i = 0; i < 32; i++) {
m = pattern.matcher(lines[i]);// Splits the lines into separate elements
String[] elements = lines[i].split(”\\|”);
if (lines[i].startsWith(”sp”)) {
result.append(elements[1] + “\n”);
}
}return result.toString().trim();
}
}
This extracts the first 10 accession numbers from the report. However, this can be altered as the first 10 may not be appropriate for the multiple alignment. This is down to the user.
This needs to be saved as another .jws file and added into Taverna using the WSDL Scavenger.
Following this:
- Add the output of parseAccNo to a split_string_into_string_list_processor. This is because the inputs need to be separated before entering the uniprot service.
- Add another String Constant and edit the value to a newline ‘\n’.
- From the Available Processors list add a Biomart Service – Uniprot and add this to the workflow.
- Then you need to configure the uniprot service by right-clicking on uniprot in the Advanced Model Explorer.
- On the left hand side of the Biomart service box select Filters and click on external identifiers and tick the box ‘limit to proteins’ with UNIPROT (ID)s. See below:
- Then select Attributes and for Header information select Uniprot AC and Protein name and for Uniprot Sequences select Protein sequence. Also in the top right hand corner make sure ‘Unique results only’ is ticked. Then just close the box.
- This will produce 3 lists of outputs (uniprot.sptr_ac, uniprot.protein_name, uniprot_seq). These need to be organised into FASTA format.
- This can be achieved using a Beanshell script. This processor can be found in the Local Services list. Add this to the workflow.
- The beanshell needs to be configured. In the script tab paste in the following code:
StringBuffer seqFile = new StringBuffer();
seqFile.append(”>” + accNo + “|” + seqName + “\n” + sequence + “\n”);
String fasta = seqFile.toString();
Then select the Ports tab and add 3 inputs (accNo, seqName and sequence) and add 1 output (fasta). Then close the box.
- You can now add all the relevant outputs from uniprot to the inputs of the beanshell processor.
- IMPORTANT – In the Advanced Model Explorer window you need to click on the beanshell_scripting_host and select the metadata tab. You need to then click on ‘Create iteration strategy’. Then highlight cross product and change this to dot product. This is essential to make sure all the outputs from the uniprot service are correlated correctly (1st against 1st, 2nd against 2nd…)
- This processor will produce a list of separate FASTA files. These need to be merged into a single file for the ClustalW multiple sequence alignment program.
- This can be achieved using another local java widget in the Local Services list. This is the Merge_string_list_to_string under the ‘list’ options.
- The output of the beanshell processor ‘fasta’ needs to be connected to the ’stringlist’ input port of Merge_string.
- Also another String Constant needs to be added and connected to the seperator input port of Merge_string. For this processor the value needs to be edited and left blank.
- A ClustalW web service can be found here. Add this WSDL file the usual way with the WSDL Scavenger.
- For this service use the ‘analyzeSimple’ processor for the workflow. Connect the output of Merge_string to the ‘query’ input port for ClustalW. Then simply connect this to the final workflow output for your multiple sequence alignments.
The workflow should look like the one below:

Please note: At the time of writing this post the uniprot biomart service was returning results in the wrong order. This is why the uniprot outputs are connected to the wrong input ports for the beanshell processor in order to produce the desired results. I have raised this in the Taverna Users mailing list.
Hope this is useful, do not hesitate to ask any questions regarding this post.
Thanks
Kieren
A simple Taverna tutorial
Bioinformatics workflows are an extremely effective way for reducing the tedium of the traditional ‘cut & paste’ methods when using distributed services on the web. The Taverna Workbench is a fantastic tool for automating in-silico experiments. Not only is the automation of these large scale analysis pipelines great for time saving and eradicating human error, they enable provenance tracking of intermediate results and allow for method re-use by publishing the workflows you have designed. If you are a bioinformatician and you are not familiar with any kind of workflow technology then you need to catch up.
This is a simple tutorial on how to build a pipeline for the analysis of protein sequences. In this example I will demonstrate how to automate numerous BLASTp jobs on a large list of protein sequences.
First of all you need to add an input ‘proteins’. The BLAST program requires each protein to be in FASTA format. Therefore, in order to separate each sequence, a small piece of code is required:
public class RemoveFirstCharacter {
public String removeFirstCharacter(String fastaList) {
return fastaList.substring(1).toString();
}
}
This can be used as a web service by saving the file as .jws and saving it to your jakarta/webapps/axis directory. This can be accessed by copying and pasting the url of the WSDL file into Taverna by right clicking on the Available processors and selecting ‘add new WSDL scavenger’ or alternatively placing the code into a beanshell script. Connect the input to this processor. Then you need to follow these instructions:
- Connect the output of RemoveFirstCharacter to a ‘Split string into string list by regular expression’ processor (found in the Available Services window>Local Services >Local Java Widgets >text menu).
- A second input is required for this processor so you need to add another Local Service processor known as a String Constant to your workflow and edit the string value to ‘>’.
- This needs to be connected to the ‘Regex’ port.
- Each sequence must then be converted back into FASTA format by adding another processor ‘Concatenate_two_strings’.
- Connect the output of ‘Split string into string list by regular expression’ into the ’string2′ input port of concatenate and connect the same StringConstant (’>’) to the ’string1′ port which will split your proteins into separate objects. See below:
This can then be connected to a number of services as each protein will be sent consecutively to the next processor. In this example I will demonstrate using Blastp. The WSDL file for this Blast service can be found here.
- Select the searchSimple processor from the list once it appears in the Available Services window.
- Connect the output port of the concatenate processor to the ‘query’ input for the Blast service.
- This services requires 2 extra inputs to define the database and the program. Therefore, add another two String Constants to the workflow.
- These can be renamed, for example to ‘Database’ and ‘Program’. In this example edit the string value of the Database processor to ‘SWISS’ and the Program to ‘blastp’.
- Connect these processors to the relevant input ports for the searchSimple processor. See below:
In my next post I will explain by providing examples of how to extract specific information from the BLAST reports and how to store this data into a relational database.
Too hot to handle
Generating new research ideas requires a detailed knowledge and awareness of the subject area. Thankfully, due to the availability of text mining services, literature can be ‘intelligently’ mined for relevant articles. GoPubMed allows you to mine PubMed abstracts using specific search terms. Four options are available including an ontology-based literature search that annotates the PubMed abstracts with your keywords and then groups the articles into a hierachy based on the gene ontologies. An advanced semantic search allows you to search using actual GO terms such as ‘protein biosynthesis’ or ‘apoptosis’.
The HotTopic feature is a really useful system for retrieving statistical data based on the biomedical literature. This feature performs a bibliometric analysis graphically displaying the growth of literature chronologically. Key authors and journals are highlighted along with geographic information based on the publications.
The figures below represent the trends related to copy number variation in the human genome as this has generated widespread interest over the last few years. These variations have been largely overlooked in previous studies and are believed to be associated with a number of genetic diseases.
Fig.1. A graph displaying the growth of literature related to copy number variation per year over the last 20 years.
Fig.2. A pie chart diplaying the authors from countries active in publishing on “copy number variation”.
Literature mining: Let the literature find you
Keeping abreast of the wealth of scientific literature by todays standards is a daunting task. It is extremely time consuming yet crucial for a scientist to remain on top of their field. Hubmed is a really useful interface that accesses Pubmed that allows you to search for relevant articles based on keywords within the articles. Expanding the search terms produces the most common keywords within the first 500 abstracts. A boolean option allows you to create conditional relationships between these terms, such as “optic AND neuropathy” as opposed to these words appearing alone.
I have developed programs that mine keywords and conditional phrases from gene ontology records and OMIM documents. For example if you are searching for a gene involved in eye disease you may want to know if the candidate has previously been identified as causing some other eye-related disease or phenotype. Automating these processes allows you to scan numerous OMIM and gene ontology records for interesting terms. This can be developed further by creating conditions between the words and phrases. A scoring system can then be applied to allow you to compare numerous candidates of interest. A simple word count based on the occurrences of these key phrases would be a reasonable place to start.
If you find yourself requiring more advanced techniques refer to text mining.
Support Vector Machine Classification – Validating your training set
Using a support vector machine for multivariate analysis requires a specific training set of data. On the subject of my particular research, my aim is to classify mitochondrial and non-mitochondrial proteins. Various parameters can be used to identify whether a candidate is mitochondrial or not. These include localisation prediction software, co-expression data, epitope tagging, disease associations, etc as discussed in my earlier post.
The parameters you choose are crucial for the training as these determine how the SVM will classify candidates.
My training set consists of known mitochondrial proteins (1) and known non-mitochondrial proteins (-1). The entire training set consists of 1300 proteins (650 each).
This training set can then be tested for false positives and false negatives. This is in the form of a program that extracts a random 65 candidates from each dataset (10%) from the local database and tests this against the remaining candidates (1170 proteins).
The program can be requested to perform multiple runs (i.e x 100) allowing you to calculate the average for false positives and negatives (sensitivity and specificity).
This process is performed using SQL and JDBC. Once 100 random testsets have been generated these can be passed to a series of SQL statements embedded in a JDBC program to organise and correlate all the data in preparation for analysis using SVM software such as SVM_Light. SVM_Light requires a training file in order to build a model file. The testset can then be applied to the classification procedure using the model file to classify the candidates in the testset producing a prediction file.
An excellent tutorial on support vector machines can be found here based at www.bioinformaticsonline.co.uk.
Using JConsole to monitor workflow memory usage
Building workflows (using Taverna for example) can sometimes run into Java heap space problems. I have had problems in the past where the workflow would hang and the following error would be generated on the command line:
java.lang.OutOfMemoryError: Java heap space
Obviously if you are sending large jobs through a workflow with numerous processors this problem will occur. What I have found really useful is the java tool JConsole which allows you to monitor java heap space, memory usage, threads, etc.
I had a workflow that kept running into java heap space problems whereby the committed memory was exceeding 1000MB. I altered the run.me script in Taverna to allocate 2048M of memory for Java allowing the workflow to complete successfully.

Fig.1. JConsole monitoring Java heap space.
Courses in Bioinformatics
I completed the MRes in Bioinformatics at Newcastle University in 2005 and was offered my current position by my MRes project supervisor. I would certainly recommend this course for undergraduates who are interested in this field. However, I would like to advise applicants on a few things:
- Apply as early as possible as they have studentships available but they are limited.
- If you are a from a biology/life science background I highly recommend you take a short course in Java programming and/or work through some online tutorials. I can’t stress enough how beneficial this will be. Follow this link for one I recommend.
- If you are from a computing science/non-life science background I recommend you read through some basic texts on molecular biology before the course commences.
- For an overview of bioinformatics for people starting out in the field, this is a great place to start 2Can Bioinformatics at the EBI.
From that year there was about 22 of us on the course. I’m pretty sure more than 50% are now studying PhDs in bioinformatics or systems biology and the majority of the remaining (including myself) are in full time employment either in Industry or research.
Candidate gene identification for mitochondrial disease
Inherited mitochondrial disorders are known to be some of the most common forms of genetic disease affecting 1 in 8000 of the population. The majority of mitochondrial genes are nuclear-encoded and transported by means of protein import. Unique characteristics within the amino acid sequences allow the cell to recognise these proteins as mitochondrial.
E-Science for Molecular Biology
The large increase in scientific data requires distributed global collaborations enabled by the internet. Large scale computing resources are required for the integration of this data held across multiple sites around the world. In the future, powerful computing resources will be available for research institutes based on a new infrastructure known as the ‘grid’. This enables scientists to analyse very large datasets that were previously inexplorable. The grid involves utilisation of distributed computing, storage facilities and external networks.
Using workflows to generate data is very effective. The main problems lie with the reliability of distributed open software and the varying data formats that exist. This will inevitably improve in the future and e-Science based approaches will have a huge impact on the scientific community.
Development of an integrated resource for rapid candidate gene identification.
An integrated bioinformatics resource has been developed that rapidly facilitates the search for nuclear-mitochondrial genes allowing biologists to select the most suitable candidates for further research.
I am currently developing a fully automated workflow using the Taverna Workbench which was developed under the myGrid initiative. The workflow incorporates a number of web services including subcellular localisation prediction programs including Mitoprot, TargetP and Predotar. The resource also includes online and local database queries regarding gene ontologies and co-expression data from databanks including Swissprot, Ensembl, EMBL and Affymetrix. Finally, a support vector machine is used for candidate gene classification using SVM_Light. There is a really good tutorial on the basics of support vector machines here.
This is connected to a relational database that stores all the relevant results from each execution of the workflow using JDBC and SQL.
Fig.1. The Taverna workbench.
The workflow requires chromosomal coordinates as input and uses Biomart to retrieve the sequences within that region. Each sequence is then individually analysed through the workflow due to the implicit iteration functionality of the Taverna workbench.
The performance of the workflow and database has proved to be a dramatic improvement compared to the traditional bioinformatics ‘cut & paste’ approach providing biologists with more time for scientific analysis. Various candidate lists have been generated and are currently undergoing laboratory investigation.
Comments(3)
Comments(4)

Comments(1)
