A simple Taverna tutorial – Part 2: Multiple sequence alignment
The aim of this tutorial is to generate a multiple sequence alignment from a Blast report. Multiple sequence alignments are important for the identification of conserved sequence regions across divergent sequences.
First you will need the following code to parse out the accession numbers from the report:
import java.util.regex.Matcher;
import java.util.regex.Pattern;public class BlastAccessionParser {
// Unique regular expression to parse out swissprot accession numbers from a Blast report
private final Pattern pattern = Pattern.compile(”[OPQ]\\w++”);
public String parseAccNo(String blastReport) {
Matcher m;
// Splits each line in the text by a carriage return//
String[] lines = blastReport.split(”\n”);
StringBuffer result = new StringBuffer();// Iterates through the lines of text. If a match is found then the next step proceeds.
// This iterates throught to the 32nd line of the Blast report to match the first 10 sequence hits.
for (int i = 0; i < 32; i++) {
m = pattern.matcher(lines[i]);// Splits the lines into separate elements
String[] elements = lines[i].split(”\\|”);
if (lines[i].startsWith(”sp”)) {
result.append(elements[1] + “\n”);
}
}return result.toString().trim();
}
}
This extracts the first 10 accession numbers from the report. However, this can be altered as the first 10 may not be appropriate for the multiple alignment. This is down to the user.
This needs to be saved as another .jws file and added into Taverna using the WSDL Scavenger.
Following this:
- Add the output of parseAccNo to a split_string_into_string_list_processor. This is because the inputs need to be separated before entering the uniprot service.
- Add another String Constant and edit the value to a newline ‘\n’.
- From the Available Processors list add a Biomart Service – Uniprot and add this to the workflow.
- Then you need to configure the uniprot service by right-clicking on uniprot in the Advanced Model Explorer.
- On the left hand side of the Biomart service box select Filters and click on external identifiers and tick the box ‘limit to proteins’ with UNIPROT (ID)s. See below:
- Then select Attributes and for Header information select Uniprot AC and Protein name and for Uniprot Sequences select Protein sequence. Also in the top right hand corner make sure ‘Unique results only’ is ticked. Then just close the box.
- This will produce 3 lists of outputs (uniprot.sptr_ac, uniprot.protein_name, uniprot_seq). These need to be organised into FASTA format.
- This can be achieved using a Beanshell script. This processor can be found in the Local Services list. Add this to the workflow.
- The beanshell needs to be configured. In the script tab paste in the following code:
StringBuffer seqFile = new StringBuffer();
seqFile.append(”>” + accNo + “|” + seqName + “\n” + sequence + “\n”);
String fasta = seqFile.toString();
Then select the Ports tab and add 3 inputs (accNo, seqName and sequence) and add 1 output (fasta). Then close the box.
- You can now add all the relevant outputs from uniprot to the inputs of the beanshell processor.
- IMPORTANT – In the Advanced Model Explorer window you need to click on the beanshell_scripting_host and select the metadata tab. You need to then click on ‘Create iteration strategy’. Then highlight cross product and change this to dot product. This is essential to make sure all the outputs from the uniprot service are correlated correctly (1st against 1st, 2nd against 2nd…)
- This processor will produce a list of separate FASTA files. These need to be merged into a single file for the ClustalW multiple sequence alignment program.
- This can be achieved using another local java widget in the Local Services list. This is the Merge_string_list_to_string under the ‘list’ options.
- The output of the beanshell processor ‘fasta’ needs to be connected to the ’stringlist’ input port of Merge_string.
- Also another String Constant needs to be added and connected to the seperator input port of Merge_string. For this processor the value needs to be edited and left blank.
- A ClustalW web service can be found here. Add this WSDL file the usual way with the WSDL Scavenger.
- For this service use the ‘analyzeSimple’ processor for the workflow. Connect the output of Merge_string to the ‘query’ input port for ClustalW. Then simply connect this to the final workflow output for your multiple sequence alignments.
The workflow should look like the one below:

Please note: At the time of writing this post the uniprot biomart service was returning results in the wrong order. This is why the uniprot outputs are connected to the wrong input ports for the beanshell processor in order to produce the desired results. I have raised this in the Taverna Users mailing list.
Hope this is useful, do not hesitate to ask any questions regarding this post.
Thanks
Kieren
3 comments so far
Leave a reply
Hi Kieran, these workflows look useful, can you upload them to http://www.myexperiment.org ? Its exactly the kind of thing we’d like people to share. Cheers. Duncan
Hi Kieran, seeing some of the workflows you’re producing has restored my faith in Taverna.
However, I am still struggling to get to grips with some of the various technologies involved. Specifically, what do I do with the java code posted above? For code you used in the previous tutorial, you wrote: “This can be used as a web service by saving the file as .jws and saving it to your jakarta/webapps/axis directory”. Presumably, I should attempt to do the same with this.
I don’t have jakarta, but if I did, is it as simple as just dropping the jws file into the directory and then searching for it using the scavenger? What is special about jakarta that allows this to work? I have access to an apache server, would this do what is required?
Many Thanks, Gareth.
Hi Kearan,
My question is not related to workflow per se. I saw in your “uniprottutorial.png” picture that you’re using ubuntu linux (I guess) and the window list shows that you have Eclipse IDE and Taverna are open at the same time. In my case, even openning one of those applications clogs the memory. Does your machine handle two big Java-based applications well? ?If yes, what is your hardware specifications?
thanks..