Candidate gene identification for mitochondrial disease

Inherited mitochondrial disorders are known to be some of the most common forms of genetic disease affecting 1 in 8000 of the population. The majority of mitochondrial genes are nuclear-encoded and transported by means of protein import. Unique characteristics within the amino acid sequences allow the cell to recognise these proteins as mitochondrial.

E-Science for Molecular Biology

The large increase in scientific data requires distributed global collaborations enabled by the internet. Large scale computing resources are required for the integration of this data held across multiple sites around the world. In the future, powerful computing resources will be available for research institutes based on a new infrastructure known as the ‘grid’. This enables scientists to analyse very large datasets that were previously inexplorable. The grid involves utilisation of distributed computing, storage facilities and external networks.

Using workflows to generate data is very effective. The main problems lie with the reliability of distributed open software and the varying data formats that exist. This will inevitably improve in the future and e-Science based approaches will have a huge impact on the scientific community.

Development of an integrated resource for rapid candidate gene identification.

An integrated bioinformatics resource has been developed that rapidly facilitates the search for nuclear-mitochondrial genes allowing biologists to select the most suitable candidates for further research.
I am currently developing a fully automated workflow using the Taverna Workbench which was developed under the myGrid initiative. The workflow incorporates a number of web services including subcellular localisation prediction programs including Mitoprot, TargetP and Predotar. The resource also includes online and local database queries regarding gene ontologies and co-expression data from databanks including Swissprot, Ensembl, EMBL and Affymetrix. Finally, a support vector machine is used for candidate gene classification using SVM_Light. There is a really good tutorial on the basics of support vector machines here.

This is connected to a relational database that stores all the relevant results from each execution of the workflow using JDBC and SQL.

Taverna workbench

Fig.1. The Taverna workbench.

The workflow requires chromosomal coordinates as input and uses Biomart to retrieve the sequences within that region. Each sequence is then individually analysed through the workflow due to the implicit iteration functionality of the Taverna workbench.

The performance of the workflow and database has proved to be a dramatic improvement compared to the traditional bioinformatics ‘cut & paste’ approach providing biologists with more time for scientific analysis. Various candidate lists have been generated and are currently undergoing laboratory investigation.

2 comments so far

  1. Truth Seeker on

    Wow. This is really good work and I hope it is fruitful in that approximately 1 child in 4000 in the United States will develop mitochondrial disease by the age of 10! I had never heard of the disease before.

    I wish you much success – the lives of many kids may possibly depend on you.

    Truth Seeker

  2. Hong Chang Bum on

    Greate. Thank you your posting.
    How can find your workflow file…?


Leave a reply