Support Vector Machine Classification – Validating your training set
Using a support vector machine for multivariate analysis requires a specific training set of data. On the subject of my particular research, my aim is to classify mitochondrial and non-mitochondrial proteins. Various parameters can be used to identify whether a candidate is mitochondrial or not. These include localisation prediction software, co-expression data, epitope tagging, disease associations, etc as discussed in my earlier post.
The parameters you choose are crucial for the training as these determine how the SVM will classify candidates.
My training set consists of known mitochondrial proteins (1) and known non-mitochondrial proteins (-1). The entire training set consists of 1300 proteins (650 each).
This training set can then be tested for false positives and false negatives. This is in the form of a program that extracts a random 65 candidates from each dataset (10%) from the local database and tests this against the remaining candidates (1170 proteins).
The program can be requested to perform multiple runs (i.e x 100) allowing you to calculate the average for false positives and negatives (sensitivity and specificity).
This process is performed using SQL and JDBC. Once 100 random testsets have been generated these can be passed to a series of SQL statements embedded in a JDBC program to organise and correlate all the data in preparation for analysis using SVM software such as SVM_Light. SVM_Light requires a training file in order to build a model file. The testset can then be applied to the classification procedure using the model file to classify the candidates in the testset producing a prediction file.
An excellent tutorial on support vector machines can be found here based at www.bioinformaticsonline.co.uk.
No comments yet
Leave a reply