Single layer perceptron for protein sequence classification
I began exploring perceptron modeling recently, and made some strides with my own UniprotKB datasets and wanted to share this project here!
My hypothesis is that given some basic numerical features regarding an amino acid sequence, I can linearly classify it as belonging to Homo sapiens or not. But before I dive into the code, here is a little intro on what a perceptron model actually is.
In short, a single layer perceptron is a type of supervised, artificial neural network with no hidden layers that can perform linearly separable classification.
In this figure, we can see that the neuron, or the fundamental unit that processes our input, applies some weight (w) to each of our input features (x) and passes it directly to the output.
In a multi-layer perceptron, there would additionally be a loss function in more than 1 hidden layers between the input and output and a more complex activation function than just summation or step function, before the output is created.
The input data is preferably standardized/normalized before the weights are applied and further processing is carried out. Normally in python sklearn library, weights are internally randomized as the initial values. With each iteration, these values are adjusted based on the output of the model, which changes the weights by a step factor (called learning rate) until convergence is achieved. Convergence is important to reduce the error in accuracy over the subsequent iterations
For my customized dataset, I fetched over 98000 records from uniprotkb that were related to reviwed molecular functions and filtered the length, mass, organism and enzyme catalog number of each record. The dataset looked something like this:
Since there is a vast trove of data here, I decided to simply use the numerical features to categorize each record as belonging to Homo sapiens or not. In order to do it, I had to scalarize and normalize my data so:
- I removed the unnecessary ‘Entry’ column (cannot be used as a feature or category)
- Replaced ‘Organism’ column with integer representation [0 for Homo sapiens, 1 for any other
- Separate the EC number into 4 sub domains (only the first occurance was retained)
- Normalize each numerical column separately
Following this, I prepared a train and test data using sklearn function train_test_split
The features were successfully split 70–30 between train and test, which I fed directly to the Perceptron
class from sklearn package.
The model was fit using 1000 iterations and 0.001 learning rate, which is quite standard for linear Perceptron models. Then, using accuracy_score
, I displayed the final test result. And lo and behold, 98.04%. Not bad for a first run! I did try to play around with lesser training and more testing data and the results varied from as much as 94 to 98%, which indicates to me that the hypothesis I began with, being able to classify an organism as Homo sapien, given the features of an amino acid sequence, is indeed possible!
Some disadvantages of using this simple linear model is that we cannot understand the non-linrear relations between our features. This means that we could potentially improve the performance of our prediction model if we use a more complex hidden layer neural network. But perhaps that is fodder for another article. See you then!
The complete code can be found in my github link here: https://github.com/AdiBad/image_analysis_DL/blob/main/single_layer_perceptron_subcel.ipynb
The dataset I used is available in the subfolder data
above.