|
** New: Biological Language Conference
The Biological Language Modeling
Group is a group of interdisciplinary scientists
that explores the analogy between human languages and protein sequences.
The central hypothesis followed here is:
The collective of ‘protein
sequences’ encoded by the genomes of different organisms
are 'written' representations of different 'languages'
or 'dialects'. The ability of each protein to 'fold'
into a three-dimensional structure able to participate in functional
networks of proteins in a cell corresponds to the 'meaning'
of a text. (Figure1)
Figure 1: Mapping between human
language and Protein Sequences.
The mapping of a protein sequence to its structure and function is one of the most challenging problems faced by biologists.
The hypothesis provides a means to approach this challenge by application of computational tools developed for the analysis of
human languages to protein sequences. This exploration of the analogy has many applications. For example - identification of 'word phrases' that a pathogen uses but not its host may lead to identification of drug targets. 'Rare words' seem to indicate the presence of initiators of protein folding. The unprecedented amount of genomic and proteomic data created by projects like the Human Genomic Initiative created an opportunity for attacking the sequence-structure-function mapping problem with data-driven methods.
The Biological Language Modeling Toolkit is a compilation of various algorithms that have been adapted to biological sequences from language modeling. Given below are the list of tools that may be used for protein sequence analysis. The tools may also be accessed via the dropdown menu provided at the top of the page.
Sequence Sorting:
Statistical analysis of biological sequence data requires n-gram string matching and string searches. Searching for a sub-string from large text data -a problem in many areas of computer science - has been dealt efficiently using data structures like suffix trees and suffix arrays as the preferred structure. This tool accepts a large whole-genome file as input and generates suffix arrays, Least Common Prefix (LCP) array and Rank arrays for the file. The output generated here may optionally be used as the input to the N-gram Extraction Tool that is explained later.
N-gram Extraction:
General statistical data about a genome file may be obtained using this tool. The tool generates Suffix array and Least Common Prefix (LCP) arrays for a Genome file (FASTA) and computes the following statistics as required by the user.
- Total Number of Proteins - value.
- Average length of Proteins - value.
- Length of all proteins - <file>.prot - Gives the length of all the proteins in the
Genome file with or without the sequence header information.
- Ngram and their frequency of occurrence - <file>.<val>ngrams - Contains the n-grams with their
frequency count sorted by either the count or the n-grams. The
<val> denotes the n-gram length.
- <file>.<val>ngramsTop - This file is generated if the
user wishes to see only the top most frequently occurring n-grams.
Protein N-gram
Analysis:
This tool may be used to find the n-gram frequency of occurrence of a protein sequence across various organisms. The input to the tool are a protein sequence, a genome (in FASTA format) and the ngram size. The output is a statistical table of the ngrams of the protein sequence and their frequency of occurrence in the genome.
N-gram Comparison:
Comparison of ngram occurrences in different organisms may be performed using the N-gram Comparison tool. The tool compares various statistical values including frequencies and standard deviations of n-gram frequencies for given sequence datasets. It computes the expected n-gram frequencies based on unigram frequencies, and then computes the difference between expected and observed n-gram frequencies, and the standard deviations of the distances for each n-gram.
Yule's Measure:
Yule's Measure may be used to determine feature boundaries of protein sequences by using statistical measures of association between amino acids. The tool computes Yule's Q statistic values for protein sequences which is a measure of association between two variables, always assuming a value between -1 and 1. A positive value implies that the variables are positively correlated. Likewise, negatively correlated variables have a negative Yule value. The tool accepts a protein sequence dataset and a training set (which may be a collection of homologous sequences) and computes a 20x20 Yule table from the training dataset. The values from the generated table are applied on the input protein sequence.
Mutual Information:
Mutual Information is another tool that may be used to determine feature boundaries of a protein sequence. Mutual information Mxy
measures the interdependence between two variables x and y by
quantifying how much the probability distribution of one variable
changes if another variable is known. This is a tool that computes
Mutual Information for amino acids in protein sequences. Within a window
of size 4, the association between the occurrences of amino acids is
calculated. The Mutual Information output is given as a function of
amino acid sequence position.
Regular Expression:
This tool can be used to measure the frequency of occurrence of a motif
over a family/class of protein sequences. The input is a motif pattern
that is converted to a regular expression and a sequence file that could
be a collection of protein sequences separated from each other by a line
break.
Position specific property conservation
Two methods have been developed to examine the conservation of amino acid properties with respect to their positions in a sequence. One method is implemented using a Gaussian distribution to model property conservation while the second uses variance to identify conservation patterns.
Property Visualization
A tool to plot the values assigned to all 20 amino acid by one or more property scales and also to list the properties that are most correlated with a specified property
Protein Protein Interaction
A new supervised classifier to combine multiple biological evidence to predict protein-protein interaction in yeast to begin with. Random forest algorithm was used to predict similarity between proteins and K nearest neighbor algorithm for the classification.
|