Mechismo is a tool that lets you find potential mechanisms for how proteins interact with other molecules and more importantly for how any changes that might affect these interactions, and consequently an entire biological system. We have constructed a database of roughly 50,000 protein-protein, protein-small-molecule and protein-nucelic acid interactions of known structure and several million interactions identified by other methods, all of which we use to identify or predict potential interactions within a set of proteins or genes provided by you.
The primary reference for mechismo is:
Betts MJ, Lu Q, Jiang YY, Drusko A, Wichmann A, Utz M, Valtierra-Gutiérrez IA, Schlesner M, Jaeger N, Jones DT, Pfister S, Lichter P, Eils R, Siebert R, Bork B, Apic G, Gavin AC, Russell RB
Mechismo: predicting the mechanistic impact of mutations and modifications on molecular interactions
Nucl Acids Res, 43(2):e10, 2015 (PubMed).
Mechismo, or the data behind it, has been used in a number of studies:
Our group:
The primary data behind Mechismo consists of:
For our current list of supported organisms/proteomes (Human, Mouse, C.elegans, D.melanogaster, S.cerivisae, E.coli, B.subtilis and M.pneumoniae) we have compared all sequences in Uniprot SPROT/VARSPLIC to the sequences of known structure using BLAST/SIFTS to obtain a list of significant matches between the model organisms and sequences of known structure. We use these matches to identify known (i.e. identical or near identical) or predicted (i.e. lower sequence similarity) sites in contact with proteins, chemicals or nucleic acids.
Each site in known or predicted contact with another molecules is assigned a confidence that we have derived by a careful analysis of the entire dataset based on the expected False Positive Rate (FRP) as a function of percentage sequence identity between the sequence of interest and the particular template of known structure. As sequence similarity decreases, so does the confidence in the prediction.
For protein-protein interactions, we collated all interactions and grouped them carefully according to Uniref classifications (into Uniref100, Uniref90, Uniref50). This allows (e.g.) interactions within Mouse to be used as evidence of an interaction between Human proteins, though how the interactions are used depends on your stringency settings. Via the standard or advanced interface you can specify only interactions of great confidence (e.g. if studying a cancer genome) or allow for weaker evidence (e.g. if exploring Yeast or Bacterial interactions) as suits your particular case.
Additionally for protein-protein interactions, we have established a metric for assessing whether any provided mutation or modification will increase or decrease the affinity at the interface. To do this we used an updated matrix of amino-acid interaction preferences that says how favorable/disfavorable a pair of interacting amino acids is. For example, if a hydrophobic residue (e.g. Leucine) in a hydrophobic pocket is mutated to a charged residues (e.g. Aspartate) then this matrix will suggest a negative (i.e. deterimental) score for the change.
For chemicals we have performed an automated classification via chemoinformatics tools, which we then manually processed to assign chemicals into about 50 groups. These groups provide specific information for chemicals or chemical types that are abundant in known structures (e.g. ATP-like, Zn++), and general properties (e.g. Organic) for chemicals that are unique.
For changes at interfaces we use empirical pair potentials which score the likelihood
of amino acid side chains interacting across an interface. Essentially these are log odds values with positive
values indicating favored pairings and negative values unfavorable pairings. When considering a change of
an amino acid, we sum the scores for the original amino acid and subtract these from the sum for the changed
amino acid. Thus positive scores indicate a change that is expected to improve the interface (i.e. to increase
affinity), and negative scores indicate a change expected to worsen it.
The matrix of these values for protein-protein interactions can be seen here. These are an updated version of those first described in
the paper describing the InterPReTS method. The values for protein-chemical and protein-DNA/RNA interactions
can be seen here.
It has long been known that the lower the sequence identity between a sequence
and the template on which it is modelled, the less likely the predictions are to be correct.
To put specific numbers on this we performed an assessment of how accurate predictions are by studying
known structures. Specifically, we cacluated false positive rates for predicting known sites using
templates at various levels of identity. Below is the table used to define the confidence intervals shown
in the software.
FPR conf. Prot-chem Prot-DNA/RNA Prot-prot <0.500 low <34 <41 <29 >=0.500 medium >=34 >=41 >=29 >=0.200 medium >=47 >=51 >=34 >=0.100 high >=64 >=56 >=37 >=0.050 high >=81 >=86 >=40 >=0.010 high >=90 >=90 >=60
Matthew Betts: matthew.betts@bioquant.uni-heidelberg.de
Rob Russell: robert.russell@bioquant.uni-heidelberg.de