Analyzing protein function on a genomic scale:
the importance of gold-standard positives and negatives for network
prediction
Ronald Jansen and Mark Gerstein
(2004) Curr Opin Microbiol 7:
535-45.
Abstract
The concept of ‘protein function’ is rather ‘fuzzy’ because it is often
based on whimsical terms or contradictory nomenclature. This currently
presents a challenge for functional genomics because precise
definitions are essential for most computational approaches. Addressing
this challenge, the notion of networks between biological entities
(including molecular and genetic interaction networks as well as
transcriptional regulatory relationships) potentially provides a
unifying language suitable for the systematic description of protein
function. Predicting the edges in protein networks requires reference
sets of examples with known outcome (that is, ‘gold standards’). Such
reference sets should ideally include positive examples — as is now
widely appreciated —but also, equally importantly, negative ones.
Moreover, it is necessary to consider the expected relative occurrence
of positives and negatives because this affects the misclassification
rates of experiments and computational predictions. For instance, a
reason why genome-wide, experimental protein–protein interaction
networks have high inaccuracies is that the prior probability of
finding interactions (positives) rather than noninteracting protein
pairs (negatives) in unbiased screens is very small. These problems can
be addressed by constructing well-defined sets of non-interacting
proteins from subcellular localization data, which allows computing the
probability of interactions based on evidence from multiple datasets.
Download the paper here.
Supplementary information
A small presentation (related to figures 5 and 6 in the paper) on the
relationship between sensitivity, specificity and positive predictive
value in genome-scale datasets: [HTML] [GIF images] [Powerpoint]
Spreadsheet with a model calculation on the construction of "negative"
protein-protein interactions from subcellular localization data
(related to table 1a and 1b): [Excel]
If you have any questions or
comments, please contact Ronald Jansen.
Last update: 01/20/2005