|
Robert
Hoffmann* and Alfonso Valencia
National Center of Biotechnology, CNB-CSIC,
Cantoblanco Madrid M-28049. *Corresponding author: contact
By exploring time series data from MEDLINE abstracts,
we have made the observation that only a few genes have been quoted
with increasing frequency over the past 25 years. This is probably
the result of selective pressure by the scientific community. Over
the years this selection has produced an extreme power law distribution
of the information available for individual genes. Interestingly,
those genes that are successfully selected are not necessarily the
most important genes to the cell. To stress the implication of this
finding we show that there is no correlation between a gene's impact
in the scientific literature and its centrality in protein interaction
networks.
In the last 25 years, a tremendous effort by the
biomedical research community has led to the more than 10 million
publications that are available at the PubMed (MEDLINE) database.
In this study we focus on a heretofore undervalued property of this
outstanding repository: Data from PubMed is time-resolved, since
for every article the date of its publication is available. This
attribute allows for studying the evolution of scientific theories,
terms and even gene names within the scientific community. In a
first step we computed annual quotation frequencies for individual
genes by tracing their names, symbols and synonyms in abstracts
since 1975 [1]. All in all, time series for 180000
genes from human, mouse, Drosophila, yeast, Zebrafish and E. coli
were generated. Figure 1a shows the distribution
of 250 of the most quoted human genes over the last 26 years. New
gene discoveries are seen at different points in time but subsequent
reference to a gene after its first description is clearly not random
and diverse patterns, or "life cycles", can be distinguished.
Lifecycles of successful genes
In Figure 1b characteristic life
cycles of four genes are shown. These correspond to typical patterns
found in 4532 genes that have appeared in the literature for at
least 15 years. The glycolipid transporter GM2A, for example, is
representative of the most frequent pattern, one that is shared
by about 4200 genes. These exhibit a rather dull life cycle and
have never attracted enough interest to become very important. Interleukin
3 (IL3) represents genes that have survived significant ups and
downs in the collective scientific interest, but never boomed. The
tumour suppressor gene p53, on the other hand, corresponds to a
minor group that shows an exceptional increase of interest and consequently
of occurrence over time. These observations demonstrate how gene
names have to overcome the selective mechanism of the scientific
community to stand out from the rest [2]. The
interest of the community in a specific gene and thus its scientific
impact depend on a gene's molecular role but also on the social
needs within the scientific community, illustrated by the exceptional
interest in genes like CD4 and p53 which are involved in HIV infection
and tumour development.
What we know about individual genes
The number of articles that mention a gene in a certain
time period represents a rough estimation of the information available
for this gene. Taking a broader look at the present information
about all genes, it makes a difference to compare genes that have
been known for 20 years or genes that have been known for just 2
years. Considering this, we are able to compare 8,176 genes that
where present for 10 years and 2,130 genes for a 20 year period.
We find that the distribution of information over these genes decays
as a power law function (Fig. 1c); a few genes
such as CD4 and p53 are most frequently quoted and attract most
attention, while for the rest comparably little has been published.
Power law distributions are the hallmark of systems in the critical
state [3]. The scientific community as a small
world network is known to share important characteristics with these
dynamical systems [4,5], where all members are
permanently interacting and influencing each other (see
Box 1). Given a certain complexity, the flow of information
within the scientific community can no longer be understood in terms
of the behaviour of individuals; small changes can have domino effects
out of proportion to their cause, leading, for instance, to the
outstanding success of CD4. In other words, trends also exist in
the scientific community.
The degree of protein interactions
These considerations raise the question, whether
genes that are frequently discussed in research abstracts are also
more important to cellular functions. Or is this extreme distribution
of our scientific attention rather a reflection of the priorities
within our society, as is suggested by CD4, the most frequent quoted
protein (Fig. 1b). CD4 is involved in HIV infection
and of clear importance to our society, though from a purely biological
point of view it has a role similar to other cell receptors. To
address this question, we employ the socially unbiased views from
high throughput experiments as a reference. In the past two years
large-scale methods have been introduced to generate global interaction
networks of proteins; Yeast two-hybrid (Y2H) and mass spectrometry
of purified complexes (TAP, HMS-PCI) aim to detect physical interactions.
Based on yeast two-hybrid data, Barabasi and colleagues have discovered
the scale-free topology of these interaction networks, where a few
genes have many interactions but most genes have only few interactions.
A clear correlation between the number of a gene's interactions
and its importance to the maintenance of cellular function [6]
was shown. Furthermore, the analysis of genomic data also revealed
a strong selective constraint on highly interacting genes [7].
For these reasons, the degree of interaction is seen as a significant
indicator of a gene's importance to the cell.
Therefore, we assessed this experimental measure of importance for
those genes that are most frequent in MEDLINE and thus most important
to the scientific community. Surprisingly, we find that there is
no correlation between the degree of connection [8]
and the frequency observed in MEDLINE for 380 yeast genes (Fig.
2). At the moment there is no large scale interaction data available
for humans, however, it is expected that this lack of correlation
will be even more striking, since medical and sociological factors
play an even stronger role in human research. We believe that the
demonstrated discrepancy originates in the complex way information
spreads within a small world network such as the scientific community;
a phenomenon that only becomes clear when considering the evolution
of information over time.
Please cite this article as
Hoffmann, R., Valencia, A. Life cycles of successful
genes. Trends Genet. 19(2), 79-81. (2003).
Download this article as PDF.
Figure
1

Fig. 1. Life cycles of genes in MEDLINE abstracts
from the past 26 years. (a) 250 human genes clustered along
the horizontal axis according to their pattern of occurrence in
the literature. Red areas indicate a peak in a gene's life cycle,
black represents periods where a gene is not mentioned. (b)
Characteristic life cycles of the genes CD4, p53, IL2, IL3 and GM2
A. Annual frequencies of all genes were standardized to the year
2000 to account for the constant overall increase of articles per
year. (c) Distribution of what we know about genes. Red circles,
genes known for 10 years; blue triangles, genes known for 20 years.
The x-axis represents the number (n) of articles per gene (approximating
the information known per gene). The probability P(n) of finding
n articles about a given gene is plotted on the y-axis and decays
as a power law, P(n) n- ,
appearing as a straight line on a log–log plot, where -
is the slope of the line. The exponents for periods of accumulation
of different length, 10
years=1.6 and 20 years=1.9,
reveal that the extreme distribution remains although more knowledge
has been accumulated during the longer period.
Figure 2

Fig. 2. Lack of correlation between the degree
of interaction of proteins and their frequency in the scientific
literature. To ensure a high level of accuracy, protein interactions
were only included if confirmed independently by at least two experimental
methods: yeast two hybrid (Y2H), tandem affinity purification (TAP)
and/or high-throughput mass spectrometric protein complex identification
(HMS-PCI). Data from Mering et al. [8]. The length
of time that genes are known to the scientific community (different
colours) influences the correlation positively.
Box 1
Small-world network
The small-world effect describes the finding that
any two people in the world, chosen at random, are connected to
one another by typically six intermediate acquaintances. Social
networks of such topology allow for the rapid spreading of news,
rumours, jokes or fashions. This also explains why diseases, transmitted
from person to person, can result in global epidemics.
Critical state
Such small-world networks meet the fundamental properties
of complex systems, where the collective behaviour of a large number
of interacting agents is not a simple combination of the behaviour
of individuals. In physics, large dynamical systems have attracted
great interest because of their tendency to organize into a poised
state far out of equilibrium [9]. This critical
state, sometimes called "the edge of chaos", separates a frozen
inactive state from a hot disordered state.
Domino effects
An important characteristic of systems in the critical
state, studied on sand-pile models, is that a little perturbation,
i.e. the addition of a single grain of sand, can lead to anything
from an insignificant shift to an avalanche of unpredictable size.
This unpredictability also applies to the dimension of epidemics,
earthquakes or, as is the case here, the success of a gene within
the scientific community. In molecular biology power law properties
have been discovered only recently in protein interaction networks
and metabolic pathways [6].
References
1. Jenssen, T. K. et al. (2001) A literature network
of human genes for high-throughput analysis of gene expression.
Nature Genet. 28, 21-28
2. Dawkins, R. The Selfish Gene (Oxford University
Press, Oxford, ed. 2, 1989)
3. Bak, P. et al. (1988) K. Self-organized criticality.
Phys. Rev. 38, 364-374
4. Milgram, S. (1967) The small world problem. Psychology
Today 2, 60-67
5. Newman, M. E. J. (2001) The structure of scientific
collaboration networks. Proc. Natl. Acad. Sci. USA 98, 404-409
6. Jeong, H. et al. (2001) Lethality and centrality
in protein networks. Nature 411, 41-42
7. Fraser, H. B. et al. (2002) Evolutionary Rate
in the Protein Interaction Network, Science 296, 750-752
8. Mering, C. et al. (2002) Comparative assessment
of large-scale data sets of protein-protein interactions, Nature
417, 399-403
9. Bak, P. How Nature Works: The Science of Self-Organized
Criticality (Copernicus, New York, 1996)
Acknowledgements
We thank Ugo Bastolla for helpful discussion. This
work was supported in part by the ORIEL and TEMBLOR EC projects.
Copyright
Copyright © 2003 Elsevier Ltd. All rights reserved
|