printcontacthome

Life cycles of successful genes

Trends Genet. 19(2), 79-81. (2003)
Copyright © 2003 Elsevier Ltd. All rights reserved
Download this article as PDF.

 

Robert Hoffmann* and Alfonso Valencia
National Center of Biotechnology, CNB-CSIC, Cantoblanco Madrid M-28049. *Corresponding author: contact

 

By exploring time series data from MEDLINE abstracts, we have made the observation that only a few genes have been quoted with increasing frequency over the past 25 years. This is probably the result of selective pressure by the scientific community. Over the years this selection has produced an extreme power law distribution of the information available for individual genes. Interestingly, those genes that are successfully selected are not necessarily the most important genes to the cell. To stress the implication of this finding we show that there is no correlation between a gene's impact in the scientific literature and its centrality in protein interaction networks.

In the last 25 years, a tremendous effort by the biomedical research community has led to the more than 10 million publications that are available at the PubMed (MEDLINE) database. In this study we focus on a heretofore undervalued property of this outstanding repository: Data from PubMed is time-resolved, since for every article the date of its publication is available. This attribute allows for studying the evolution of scientific theories, terms and even gene names within the scientific community. In a first step we computed annual quotation frequencies for individual genes by tracing their names, symbols and synonyms in abstracts since 1975 [1]. All in all, time series for 180000 genes from human, mouse, Drosophila, yeast, Zebrafish and E. coli were generated. Figure 1a shows the distribution of 250 of the most quoted human genes over the last 26 years. New gene discoveries are seen at different points in time but subsequent reference to a gene after its first description is clearly not random and diverse patterns, or "life cycles", can be distinguished.

Lifecycles of successful genes

In Figure 1b characteristic life cycles of four genes are shown. These correspond to typical patterns found in 4532 genes that have appeared in the literature for at least 15 years. The glycolipid transporter GM2A, for example, is representative of the most frequent pattern, one that is shared by about 4200 genes. These exhibit a rather dull life cycle and have never attracted enough interest to become very important. Interleukin 3 (IL3) represents genes that have survived significant ups and downs in the collective scientific interest, but never boomed. The tumour suppressor gene p53, on the other hand, corresponds to a minor group that shows an exceptional increase of interest and consequently of occurrence over time. These observations demonstrate how gene names have to overcome the selective mechanism of the scientific community to stand out from the rest [2]. The interest of the community in a specific gene and thus its scientific impact depend on a gene's molecular role but also on the social needs within the scientific community, illustrated by the exceptional interest in genes like CD4 and p53 which are involved in HIV infection and tumour development.

What we know about individual genes

The number of articles that mention a gene in a certain time period represents a rough estimation of the information available for this gene. Taking a broader look at the present information about all genes, it makes a difference to compare genes that have been known for 20 years or genes that have been known for just 2 years. Considering this, we are able to compare 8,176 genes that where present for 10 years and 2,130 genes for a 20 year period.
We find that the distribution of information over these genes decays as a power law function (Fig. 1c); a few genes such as CD4 and p53 are most frequently quoted and attract most attention, while for the rest comparably little has been published. Power law distributions are the hallmark of systems in the critical state [3]. The scientific community as a small world network is known to share important characteristics with these dynamical systems [4,5], where all members are permanently interacting and influencing each other (see Box 1). Given a certain complexity, the flow of information within the scientific community can no longer be understood in terms of the behaviour of individuals; small changes can have domino effects out of proportion to their cause, leading, for instance, to the outstanding success of CD4. In other words, trends also exist in the scientific community.

The degree of protein interactions

These considerations raise the question, whether genes that are frequently discussed in research abstracts are also more important to cellular functions. Or is this extreme distribution of our scientific attention rather a reflection of the priorities within our society, as is suggested by CD4, the most frequent quoted protein (Fig. 1b). CD4 is involved in HIV infection and of clear importance to our society, though from a purely biological point of view it has a role similar to other cell receptors. To address this question, we employ the socially unbiased views from high throughput experiments as a reference. In the past two years large-scale methods have been introduced to generate global interaction networks of proteins; Yeast two-hybrid (Y2H) and mass spectrometry of purified complexes (TAP, HMS-PCI) aim to detect physical interactions. Based on yeast two-hybrid data, Barabasi and colleagues have discovered the scale-free topology of these interaction networks, where a few genes have many interactions but most genes have only few interactions. A clear correlation between the number of a gene's interactions and its importance to the maintenance of cellular function [6] was shown. Furthermore, the analysis of genomic data also revealed a strong selective constraint on highly interacting genes [7]. For these reasons, the degree of interaction is seen as a significant indicator of a gene's importance to the cell.
Therefore, we assessed this experimental measure of importance for those genes that are most frequent in MEDLINE and thus most important to the scientific community. Surprisingly, we find that there is no correlation between the degree of connection [8] and the frequency observed in MEDLINE for 380 yeast genes (Fig. 2). At the moment there is no large scale interaction data available for humans, however, it is expected that this lack of correlation will be even more striking, since medical and sociological factors play an even stronger role in human research. We believe that the demonstrated discrepancy originates in the complex way information spreads within a small world network such as the scientific community; a phenomenon that only becomes clear when considering the evolution of information over time.

 

Please cite this article as

Hoffmann, R., Valencia, A. Life cycles of successful genes. Trends Genet. 19(2), 79-81. (2003).
Download this article as PDF.

 

Figure 1

Fig. 1. Life cycles of genes in MEDLINE abstracts from the past 26 years. (a) 250 human genes clustered along the horizontal axis according to their pattern of occurrence in the literature. Red areas indicate a peak in a gene's life cycle, black represents periods where a gene is not mentioned. (b) Characteristic life cycles of the genes CD4, p53, IL2, IL3 and GM2 A. Annual frequencies of all genes were standardized to the year 2000 to account for the constant overall increase of articles per year. (c) Distribution of what we know about genes. Red circles, genes known for 10 years; blue triangles, genes known for 20 years. The x-axis represents the number (n) of articles per gene (approximating the information known per gene). The probability P(n) of finding n articles about a given gene is plotted on the y-axis and decays as a power law, P(n)n-, appearing as a straight line on a log–log plot, where - is the slope of the line. The exponents for periods of accumulation of different length, 10 years=1.6 and 20 years=1.9, reveal that the extreme distribution remains although more knowledge has been accumulated during the longer period.

 

Figure 2

 

 

Fig. 2. Lack of correlation between the degree of interaction of proteins and their frequency in the scientific literature. To ensure a high level of accuracy, protein interactions were only included if confirmed independently by at least two experimental methods: yeast two hybrid (Y2H), tandem affinity purification (TAP) and/or high-throughput mass spectrometric protein complex identification (HMS-PCI). Data from Mering et al. [8]. The length of time that genes are known to the scientific community (different colours) influences the correlation positively.


Box 1

Small-world network

The small-world effect describes the finding that any two people in the world, chosen at random, are connected to one another by typically six intermediate acquaintances. Social networks of such topology allow for the rapid spreading of news, rumours, jokes or fashions. This also explains why diseases, transmitted from person to person, can result in global epidemics.

Critical state

Such small-world networks meet the fundamental properties of complex systems, where the collective behaviour of a large number of interacting agents is not a simple combination of the behaviour of individuals. In physics, large dynamical systems have attracted great interest because of their tendency to organize into a poised state far out of equilibrium [9]. This critical state, sometimes called "the edge of chaos", separates a frozen inactive state from a hot disordered state.

Domino effects

An important characteristic of systems in the critical state, studied on sand-pile models, is that a little perturbation, i.e. the addition of a single grain of sand, can lead to anything from an insignificant shift to an avalanche of unpredictable size. This unpredictability also applies to the dimension of epidemics, earthquakes or, as is the case here, the success of a gene within the scientific community. In molecular biology power law properties have been discovered only recently in protein interaction networks and metabolic pathways [6].

 

References

1. Jenssen, T. K. et al. (2001) A literature network of human genes for high-throughput analysis of gene expression. Nature Genet. 28, 21-28

2. Dawkins, R. The Selfish Gene (Oxford University Press, Oxford, ed. 2, 1989)

3. Bak, P. et al. (1988) K. Self-organized criticality. Phys. Rev. 38, 364-374

4. Milgram, S. (1967) The small world problem. Psychology Today 2, 60-67

5. Newman, M. E. J. (2001) The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA 98, 404-409

6. Jeong, H. et al. (2001) Lethality and centrality in protein networks. Nature 411, 41-42

7. Fraser, H. B. et al. (2002) Evolutionary Rate in the Protein Interaction Network, Science 296, 750-752

8. Mering, C. et al. (2002) Comparative assessment of large-scale data sets of protein-protein interactions, Nature 417, 399-403

9. Bak, P. How Nature Works: The Science of Self-Organized Criticality (Copernicus, New York, 1996)

 

Acknowledgements

We thank Ugo Bastolla for helpful discussion. This work was supported in part by the ORIEL and TEMBLOR EC projects.

 

Copyright

Copyright © 2003 Elsevier Ltd. All rights reserved

 

Sitemap.