Non‐coding RNAs: the architects of eukaryotic complexity

John S Mattick

Author Affiliations

  1. John S Mattick*,1
  1. 1 ARC Special Research Centre for Functional and Applied Genomics, Institute for Molecular Bioscience, University of Queensland, Brisbane, 4072, Australia
  1. *Corresponding author. Tel: +61 7 3365 4446; Fax: +61 7 3365 8813; E-mail: j.mattick{at}


Around 98% of all transcriptional output in humans is non‐coding RNA. RNA‐mediated gene regulation is widespread in higher eukaryotes and complex genetic phenomena like RNA interference, co‐suppression, transgene silencing, imprinting, methylation, and possibly position‐effect variegation and transvection, all involve intersecting pathways based on or connected to RNA signaling. I suggest that the central dogma is incomplete, and that intronic and other non‐coding RNAs have evolved to comprise a second tier of gene expression in eukaryotes, which enables the integration and networking of complex suites of gene activity. Although proteins are the fundamental effectors of cellular function, the basis of eukaryotic complexity and phenotypic variation may lie primarily in a control architecture composed of a highly parallel system of trans‐acting RNAs that relay state information required for the coordination and modulation of gene expression, via chromatin remodeling, RNA–DNA, RNA–RNA and RNA–protein interactions. This system has interesting and perhaps informative analogies with small world networks and dataflow computing.

The genome sequencing projects have revealed an unexpected problem in our understanding of the molecular basis of developmental complexity in the higher organisms: complex organisms have lower numbers of protein coding genes than anticipated. The fruitfly Drosophila melanogaster and the nematode Caenorhabditis elegans appear to have only about twice as many protein coding genes (∼12–14 000) as microorganisms such as Saccharomyces cerevisiae (∼6200) and Pseudomonas aeruginosa (∼5500) (Rubin et al., 2000; Stover et al., 2000). Humans appear to have only twice as many again (∼l30 000) (International Human Genome Sequencing Consortium, 2001; Venter et al., 2001), although there is some debate about this (Wright et al., 2001; see also below). While the repertoire of protein isoforms expressed in the higher organisms is greatly increased by alternative splicing (Graveley, 2001), the other striking feature of the evolution of the higher organisms, which has been largely overlooked to date, is the huge increase in the amount of non‐protein‐coding RNA, which in humans accounts for ∼98% of all genomic output (see below).

Have we missed something fundamental? Are these RNAs functional, and if so might they represent an important development in the genetic operating system of the higher organisms, as opposed to the mainly protein‐based systems of microbes?

Phenotypic diversity in eukaryotes

The proteomes of the higher organisms are relatively stable. Humans and mice share 99% of their protein coding genes in common (J.C. Venter, personal communication), and differentiation in these and other complex eukaryotes appears to be achieved primarily by modular re‐use and multitasking of different subsets of the proteome (Pawson, 1995; Duboule and Wilkins, 1998). Moreover, of the ∼3 000 000 sequence differences per haploid genome between individual humans, only ∼10 000 (0.3%) occur in protein‐coding sequences, mostly as silent (third base) changes (Venter et al., 2001).

Thus, phenotypic variation between both individuals and species may be based largely on differences in non‐protein‐coding sequences and be mainly a matter of variation in gene expression, i.e. due to the control architecture of the system. This further implies that, although protein variation will also contribute, the primary source of complex traits and of quantitative trait variation is embedded in this control architecture. If so, this has significant implications for understanding the basis of differentiation and development and the regulatory networks that underlie neural function, disease susceptibility and cancer.

While the control architecture is assumed to be primarily located in cis‐acting gene promoters and enhancers, which are subject to combinatorial inputs from transcription factors modulated by signaling pathways, this may be only part of the answer. This view ignores the possible role(s) of non‐coding RNAs, which represent the vast majority of genomic output in higher organisms. The failure to recognize the possible significance of these RNAs is based on the central dogma, as determined from bacterial molecular genetics, that genes are synonymous with proteins, and that RNAs are just temporary reflections of this information. This view is reinforced by the prevailing biochemical perspective that proteins comprise the primary regulators of cell and organismal biology, which is essentially the case in prokaryotes (although non‐coding RNAs are occasionally used), but may not be true for higher eukaryotes.

Genomic output in the higher eukaryotes

Non‐protein‐coding RNA transcription in the eukaryotes falls into two classes: introns and other non‐protein‐coding RNAs. In humans, introns account for ∼95% of the pre‐mRNA transcripts of protein coding genes, and are generally of high sequence complexity. As far as can be judged from ad hoc reports and from hybridization kinetic analysis of the relative complexity of heterogeneous nuclear (hn) RNA versus mRNA, other non‐coding RNAs represent half to three quarters of all transcription from the genomes of the higher organisms (Davidson et al., 1977; Mattick and Gagen, 2001; Shabalina et al., 2001). These RNAs include a plethora of antisense transcripts and ‘intergenic’ transcripts (Ashe et al., 1997; Askew and Xu, 1999; Eddy, 1999; Erdmann et al., 2001; Mattick and Gagen, 2001) and may include many of the estimated 65 000–75 000 transcriptional units in the human genome (Wright et al., 2001). If we assume that approximately two thirds of all transcripts do not contain protein‐coding sequences, the real number of ‘genes’ (defined as those that produce separate primary transcripts and are separately regulated) in mammalian genomes may in fact be of the order of 100 000. Where these non‐coding RNAs have been examined, they are developmentally regulated and have genetic effects. A good example is the bithorax‐abdominal A/B complex of Drosophila which spans ∼200 kb and expresses seven major transcripts that cover almost the entire region. Only three of these contain protein‐coding sequences, but all are spatially and temporally regulated and the interruption or deletion of the DNA that encodes them has known phenotypic consequences (Lipshitz et al., 1987; Sanchez‐Herrero and Akam, 1989).

Potential trans‐acting mediators of cellular networking and regulation

If these non‐protein‐coding RNAs are functional, their most obvious role would be in networking, i.e. the production of parallel trans‐acting signals that allow activity at one locus to be connected with others in real time. This further implies that suites of gene activity and other levels of systems control may be directly coordinated and integrated in a programmed manner via efference RNA signals (eRNAs) (Figure 1) and that this may be fundamental to the operation of the system. These eRNAs could act as a cellular memory of recent transcription events (Mattick, 1994), as a kind of soft wiring (Herbert and Rich, 1999a). At face value this would represent an enormous increase in network connectivity and functionality over the situation where system activity is solely regulated through protein‐based feedback loops that relay metabolic and environmental state information (Mattick, 1994; Mattick and Gagen, 2001). Moreover, if a system utilizing an RNA communication network has evolved, it would not be surprising if many loci had evolved solely to express RNA.

Figure 1.

Comparison of the prokaryotic and proposed eukaryotic genetic operating systems. The left panel shows the central dogma in which genes code, via mRNA, for proteins, which carry out the catalytic, structural, signal transduction and regulatory functions of the cell. The right panel shows the proposed operating system in eukaryotes wherein genes may express two levels of information: mRNA for proteins, and eRNAs that carry out concomitant networking and other functions within the organism. Thus there are three types of genes in eukaryotes: those that encode only protein (which are rare), those that encode only eRNA, and those that encode both.

The origin and evolution of eukaryotic nuclear introns

When nuclear introns were first discovered they were assumed to be non‐functional and were postulated to be remnants of the prebiotic assembly of genes from exonic cassettes of protein‐coding information (Gilbert et al., 1986). However, it is now clear that modern nuclear introns invaded eukaryotic genes late in evolution, after the separation of transcription and translation (Mattick, 1994; Cho and Doolittle, 1997; Logsdon, 1998; Wolf et al., 2000). The fragmentation of protein‐coding genes by introns may have conferred an advantage by facilitating the modular shuffling of eukaryotic protein domains in evolutionary time and in real time via alternative splicing, but this is not necessarily the prime reason for their dominance. Alternative splicing signals are usually short and located near intron–exon boundaries (Lopez, 1998), and cannot account for the vast tracts of intronic sequences that populate most protein‐coding genes in the higher organisms.

Nuclear introns are clearly derived from self‐splicing group II introns of prokaryotes, which have the same splicing mechanism and which have expanded in eukaryotes by retrotransposition and other mutational, recombinational and insertional processes (Lambowitz and Belfort, 1993; Jacquier, 1996; Tarrio et al., 1998; Cousineau et al., 2000; Eickbush, 2000). The evolution of the spliceosome by the devolution of cis‐acting catalytic RNAs into trans‐acting general factors (spliceosomal RNAs) and the recruitment of accessory proteins would have reduced the internal sequence constraints on these introns, and allowed them considerable freedom to drift, expand and evolve. Any sequences that acquired a useful function, for example as trans‐acting signals capable of transmitting other information in parallel with their associated protein coding sequences, would have had a certain selective value and formed the genesis of a networking system in eukaryotic cells (Mattick, 1994). This does not imply that all introns will have evolved function, as each will be evolving largely independently, but rather that an increasing number may well have done so. In the pufferfish Fugu rupripes for example, which has a highly compact genome (Elgar, 1996), about three‐quarters of the introns are very small, and probably represent vestigial remnants of past insertions, whereas the remainder are considerably larger and probably contain functional information.

Intronic RNA and other non‐protein‐coding RNAs now constitute the majority of genomic output in complex eukaryotes. Moreover, after accounting for variable amounts of repetitive DNA, there is a good correlation between intron density and developmental complexity (Mattick and Gagen, 2001). Introns and other noncoding RNAs have high sequence complexity and, in some cases, show interesting patterns of conservation across large evolutionary distances. Conservation is often found in large blocks that are indicative of selective constraints (Jareborg et al., 1999; Mattick and Gagen, 2001; Shabalina et al., 2001). The fact that most introns are less conserved than their associated protein‐coding exons does not mean that they lack function, but rather that they are subject to less severe constraints.

Evidence that introns and other non‐coding RNAs have function

Examples of intronic and other non‐protein‐coding RNAs that contain functional information are increasingly coming to light (Askew and Xu, 1999; Eddy, 1999) (see also below). One interesting subclass of these are small nucleolar RNAs (snoRNAs), which are produced from intronic RNAs derived from genes encoding ribosomal proteins and nucleolar proteins, as well as from other genes whose exons no longer have any protein coding capacity (Maxwell and Fournier, 1995; Tycowski et al., 1996; Filipowicz, 2000). These introns are processed through pathways involving endonucleolytic cleavage by double‐stranded RNase III‐related enzymes, exonucleolytic trimming and possibly RNA‐mediated cleavage, which occur in large complexes called exosomes (Allmang et al., 1999; van Hoof and Parker, 1999).

Other interesting examples of non‐coding RNAs with functional activity are the small temporal RNAs lin‐4 and let‐7, which control developmental timing in C. elegans via RNA–RNA interactions that affect the translation and stability of other transcripts (Moss, 2000). let‐7 is conserved among vertebrates and invertebrates (Pasquinelli et al., 2000). These small RNAs are derived from larger precursors and are around 22 nucleotides in length, similar to the size of RNAs produced by RNA interference (RNAi)‐mediated RNA processing (see below). Indeed, it has been shown recently that the production of these RNAs is dependent on homologs of the Dicer and RDE‐1 families of proteins that are also involved in RNAi (Grishok et al., 2001). It is quite conceivable that such pathways are involved in the downstream processing of a wide range of intronic and other non‐coding RNAs, whose products may number in the tens or hundreds of thousands and which may act as guide RNAs to regulate many different processes. There are many other examples of non‐coding RNAs that have a role during development in both animals and plants, including Xist and roX1/roX2 which are involved in dosage compensation, as well as H19, Pgc, NTT, bic, BORG, BC200, his‐1, Bsr, hsr‐omega, ENOD40, CR20, among many others (Nakamura et al., 1996; Teramoto et al., 1996; Liu et al., 1997; Tam et al., 1997; Takeda et al., 1998; Eddy, 1999; Komine et al., 1999; Erdmann et al., 2001). Some of these RNAs are alternatively spliced or have alternative polyadenylation sites and are probably derived from genes that have lost their protein coding capacity.

It seems safe to predict that the vast majority of non‐coding RNAs have not yet been catalogued (see e.g. Ashe et al., 1997), as most genomic screens have been intrinsically biased against their discovery (Eddy, 1999). It is only recently that some attempts to do this more systematically have been initiated (Olivas et al., 1997; Hüttenhofer et al., 2001). In addition, it is likely that single‐base mutations in non‐coding RNAs will be hard to detect phenotypically. As is the case for promoters, such sequences may be somewhat more flexible than are protein‐coding sequences, especially if the affected RNAs are part of a scale‐free network that is resistant to damage (Albert et al., 2000). On the other hand, it is relatively easy to find mutations in genomic sequences encoding non‐coding RNAs by insertional and deletional mutagenesis, as in the case for the Drosophila bithorax locus referred to above.

Complex genetic phenomena involving RNA

A central role for RNA signaling and RNA metabolism in eukaryotic biology is becoming more obvious. There are a number of poorly understood genetic phenomena in higher eukaryotes which include RNAi, co‐suppression, transgene silencing, position effect variegation, imprinting, DNA methylation, X‐chromosome dosage compensation and transvection, all of which share features in common (Judd, 1995; Fire, 1999; Jones et al., 2000; Kelley and Kuroda, 2000; Mette et al., 2000; Morel et al., 2000; Sleutels et al., 2000; Wassenegger, 2000; Sharp, 2001). Without going into detail, RNA signals have been shown to be central to, or at least implicated in, all of these phenomena, which involve RNA–RNA and RNA–DNA interactions as well as chromatin remodeling (see Mattick and Gagen, 2001; Sharp, 2001; and references therein). RNAi and post‐transcriptional gene silencing in animals and plants is mediated by 21–22 nucleotide RNAs generated by RNase III cleavage from longer double‐stranded RNAs (Hammond et al., 2001; Sharp, 2001), a length similar to that required for RNA‐directed DNA methylation (Wassenegger, 2000), and which is probably close to the optimal minimum required for stable base‐pairing and sequence‐specific interactions within complex genomes. While some of these pathways may be utilized in defense against viruses and transposon mobilization (Baulcombe, 2001), it is also clear that they are an integral part of normal cell and developmental biology (see Grishok et al., 2001).

Large families of proteins are involved in RNA metabolism and signaling

It has also become obvious that there are many large gene families which encode proteins involved in RNA metabolism, some of which have come to light by the genetic analysis of RNAi, and which also affect co‐suppression and transgene silencing. Apart from RNaseD‐type 3′‐5′ exonucleases and double‐stranded RNase IIIs, of which there are many homologs in metazoan genomes, these include: the Dicer family of proteins that contain similar domains (RNase type III domains and dsRNA‐binding domains) together with an RNA helicase domain and a PAZ domain; adenosine deaminases that act on dsRNAs (ADARS); RNA‐dependent RNA polymerases; RNA helicases and DExH/D box proteins; the RDE‐1 (Argonaute/piwi/zwille) family of proteins found in plants, fungi, invertebrates and mammals (which also contain a PAZ domain), with at least 20 homologs in C. elegans; and others identified in genetic screens but yet to be defined biochemically (Cerutti et al., 2000; Fagard et al., 2000; Baulcombe, 2001; Grishok et al., 2001; Schwer, 2001).

Other families of RNA‐binding proteins include those with one or more RRM (RNA recognition motif) domains, KH domains and RG domains, among others (Perez‐Canadillas and Varani, 2001), and it seems likely that RNA‐binding proteins of one sort or another constitute the largest group of proteins in the genomes of the higher eukaryotes. In addition many proteins that are considered to be ‘transcription factors’, such as Y‐box (cold shock) proteins, winged‐helix‐turn‐helix proteins, and zinc finger proteins such as Sp1 and WT1, appear to bind RNA or RNA–DNA hybrids, and may well be recognizing not DNA per se but higher order structures involving RNA, as well as associating in complexes with other proteins such as DNA methyltransferase, histone H5 and hnRNP K (Shi and Berg, 1995; Ladomery, 1997; Herbert and Rich, 1999b; Fierro‐Monti and Mathews, 2000; Shnyreva et al., 2000).

RNA regulates chromatin architecture

There is also good evidence that RNA regulates chromatin architecture. DNA methylation is RNA‐directed, at least in plants and probably in all higher eukaryotes (Wassenegger, 2000). The phenomenon of transvection, or allelic cross‐talk, which has been largely described in Drosophila but which also occurs in other higher eukaryotes (Wu and Morris, 1999), has been implicated in genomic imprinting and X chromosome inactivation and almost certainly involves trans‐acting RNA signals (see Mattick and Gagen, 2001). Transvection, co‐suppression and transgene silencing have all been shown to involve Polycomb‐group proteins (Birchler et al., 2000), which are involved in chromatin remodeling via histone deacetylation (van der Vlag and Otte, 1999; Gebuhr et al., 2000), leading to the suggestion that trans‐acting RNAs may direct the gene‐specific binding of Polycomb complexes (Sharp, 2001).

Importantly, it has recently been shown that a conserved domain called a chromodomain, which occurs in Polycomb‐group proteins, as well as in other proteins involved in chromatin remodeling such as the HP1 and CHD families (Jones et al., 2000) and the histone acetyltransferase MOF, is an RNA‐binding module (Akhtar et al., 2000). The chromodomain controls sequence and target specificity (Jones et al., 2000) and different Polycomb‐group protein complexes function at different genomic sites (Strutt and Paro, 1997). Chromodomain‐containing proteins are also involved in position effect variegation (Kennison, 1995). In addition, a non‐coding RNA has been shown to act as a transcriptional co‐activator for steroid receptors (Lanz et al., 1999), whose action also requires chromatin remodeling and the recruitment of histone acetyltransferases (Zhang and Lazar, 2000). Thus chromatin structure and hence gene expression in higher eukaryotes appears to be controlled not just by protein factors but also by trans‐acting RNA signals.

RNA networks have parallels with other complex information processing systems

Taken together, these observations suggest that a complex network of RNA signaling with a sophisticated infrastructure operates in higher eukaryotes, which enables direct gene–gene communication and the integration and regulation of gene activity at many different levels, including chromatin structure, DNA methylation, transcription, RNA splicing and processing, RNA translation, RNA stability, and RNA signaling in other pathways (Figure 2). This is reminiscent of network control in other information processing systems, such as computers and the brain, where control codes (which are mainly internally sourced) are used to integrate and multitask complex patterns of activity (Mattick and Gagen, 2001). Such systems require multiple inputs and outputs, which in neurobiology are referred to as ‘efference’ signals (Bridgeman, 1995), and it has been suggested that trans‐acting RNAs may play a central role in regulating gene expression in the brain (see Smalheiser et al., 2001).

Figure 2.

A more detailed schematic of the proposed role of eRNAs in eukaryotic system networking and control. Genes, packaged in chromatin, express primary transcripts which are then (alternatively) spliced to yield an mRNA and/or n introns, which may be further processed to form multiple smaller species, such as let‐7. Some noncoding RNA genes may yield functional RNAs from both introns and exons (nRNA). These RNAs may then act as signaling or guide molecules to integrate activity at this locus with that of related parts of the network, via effects on chromatin structure, transcription, splicing, other levels of RNA processing, mRNA translation, mRNA stability and other levels of RNA‐mediated signal transduction within the cell. The evidence indicates that many if not most of these interactions will be homology (primary sequence) dependent, and involve RNA–DNA, RNA–RNA and RNA–protein interactions, but others may involve secondary or tertiary RNA structures and RNA‐mediated catalysis. This scheme is not comprehensive, but is intended to give a sense of the complexity and potential of such networks for programmed control and system integration of complex suites of gene activity in differentiation and development.

A more detailed presentation of the evidence for this hypothesis and its relationship to information processing in other domains is presented in Mattick and Gagen (2001). Such a system has interesting and perhaps instructive analogies with small world networks and dataflow computing. Experimental approaches to testing this hypothesis will include examination of the effects of ectopic production of introns and other non‐coding RNAs on gene expression patterns and phenotypic indices, aided by bioinformatic analysis to identify conserved sequences in RNA and DNA that may act as transmitters or receivers in the network, as most of these RNA‐dependent effects would appear to be homology‐dependent. Comparison of the human, mouse and other mammalian genomes shows a surprisingly large degree of sequence homology outside of protein‐coding regions (V. Bonazzi, personal communication; Mayor et al., 2000). If correct, understanding the biology of higher organisms will not simply require understanding of the proteome, which is the focus of so much research at present, but also the identification of all non‐coding RNAs, their expression patterns, processing, and signaling pathways. It also suggests that, far from being evolutionary junk, introns and other non‐coding RNAs form the primary control architecture that underpins eukaryotic differentiation and development.


I would like to thank Michael Gagen (Physics Department, University of Queensland) for many stimulating and informative discussions on the intersection between genetics and information processing systems. I would also like to thank Philip LoCascio (Oak Ridge National Laboratory, TN) for pointing out the similarities between this hypothesis and dataflow computing. Apologies are extended to authors whose work was not cited directly due to space limitations.


John S Mattick