Reporter gene assays have demonstrated both transcription‐associated mutagenesis (TAM) and transcription‐coupled repair, but the net impact of transcription on mutation rate remains unclear, especially at the genomic scale. Using comparative genomics of related species as well as mutation accumulation lines, we show in yeast that the rate of point mutation in a gene increases with the expression level of the gene. Transcription induces mutagenesis on both DNA strands, indicating simultaneous actions of several TAM mechanisms. A significant positive correlation is also detected between the human germline mutation rate and expression level. These results indicate that transcription is overall mutagenic.
Mutations are permanent changes in the genomic sequence. They arise from replication error, transposition and DNA damage; the last might be spontaneous or induced by chemicals or radiation. Mutation rate varies in a genome beyond what is expected from the difference in the natural mutability of the four bases []. For instance, mutations are more frequent in later‐replicating chromosomal regions than earlier‐replicating regions during a cell cycle [, ], and nucleosome binding reduces the mutability of cytosines (C) []. Transcription is yet another determinant of the mutation rate [[5–8]]; its exclusive impact on the transcribed regions of a genome makes it particularly important. The first indication of transcription‐associated mutagenesis (TAM) arrived in the early 1970s [, ]. By the mid 1990s, reporter gene assays in bacteria had convincingly demonstrated that transcription induces the mutagenesis of the non‐transcribed (also known as the plus, coding, or non‐template) strand of a transcribed region []. The likely mechanism is that, during transcription, the non‐transcribed strain is exposed as a single‐stranded (ss) DNA, which is susceptible to chemical damage []. The exposure time can be quite long when the RNA polymerase is stalled or when an extended segment of the nascent mRNA is bound to the transcribed strand []. Reporter gene assays have also demonstrated TAM in the baker's yeast Saccharomyces cerevisiae, but the focus has been on short insertion/deletion (indel) mutations, apparently because indel mutations are induced much more than point mutations in those assays [[12–15]]. More mechanisms of TAM have been suggested, including transcription‐associated supercoiling and collision between replication forks and the transcription machinery []. Recently, it was reported in human cell lines that, under ultraviolet light treatment, the transcribed strand is also subject to TAM []. The probable mechanism is that photolesions in the transcribed strand cause persistent RNA polymerase stalling, which increases spontaneous deamination of C's in the photolesions, resulting in increased C→T mutations in the transcribed strand []. In contrast to TAM, there are also numerous studies in bacteria and eukaryotes that showed transcription‐coupled repair (TCR) of lesions in the transcribed strand [, ]. The likely mechanism is that bulky lesions in the transcribed strand cause RNA polymerase to stall, which recruits proteins for DNA repair [, ]. Thus, transcription apparently can induce both mutagenesis and repair. Because all of the above studies were on the basis of reporter gene assays, which typically quantify only a subset of mutations, were somewhat artificial (e.g., reporter genes are usually transcribed using non‐native promoters or from non‐native genomic locations), and might not represent the entire genome, it remains unclear whether transcription has a net effect of increasing or decreasing the mutation rate.
Answering the above question requires a genomic study. Two approaches are conceivable. First, one might compare the genome sequences of mutation accumulation (MA) lines with those of their ancestors. Second, one might compare the genome sequences of closely related species. In both approaches, it is important to examine only genomic regions that are effectively neutral such that the observed DNA changes reflect the mutation rate. Here we use both of these approaches to investigate the relationship between gene expression level and mutation rate in yeast, because reporter‐based TAM and TCR are best studied in yeast among eukaryotes. We will then use the second approach to address the same question in humans.
Mutation rates in yeast MA lines
We first examined the impact of gene expression level on mutation rate by analysing a recently published data set of point mutations in MA lines of S. cerevisiae []. To boost the extremely low rate of natural point mutation in yeast [], the authors deleted the uracil‐DNA glycosylase gene UNG1, which is required for repair of uracil in DNA formed by spontaneous C deamination. Because the effective population size was maintained at 1,000–2,000 cells during the MA [], mutations in important genes might still be subject to selection. Hence, we removed the genic regions (i.e., from the start to stop codons in gene sequences) of essential genes and non‐essential genes, of which the deletion strains have fitness <0.99, and analysed the remaining regions of the genome hereby referred to as the analysed genome (see Methods). A total of 190 confirmed point mutations are present in the analysed genome. Because the MA was conducted in the rich medium YPD (yeast extract‐peptone‐glucose) [], we used expression levels of the wild‐type yeast in YPD [] as proxies for the expression levels of the strain lacking UNG1, under the assumption that deleting UNG1 does not substantially alter the transcriptome. The mRNA sequencing (RNA‐Seq) data [] we analysed provide the number of sequencing reads mapped to each nucleotide position in the genome. This number is used as the expression level of each nucleotide site. The mean expression level of the 190 mutated sites is 50.08, significantly greater than the random expectation of 24.99 (P=0.0303, randomization test; Fig 1). We estimated this random expectation by picking 10,000 sets of 190 random nucleotide sites from the analysed genome. To control the different mutabilities of the four nucleotides [, ], we required that the 190 randomly picked sites contain the same numbers of G:C nucleotides and A:T nucleotides as observed at the mutated sites. Thus, the MA experiment demonstrates that mutations tend to occur at highly expressed sites.
Mutation rates from interspecific comparison
Because the small number of mutations in the MA lines makes it difficult to control potential confounding factors, we turned to between‐species comparisons of genome sequences. Specifically, we compared S. cerevisiae with its sister species S. paradoxus. The genomic divergence between the two species is ∼11% at genic and ∼23% at intergenic regions []. Estimating mutation rates of different genes by interspecific comparison requires the use of effectively neutral genic regions. Because both synonymous and non‐synonymous sites might be subject to natural selection that generally intensifies with the expression level [[21–25]], they cannot be used to estimate the mutation rates for genes of different expression levels. Instead, we estimate the mutation rate of a gene from the substitution rate of its intron sequence, which is likely to be neutral at most sites after the removal of the two ends (see Methods). The expression level of an intron is assumed to be the same as that of its mother gene estimated by RNA‐Seq []. We observed a positive correlation between the expression level of a gene and the mutation rate estimated from its intron nucleotide substitution rate (Spearman's rank correlation coefficient ρ=0.402, P<0.00001; Table 1; supplementary Fig S1 online). The gradual increase of the mutation rate with the expression level can be clearly seen in binned data when all introns are grouped into five equal‐size bins on the basis of expression levels (Fig 2A). It has been reported that yeast introns contain regulatory sequences that promote high transcription [, ], suggesting that not all sites in introns are neutral. Because introns of highly expressed genes are likely to be subject to stronger purifying selection than those of lowly expressed genes, the mutation rate has been underestimated more for highly expressed genes than for lowly expressed genes. In other words, our result in Fig 2A is conservative.
Several more factors are known to affect the mutation rate, including G+C content [], CpG dinucleotide frequency [], replication timing [, ] and nucleosome binding [, ]. But, the partial correlation between gene expression level and intron substitution rate remains highly significant even when the above four factors are simultaneously controlled (ρ=0.349, P<0.0001; Table 1). We conducted a multiple regression analysis to assess the relative contributions of gene expression level and the other four factors in explaining the total variance in mutation rate (Table 1). This multiple regression model explains ∼18% of the total variance in intron mutation rate. Gene expression level is the best predictor, explaining at least ∼89% of the variance explained by the model; no other factor has a significant independent contribution to the variance explained by the model after the Bonferroni correction for multiple testing (Table 1).
The RNA‐Seq expression level of a gene measures the concentration of its mRNA, which is determined jointly by the mRNA synthesis rate and mRNA degradation rate. Consistent with the hypothesis that mRNA synthesis induces mutagenesis, we found the mutation rate to be significantly correlated with the mRNA synthesis rate (ρ=0.248, P=0.0055), but not the mRNA degradation rate (ρ=−0.034, P=0.71) (see Methods). The fact that the mutation rate correlates better with the mRNA concentration than mRNA synthesis rate is probably due to the generally much more accurate estimation of mRNA concentrations than synthesis rates.
Patterns of transcription‐associated mutations
There are 12 possible types of single‐nucleotide mutation among the four bases. Four of them are generally known to be the most frequent: C→T, A→G, G→T, and A→T []. C naturally deaminates to U, which simulates T and results in a C→T transition after two rounds of DNA replication. C might also be 5‐methylated, especially when followed by a G in CpG, and 5‐methylcytosine frequently deaminates to T. C→T transitions and G→A transitions on the complementary DNA strand are typically the most common mutations in a genome. Nucleotide A might deaminate to hypoxanthine (H), which pairs better with C than T and results in an A→G transition (and an T→C transition on the complementary strand) after two rounds of replication. Another frequent mutation is due to the oxidation of G to 8‐hydroxyguanine, which can mispair with A and results in G→T after two rounds of DNA replication. G→T, as well as A→T, might also occur through depurination followed by the insertion of an A opposite the abasic site in replication.
Using S. mikatae as an outgroup of S. cerevisiae and S. paradoxus [], we inferred all single‐nucleotide substitutions that occurred in the S. cerevisiae lineage since its separation from S. paradoxus. We then calculated the difference between the frequency of each mutation type in introns and that in untranscribed regions of the genome. Here, the frequency of mutation X→Y is the number of X→Y mutations divided by the number of X in the inferred common ancestor of S. cerevisiae and S. paradoxus. On the non‐transcribed strand, the above mentioned four common mutation types (C→T, A→G, G→T and A→T) all show significantly higher frequencies in introns than untranscribed regions (Fig 2B). On the transcribed strand, only C→T is more frequent in introns than untranscribed regions (Fig 2B). These patterns are consistent with the simultaneous actions of two TAM mechanisms: (i) generally increased mutagenesis on the non‐transcribed strand due to the susceptibility of ssDNA to damage and (2) increased C→T mutagenesis on the transcribed strand associated with photolesion‐associated RNA polymerase stalling. The C→T mutation rate appears higher on the transcribed strand than on the non‐transcribed strand, although their difference is not significant (P=0.18, two‐tail Fisher's exact test). Mutational patterns of MA lines (supplementary Fig S2 online) are generally consistent with those from the interspecific comparison, although a rigorous statistical analysis is hampered by the small sample size of the MA data.
Mutation rates in the human germline
It is of significant interest to examine if highly expressed genes also have elevated mutation rates in multicellular eukaryotes. In multicellulars, one needs to distinguish germline mutations from somatic mutations, because only the former are inherited to future generations. We used alignments of human and macaque genome sequences [] to estimate the substitution rates in introns (see Methods) and used human RNA‐Seq data previously assembled [] to estimate gene expression levels in 12 tissues, including the testis that contains the germline. To gauge the rate of transcription‐associated mutations in each gene, we estimated the difference between the mutation rate of all introns of the gene and that of its flanking untranscribed regions (see Methods), because mutation rate varies among chromosomal regions []. We found this difference to be positively correlated with the expression level in each of the 12 tissues, after simultaneously controlling for GC content, CpG frequency, replication timing and nucleosome occupancy (Fig 3). Among the 12 tissues, 9 showed significant correlations (P<0.05) and testis exhibits the highest correlation. The generally positive correlation in non‐testis tissues might be due to gene expression similarities between these tissues and the germline. Using multiple linear regression (see Methods), we estimated that doubling the expression level of an averagely expressed gene increases the mutation rate difference between its introns and flanking untranscribed regions by 15%. We also validated the above results by using different flanking regions (supplementary Fig S3 online). Because synonymous mutations behave nearly neutrally in species with low effective population sizes such as humans, we also gauged human mutation rates at synonymous sites, and positive correlation between testis expression level and mutation rate is again observed (supplementary Fig S4 online). We further confirmed our results using a more recently published RNA‐Seq data [] that include the expression levels of both humans and macaques (supplementary Fig S5 online). We did not examine the mutational pattern in the human germline, because it has been extensively studied [, , ].
In this study, we provided several lines of genomic evidence for elevated mutation rates of highly expressed genes in budding yeast and the human germline. We showed in yeast that the increased mutation rate likely results from mutagenesis associated with mRNA synthesis. Because transcription is known to induce both mutagenesis (TAM) and DNA repair (TCR), our finding suggests that, at the genomic scale, the effect of TAM overwhelms that of TCR in yeast as well as the human germline. As mentioned, TAM has several distinct molecular mechanisms that can lead to different mutational patterns. Our observations in yeast suggest that more than one TAM mechanism is at work. Specifically, the increased rates of the four most frequent types of mutation in the non‐transcribed strand is likely caused by the susceptibility of this naked ssDNA to damage [], whereas the increased rate of C→T mutation in the transcribed strand is probably caused by the mutagenesis during the extended RNA polymerase stalling []. One mechanism of TAM involves the collision between a DNA replication fork and RNA polymerase in the S phase of a cell cycle []. However, we found no correlation between the yeast intron mutation rate and expression level in the S phase [] (ρ=−0.036, P=0.71), suggesting that the collision is not the primary cause of the observed genomic pattern of TAM. It has been suggested that TAM might be minimized by internal folding of the non‐transcribed DNA strand that results in short segments of double‐stranded DNA []. Following a previous study [], we calculated the transcription‐driven mutability index (TDMI) of each yeast intron, which is defined by the percentage of unpaired nucleotides in the non‐transcribed strand of the intron (see Methods). But, TDMI and intron substitution rate are not significantly correlated, with (ρ=0.035, P=0.70) or without the control of expression level (ρ=−0.055, P=0.55; supplementary Fig S6 online), suggesting that internal folding of the non‐transcribed DNA strand is not a significant factor modulating TAM in yeast.
Several reporter assays in yeast showed that transcription dramatically induces indel mutations [[12–15]]. But, we found no correlation between the frequencies of indels in introns and the expression levels of the introns in the comparison between S. cerevisiae and S. paradoxus (supplementary Fig S7 online), suggesting that the reporter assay results might be gene specific or condition specific. Alternatively, the effect revealed by the reporter assays is widespread, but its contribution to the genome‐wide variation of indel mutation rate is too small to detect here. In the human–macaque comparison, the patterns vary among indels of different sizes (supplementary Fig S8 online).
We discovered a significantly higher germline mutation rate in strongly expressed human genes than weakly expressed ones. This significant correlation, however, disappears when the four confounding factors (GC content, CpG frequency, replication timing and nucleosome binding) are not controlled, suggesting the importance of controlling these variables and explaining why the correlation was not previously observed []. Nonetheless, the impact of gene expression on mutation rate, after the controls, is much weaker in human (ρ=0.083) than in yeast (ρ=0.349). This difference has several potential causes. First, the gene expression levels in the human germline might not be well approximated by those in the testis of a particular age represented by the human individual studied, which reduces the correlation with the mutation rate. Second, the correlation between the mRNA synthesis rate and mRNA concentration might be lower in human than in yeast if human has more post‐transcriptional modifications than yeast. For example, microRNAs, which are short non‐coding RNAs that bind to target mRNAs to cause their degradation and silencing [], regulate ∼30% of protein‐coding genes in mammals [], but no microRNA exists in S. cerevisiae []. Third, it is possible that the mechanisms or the relative importance of TAM and TCR differ between yeast and the human germline. Regarding this possibility, it is important to discuss the somatic mutation rates and patterns recently revealed from human cancers. Several cancer genomic studies reported lower somatic mutation rates in more highly expressed genes and a greater reduction in C→T mutation on the transcribed strand than the non‐transcribed strand [[41–43]]. These results are explainable by the action of TCR. Compared with the germline, cancer cells have overall much higher mutation rates [], which implies the presence of abundant lesions on the transcribed strand that could be repaired through TCR. In other words, the impact of TCR is expected to be greater to cancer cells than to the germline because of the higher abundance of lesions in the former than the latter. By contrast, owing to the generally high background mutation rate in cancer cells, several mutations brought about by TAM become relatively insignificant. Thus, the relative impact of TAM and TCR might be reversed in cancer cells, compared with that in the germline. The mutational patterns observed in cancer cells are consistent with this interpretation [[41–43]].
Our finding of elevated mutation rates in highly expressed genes in yeast and human germline has several important implications. First, because highly expressed genes tend to have more important roles than lowly expressed genes [] and because most mutations are deleterious, our finding suggests that (i) spontaneous mutations are more harmful than is now appreciated on the basis of the assumption of homogenous mutation rates and that (ii) the actual mutation load is likely greater than the present estimate. Second, the fact that transcription is overall a mutagen implies the possibility of natural selection for reduced transcription from an unnecessarily high level due simply to the benefit of reducing deleterious mutations. When a mutation rate modifier is linked to the gene whose mutation rate is being modified, the selective advantage of the modifier is Δμ if it reduces the deleterious mutation rate of the gene by Δμ per gene per generation []. In yeast, the average point mutation rate is 3 × 10−10 per nucleotide site per generation [] and the coding region of an average gene has 1350 nucleotides []. If 80% of mutations in the coding region of a gene are deleterious, the deleterious mutation rate is 3.24 × 10−7 per gene per generation. If a modifier reduces the mutation rate of a gene by 20% by reducing its transcription level, its advantage is s=Δμ=0.648 × 10−7, which is slightly greater than what natural selection can detect in yeast (0.5/effective population size=0.5 × 10−7) []. Of course, reducing transcription should also save energy and materials []. Even when the protein concentration needs to be maintained, mutation rate might still be lowered by simultaneously reducing the transcription rate and increasing the translation rate. However, this strategy increases the gene expression noise [], which typically imposes a fitness cost to the cell []. The current transcription/translation rate ratio might reflect a tradeoff between these antagonistic factors. Third, in yeast, the expression level of a protein has been shown to be the most important determinant of the evolutionary rate of the protein sequence, with highly expressed proteins having low amino‐acid substitution rates []. This phenomenon has been explained by several non‐mutually exclusive hypotheses: protein misfolding avoidance [, ], protein misinteraction avoidance [] and protein functional importance [, ]. The fact that highly expressed genes experience more mutations than lowly expressed ones suggests that the strong purifying selection acting on highly expressed proteins far exceeds what is reflected from their low amino‐acid substitution rates.
Contrary to our findings in yeast and the human germline, Martincorena et al [] recently reported lower mutation rates in more strongly expressed genes in Escherichia coli. Because of the absence of introns in E. coli, these authors calculated synonymous nucleotide polymorphism (θS) of each gene using the genome sequences of 34 strains. Because synonymous mutations in E. coli are subject to natural selection, they converted θS to θS’ such that θS’ is uncorrelated with codon usage bias, two indicators of selection, and two known determinants of the mutation rate. Intriguingly, θS’ is negatively correlated with the expression level, which led the authors to suggest an evolutionary risk management strategy that has suppressed the mutation rate of highly expression genes because of their relative importance to the organism []. But, how accurately θS’ reflects mutation rate is unclear. We propose that a small‐population MA experiment be conducted in E. coli to estimate the correlation between expression level and mutation rate. Furthermore, a negative correlation between expression level and mutation rate is also evident in human cancer cells and hence need not be a result of evolutionary risk management.
Data sets. Details about the yeast MA lines, essential genes, fitness values of single‐gene deletion strains, mRNA synthesis and degradation rates, yeast and primate intron sequences, yeast and human nucleosome occupancy data, yeast and human DNA replication timing data and yeast and primate transcriptome data are provided in supplementary information online. As described in supplementary information online, the TDMI of each yeast intron was estimated following a previous study [].
Nucleotide substitution rates and patterns. The nucleotide substitution rate for a pair of orthologous intron sequences was estimated using the baseml module of PAML [] under Kimura's two‐parameter model [], because the sequence divergence is moderate here. In the case of humans, introns from the same gene are concatenated before analysis. To control regional variation in mutation rate in the human genome, we calculated the substitution rate in the introns of a gene minus that in a flanking intergenic region, which is a 5‐kb segment starting from 5 kb (or 10 kb in supplementary Fig S3 online) downstream of the 3′ end of the 3′ untranslated region (UTR) of the gene. We removed the gene from our analysis if this 5‐kb segment overlaps with the neighbouring gene, which is defined by a segment encompassing from 5 kb upstream of the 5′ end of 5UTR to 5 kb downstream of the 3′ end of 3UTR. Indel substitution rates were similarly estimated, except that we did not correct for potential multiple hits
To study mutational patterns in yeast, we used three‐way alignments (S. cerevisiae, S. paradoxus and S. mikatae) in UCSC Genome Browser [] to infer the intron sequences of the common ancestor of S. cerevisiae and S. paradoxus by parsimony, followed by tabulating each mutation type. Ambiguous sites by the parsimony principle were not further considered. Multiple regression analysis was conducted to evaluate the relative contributions of various factors to mutation rates, as described in supplementary information online.
Conflict of Interest
The authors declare that they have no conflict of interest.
We thank X. Chen for providing the mutation data from the yeast MA lines and A. Eyre‐Walker and two anonymous reviewers for valuable comments. This work was supported in part by a research grant from the US National Institutes of Health to J.Z.
Author Contributions: J.Z. conceived and designed the research; C.P. and W.Q. performed the research and analysed the data; C.P. and J.Z. wrote the manuscript.
- Copyright © 2012 European Molecular Biology Organization