An essential issue derived from the sequencing of the human and other genomes is the identification of gene regulatory elements. Using in vivo footprinting and expression analysis, here we show that mouse and human CpG island promoters at homologous genes have a completely different organization in terms of size and binding of transcription factors. Despite these species‐specific differences, a unifying picture emerges from the precise confinement of protein–DNA interactions between the 5′ boundary of the CpG islands and the transcription initiation site. This finding allows direct localization of promoters on genomic sequences and reveals a very high rate of variation and evolutionary divergence of mammalian regulatory regions. Our results also show that CpG island promoters associated with tissue‐specific genes, such as the human α‐globin, are bound by ubiquitous factors that allow a constitutive low level of expression in many cell types.
Mammalian promoters belong to two different categories in terms of base composition and DNA methylation. In human and mouse, approximately half of them have a G+C content and a methylation pattern that are undistinguishable form bulk DNA and are invariably associated with tissue‐specific genes. The other half are associated with CpG islands, which are regions devoid of methylation and have a G+C content higher than the genome average. They remain non‐methylated throughout development regardless of the expression of the gene, with the exception of CpG islands in the inactive X chromosome and those at imprinted genes. All housekeeping genes and many tissue‐specific genes fall within this group (Antequera and Bird, 1993). The relationship between CpG islands and gene promoters has long been known (Bird, 1987; Gardiner and Frommer, 1987) and has been exploited to identify and validate the presence of genes within the recently completed sequence of the human genome (Lander et al., 2001; Venter et al., 2001).
An interesting feature of CpG islands is that, while maintaining the distinctive G+C composition and lack of methylation common to all of them, each one is unique in terms of sequence and position relative to the transcription initiation site (see below). This uniqueness is consistent with the proposed origin of the CpG islands in the vertebrate genome as regions under a mutational pressure different from the rest of the genome because of their dual role as promoters and DNA replication origins (Delgado et al., 1998; Antequera and Bird, 1999). As a consequence, comparison of CpG island sequences, even when associated to homologous genes, discloses no significant homology (Broderick et al., 1987; Zhao et al., 1998). This, together with the short, dispersed and degenerated nature of regulatory elements that make up promoters, explains the difficulty involved in their identification by sequence comparison (Prestridge, 1995; Duret and Bucher, 1997; Hardison et al., 1997; Tautz, 2000).
In this article, we describe the organization of promoters at CpG islands at mouse and human homologous genes by in vivo footprinting. Our results reveal a disparate pattern of protein–DNA interactions at regulatory regions of genes whose genomic organization, sequence and expression pattern are highly conserved between both organisms.
The promoter of the mouse adenosine phosphoribosyl transferase gene (APRT) lies within a CpG island and has been characterized by transient transfection experiments (Dush et al., 1988) and by in vivo footprinting analysis (Macleod et al., 1994). The promoter extends 100 bp upstream from the major transcription initiation site up to the 5′ boundary of the unmethylated CpG‐rich region (Figure 1A). The human APRT gene has a very similar genomic organization and its coding sequence is 82% identical to that of the mouse homolog (Broderick et al., 1987). In contrast, the CpG island at the human APRT gene stretches ∼600 bp upstream from the transcription initiation point (Figure 1A). To analyze the organization of the human promoter relative to the CpG island, we mapped the boundary of the unmethylated region by bisulfite analysis and footprinted in vivo a 1 kb region spanning the 5′ end of the CpG island. The results shown in Figure 1 define a region of multiple protein–DNA interactions encompassing the 600 bp CpG‐rich region. No interactions were detected along the 350 bp further upstream footprinted with oligonucleotide sets 7 and 8 (Figure 1). Given the differences in the pattern of protein–DNA interactions between the mouse and human APRT upstream sequences, we asked whether the entire 600 bp region was required for transcription. We cloned this region and two subfragments of 112 and 430 bp (Figure 2A) immediately upstream from a minimal promoter driving the chloramphenicol acetyltransferase (CAT) gene, and the constructs were transiently transfected into HeLa cells. Figure 2B shows that both subfragments can activate transcription independently but that there is a strong synergic effect when combined, suggesting that most or all of the elements along the 600 bp region contribute to the transcriptional activation of the human APRT gene.
The localization of proximal regulatory sequences of the human and mouse APRT gene between the 5′ boundary of the CpG island and the transcription initiation site opened the possibility of predicting the localization of the cis‐regulatory regions of CpG island genes. To test this possibility, we footprinted the 5′ regions of two pairs of mouse and human homologous genes with differently positioned CpG islands relative to the transcription initiation site: the telomerase RNA gene and the adenosine deaminase (ADA) gene (Figure 3A). Results shown in Figure 3B show that, as in the case of the APRT gene, the pattern of protein–DNA interactions is strikingly different in both organisms (Figure 3). The regions of protein–DNA interaction in the telomerase RNA genes exactly coincided with the minimal promoters defined previously by transient transfection experiments (Zhao et al., 1998) and, together with the APRT expression analysis (Figure 2), suggest that the regions defined by in vivo footprinting are required for the expression of these genes. A search of the TRANSFACT database with the sites containing reactive guanines showed that some of them corresponded to Sp1 and other well characterized transcriptional activators (see Supplementary data).
Approximately 50% of all CpG islands are associated with tissue‐specific genes (Antequera and Bird, 1993). To test whether the same promoter organization also applies to them, we decided to study the human α‐globin gene, which is associated with a non‐methylated CpG island in all tissues regardless of its expression (Bird et al., 1987). We footprinted a region 1.2 kb long encompassing the transcription initiation site in human K562 leukemia cells that actively express the α‐globin gene. Our results revealed extensive protein–DNA interactions across the entire 500 bp CpG‐rich region immediately upstream from the transcription initiation site (Figure 4), suggesting a similar organization of CpG island promoters at housekeeping and tissue‐specific genes.
An interesting point emerging from these results is whether there might be any promoter occupancy in cells where the α‐globin gene is not transcribed. We addressed this question by footprinting analysis in HeLa cells, in which transcription of the α‐globin gene is undetectable by northern blotting (Figure 5A). Surprisingly, except for a few minor differences, the pattern of protein–DNA interaction was identical to that found in K562, (Figure 4 and Supplementary data). This suggested that most of the factors bound in both cell types are probably ubiquitous. In view of these results, we reassessed the expression level by RT–PCR and found that the gene is transcribed at a low rate in HeLa cells (Figure 5B). To monitor how general this situation was in non‐erythroid cells, we also tested the expression of the α‐globin gene in the human embryonic kidney cell line 293 whose CpG island is also non‐methylated (data not shown). No mRNA was detected by northern analysis (Figure 5A) but a low level of expression comparable to that found in HeLa cells was detected by RT–PCR. In both cases, this level was approximately four orders of magnitude lower than in K562 cells (Figure 5B). In contrast, transcription of the β‐globin gene, which is not associated with a CpG island, was detected by RT–PCR only in K562 cells (Figure 5B).
The large differences in the level of α‐globin expression between erythroid and non‐erythroid cells and the similarity in the pattern of protein–DNA interactions raised the question of how relevant for transcription the factors bound to the promoter of the gene were. We addressed this question by looking for a cell line where the α‐globin CpG island was methylated. We have previously shown that this CpG island often becomes methylated in non‐erythroid cultured cells (Antequera et al., 1990). We tested the level of methylation in several human cell lines (data not shown) and found that this CpG island was methylated at virtually all CpGs in the lymphoblastoid cell line RPMI 8420 (Figure 4A). In vivo footprinting showed that methylation effectively prevented protein–DNA interactions since no differences in guanine sensitivity to dimethylsulfate (DMS) between naked DNA and nuclei were detected along the entire footprinted region (Figure 4B). No transcription was detected in these cells even by RT–PCR (Figure 5B), suggesting that the binding of factors to the gene upstream region is at least required for the low expression of the α‐globin gene.
The extensive binding of factors to the α‐globin promoter in non‐erythroid cells can explain the surprising ‘promiscuous expression’ of the human α‐globin gene when transfected into non‐erythroid cells (Whitelaw et al., 1989). That study failed to detect any sequence responsible for the erythroid‐specific expression within or close to the gene. Our findings are compatible with those observations and suggest that regulatory elements distant from the gene could be responsible for the enhancement in erythroid cells of the low constitutive transcription driven by ubiquitous factors in many other cell types. Our results do not preclude, however, the possibility that the subtle differences in protein–DNA interactions between K562 and HeLa cells, or the binding of other undetected factors, could be responsible for the differences in transcriptional activity.
The large majority of CpG islands are associated with the 5′ region of genes (Bird, 1987; Gardiner and Frommer, 1987), and previous analyses of several CpG island promoters have shown multiple binding sites for trancription factors (Stapleton et al., 1993; Tommasi and Pfeifer, 1997, 1999). However, the organization of regulatory elements within the frame of the CpG islands has not been addressed systematically. Our results indicate that, in some cases, the unmethylated and the CpG‐rich regions do not overlap exactly, suggesting that both features could have been generated by different mechanisms. The short unmethylated and CpG depleted regions at the 5′ boundary of some islands (Figures 1A, 3A and 4A) do not fulfill the definition of CpG islands in terms of G+C content and CpG frequency (Bird, 1986; Gardiner and Frommer, 1987) and we have not detected any factors bound to them. Our finding of the precise circumscription of the protein–DNA interactions between the 5′ boundary of the CpG‐rich region and the transcription initiation site will contribute to improving algorithms for identifying elements of transcriptional control in genomic sequences by reducing the ‘space’ of sequences to be analyzed. For example, phylogenetic footprinting of the mouse and human regulatory regions of skeletal muscle‐specific genes has shown that limitation of this ‘space’ dramatically reduces the noise generated by false predictions (Wasserman et al., 2000; Ohler and Niemann, 2001).
Our results also provide direct evidence for the divergent evolution of cis‐regulatory regions in mammals. Previous comparative analysis of CpG island sequences between human and mouse APRT and telomerase RNA genes showed no significant homology between them (Broderick et al., 1987; Zhao et al., 1998). Comparison of the human and mouse ADA upstream gene regions revealed the presence of three conserved motifs (Ingolia et al., 1986) of which only two are bound by factors in vivo (see Supplementary data). The number and position of the remaining occupied sites are specific for each species.
Similar results to those for the α‐globin gene expression (Figure 5) have been described for the trkA proto‐oncogene, a tissue‐specific gene associated with a CpG island encoding the receptor for the nerve growth factor (Sacristán et al., 1999). While northern analysis detected transcription in brain and testis only, more sensitive in situ hybridization and RNase protection analysis revealed widespread expression of trkA in many non‐neuronal tissues (Lomen‐Hoerth and Shooter, 1995). These observations, together with the human α‐globin expression in non‐erythroid tissues, suggest that this scenario could apply to other tissue‐specific CpG island genes.
Multiple examples illustrate the notion that promoter divergence is a major source of the genetic variability that underlies evolution (Wang et al., 1999; Carroll, 2000). It has recently been reported that promoters can evolve in different species of Drosophila by stabilizing selection without any loss of specificity (Ludwig et al., 2000). One prediction of that study was that such a pattern of variation would be a common theme in cis‐regulatory evolution (Ludwig et al., 2000). Our data show that this prediction is correct to the extreme of generating highly divergent promoters at genes that encode the same protein in mammals.
DNA methylation analysis.
DNA was denatured and treated with sodium bisulfite and hydroquinone as previously described by Clark et al. (1994). Transformed DNA was purified through a columm of Wizard DNA Clean‐Up System (Promega) and 100 ng were amplified by two rounds of 35 cycles of PCR using nested oligonucleotides under the following conditions. First 10 cycles: 2 min at 94°C, 1 min at the annealing temperature (specific for each set of primers) and 3 min of extension at 68°C using the Expand High Fidelity PCR System (Roche). For the remaining 25 cycles, 15 s were used for denaturation and annealing and the time for extension was increased by 20 s per cycle. The resulting amplified fragments were electrophoresed, recovered from the gel and cloned directly in the pGEM‐T vector (Promega) for sequencing.
In vivo footprinting.
Human K562, HeLa and 293 cells were grown in α‐MEM medium (BioWhittaker) with 10% fetal calf serum, 1% l‐glutamine and 1% penicillin–streptomycin. The RPMI 8420 cell line was grown in RPMI 1640 medium (BioWhittaker). Exponential cultures were treated with 0.1% DMS (Sigma) for 1–5 min at room temperature. The methylation reaction was stopped by the addition of 1% bovine serum albumin and 100 mM β‐mercaptoethanol made up in phosphate‐buffered saline. Cells were collected and total DNA was purified and cleaved with piperidine following the protocol described by Mueller et al. (1992). Amplification of the resulting fragments by LM–PCR was carried out as described in the same reference. The same treatment was applied to a sample of naked DNA to be used as control. Labelled fragment ladders were electrophoresed in 6% acrylamide sequencing gels and visualized by autoradiography.
Cell transfection and expression analysis.
The three fragments of 580, 430 and 112 bp from the human APRT gene were cloned into the SalI site of the pBLCAT2 plasmid (Luckow and Schütz, 1987) immediately upstream from a minimal thymidine kinase promoter driving the CAT gene. Constructs were sequenced before transfection. Exponentially growing HeLa cells were transfected by the calcium phosphate method as described by Sambrook et al. (1989). Cells were cotransfected with a plasmid expressing the LacZ gene driven by the CMV promoter as a transfection control. CAT activity was measured using a CAT‐ELISA kit (Roche).
The sequences and localization of all the oligonucleotides used in this work (>100 oligonucleotides) are available upon request.
Supplementary Figure 1
Supplementary Figure 2
We thank Dionisio Martín Zanca for advice on the cell transfection analysis and helpful criticism of the manuscript. We are also grateful to Mauro Giacca and Gulnara Abdurashidova for advice on the footprinting analysis. M.C. was a recipient of a postgraduate fellowship from the Ministerio de Educación y Cultura. This work was funded by the Spanish CICYT and the European Union FEDER programme.
- Copyright © 2001 European Molecular Biology Organization