C2H2 zinc‐finger proteins (ZFPs) constitute the largest family of nucleic acid binding factors in higher eukaryotes. In silico analysis identified a total of 326 putative ZFP genes in the Drosophila genome, corresponding to ∼2.3% of the annotated genes. Approximately 29% of the Drosophila ZFPs are evolutionary conserved in humans and/or Caenorhabditis elegans. In addition, ∼28% of the ZFPs contain an N‐terminal zinc‐finger‐associated C4DM domain (ZAD) consisting of ∼75 amino acid residues. The ZAD is restricted to ZFPs of dipteran and closely related insects. The evolutionary restriction, an expansion of ZAD‐containing ZFP genes in the Drosophila genome and their clustering at few chromosomal sites are features reminiscent of vertebrate KRAB‐ZFPs. ZADs are likely to represent protein–protein interaction domains. We propose that ZAD‐containing ZFP genes participate in transcriptional regulation either directly or through site‐specific modification and/or regulation of chromatin.
C2H2 zinc‐finger (ZF) motifs, which represent the most abundant nucleic acid binding motif in higher eukaryotes (Rubin et al., 2000; Lander et al., 2001; Venter et al., 2001), are found in RNA‐binding proteins (Joho et al., 1990), transcription factors (Rosenberg et al., 1986; Stanojevic et al., 1989) and chromatin components (Reuter et al., 1990). Lineage‐specific subgroups of ZF proteins (ZFPs) can be found in the genomes of Saccharomyces cerevisiae (Böhm et al., 1997), Arabidopsis thaliana (Riechmann et al., 2000), Caenorhabditis elegans (Chervitz et al., 1998), Drosophila melanogaster (Rubin et al., 2000) and Homo sapiens (Lander et al., 2001; Venter et al., 2001) and are especially expanded in the higher eukaryotic species. In humans, this expansion includes ZFPs that contain the evolutionarily conserved BTB/POZ domains or SCAN and KRAB domains (reviewed by Collins et al., 2001), which are restricted to vertebrates (Lander et al., 2001). No corresponding expansion of ZFPs has been observed in the C. elegans genome. In Drosophila, as in humans, ZFPs were found to be associated with BTB/POZ domains and with a recently identified, but uncharacterized, C4DM domain (Lander et al., 2001; Lespinet et al., 2002). Here, we report a detailed in silico analysis of ZFPs in the Drosophila genome, showing that the C4DM domain is an N‐terminal protein structure that is almost exclusively found in association with ZFPs. This ZF‐associated C4DM domain (ZAD) characterizes the single largest subfamily of mostly clustered Drosophila ZFP genes and appears to be restricted to dipteran and closely related insect genomes.
Results and discussion
Characterization of ZFPs in the Drosophila genome
We identified a total of 326 C2H2 ZFP genes in the genome of Drosophila. This estimate differs from the previously published numbers of, for example, 352 (Rubin et al., 2000) or 234 (Venter et al., 2001). We propose that our estimate provides the most accurate assessment yet, as we did not rely solely on in silico methods but also performed a manual inspection of all identified ZF motifs (see Methods). Of all putative Drosophila ZFPs, 94 (∼29%) are conserved in humans and/or C. elegans, an assignment based on the arrangement and sequence of the ZFs as well as sequence similarities outside the ZF domains of the proteins. The remaining 232 ZFPs appear to be Drosophila‐specific or restricted to the insect lineage. The identified Drosophila ZFP genes and their chromosomal distribution are summarized in Table 1 (see also Supplementary data available at EMBO reports Online).
In order to place the 232 lineage‐specific ZFPs into subgroups, we probed for associated protein motifs. We found 13 ZFPs (∼4%) containing a BTB/POZ domain, a combination that has also been observed in human ZFPs (Lander et al., 2001). In 91 ZFPs (∼28%; Table 1), we identified an N‐terminal domain of >70 amino acids. This domain, which defines the single largest subfamily of Drosophila ZFPs, has recently been noted as a C4DM domain (Lander et al., 2001; Lespinet et al., 2002). In all but two cases, this domain is always ZFP‐associated. The coding sequence of one of the two ZADs that are not associated with a ZFP coding region is found immediately upstream of a ZAD‐ZFP‐encoding gene (CG4639) and was not previously annotated. Thus, it is possible that this ZAD is included in an as‐yet‐unidentified splice variant of CG4639. The second ZFP‐unrelated ZAD encoded by CG11371 is highly diverged and is part of a protein lacking any other significant protein domain or motif. For simplicity and to demonstrate the association, we refer to this motif as ZAD.
At present, mutant alleles have only identified for three ZAD‐containing ZFP genes. These are deformed wing/zeste‐white5 (dwg/zw5; Fahmy and Fahmy, 1959), grauzone (Schupbach and Wieschaus, 1989) and Serendipity δ (Payre et al., 1990). The functional characterization of these genes, as well as the results of biochemical studies, suggests that ZAD‐containing ZFPs are involved in transcriptional control. dwg/zw5 encodes a site‐specific DNA‐binding ZFP that promotes the formation of insulator complexes (Gaszner et al., 1999), whereas grauzone and Serendipity δ encode transcription factors implicated in the activation of the genes cortex (Chen et al., 2000; Harms et al., 2000) and bicoid (Payre et al., 1994), respectively. One additional ZAD‐containing ZFP, termed DIP1, contains only a single ZF motif and can associate with the NFκB homologue Dorsal (Bhaskar et al., 2000).
ZAD‐encoding sequences were found in the available ESTs of the dipterans Drosophila spp., Anopheles gambiae and Aedes aegyptii, the hymenopteran Apis mellifera and the lepidopteran Bombyx mori (see Supplementary data). In contrast, not a single EST in over 7 million vertebrate samples (see Supplementary data) or in non‐insect invertebrate such as C. elegans has been identified. These observations suggest that the ZAD is restricted to insects and has emerged during their evolution.
Classification of the ZAD
ZADs vary in length between 71 and 97 amino acid residues. A multiple sequence alignment of a representative subset of 32 ZADs (Figure 1; for a complete alignment of the Drosophila ZADs, see Supplementary data) shows that the domain consists of four conserved sequence blocks (blocks 1–4), which are linked by three variable regions (r1–r3) of different lengths (Figure 1). The most striking feature of ZADs is the occurrence of two invariant cysteine pairs in blocks 1 and 4, suggesting that they may coordinate the binding of a zinc ion to stabilize a distinct fold of the domain.
Secondary structure analysis predicts that the variable regions 1–3, which contain preferentially small and polar amino acid residues (Figure 1), represent turns or unstructured spacers, whereas the conserved blocks 1–4 form β1β2α1β3α2‐folds (with strong predictions except for β2; see Supplementary data), which are likely to represent the core of the ZAD structure (Figure 1). Within each of the blocks 1–4, most conserved amino acid residues are hydrophobic; the few exceptions include a highly conserved arginine residue (position 4; Figure 1) located between the cysteines of block 1. The importance of this conserved arginine residue, and of the domain itself, is supported by the finding that a point mutation that results in an arginine‐to‐glycine replacement in the dwg/zw5 protein causes a lethal phenotype (Gaszner et al., 1999). Furthermore, a point mutation in Serendipity δ that results in a tyrosine replacement of the second invariant cysteine of block 1 also causes a lethal phenotype (Crozatier et al., 1992). These observations suggest that the core structure of the ZAD carries an essential function, at least in the case of Serendipity δ and dwg/zw5. Mutational analysis combined with biochemical studies showed that the ZAD‐like domain of Serendipity δ functions as a protein–protein interaction domain (Payre et al., 1997), a function that has been proposed for the ZAD of the dwg/zw5 protein as well (Gaszner et al., 1999). The experimental data therefore support the proposal that ZADs represent or contain protein–protein interaction surfaces that, with the possible exception of two out of 93 cases, are combined with arrays of putative DNA‐binding ZFs.
Intron‐based classification and clustering of ZAD‐bearing ZFP genes
ZAD‐containing ZFPs are not randomly distributed throughout the genome. The X chromosome, both arms of the second chromosome and the left arm of the third chromosome each contain between 10 and 14 ZAD‐containing ZFP genes, whereas the right arm of the third chromosome contains 44 family members (Table 2). Furthermore, nearly half of the ZAD‐containing ZFP genes (41 of 91; see Supplementary data) are found in gene clusters (see below).
Based on the intron structure of the primary transcripts, ZADs can be divided into two large subsets. A total of 38 ZADs are encoded by a single exon (subset 1), whereas the open reading frames of 53 ZADs are split by an intron located in a conserved position in block 3 between the β‐strand and the α‐helix (subset 2). We could further place 45 ZAD‐coding sequences into 10 sequence‐related subgroups (Table 2). Eight of these are distributed in a chromosome‐specific manner, and each subgroup consists of either subset 1 or subset 2 ZADs. Another interesting finding is that members of most subgroups represent clustered genes (27 of 45; Table 2) and that their sequence similarity includes not only the ZADs but also the associated array of ZFs.
A comparative tree (see Supplementary data) containing all 91 ZADs of Drosophila and 71 newly identified ZAD‐containing ZFPs encoded by the A. gambiae genome shows that the members of the 10 subgroups occupy neighbouring positions in the tree and that in most cases the ZADs of the two species are located on distinct branches. In only a few instances are direct neighbours in the tree derived from the two species. This indicates that the majority of the ZADs of both species underwent species‐specific expansions. In Drosophila, these findings suggest that (i) the duplication events occurred after the intron‐containing ZADs had separated from those lacking the intron and (ii) the expansion and clustering of the ZAD‐containing ZFPs involved multiple local duplication events of the ancestral founder genes.
To examine whether both individual and clustered ZAD‐containing ZFP genomic sequences are transcribed, we searched for ESTs corresponding to the individual transcripts (Table 2). We found 466 ESTs corresponding to 71 ZAD‐coding sequences, implying that the majority of ZAD‐containing ZFP genes is transcribed. The remaining 20 ZAD sequences, for which no ESTs could be identified, may either be expressed at very low levels and/or only in a few cells or may represent non‐functional pseudogenes.
Enrichment of lineage‐specific ZAD‐containing ZFPs and their clustering at distinct chromosomal locations suggest a recent expansion of this ZFP subfamily. An analogous lineage‐specific expansion of transcription factors has been observed for nuclear hormone receptors in the C. elegans genome (Ruvkun and Hobert, 1998; Sluder and Maina, 2001) and KRAB‐containing ZFPs in humans (Lander et al., 2001). The finding that most ZAD‐containing ZFPs are expressed suggests that the expansion has been accompanied by stabilizing partially redundant functions of newly generated transcription units in the genomes or allowed them to adopt novel functions that were subsequently maintained. Alternatively, the expansion has occurred only very recently in the evolutionary history of Drosophila. If so, most members of the sequence‐related subgroups may still carry largely redundant functions, explaining why the majority of the Drosophila ZAD‐containing ZFPs has escaped functional detection by mutagenesis screens (e.g. Nüsslein‐Volhard and Wieschaus, 1980; Spradling et al., 1999; Peter et al., 2002). This explanation would also be consistent with the finding that most ZAD‐coding sequences of A. gambiae show only modest sequence similarity with the Drosophila counterparts (see Supplementary data). Expanded ZAD‐containing ZFPs could therefore provide an important source for the emergence of novel protein–protein and/or protein–DNA interactions that contribute to a species‐specific regulatory diversity in the control of transcription and/or chromatin structure and function. Since ZAD and the analogous KRAB domain participate in a lineage‐specific expansion of ZFPs in insect and vertebrate genomes, respectively, the results described here may constitute an example of convergent evolution at the level of transcriptional regulation, the significance of which remains to be addressed experimentally.
Identification of C2H2 ZFPs and ZFP‐associated protein motifs in the Drosophila genome.
In order to identify C2H2 ZFPs in the Drosophila proteome (GadFly release 2), we used the Pfam domain PF00096 (Bateman et al., 2002) and the Pfam search tool. As a threshold, we assigned a minimal score of 0.0. The identified ZF motifs were subsequently manually inspected to eliminate false‐positives. This was done by checking for overlaps with other protein motifs in Pfam or SMART (Letunic et al., 2002); putative C2H2 motifs that overlap other more significant hits to protein domains or motifs were eliminated. The identified ZFPs were analysed with Pfam and SMART to find additional domains.
Profile construction and searches with the ZAD.
An initial ClustalW 1.81 (Thompson et al., 1994) alignment of the identified ZADs was used to construct a profile hidden Markov model (HMM) using the HMMER package 2.1.1 (Eddy, 1998). We performed a search against the genomic regions of the identified ZFPs using the Wise package 2.2.0 (Birney et al., 1996). The genomic structure of the identified ZAD‐containing ZFPs was determined (if possible) using the Gene2EST package (Gemünd et al., 2001) in combination with BLAST 2.2.2 (Altschul et al., 1997). The verified protein sequences encoding the ZAD were aligned using ClustalW 1.81. This alignment was used to construct an enhanced profile HMM, with which we performed searches against the publicly available EST database (NCBI DbEST, downloaded May 2002) and the set of all annotated fly proteins (Gadfly release 2). All searches against nucleotide databases were performed using the Wise 2.2.0 package; searches against protein databases were performed using the HMMER 2.1.1 package.
Classification of ZADs into subgroups and tree construction.
To subgroup the ZADs, we calculated a distance matrix with PROT‐DIST of the Phylip 3.5c package (Felsenstein, 1993) from a multiple sequence alignment. The resulting distance matrix was used to construct a tree using the neighbour‐joining algorithm provided by Neighbor (Phylip). Sequence‐related subgroups were defined: (i) all members of the sequence‐related subgroups form distinct branches of the tree and no non‐member is part of this branch; and (ii) the average distance between all members plus the standard deviation is smaller than the averaged distances to all non‐members (in the case of subgroups containing only two members, the maximal distance between these has been arbitrarily set to 1.4).
We used the ZAD‐HMM in conjunction with Wise 2.2.0 (as described above) to identify ZAD or ZAD‐like motifs in the genomic sequences of A. gambiae ZFPs extracted from EnsEMBL 8.1b.1 (Hubbard et al., 2002). The identified A. gambiae ZADs and the Drosophila ZADs were aligned and a tree was constructed as described above.
Secondary structure prediction.
Secondary structure prediction was carried out with ALB (Ptitsyn and Finkelstein, 1989). A consensus prediction was calculated from the prediction of all ZADs in all alignment positions which have <10% gaps. The secondary structure prediction was verified using PHD (Rost, 1996). Since the predictions of the two programs did not differ significantly, we show the result obtained by ALB.
Supplementary data are available at EMBO reports Online.
We thank our colleagues for help and critical discussions. This work was supported by the German Human Genome Project (grant 01 KW 9632/9; to H.J.). H.‐R.C. thanks the Boehringer Ingelheim Fonds for a predoctoral fellowship.
- Copyright © 2002 European Molecular Biology Organization