In the genetic code, the UGA codon has a dual function as it encodes selenocysteine (Sec) and serves as a stop signal. However, only the translation terminator function is used in gene annotation programs, resulting in misannotation of selenoprotein genes. Here, we applied two independent bioinformatics approaches to characterize a selenoprotein set in prokaryotic genomes. One method searched for selenoprotein genes by identifying RNA stem–loop structures, selenocysteine insertion sequence elements; the second approach identified Sec/Cys pairs in homologous sequences. These analyses identified all or almost all selenoproteins in completely sequenced bacterial and archaeal genomes and provided a view on the distribution and composition of prokaryotic selenoproteomes. In addition, lineage‐specific and core selenoproteins were detected, which provided insights into the mechanisms of selenoprotein evolution. Characterization of selenoproteomes allows interpretation of other UGA codons in completed genomes of prokaryotes as terminators, addressing the UGA dual‐function problem.
In the genetic code, the codon UGA differs from other codons in that it has a dual function: although it most often terminates protein synthesis, it also encodes the 21st amino acid in protein, selenocysteine (Sec) (reviewed in Low & Berry, 1996; Bock, 2000; Hatfield & Gladyshev, 2002). Available gene annotation programs interpret UGA as a stop codon, resulting in misannotation or completely missing selenoprotein genes. This problem is particularly serious for prokaryotic genomes as they currently account for >90% of completely sequenced genomes. Yet, neither which prokaryotes contain selenoproteins nor the number of selenoprotein genes in any of the prokaryotic genomes is known. Recent identification of the 22nd amino acid, pyrrolysine, which is encoded by UAG codon in several methanogenic organisms (Srinivasan et al, 2002), suggests that misannotation of genes due to dual functions of termination signals is not limited to Sec.
To insert Sec at UGA codons, selenoprotein genes evolved an RNA stem–loop structure, designated Sec insertion sequence (SECIS) element. SECIS elements are present immediately downstream of Sec UGA codons in bacteria (Zinoni et al, 1990; Huttenhofer et al, 1996; Liu et al, 1998) and in 3′ untranslated regions (UTRs) in archaea (Wilting et al, 1997) and eukaryotes (Berry et al, 1991; Walczak et al, 1996). The bacterial, archaeal and eukaryotic SECIS elements have no similarities to each other with regard to sequence and structure. The eukaryotic SECIS element consensus has been well characterized, which allowed identification of these structures in sequence databases (Kryukov et al, 1999; Lescure et al, 1999; Castellano et al, 2001; Martin‐Romero et al, 2001), including mammalian genomes (Kryukov et al, 2003). However, conservation of bacterial SECIS elements has been insufficient for their computational description, which precluded identification of bacterial selenoprotein genes by searching for the stem–loop structure.
Conversely, SECIS elements in known selenoprotein genes in archaea exhibited significant conservation (Wilting et al, 1997), suggesting that searches for these stem–loop structures might be useful in identifying archaeal selenoprotein genes. Previous homology screens and manual analyses of selected genomic regions identified seven selenoprotein genes in Methanococcus jannaschii (Wilting et al, 1997) and, subsequently, Sec‐specific translation elongation factors were identified in Methanococcus and Methanopyrus species (Rother et al, 2000, 2001a, 2001b).
In the present work, we applied independent bioinformatics approaches to identify entire selenoprotein sets in completed bacterial and archaeal genomes.
Results And Discussion
SECISearch‐based identification of selenoproteins
Selenoprotein homologues of known archaeal selenoprotein genes (Wilting et al, 1997) were compiled and their SECIS elements were extracted and structurally aligned (Fig 1). This procedure revealed conserved SECIS regions and resulted in an archaeal SECIS element consensus (Fig 2, right structure). The consensus was very similar to that previously reported (Wilting et al, 1997), although the primary sequence conservation was limited to the unpaired GAA_A region. We identified conserved structural features of these structures, as shown in Fig 2, and developed a computer program, archaeal SECISearch, to recognize an archaeal consensus SECIS element in sequence databases. This three‐step program searched for primary sequence, secondary structure and free energy criteria of the predicted stem–loop structures. Additional ‘fine structural criteria’ were also applied that required the presence of a conserved GAA_A bulge in the predicted structure and removed candidate sequences with predicted Y‐shaped structures. A total of 14 completely sequenced archaeal genomes (all that were available at the time of the searches) were analysed with this program (Table 1). Subsequently, sequences flanking the predicted SECIS elements were analysed for the occurrence of open reading frames (ORFs).
Interestingly, only M. jannaschii and Methanopyrus kandleri had selenoprotein genes (Table 1). Consistent with this finding, known Sec insertion machinery genes, selenophosphate synthetase (SPS) and a Sec‐specific elongation factor SelB, were detected in M. jannaschii and M. kandleri, but not in other archaeal genomes. The SECIS‐based analysis revealed eight selenoprotein genes in M. jannaschii and seven in M. kandleri genomes, with only one false positive and no false negatives in 14 analysed genomes. Of these 15 selenoprotein genes, 13 contained SECIS elements in the 3′‐UTRs, and one SECIS element in each organism was found in the 5′‐UTR, the location not observed in eukaryotic and bacterial selenoprotein genes. Although incorrectly annotated in genome databases, 14 archaeal selenoproteins were homologues of previously identified selenoproteins (Wilting et al, 1997). The 15th archaeal selenoprotein was a new selenoprotein with distant homology to HesB protein (Figs 2, 3). This ORF was entirely missing in the M. jannaschii genome annotation.
SECIS‐independent identification of selenoproteins
We also applied a SECIS‐independent method for identification of selenoprotein genes that searched for Sec/Cys pairs in homologous sequences. It took advantage of the fact that all known prokaryotic selenoproteins except one (selenoprotein A) had homologues in NCBI non‐redundant and/or microbial databases that inserted Cys in the place of Sec. This method was applied to both bacterial and archaeal genomes that could contain selenoprotein genes as follows (Fig 4). A database of all available prokaryotic genomes that coded for selenoprotein genes was developed, which was composed of 12 completely and 33 incompletely sequenced bacterial genomes and two completely sequenced archaeal genomes (M. jannaschii and M. kandleri) that possessed components of the Sec insertion machinery (archaeal and bacterial searches were combined in this experiment because sets of known selenoproteins in these organisms showed significant overlap). We also constructed a prokaryotic protein database, which included 227,930 protein sequences from all available annotated prokaryotic genomes. This protein database was searched against the selenoprotein genome database with tblastn to identify local alignments, in which TGA in a genomic sequence corresponded to cysteine in a protein query and the corresponding Sec/Cys flanking sequences showed significant homology. The identified TGA‐containing sequences were subjected to computational filters (tests for the presence of upstream in‐frame stop signals preceding translation initiation signals and superior blastx and rpsblast hits in different frames than the TGA codon; see Methods for details) and clustered, pseudogenes were removed, and the resulting candidate selenoprotein genes were screened for homologues against NCBI microbial and NR databases. To decrease the possibility that some of the hits were due to sequence errors that replaced TGC or TGT codons with TGA, an ORF containing an in‐frame TGA was considered a selenoprotein gene if at least two different sequences in two distant genomes were found that conserved this TGA and in addition at least two corresponding Cys‐containing homologues were detected. Protein families in which all homologues conserve Sec would be missed by this approach, but such families appear to be extremely rare (currently, only one family (the selenoprotein A family) conserves Sec in 100% of sequences).
Analysis of the archaeal subset of identified sequences revealed the same set of seven and eight archaeal selenoprotein genes in the M. kandleri and M. jannaschii genomes (Table 2), with no other archaeal selenoprotein candidates or false‐positive sequences. Thus, both SECIS‐based and Sec/Cys homology approaches, being independent methods, were efficient in identifying selenoprotein genes in archaeal genomes. Analysis of the bacterial hits revealed ten previously known and five new selenoprotein families (Fig 4, Table 2 and supplementary Figs S1–S5 online). In addition, candidate selenoproteins, each of which was found only in one bacterial genome, were identified (Fig 4, Table 2 and supplementary Figs S6–S13 online). Potential bacterial SECIS structures were identified downstream of TGA in known, new and candidate selenoprotein genes (supplementary Fig S14 online). Only one previously known selenoprotein, selenoprotein A of the glycine reductase complex, was not detected, because no Cys‐containing homologues were found for this protein. Thus, our ability to complete the analyses of genomic sequences with an excellent true‐positive rate suggested that we identified all or almost all selenoprotein genes in completely sequenced prokaryotic genomes.
Our data allowed a view on entire selenoproteomes in completed bacterial and archaeal genomes. Although selenoproteins were present in all three major domains of life, only ∼20% of completed bacterial and ∼14% of completed archaeal genomes had them. In addition, the number of selenoproteins in a genome varied from one to more than ten, with Carboxydothermus hydrogenoformans, Eubacterium acidaminophilum, Geobacter metallireducens, Geobacter sulfurreducens and M. jannaschii having the largest numbers of selenoproteins among partially and completely sequenced genomes. Analysis of the composition of selenoproteomes revealed that most selenoproteins were redox proteins, which used Sec either to coordinate a redox‐active metal (molybdenum, nickel or tungsten) or in Sec: thiol redox catalysis. Interestingly, in contrast to mammals, where new selenoproteins identified through SECIS elements represent a significant fraction of selenoproteomes (Kryukov et al, 2003), the data suggested that in many prokaryotes, entire selenoprotein sets were already known before our study. For example, seven out of eight selenoproteins in M. jannaschii (Wilting et al, 1997) and two out of two selenoproteins in Haemophilus influenzae (Wilting et al, 1998) were previously identified.
The presence of selenoproteins that were found in a small number of genomes (mostly homologues of thiol‐dependent oxidoreductases) (Table 2) contrasted with the broad occurrence of their Cys homologues. This observation suggested that these selenoproteins evolved recently, probably from Cys‐containing proteins. Conversely, formate dehydrogenases, which were present in most genomes with functional Sec insertion systems (supplementary Fig S1 online), appeared to be of ancient origin. In addition, bacterial formate dehydrogenase genes often clustered with Sec insertion machinery genes (data not shown), suggesting correlated expression. Thus, both a lineage‐specific expansion (recent origin) and the presence of core selenoproteins (ancient origin) contribute to the composition of selenoproteomes. Our data suggest that as new genomic sequences become available, additional examples of lineage‐specific selenoproteins are likely to emerge.
In conclusion, we identified genes encoding selenocysteine‐containing proteins in completely sequenced genomes of archaea and bacteria. This essentially solved the UGA dual‐function gene prediction problem in prokaryotes, as other in‐frame UGA codons may now be assigned a terminator function. Further systematic characterization of selenoprotein functions should reveal a full set of biological processes that are dependent on the trace element selenium.
Databases and resources. Completely and incompletely sequenced genomes of archaea and bacteria were obtained from the NCBI data repository (ftp://ftp.ncbi.nlm.nih.gov/genomes) (Wheeler et al, 2003) and TIGR, and blast programs were obtained from NCBI (http://www.ncbi.nlm.nih.gov/BLAST) (Altschul et al, 1990). The searches were performed on a Prairiefire 256‐processor Beowulf cluster supercomputer at the Research Computing Facility of the University of Nebraska‐Lincoln.
SECIS‐based identification of selenoprotein genes in archaeal genomes. Archaeal SECISearch contains three modules. The first module is based on the PatScan program (http://www‐unix.mcs.anl.gov/compbio/PatScan/HTML/patscan.html) (Dsouza et al, 1997) and searches for RNA structures that match the archaeal SECIS element primary sequence and the secondary structure consensus. The second module, based on the RNAfold program from Vienna RNA package (http://www.tbi.univie.ac.at/~ivo/RNA) (Hofacker et al, 1994), predicts secondary structure and calculates free energy for the entire putative SECIS element. The predicted bulge‐forming GAA_A nucleotides are constrained to be unpaired. Predicted RNA structures, the calculated free energies of which are above the −16 kcal/mol threshold determined from the analysis of known archaeal SECIS elements, are excluded from further analysis. The third module of archaeal SECISearch filters out structures that are either Y‐shaped or do not have a pronounced GAA_A bulge (nucleotide pairs exactly ‘above’ and at most one nucleotide ‘below’ the GAA_A bulge are required to be present in the SECIS element stem).
Identification of selenoprotein genes by searching for Sec/Cys pars in homologous sequences (SECIS‐independent method) in prokaryotic genomes. Identification of Sec/Cys pairs in homologous sequences: The protein database was constructed by automated extraction of bacterial and archaeal protein sequences from the NCBI genome data repository. This database had 227,930 unique ORFs, which at the time of analysis represented nearly all predicted protein sequences in completely sequenced prokaryotic genomes. To identify prokaryotic genomes that encode Sec‐containing proteins, the NCBI microbial genome database was searched for homologues of known protein components of Sec insertion machinery (Sec synthase, SPS and Sec‐specific elongation factor). We identified 12 completely and 33 partly sequenced bacterial and two completely sequenced archaeal genomes, in which at least one Sec insertion machinery gene was detected. These 47 genomes were extracted and combined to generate a selenoprotein genome database.
The protein database was searched against a selenoprotein genome database (47 genomes, ∼200 Mb) with the NCBI‐tblastn program. In this search, the threshold for extending hits was set to 11 and the expectation value was cut off to 10. The resulting data set was analysed for the presence of local alignments, in which cysteine in a protein query from a protein database corresponded to TGA in a translated nucleotide sequence from the selenoprotein genome database. If at least one such alignment with an expectation value below 10−3 or at least two alignments with expectation values below 1 were identified for a particular TGA codon, the corresponding TGA‐containing sequence was chosen for further analysis. We identified 9,315 such local alignments.
Analysis of TGA‐flanking sequences: In each of the TGA‐containing sequences from 9,315 alignments, a region upstream of the TGA was analysed for the presence of in‐frame start and stop codons. If a stop codon (TGA, TAA or TAG) occurred closer to the TGA codon than an appropriate start codon (ATG or GTG), such sequences were discarded. For each remaining sequence, a 1 kb TGA‐flanking region (500 nt–TGA–500 nt) was searched against the protein database with NCBI‐blastx program. If the best hit that covered the TGA codon with at least a seven‐nucleotide overlap was in a different frame from the TGA, the corresponding sequence was filtered out. The 1 kb TGA‐flanking regions (500 nt–TGA–500 nt) were then translated in all three possible ORFs and searched using rpsblast against an NCBI collection of known conserved domains (ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd). If the best hit that covered the TGA codon with at least six amino‐acid residue overlap was not in the same frame as the TGA codon, the sequence was removed from further analysis. In addition, if additional stop codons were found within the predicted conserved domain that covered TGA in the correct frame, such a sequence was considered a pseudogene and was also discarded. A total of 6,682 local alignments remained after application of these filters.
Clustering: The set of 2,071 unique proteins that corresponded to the remaining 6,682 TGA‐containing sequences was extracted. These protein sequences were compared pairwise by the NCBI‐bl2seq program. If two proteins produced a local alignment that had an expectation value below 10−5 and was at least 20 amino‐acid residues long, they were assigned to the same cluster. Clusters containing proteins that were initially aligned with the same TGA‐containing region were joined into larger clusters. This analysis resulted in 244 clusters.
Analyses of cysteine conservation and adjacent genes: Cysteines that corresponded to the TGA codons in the local alignments were analysed for conservation in other homologous proteins. If a cysteine was not conserved, the hit was discarded. In addition, ORFs that contained TGA‐flanking regions were analysed for domain conservation in homologues. The 1 kb TGA‐flanking regions were searched with the NCBI‐blastx program against the NCBI microbial and NR protein databases. If a TGA codon was located on the edge of the homology region (on either side of the sequence), the candidate hit was removed from further analyses. The remaining sequences were manually analysed for the occurrence of homologous selenoproteins and corresponding Cys‐containing proteins in the NCBI microbial and NR databases and for the presence of potential SECIS elements immediately downstream of the TGA codon using mfold.
Supplementary information is available at EMBO reports online (http://www.nature.com/embor/journal/v5/n5/7400126‐s1.pdf).
We thank the Research Computing Facility of the University of Nebraska‐Lincoln for the use of Prairiefire 256‐processor Beowulf cluster supercomputer. This work was supported by NIH grant GM61603 (to V.N.G.).
- Copyright © 2004 European Molecular Biology Organization