A statistical analysis of 9000 flanking sequence tags characterizing transferred DNA (T‐DNA) transformants in Arabidopsis sheds new light on T‐DNA insertion by illegitimate recombination. T‐DNA integration is favoured in plant DNA regions with an A‐T‐rich content. The formation of a short DNA duplex between the host DNA and the left end of the T‐DNA sets the frame for the recombination. The sequence immediately downstream of the plant A‐T‐rich region is the master element for setting up the DNA duplex, and deletions into the left end of the integrated T‐DNA depend on the location of a complementary sequence on the T‐DNA. Recombination at the right end of the T‐DNA with the host DNA involves another DNA duplex, 2–3 base pairs long, that preferentially includes a G close to the right end of the T‐DNA.
Transferred DNA (T‐DNA) from Agrobacterium tumefaciens Ti plasmids is a widely used tool for genetic engineering and plant insertional mutagenesis (Galbiati et al., 2000). T‐DNA is transferred into plant cell nuclei as a single‐stranded molecule attached at the 5′ end to the protein VirD2 and coated with virulence protein E2. T‐DNA integrates into the genome (Gelvin, 2000) by illegitimate recombination (Gheysen et al., 1991; Mayerhofer et al., 1991; Zupan et al., 2000) via a largely unknown mechanism. Characterization of a limited number of T‐DNA insertions into genes showed an apparently even repartition along Arabidopsis thaliana chromosomes with no preferential integration into a specific gene structure (Azpiroz‐Leehan and Feldmann, 1997). We have produced flanking sequence tags (FSTs) for >18 000 A. thaliana T‐DNA transformants (Balzergue et al., 2001). This has allowed an in‐depth analysis of the sequence specificity of T‐DNA insertion sites (IS) and brings new insights into the integration process.
Distribution of IS in the host genome
The FST distribution in the genome of A. thaliana is even throughout the five chromosomes. As an example, the FST distribution along chromosome 3 is given in Figure 1. FSTs are progressively less frequently observed towards the centromere, as shown in Figure 1 with the predicted genes [The Arabidopsis Genome Initiative (AGI), 2000]. About 40% of the integrations are in genes, i.e. in regions defined by the AGI‐predicted genes plus 200 base pairs (bp) on each side of them and covering 54% of the genome. There is no apparent category of genes more or less prone to T‐DNA insertion. We observed 121 FSTs per Mb in the 200 bp upstream of the start codon and 77 FSTs per Mb in the intergenic regions and 3′ UTR. In genes, FSTs are more frequently found in introns than in exons, with 43 and 33 FSTs, respectively, per Mb.
T‐DNA after integration
The sites of VirD2‐mediated cleavage of the T‐DNA have been determined both in vivo (Dürrenberger et al., 1989) and in vitro (Pansegrau et al., 1993). FSTs from the T‐DNA left border (LB), at the 3′ end of the transferred single‐stranded T‐DNA, show that in 24% of the cases the T‐DNA is integrated with a full‐length LB (canonical insertion) (Figure 2A). The number of inserted T‐DNAs with an LB truncated by 1–23 bases is relatively even and longer deletions are rare. The right border (RB) of the T‐DNA (Figure 2A) is also sometimes fully conserved after integration (19%). The RB is frequently truncated between the second and fifth bases from the canonical IS (36%).
Microsimilarity between the host genome and T‐DNA borders
We characterized further the sequences upstream and downstream of 4430 IS. These two regions are defined with respect to the integrated T‐DNA: the region upstream of the IS is the region that would be sequenced with a primer identical to a sequence in the LB, and the region downstream of the IS is the region that would be sequenced with a primer designed from the RB. In a region close to the IS, the nucleotide composition is different from the regional composition (Figure 3). We postulated that it might reflect the previously proposed role of microsimilarities (often defined as microhomologies) between host DNA and T‐DNA sequences in the integration process (Tinland, 1996). This hypothesis, based on a small number of IS, can be tested with the larger set of IS sequences available: FSTs showing different lengths of deletion in the integrated T‐DNA (Figure 2A) might exhibit modified microsimilarity when compared to canonical IS. Consistently, the most frequently observed sequence downstream of the plant IS (Table 1, Figure 2B) is related to the sequence at the end of the integrated T‐DNA [indicated by arrows (a)–(d) below the T‐DNA sequence in the abscissa in Figure 2A]. For instance, when the integrated T‐DNA ends at position (c), (Figure 2A), the over‐represented nucleotides in the plant IS indicate that the IS consensus sequence is CCCAAC (Table 1; Figure 2B downstream), whereas it is AAAAG when the integrated T‐DNA ends at position (d). There is, in all cases, an imperfect but striking similarity between the complement of the 3′ sequence of integrated T‐DNA and the consensus at the plant IS (Figure 2A and B). Thus, for canonical insertions [case (a) in Figure 2A and B], the consensus sequence 5′‐(C/T)(A/C)(G/A)GGA‐3′ in the plant genome has some similarity with the complement of the sequence 3′‐GTCCT‐5′ (i.e. CAGGA) at the end of the T‐DNA. The occurrence observed for one nucleotide may be as high as 84.8% at the plant IS (Table 1). Nevertheless, the microcomplementarities observed between the T‐DNA and plant DNA are not perfect, since, in many cases, two nucleotides may be over‐represented at the same position. Indeed, as illustrated for the canonical insertion (Figure 2C), this may be explained by the alignment of two sequences (5′‐CAGGAN‐3′ and 5′‐TCAGGA‐3′), complementing the same sequence in the T‐DNA LB (3′‐GTCCT‐5′), but shifted by one nucleotide. This strongly suggests a frequent deletion of one nucleotide downstream of the IS as a consequence of the integration. Inspection of individual sequences downstream of the canonical IS indicates that there is a clear shift due to the presence of one nucleotide before the microsimilarity in 53% of IS, either a T (35%), an A (11%), a C (6%) or a G (<1%). The relative nucleotidic representation of the first five positions upstream of the IS has been re‐computed, taking into consideration, when necessary, the shift of one nucleotide between the apparent and actual IS. Corrected values are 71, 59, 39, 45 and 40% for C, A, G, G and A positions, respectively. Therefore, the consensus sequence of the microsimilarity, at the plant IS, for canonical integrations is clearly 5′‐CAGGA‐3′. Alignments with no accepted gap of each IS corresponding to canonical T‐DNA insertions with the 5′‐CAGGA‐3′ sequence show that a majority of IS (63%) exhibits an identity of at least 50% with this sequence (Figure 2D). The same alignments with sequences randomly taken from the genome provide only 19% of sequences with this score. Thus, collectively, our data show that, whatever the position of the cut in the T‐DNA, there is a microsimilarity between the integrated T‐DNA border and the plant IS.
Interestingly, the plant genome sequence upstream of the T‐DNA IS shows an over‐representation of the nucleotide T (Table 1). It is particularly striking when FSTs correspond to canonical insertions at the LB. In this case, upstream of the plant IS there is a highly significant occurrence of T at five contiguous positions that cannot be due to sequence matches between the T‐DNA and the plant genome. If, as discussed above, the shift of one nucleotide introduced in the apparent IS by a deletion of one nucleotide is taken into consideration, the T representation becomes 38, 40, 45 and 73% at the −5 to −1 position upstream of IS, respectively. Thus, the T‐rich region that frequently ends by a T is located immediately upstream of the microsimilarity region. Half of the five nucleotide sequences upstream of the IS contain at least three Ts.
Lastly, we searched for microsimilarities between the RB of the single‐stranded T‐DNA and the complement of the plant DNA. We observed a significant over‐representation of Gs, in the plant DNA, at position −2 from the IS, either characterized by FSTs from the RB of canonical T‐DNA insertions [(e) in Figure 2A and B] or FSTs from T‐DNAs nicked between the second and the third bases [(f) in Figure 2A and B). These results show that in both cases the nucleotide G is significantly over‐represented, in plant sequences, at 2 bp from the IS and it is preceded by an over‐representation of A in the case where the T‐DNA end is canonical, tttagcacaCT, and T when the T‐DNA end is tttagcaCA.
We have generated and analysed a large set of integrated T‐DNAs and their respective pre‐IS. We confirmed and further characterized the involvement of a microsimilarity between the T‐DNA LB sequence and host DNA IS previously observed in a limited number of IS (Tinland et al., 1995) and proposed to be the docking force for integration (Tinland, 1996). Interestingly, we found that even in non‐canonical insertions a microsimilarity with the host DNA is also present. Most of the microsimilarities involved the first 25 nucleotides of the LB, but some are observed up to the sequence of the oligonucleotide used to prime the FST sequencing. As a consequence of the small length of the microsimilarity and the large region of the T‐DNA in which it may be found, the T‐DNA may potentially integrate anywhere in the plant genome. However, we showed an over‐representation of Ts at five positions immediately upstream of the IS, indicating that T‐DNA integration is influenced by the nucleotide composition at the pre‐IS independently of any microsimilarity. The preference for a T‐rich context for T‐DNA integration may explain favoured integrations in the gene region upstream of the start codon, as well as the higher density of FSTs found in introns than in exons. Our data have also increased our knowledge on the recombination reaction at the 5′ end of the T‐DNA. Previous results (Tinland et al., 1995) indicated that an identity existed between at least the nucleotide linked to VirD2 and the last nucleotide of the plant IS. We extended these observations and demonstrated that the identity between the 5′ end of the T‐DNA and the plant pre‐IS often involved the last nucleotide but also a G located immediately downstream. Our data statistically support the model for T‐DNA integration previously proposed by Tinland (1996). Taking all our new data into consideration, we propose a model that mainly differs from Tinland's model by the preference for T‐DNA integration in the vicinity of a T‐rich region (Figure 4). The following five steps are involved. (1) The integration process, often initiated by the 3′ (LB) of the T‐DNA invading a poly T‐rich site of the host DNA. (2) Upstream of the 3′ end of the single‐stranded T‐DNA, a more or less perfect duplex with the top strand of the host DNA is formed. Our findings are the first proof of a link between the location of the microsimilarity in the T‐DNA and the cut in the LB of the integrated T‐DNA. (3) After degradation of the 3′ end portion of the T‐DNA downstream of the duplex, the ligation between the digested bottom strand of the host DNA and the 3′ end of the T‐DNA is performed by host enzymes. We assume that the 1 bp deletion frequently observed in the plant DNA is a consequence of this digestion‐ligation step. (4) A nick in the upper host DNA strand is created downstream of the duplex and used to initiate the synthesis of the complementary strand of the invading T‐DNA. The imperfect matches in the duplex are detected and repaired by host enzymes, using the invading T‐DNA sequence as a template. (5) The right end of the T‐DNA is ligated to the bottom strand of the host DNA. The pairing frequently involves a G and another nucleotide upstream of it. (6) The top strand of the host DNA is degraded between the two microsimilarities and a ligation with the synthesized complementary T‐DNA is made. This may result in a deletion of variable length in the host DNA. Out of 180 transformants for which we have FSTs from both sides of the integrated T‐DNA, 88 have apparent deletions shorter than 150 bases.
Some of the results presented here differ from previously published data, and one of the possible explanations is that for the first time we used a set of FSTs not biased by an analysis of a particular set of mutants. Secondly, we used a large set of data, enough to be statistically representative of the integration process.
The insertion of T‐DNA into the genome, may have recruited some of the cellular processes involved in illegitimate recombination (Tzfira and Citovsky, 2002). We demonstrate that the primary docking force of the T‐DNA towards the plant IS may be a poly T‐rich stretch in the plant DNA. An observed over‐representation of T‐rich sequences has been observed in other microsimilarity‐mediated recombinations (Kohli et al., 1999). An A‐T‐rich sequence is a region with both a low DNA duplex stability (Breslauer et al., 1986) and a strong bending (Bolshoy et al., 1991). Bending rather than sequence itself has been shown to favour retroviral integration in vitro (Müller and Varmus, 1994) and P transposable element integration in the Drosophila genome (Liao et al., 2000). Recognition of a bended DNA region might, therefore, be a common feature in the integration of foreign DNA in eucaryote genomes.
The collection of T‐DNA transformant lines has been generated with the A. thaliana ecotype Wassilewskija at the Institut National de la Recherche Agronomique (INRA Versailles), using the A. tumefaciens strain C58C1 (pMP90; Bechtold et al., 1993). The protocol used for obtaining FSTs is described in a previous paper (Balzergue et al., 2001), and details on FST sequences are available through the FLAGdb/FST database (Samson et al., 2002; http://genoplante‐info.infobiogen.fr). The FST set analysed contained 8919 non‐redundant sequences unequivocally mapped at only one locus of the plant genome. This FST set did not contain FSTs indicative of complex IS such as tandem insertions of two T‐DNA. Only FSTs not flanked by filler DNA were used. About 20% of FSTs contain stretches of bases downstream of the end of the inserted T‐DNA that do not match plant, plasmid or any known sequences. In most of cases, this filler DNA is shorter than 50 bases and its nucleotide composition is the same as the average in the A. thaliana genome. The presence of DNA filler at sites of non‐homologous recombination is thought to be a consequence of DNA break repairing. BLAST programs (Altschul et al., 1997) were used to align FSTs with the five chromosomes of A. thaliana. The pseudo‐molecules were downloaded from the TIGR site (http://www.tigr.org), with the associated coding sequence annotations (ID 68170, 51595, 68173, 68164 and 68172). A gene is interrupted by a T‐DNA when a FST starts in the genomic region, including the predicted CDS (coding sequence) for this gene and 200 bp on each side.
The authors are grateful to Mickael Hodges and Ian Small for reading the manuscript. This work was supported by the French Génoplante Program.
- Copyright © 2002 European Molecular Biology Organization