Transposable elements and human cancer: A causal relationship?

Somatic retrotransposition in human cancer revealed by whole-genome and exome sequencing

The observed counts were taken from the TE detection pipeline, and the expected counts were computed as the ratio of unique insertions seen in matched normal vs.

The significance of the difference between the observed versus expected counts of unique L1 insertions was evaluated using the Fisher's exact test.

Counts of TE insertions for matched normal and primary tumor tissue samples were characterized based on their frequencies from the Genomes Project 1KGP Sudmant et al. The distributions of TE insertion counts across the three frequency bins were compared for matched normal and cancer samples for the different tissue types analyzed here, and the significance of the differences between these distributions were evaluated using the Kolmogorov-Smirnov test.

RefSeq genes Pruitt et al. Primary Tumor Tissue Samples RNA-seq data were used to evaluate the differences in TE expression levels between matched normal and primary tumor tissue samples as described in the Materials and Methods.

The observed differences in gene expression levels between normal and tumor tissue were compared to differences in TE expression levels for breast invasive carcinoma, head, and neck squamous cell carcinoma and lung adenocarcinoma.

There are no significant differences observed for the distributions of gene expression levels between matched normal and primary tumor tissue pairs for any of the three cancer types analyzed here Figure 2. Similarly, when all three families of potentially active TEs Alu, L1, and SVA are considered together, there is no significant difference seen for the overall levels of expression between matched normal and tumor tissue.

However, when full-length, potentially active L1 sequences are considered alone, we observe statistically significant increases in L1 expression levels for all three cancer types. Gene expression levels for matched normal vs. Expression levels are shown as distributions of log10 transformed read counts, and normal versus tumor comparisons are shown for breast invasive carcinoma greenhead and neck squamous cell carcinoma redand lung adenocarcinoma blue.

For each tissue type, the significance levels of the differences in L1 expression between normal and cancer pairs are indicated with P-values from the Kolmogorov-Smirnov test.

The methods that we used to characterize TE expression levels include several analytical controls aimed to ensure that only genuine TE-initiated transcripts, from members of potentially active families, are measured.

Nevertheless, the lack of a difference between normal and tumor expression levels observed when all three active TE families were considered together could reflect technical difficulties with identifying bona fide TE transcripts that are initiated from element promoters as opposed to TE sequences that are passively expressed as part of longer genic transcripts. This is particularly true for Alu elements, many of which are found in the introns of human genes and transcribed as read-through transcripts initiated from RNA Pol II gene promoters Deininger, Our confidence in the ability to measure L1-initiated transcripts is higher owing to the focus on previously identified full-length, intact elements that are located in intergenic regions.

In any case, the up-regulation of L1s in cancer that we observed has potential implications for increased TE insertional activity for all three families, since L1 encoded proteins are responsible for the cis retrotransposition of L1s as well as the trans activation of Alu and SVA elements Batzer and Deininger, ; Hancks and Kazazian, We analyzed the same pairs of matched normal and primary tumor tissues to evaluate whether the observed increase in L1 expression corresponds to increased transpositional activity of human TEs.

This technological advance is exemplified by the recent Phase 3 release of the 1KGP, which includes a complete genome-wide census of polymorphic TE insertion sites for individuals across 26 human populations Sudmant et al.

We analyzed whole genome DNA-seq data using computational methods for TE insertion detection see Materials and Methods in order to compare TE insertional activity between matched normal versus primary tumor tissue samples. When all three families of active human TEs are considered together, we observed a total of TE insertions across the nine individuals analyzed for normal and cancer tissue pairs, of which are unique insertions found in only one individual and one tissue type.

These results are consistent with a potential role for L1 transpositional activity in tumorigenesis for the cancer types analyzed here, as has been previously suggested for several different cancers Morse et al.

TE insertional activity in matched normal vs. The number of TE insertions were measured for normal and primary tumor tissue pairs for breast invasive carcinoma, head, and neck squamous cell carcinoma and lung adenocarcinoma via analysis of whole genome DNA-seq data as described in the Materials and Methods.

A The total number of predicted TE insertions, pooled for all nine individuals over the three cancer types analyzed here, are shown for normal vs. Venn diagrams show the numbers of unique versus shared TE insertions for the two tissue types. B Comparison of the observed versus expected numbers of unique L1 insertions for normal vs.

C Comparison of the population frequencies of observed TE insertions in matched normal vs. D—F The same comparisons of TE insertion population frequencies are shown individually for each cancer type analyzed here.

TE insertion population frequencies are color coded as shown in the key. P-values show the significance of the differences for observed distributions based on the Fisher's exact test B and the Kolmogorov-Smirnov test C—F.

Given the relatively high level of L1 insertional activity in the tumor tissue samples analyzed here, we tested whether tumor-specific L1 insertions are found at lower frequencies among the presumably healthy donors from the 1KGP compared to L1 insertions found in matched normal tissue. The idea was to evaluate whether the tumor-specific L1 insertions represent mutations that are private, and thereby more likely to be deleterious or disease-causing.

The strongest effect is seen for head and neck squamous cell carcinoma. The pattern of a significant excess of private L1 insertions in tumor compared to normal tissue, observed for all three cancer types studied here, provides further evidence in support of a possible role for L1 activity in tumorigenesis. It should be noted TE insertions found in low copy numbers may not be detectable using next-generation sequence analysis, whereas such insertions may be uncovered using more sensitive PCR-based approaches.

False negatives of this kind will be more prevalent at low levels of sequence coverage. Sequence based predictions can also yield false-positive TE insertion calls.

In an effort to deal with this issue, we have only used high-confidence calls produced by two independent programs—MELT and Mobster—that we have recently shown to be most reliable for the detection of human TE insertions Rishishwar et al.

One other potential problem with the sequence based analysis relates to the base pair resolution with which TE insertions can be called via computational analysis of next-generation sequence data. Currently, the most accurate programs for calling TE insertions from next-generation sequence data do not yet allow for the insertions to be precisely located to genomic regions at single base pair resolution. It is possible that this approximation can lead to multiple TE insertion events being collapsed into a single event.

Subsequent experimental confirmation of individual TE insertion calls of interest e. Potentially Tumorigenic TE Insertions Having established a potential role for transpositional activity in tumorigenesis using the genome-wide approaches described above, we wanted to search for specific examples where individual TE insertions could be implicated as possible cancer driver mutations.

To do so, we performed an integrated analysis of TE insertion, gene expression and chromatin data see Materials and Methods in an effort to identify the cancer-specific TE insertions that are most likely to play a causal role in tumorigenesis. We considered TE insertions that are co-located with either exons or regulatory elements of previously characterized tumor suppressor genes to have the highest likelihood of being functionally relevant. Only events with at least 10 read-pairs, including at least two in each direction, supporting the insertion were maintained.

Events consistent with microsatellite instability or ancient retrotransposons were filtered out. TranspoSeq-Exome We modified TranspoSeq to search for novel junctions between retrotransposons and unique genomic sequence using split reads.

Instead of aligning all discordant read-pairs to the database of consensus retrotransposon sequences, TranspoSeq-Exome first parses out all clipped reads identified by BWA and aligns the clipped sequence to the database of retrotransposons.

Additionally, the exact base-pair location of a clip can be misidentified by BWA. Primers, designed using Primer3 Rozen and Skaletskyand target information is listed in Supplemental Table 1. See Supplemental Figure 9 and accompanying text for further information regarding validation experiments. Statistical analysis Correlations with other genomic features Data for replication timing and chromatin conformation were collected from Chen et al.

Values were then converted to reads per kilobase per million RPKM by the formula: Element characteristics were assessed using L1Base Penzkofer Correlation with gene expression To assess overall gene expression changes across all tumor types: We compared gene expression in the sample in which the insertion is present to the distribution of RSEM across all other samples investigated.

We used a two-tailed Wilcoxon-Mann Whitney test in R to test for the hypothesis that a gene with a retrotransposon insertion is transcribed at a significantly lower level in samples with this insertion. To assess individual expression changes: