Human endogenous retroviruses (HERVs) are remnants of ancient retroviral infections that became fixed in the germ line DNA millions of years ago. The fact that humoral and cellular immune responses against HERV-encoded proteins have been identified in cancer patients suggests that these antigens might be used in cancer immunotherapy or diagnosis. We analyzed the digital expression patterns of the HERV-K (HML-2), -W, -H and -E families in normal and cancerous tissues. Thirty-one proviral members of the HERV-K family and one representative each for the other HERV families were used as probes to search human EST data. Matching of HERV proviruses to ESTs was HERV family-specific and the expression profiles of the HERV families distinct. The HERV-K family was expressed in normal tissues such as muscle, skin and brain, as well as in germ cell tumors and other cancerous tissues. HERV-H was the only family expressed in cancers of the intestine, bone marrow, bladder and cervix, and was more highly expressed than the other families in cancers of the stomach, colon and prostate. In contrast, HERV-W was predominantly expressed in normal placenta. Expression patterns were confirmed by MPSS (massively parallel signature sequencing) data where available. For the HERV-K family, we mapped most ESTs to their corresponding proviruses and assessed the coding capacities of the matched proviruses. This study shows that HERV families are more widely expressed than originally thought and that some members of the HERV-K and -H families could encode targets for cancer immunotherapy.
This article was published in Cancer Immunity, a Cancer Research Institute journal that ceased publication in 2013 and is now provided online in association with Cancer Immunology Research.
Human endogenous retroviruses originate from ancient retroviral infections that became fixed in the germ line DNA between <1 and 40 million years ago and represent approximately 1% of the human genome (1, 2). So far, about twenty HERV families, named according to the transfer RNA used to prime reverse transcription, have been identified (3). Most HERV families are present in multiple copies in the genome and all those examined to date are defective due to the accumulation of mutations affecting their coding potential (4). However, some families still contain open reading frames (ORFs) for one or more of the retroviral genes. HERVs have simple genomes containing gag, pol and env genes encoding retroviral polyproteins, flanked by two long terminal repeats (LTRs). The 5' LTR contains enhancer and promoter sequences that provide signals recognized by the cellular machinery for transcription initiation and the 3' LTR provides a polyadenylation signal. Typically, two polyadenylated transcripts are produced: a full-length transcript, used for the synthesis of new viral genomes and as mRNA for the gag and pol genes, and a spliced transcript, used as mRNA for the env gene. A HERV provirus could therefore be considered as a single retroviral gene containing different translational units. Translation of the different ORFs present on the full-length transcript is regulated and involves some nonsense and frameshift suppression events to produce about 20-fold more Gag than Pol protein products. These polyproteins are then cleaved by the viral protease (5).
The HERV-K (HML-2) family is present as 30-50 proviral copies in the human genome (6) and is the most conserved HERV family, since some members of this family have intact open reading frames for the gag, pol or env retroviral genes (7). Moreover, members of the HERV-W and -H families have been shown to encode intact Env proteins (8, 9). HERVs are transcriptionally silent in most normal human tissues (10). However, HERVs have been found to be expressed in some normal tissues, such as placenta, and under pathological conditions (8, 11, 12). For instance, the HERV-K family has been reported to be expressed in teratocarcinoma and breast cancer cell lines (13, 14), HERV-E in prostate carcinoma (15), HERV-H in leukemia cell lines (16) and HERV-W in normal placenta and brain tissues from multiple sclerosis or schizophrenic patients (8, 17). There is no evidence, however, that HERVs are directly implicated in carcinogenesis.
On the other hand, antibody responses against HERV-K proteins have been observed in patients with germ cell tumors (18, 19). In addition, antibodies reactive against cDNA clones encoding HERV-K Gag or Env were identified in the sera of testicular, melanoma and prostate cancer patients using the SEREX (serological analysis of recombinant cDNA expression libraries) methodology, demonstrating that a humoral response was mounted against these proteins (SEREX sequence IDs 1630, 1631, 92, 289 and 2312). Schiavetti et al. recently identified a short HERV-K ORF encoding an antigen that elicited a CTL response in melanoma patients (20). The fact that HERV-encoded proteins have been found to be able to elicit humoral and/or T-cell mediated responses in cancer patients suggests that these proteins could be a source of antigens for use in cancer immunotherapy or diagnosis.
In order to identify additional HERV antigen candidates in different malignancies, the expression patterns of each HERV family need to be analyzed and the coding capacity of differentially expressed transcripts assessed. In this study, we analyzed digital expression patterns of four HERV families, HERV-K, -H, -W and -E, for which proviruses containing at least one full-length viral open reading frame have been described. When available, we compared expression profiles obtained by EST analysis with those based on MPSS of a set of 31 normal tissues and 3 cancer cell lines. MPSS is a technology that allows the generation of millions of signature tags proximal to the 3' end of transcripts, sufficient to cover cellular transcripts up to 10-fold (21, 22). Moreover, since most HERV-K proviruses have recently been identified, we could assign most ESTs to their corresponding proviruses and assess their expression levels and coding capacities.
Mapping HERV proviral sequences to the genome
We mapped HERV proviruses onto the human genome based on previously published analyses (23, 24, 25, 26) and using the lalign (27) and bl2seq (28) pairwise alignment tools with known family members. Only proviruses that were not extensively degenerate or truncated were selected for the analysis of their expression patterns. Table 1 shows the list of analyzed proviruses and their mapping to human genomic sequences and Ensembl chromosomes (based on the NCBI 33 assembly of the human genome).
EST-based expression profiles of HERV families
In order to analyze the expression patterns of the four less degenerate HERV families, two query sets containing proviral sequences representative of each HERV family were created. Since only a few proviral sequences have been described for the HERV-H, -W and -E families, one representative proviral member for each of these families was included in the query sets. The first query set contained these sequences as well as thirty-one HERV-K proviral sequences, thus covering most of the HERV-K family. For comparison purposes, the second query set contained HERV-K108 as the representative of the HERV-K family. This HERV-K provirus was selected because it contains large retroviral ORFs. LTR sequences were removed from proviral sequences prior to inclusion in the query sets in order to avoid matching solo LTRs lacking adjacent retroviral sequences.
These query sets were then used to search the human EST data using Megablast (29). Blast results were sorted according to matched ESTs and only HERV proviruses that matched ESTs with the best alignment score were kept. ESTs matching Alu sequences inserted in the proviruses HERV-K 22q11, 19p13_11B, 6p21 and 9q34_3 were excluded from the analysis. Moreover, in order to confirm that ESTs stemmed from the matched HERV-K provirus, ESTs were searched against the human genome (NCBI build 33) using BLAT (BLAST-like alignment tool) and the chromosome coordinates compared to the ones obtained for the HERV-K proviruses present in the query set. A majority of ESTs (95/143) matched the corresponding HERV-K provirus unambiguously, 6 ESTs matched more than one HERV-K provirus in the query set with identical BLAT scores, while 42 ESTs matched yet uncharacterized HERV-K provirus loci with a better BLAT score (see Supplementary Table 4). The total number of matched ESTs from non-normalized libraries is listed in Table 1 for each provirus and HERV family, respectively.
The HERV-K C3_NT005863 provirus, corresponding to the transcriptionally active HERV-K(II) described by Sugimoto et al. (30) and the HERV-K 22q11 provirus, a provirus encoding a Gag protein identified by Y. Obata using the SEREX methodology in a prostate cancer patient, are the most expressed members of the HERV-K family in our analysis. As expected from the sequence divergence between the four HERV families, EST matches were always HERV family-specific (i.e. ESTs were matched only by members of the same HERV family). Since there was no competition for the matching of the HERV-W, -H and -E proviruses to their family-specific ESTs, we assumed that the expression of the member selected represented the expression pattern of the entire family.
Information about the tissue of origin and its status (cancerous or normal) was then retrieved for each matched EST based on the classification of the corresponding cDNA library using a controlled hierarchical ontology (eVOC) (31). Based on this information, we sorted the ESTs into 32 different tissue categories (see Supplementary Table 1). The list of normal and cancerous tissues in which HERV families are expressed is shown in Table 2. Only ESTs from non-normalized cDNA libraries were taken into account. For each HERV family, tissues are sorted from the highest to the lowest relative expression, based on the ratio of the number of matched ESTs to the total number of ESTs sequenced for the tissue.
These results suggest that HERV families have distinct expression profiles and are expressed in both normal and cancerous tissues. The overall level of expression is similar for the HERV-K, -H and -W families, but is lower for the HERV-E family. Although their level of expression is relatively low in most tissues, as suggested by the low number of ESTs identified, certain HERV families are more highly expressed in some tissues. For example, the HERV-W family is highly expressed in normal placenta (78 ESTs), consistent with the fusiogenic function recently described for its envelope protein in human placenta (32, 33). The HERV-H and -K families are highly expressed in testicular cancers but are also expressed in different normal and cancerous tissues. The HERV-H family is particularly well represented in cancerous tissues, such as colon, stomach and prostate.
HERV-K expression in testicular cancer (see Table 2B) was confirmed by the fact that about 70% (41/58) of matched ESTs from subtracted libraries and 47% (7/15) from normalized libraries were generated from germ cell tumors. Moreover, a majority of these ESTs matched the HERV-K 22q11 (22 ESTs, of which 18 were from subtracted cDNA libraries) or HERV-K101 (16 ESTs, of which 14 were from subtracted cDNA libraries) proviruses (see Supplementary Table 2). Preferential HERV-W expression in normal placenta was also confirmed since this tissue represented 96% (45/47) and 60% (3/5) of matched ESTs from normalized and subtracted libraries, respectively. Fewer ESTs from these normalized or subtracted cDNA libraries were identified for the HERV-H (13 ESTs) and -E (22 ESTs) families. Nevertheless, expression patterns in these cDNA libraries were similar to the ones observed in non-normalized libraries.
Assessment of the homogeneity of HERV-K provirus expression patterns and HERV family-specific matching to ESTs
In order to determine the level of similarity between the expression profiles of the 31 HERV-K proviruses and to confirm that matching to ESTs was family-specific, all HERV provirus sequences were aligned to all ESTs matched by the HERV-K family. Normalized alignment scores and binary distances were computed as described in Materials and Methods and proviruses were clustered based on these binary distances. As shown in Figure 1, expression patterns were relatively similar, though not identical, for the different proviral members of the HERV-K family. Most of the variability in the HERV-K provirus expression patterns was indeed attributable to truncated HERV-K proviruses that failed to match some ESTs (HERV-K proviruses on the left side of Figure 1). In addition, this analysis provides experimental evidence that matching to EST sequences is actually HERV family-specific, since proviruses from the HERV-W, -H and -E families failed to match HERV-K family-specific ESTs.
Comparison of the expression levels of the different HERV families in normal and cancerous tissues
As shown in Table 2, the number of ESTs present in normal and cancerous tissues for the HERV families analyzed were obtained for two query sets containing one representative for each of the HERV-H, -W and -E families and either thirty-one members of the HERV-K family or the HERV-K108 provirus (representative of the HERV-K family). In order to compare the expression levels of the different HERV families in each tissue, these observed frequencies were used to build two contingency tables. For each HERV family, expected frequencies and chi square values were then calculated for each tissue as described in Materials and Methods. The results for the two query sets are shown in Table 3 (A and B, respectively). Normal tissues were sorted according to their contribution to the total chi square value, while cancerous tissues were listed in same order as normal tissues for comparison purposes. Moreover, tissues for which HERV expression was observed in only one tissue state (normal/cancer) were grouped in the separate "Others Norm." (adipose, nerve, fetal liver-spleen, eye and bone) or "Others Canc." (stomach, lymph node, intestine, bone marrow, bladder and cervix) categories, respectively.
Differential expression of a HERV family in a tissue was inferred by comparing the observed and expected frequencies of the different HERV families. A HERV family was considered differentially expressed in a tissue if its observed frequency was at least two-fold higher than its expected frequency, with observed frequencies for the other HERV families being equal or lower than expected. Other parameters were taken into account, such as the contribution of the tissue to the total chi square value and, for the HERV-K family, the correlation between the two query sets.
As shown in Table 3A, the HERV-K family is the only HERV family expressed in normal muscle and is more expressed in normal and cancerous brain and skin tissues than the other HERV families. Expression of the HERV-K family in these tissues is confirmed by the results obtained with the HERV-K108 provirus as the only representative of the HERV-K family (Table 3B). The HERV-K family is also expressed in other cancerous tissues, such as testis, uterus and head and neck, as well as in normal pancreas. However, the observed frequency in testicular cancers is just above the expected frequency (9/7) and therefore does not meet the two-fold criteria set above. It is interesting to note that most of the HERV-K ESTs identified in the "Others Canc." category (7/8) were isolated from stomach cancer cDNA libraries. Half of these ESTs stem from either the HERV-K101 (3 ESTs) or the K108 (1 EST) provirus. Taking the size of the cDNA libraries into account and comparing the frequencies observed in normal and cancerous tissues for the HERV-K family, we observed higher HERV-K expression levels in the following normal tissues: brain, muscle, skin and pancreas. On the other hand, the family is more expressed in testis, breast, and head and neck cancerous tissues than in the respective normal tissues. Therefore, the HERV-K family is expressed in both normal and cancerous tissues. In contrast, the HERV-H family is preferentially expressed in cancerous tissues such as prostate and colon. It is also over-represented in testicular cancers (11/7), though not reaching the two-fold threshold between observed and expected frequencies. On the other hand, the over-representation of observed HERV-H ESTs in the "Others Canc." category (28/13) indicates that the HERV-H family is specifically expressed in a set of cancerous tissues comprising bone marrow (5 ESTs), small intestine (4), bladder (3) and cervix (3) and is more highly expressed in stomach cancers (11) than the HERV-K family (7). The HERV-W family is predominantly expressed in normal placenta, but is also expressed in cancerous placenta. Finally, the HERV-E family is the least expressed of the HERV families analyzed. This family seems to be expressed in normal and cancerous breast tissues, normal kidney and cancerous endocrine glands but the number of ESTs identified is relatively low (<5). However, the 5 ESTs observed in the "Others Norm." category were isolated from normal eye tissue.
HERV-K proviruses with atypical tissue expression
Approximately two-fold more ESTs were matched when thirty-one members of the HERV-K family were used instead of HERV-K108 alone in the query set (Table 2), indicating that some HERV-K proviruses matched additional ESTs. Consequently, when all HERV-K proviruses were present in the query set, additional normal and cancerous tissues showing HERV-K expression were identified. Table 4 shows HERV-K proviruses that are expressed in tissues where other members of the family are not usually expressed. For instance, C3_NT005863 is the only HERV-K provirus expressed in normal pancreas and is the most expressed provirus in fetal liver and spleen. The HERV-K 22q11 provirus accounts for all matched ESTs originating from normal uterus and normal or malignant prostate.
HERV expression analysis by MPSS
Massively parallel signature sequencing is a technology that allows the generation of millions of signature tags proximal to the 3' end of transcripts (21). Unlike other methods of sequence-based transcriptome analysis, such as ESTs and serial analysis of gene expression (SAGE), the MPSS technology can obtain, in a single experiment, up to a 10-fold clone coverage of the transcripts present in a human cell. MPSS data have been generated for 31 normal tissues and 3 cancer cell lines. Therefore, we were interested in comparing the HERV expression data based on EST counts with that based on MPSS data. We first predicted potential MPSS tags in HERV proviral sequences by extracting 13 nt sequences adjacent to the DpnII site proximal to the polyadenylation site present in the 3' LTR of HERV proviruses. After removing tags that also matched cellular genes, we checked if those sequences were present in the MPSS data and counted the number of times each tag was identified in each tissue. Three MPSS tags meeting these criteria were identified. The first one is shared by 17 HERV-K proviruses including provirus C3_NT005863, the second is specific for HERV-K 22q11, while the third is specific for the HERV-W provirus. The number of instances each MPSS tag was detected in the different tissues, reflecting the expression levels of the corresponding transcripts, is shown in Table 5. Because a modified protocol has been used to generate the MPSS data for the three cancer cell lines (Table 5B), the normal breast cell line and a normal placenta tissue sample, the number of MPSS tags in these tissues cannot be compared with those present in the normal tissues (Table 5A).
Expression of the HERV-K 22q11 provirus in normal brain and prostate tissues (see Supplementary Table 2) is confirmed by the MPSS data. Moreover, the MPSS analysis reveals expression of this provirus in normal tissues (Table 5A) that were not identified by the EST expression analysis, such as placenta, testis, kidney and other tissues, as well as in a melanoma cell line (Table 5B). In contrast, the expression pattern of the MPSS tag shared by 17 HERV-K proviruses is similar but not identical to the ones obtained based on the EST analysis. For instance, there was a correlation between MPSS and EST data for expression of these HERV-K proviruses in normal brain and breast tissues, but no expression could be detected in normal pancreas using MPSS while expression of the C3_NT005863 provirus in this tissue had been observed in the ESTs analysis (Supplementary Table 2). HERV-W expression in normal placenta and breast tissues is, however, confirmed by the MPSS analysis (Table 5B).
Analysis of ESTs matching retroviral open reading frames
Because of the accumulation of mutations in HERV sequences, most retroviral ORFs are fragmented or shortened compared to their original ancestor. In order to determine whether ESTs preferentially matched conserved retroviral ORFs, we analyzed the proportion of ESTs matching open reading frame sequences longer than 450 nucleotides for each HERV family studied. We found that a majority of ESTs from non-normalized cDNA libraries matched such HERV ORFs. The proportion of ESTs matching ORFs was 66% (63/95) for HERV-K, 67% (66/98) for HERV-W, 75% (27/36) for HERV-E and 41% (74/157) for HERV-H (Supplementary Table 3). Therefore, with the exception of HERV-H, a majority of the ESTs matched ORFs potentially encoding at least 150 aa.
Among the ESTs matching HERV ORFs, all HERV-W specific ESTs (66/66) and a majority of HERV-H ESTs (58%, 43/74) matched a full-length envelope ORF encoded by the respective provirus. Moreover, 50% (7/14) of ESTs matching HERV-K C3_NT005863 and 37.5% (6/16) matching the 22q11 proviral ORFs matched regions encoding potentially functional full-length Gag proteins. The remaining HERV-K C3_NT005863 and 22q11 ESTs matched partial pol ORFs. For the other HERV-K proviruses in the query set, most ESTs matched either pol or env partial ORFs.
The fact that HERV expression has been reported in multiple cancer tissues and that HERV-K endogenous retroviral proteins have been identified using the SEREX methodology suggests that these proteins could be useful antigens for diagnostic purposes or cancer immunotherapy. We therefore studied digital expression patterns of the four least degenerate HERV families using members of each family to search human ESTs. As expected from the sequence differences between the HERV families at the DNA level, we observed that ESTs derived from different HERV families could easily be distinguished: ESTs matching members of the HERV-K family did not match proviral sequences from the HERV-W, H or E families. Moreover, although individual HERV-K members often did not match all family-specific ESTs, their expression patterns were similar and clustered together (Figure 1). Since our query set contained most of the proviral members of the HERV-K family, we also attempted to match each EST to its respective provirus. In order to be able to compare expression levels between or within HERV families, we compared EST counts from non-normalized, non-subtracted cDNA libraries.
Total EST counts for the HERV families analyzed in this study were relatively low, suggesting that HERVs are expressed at low levels. This is consistent with the fact that transcription of most HERV proviruses is usually shut down, possibly by methylation (10, 34). However, this observation might also be due to limitations in the depth of cDNA sequencing. Firstly, the number of cDNA libraries and EST sequences generated vary greatly from tissue to tissue. For example, about 80 times more EST sequences have been generated from normal brain than from normal uterus. Secondly, poorly expressed genes may be under-represented in relatively small cDNA libraries. The digital expression patterns reported here are consistent with the patterns already reported for the respective HERV families. For example, the HERV-K and -H families are known to be expressed in testicular and germ cell tumors. However, using digital expression profiling, expression in additional tissues was found. The HERV-K family is more highly expressed in normal tissues than in tumors. It is the only HERV family expressed in normal muscle and is overexpressed in normal brain, skin and pancreas, as well as in cancers of the brain, head and neck, and uterus, compared to the other HERV families. Moreover, HERV-K101 and K108 are expressed in stomach cancers but not in normal stomach tissues. Our analysis suggests that the HERV-K C3_NT005863 (HERV-KII) is expressed in normal pancreas and confirms the expression of the HERV-K 22q11 provirus in normal and malignant prostate tissues, contrasting with the rest of the HERV-K family. The level of expression of HERV-K 22q11 is however higher in normal prostate tissues than in prostate cancers. On the other hand, the HERV-H family is preferentially expressed in cancer tissues (105 ESTs from cancerous tissues versus 46 ESTs from normal tissues) and has a higher level of expression in stomach, colon, prostate and testis tumors than other families. The observed versus expected frequency for HERV-H in testicular cancers, however, was less than two-fold (11/7). In addition, this family seems to be specifically expressed in a set of cancer tissues such as small intestine, bone marrow, bladder and cervix. The HERV-W family is mostly expressed in placenta (8) where its envelope protein (syncytin) has been reported to play a role in human placenta morphogenesis by mediating cytotrophoblast fusion (33). Finally, the HERV-E family had a relatively low digital expression.
Tissue expression patterns of HERV-K 22q11 (expressed in brain and prostate tissues) and HERV-W (predominantly expressed in placenta) were confirmed by the MPSS data (22) in a set of 31 different normal tissue samples and 3 cancer cell lines. The higher sensitivity of the MPSS analysis relative to other digital expression methods allowed us to identify additional tissues in which the HERV-K 22q11 provirus is expressed, such as normal kidney, heart and thymus. On the other hand, the MPSS analysis was limited by the fact that MPSS data were available to us for only a few tissue samples, and thus expression detected by ESTs could not always be confirmed by MPSS.
We observed that a majority of ESTs matched relatively long and sometimes full-length ORFs of the HERV-K, -W and -E families. The active transcription of HERV ORFs and the detection of antibody responses against the HERV-K Gag or Env proteins in about 60% of germ cell tumor patients (18, 35) suggest that these proteins could be potential targets in cancer immunotherapy. However, since peptides encoded by these long HERV ORFs might not be produced in vivo, it is important to assess their tissue expression levels experimentally.
Because only one provirus each was used to represent the HERV-W, -H or -E families, we cannot be sure whether ESTs matching the corresponding family were transcribed from the provirus used in the query set or another member of the family. However, since the HERV-H family was found to be expressed in some cancerous tissues (small intestine, bone marrow, bladder and cervix) but not in the corresponding normal tissues, members of this family should be further investigated for their expression patterns and immunological properties.
For the HERV-K family, we were able to match a majority of ESTs to their respective proviruses. Our digital expression analysis showed that C3_NT005863 and 22q11 are the most highly expressed HERV-K members. The fact that numerous ESTs matched full-length gag ORFs encoded by these two proviruses and that an antibody response against the HERV-K 22q11 Gag protein has been observed in prostate cancer patients (36), suggests that these proteins are good antigen candidates. Moreover, since HERV-K 22q11 is a member of the HERV-K(Old) subfamily (25), it still has a 96 bp insertion in the gag ORF, deleted in more recent HERV-K proviruses. It would be interesting to study the immunogenicity of the peptide encoded by this insertion since this subfamily is present at a lower copy number in the human genome than the rest of the HERV-K family. An antibody response against such an antigen might be more specific for the tissues in which HERV-K 22q11 is expressed, prostate in particular. However, the fact that the HERV-K 22q11 provirus is more highly expressed in normal prostate than in tumors and is also expressed in normal brain might raise some safety issues for an immunotherapy approach. Additionally, we identified two HERV-K stomach cancer target candidates as a few ESTs from this tissue matched pol or full-length env ORFs encoded by the K101 (3 ESTs) or K108 (1 EST) proviruses, respectively. Therefore, these proteins and HERV-K proviruses should be further investigated in vivo to determine their level of expression in cancer and normal tissues as well as their immunological properties.
In summary, EST and MPSS data indicate that HERV-derived RNAs are more widely expressed than originally thought. The HERV-K101 and K108, expressed in stomach cancers, as well as the HERV-H family that seems to be specifically expressed in some cancers might be good candidates for cancer immunotherapy or diagnosis. Their precise tissue distribution will have to be verified using more sensitive and quantitative methods, and the presence of HERV-specific proteins in these tissues investigated histologically. It is clear that the HERV families are expressed differentially across tissues. Whether this reflects different biological roles, if any, remains an open question.
Materials and methods
Mapping of HERV proviral sequences
Information about previously described HERV-K, -W, -H and -E proviruses was retrieved from the literature (8, 23, 24, 26, 37, 38, 39, 40) and the proviruses mapped to human genomic sequences and Ensembl chromosomes (NCBI 33 assembly). Some proviral sequences have been submitted individually to GenBank/EMBL, but in most cases only large bacterial artificial chromosome (BAC) or contig sequences containing the proviral HERV sequence were available. In order to determine the start and stop coordinates of the provirus on the genomic sequence and analyze the proviral structure, pairwise alignment tools such as lalign (27) or bl2seq (28) were used to align known LTRs, gag, pol or env sequences for a known family member. Coordinates, as well as complete proviral sequences, were then retrieved from the GenBank sequence repository or Ensembl web site. For the HERV-K family, the sequences of the K101 and K108 proviruses described by Barbulescu et al. (23) were used as probes in this analysis.
Alternatively, HERV-K proviral sequences were obtained using our own strategy to identify new HERV-K proviruses in the human genome. A query set (probes) composed of 6 HERV-K proviruses (K101, K102, K104, K108 and K113) described by Barbulescu et al. (23, 37), HERV-K10 (41) and two HERV-K18 alleles (42), was used to search the NCBI NT genomic contig database using Megablast (29). An alignment of at least 1000 base pairs was used as seed and the alignment was extended taking all other overlapping alignments into account. Information about the matched NT contig and the sequence was then extracted. This new strategy allowed us to also map slightly divergent HERV-K sequences. About 80 HERV-K insertion sites were found in the human genome using this approach, most of them containing only partial, truncated proviral sequences. Among the 22 predicted full-length HERV-K insertions, three potentially new proviral sequences, named after the chromosome and NCBI NT contig on which they were identified, were selected for further analysis (C3_NT005863.11, C3_NT022411.11 and C11_NT009151.11). The first two proviruses, C3_NT005863 and C3_NT022411, correspond to the HERV-K(II) and HERV-K(I) proviral sequences described by Sugimoto et al. (30). The third proviral sequence, C11_NT009151, is identical to the HERV-K provirus 11q23 identified by Hughes et al. (24).
Digital expression analyses
Query sets containing proviral sequences without LTRs (see supplemental data) as probes were constituted to search human ESTs using Megablast (29). Filters were inactivated (-FF) and mismatches penalized (-q -9). Only alignments with an E value lower than 10-40 were selected. The first query set was composed of 31 HERV-K proviruses (see Table 1) and one representative each of the other HERV-W (GenBank Accession No. AC000064 30950-37500), -H (GenBank Accession No. AJ289709) and -E (GenBank Accession No. M10976) families. For comparison purposes, the same analysis was carried out with the HERV-K108 provirus as a representative of the HERV-K family.
Hits identified by Megablast were first clustered by EST, sorted by alignment scores, and the library name and identifier were then retrieved for each EST. The controlled hierarchical vocabulary (eVOC) for development stages, anatomical sites and pathology types was used to classify cDNA libraries and describe the tissues from which they were generated (31). After selecting only the best matching provirus for each EST, tissue information was retrieved based on the library identifier. The first output of this analysis was a tab-delimited file containing a list of ESTs, best matching HERV proviruses, alignment information, cDNA library names and tissue descriptions. The data in the file was then rearranged by HERV families, sorted by HERV provirus and ESTs from standard cDNA libraries were separated from those derived from normalized or subtracted ones (see Supplementary Table 3). Only EST counts from non-normalized, non-subtracted cDNA libraries were used to compare the expression levels of the different HERV families in normal and cancer tissues.
The relative expression of one HERV family in a tissue is characterized by the ratio of the number of ESTs matching members of the HERV family to the total number of ESTs sequenced in the respective tissue. HERV expression patterns were obtained by sorting tissues according to the relative expression of the HERV family in each tissue. The total number of ESTs present in a tissue was compiled by counting all ESTs from non-normalized, non-subtracted cDNA libraries.
Validation of HERV-K provirus loci matched by ESTs
To validate that the EST stemmed from the matched HERV-K provirus in the query set, HERV-K ESTs were searched against the human genome (NCBI build 33) using BLAT (43) and loci coordinates compared to characterized HERV-K provirus loci present in the query set (see Supplementary Table 4). In addition, a tag was added in Supplementary Table 3 for each HERV-K EST: Y, validated EST mapping to the HERV-K provirus; N, the EST matches a different locus from an uncharacterized HERV-K provirus with a higher BLAT score.
Clustering of HERV proviral expression
A comparison of all EST sequences matched by the HERV-K family versus all proviral sequences present in the query set was performed using the Smith-Waterman (SW) algorithm (44) with a standard DNA similarity matrix and gap opening/extension penalties set to -16/-4. The SW scores S were normalized to obtain bit scores (45) using the formula Sbit=[lambdaS-(ln K)]/(ln 2), where the parameter values for the scoring system used above are lambda=0.15812780 and K=0.05411805. These values were estimated through simulation. The binary distance between any pair of proviral sequences was calculated according to the formula d=1-[n11/(n11+n10+n01)], where n11 is the number of EST sequences with a bit score equal to or greater than 36 - this threshold corresponds to a P-value of 0.01 (46) for both proviral sequences; n10 and n01 are the number of EST sequences with a bit score equal to or greater than 36 matching only one of the two proviral sequences. This distance measure ranges from 0 to 1. Eventually, proviral sequences were clustered using average-linkage agglomerative hierarchical clustering based on the binary distances.
Comparison of the expression levels in normal and cancerous tissues of the HERV families
The observed frequencies for each HERV family in normal and cancerous tissues were used to build contingency tables and expected cell frequencies, E, calculated using the formula E=(RTxCT)/N, where RT is the row total, CT the column total and N the grand total. Chi square values were then computed for each tissue using the formula chi2=sum of [(obs-exp)2/exp], where obs is the observed frequency of the HERV family in the tissue and exp its expected frequency. A HERV family was considered to be differentially expressed in a tissue if its observed frequency was at least two-fold higher than its expected frequency in the tissue and the observed frequencies for the other HERV families were equal to or lower than their expected frequencies. The contribution of the tissue to the total chi square value was also taken into account.
The relative expression levels of a HERV family in a normal tissue versus the corresponding cancerous tissue were assessed by comparing the observed frequencies in each tissue and the total number of ESTs sequenced for the respective tissue.
Evaluation of the coding capacity
Proviral sequences without LTRs were analyzed for the presence of open reading frames using the NCBI ORF finder web site (47). The coordinates of ORFs longer than 450 nt were retrieved for each provirus. The number of ESTs matching the ORFs was then compiled by comparing these coordinates to the ones of the matching ESTs on the provirus (Supplementary Table 3). The presence of putative conserved domains, characteristic for Gag, Pol or Env proteins, was also evaluated using the NCBI Conserved Domain search feature available on the NCBI ORF finder web site.
The authors would like to acknowledge the support provided to this project by Dr. Lloyd Old. We thank Prof. Ernest Feytmans and Dr. Pierre Farmer for advice on statistical analysis and Dr. Andrew Simpson for critical reading of the manuscript. This work was supported by the European Cancer Immunome Program (QLGT-CT1999-01211), the Cancer Research Institute (New York) and the Ludwig Institute for Cancer Research.
- Received November 5, 2003.
- Accepted January 27, 2004.
- Copyright © 2004 by C. Victor Jongeneel