Despite the high prevalence of colon cancer in the world and the great interest in targeted anti-cancer therapy, only few tumor-specific gene products have been identified that could serve as targets for the immunological treatment of colorectal cancers. The aim of our study was therefore to identify frequently expressed colon cancer-specific antigens. We performed a large-scale analysis of genes expressed in normal colon and colon cancer tissues isolated from colorectal cancer patients using massively parallel signal sequencing (MPSS). Candidates were additionally subjected to experimental evaluation by semi-quantitative RT-PCR on a cohort of colorectal cancer patients. From a pool of more than 6000 genes identified unambiguously in the analysis, we found 2124 genes that were selectively expressed in colon cancer tissue and 147 genes that were differentially expressed to a significant degree between normal and cancer cells. Differential expression of many genes was confirmed by RT-PCR on a cohort of patients. Despite the fact that deregulated genes were involved in many different cellular pathways, we found that genes expressed in the extracellular space were significantly over-represented in colorectal cancer. Strikingly, we identified a transcript from a chromosome X-linked member of the human endogenous retrovirus (HERV) H family that was frequently and selectively expressed in colon cancer but not in normal tissues. Our data suggest that this sequence should be considered as a target of immunological interventions against colorectal cancer.
This article was published in Cancer Immunity, a Cancer Research Institute journal that ceased publication in 2013 and is now provided online in association with Cancer Immunology Research.
According to a recent survey, human colorectal carcinoma (CC) is the third most frequent cancer worldwide, affecting predominantly people above 50 years old. The genetic pathway by which CC appears and develops has been well characterized (1). This process is accompanied by the deregulated expression of a number of genes, of which many have no immediate role in tumorigenesis. Identification of frequently deregulated gene products may nevertheless lead to the definition of a group of markers useful for the molecular characterization and staging of CC. In addition, induced expression of gene products in CC may give rise to tumor-associated antigens and elicit immune responses. Immune responses appear to be therapeutically beneficial as the clinical outcome of patients with CC was recently demonstrated to correlate with the type, density and localization of immune cells (2). Also, antibody-based therapies targeting CC-associated antigens (e.g. cetuximab against EGFR), alone or in combination with chemotherapy, have been shown to induce clinical responses in patients with CC (3). Unlike other types of cancers in which many gene products resulting from deregulated expression have been identified, only few have been found in CC. For example, we and others have previously shown that known tumor-specific gene products, in particular those belonging to the class of cancer/testis antigens, are only rarely expressed in CC (4, 5). These results prompted us to search for gene products whose expression would be frequently deregulated in CC.
The recently developed MPSS (massively parallel signature sequencing) method enables the simultaneous sequencing of over 106 gene tags located in the proximity of the 3' end of transcribed genes (6). This technique offers several advantages over previously described ones: (i) the large number of sequenced tags saturates the screen of the approximately 30'000 different genes predicted to be present in the genome; (ii) MPSS does not require a priori knowledge of the population of genes expressed in a given sample; and (iii) the relative abundance of each transcript in a given sample is more precisely determined because of the large dynamic range of the tag distribution, from zero to many thousands of tags per million (tpm).
Using MPSS, we have identified many genes that appear to be differentially expressed in normal colon (NC) and CC. We tested a subset of these candidates by semi-quantitative RT-PCR and confirmed, for most of them, their differential expression. Our analysis of CC samples obtained from more than 25 patients also uncovered the frequent and specific expression of a sequence derived from an X-linked human endogenous retrovirus. Other genes were also found to be frequently over- or under-expressed in CC samples. Altogether, our analysis identified several candidates that could serve as targets of spontaneous or induced immune responses in CC patients.
MPSS analysis of normal colon and colon cancer tissue
Massively parallel signature sequencing of normal colon (NC) mucosa and primary colon cancer (CC) resulted in the identification of 10832 and 14219 tags, respectively. Of these, 5843 of the NC sample and 7267 of the CC sample mapped to annotated genes. The others mapped to genes encoded by mitochondrial DNA, non-coding reverse DNA strands, genomic and non-genomic sequences and contaminants. Based on the tags mapping to annotated genes, we found 1429 gene clusters that were expressed only in NC, 2818 only in CC and 4205 in both. Among them, several matched to more than one gene or to a single gene found on multiple chromosomes. The former could occur in the case of genes belonging to conserved families while the second is probably due to mis-annotations of the genome. After discarding these tags, a total of 6364 genes remained that were unambiguously identified by MPSS tags (4240 in NC and 5284 in CC). Of these, 1080 were expressed specifically in NC, 2124 were expressed specifically in CC and 3160 were expressed in both NC and CC. A complete list of these genes is provided in Supplementary Table 1.
The tag distribution was markedly skewed towards small counts (Figure 1A) as approximately 50% of the genes in NC and CC had fewer than 10 tpm. Similar results were obtained with other normal and neoplastic pairs of tissues, such as normal breast (NB) and breast cancer (BC) and normal melanocytes (NM) and metastatic melanoma (MM), which have been previously analyzed by MPSS [(7) and unpublished data] (Figure 1B). Interestingly, the average number of tags with counts ranging from 1-999 tpm was significantly higher in the cancer samples than in the normal samples, while the number of genes with >1000 tpm did not significantly differ. Inversely, the number of tags absent in normal tissues was significantly higher than in cancer tissues, suggesting that increased diversity of gene expression might be a hallmark of cancer cells. Since limited gene diversity is characteristic of differentiated tissues, such an increase may also indicate that tumor cells undergo dedifferentiation. Alternatively, the increased diversity of tags detected in the tumor samples might also reflect a cancer-associated differentiation of fibroblasts surrounding the tumor (desmoplastic reaction) (8).
The degree of gene expression variability in the MPSS analysis of NC and CC samples was first assessed by comparing the tpm values of 5 bona fide housekeeping genes defined in (9) (β-actin, ubiquitin C, cyclophilin A, β-glucuronidase and GAPDH) and of the amyloid precursor protein (APP), previously described to be expressed at nearly identical levels in NC and CC (10). As shown in Table 1, the tpm values corresponding to these genes were found to be equivalent in NC and CC. We next extended our analysis to assess the overall degree of tpm variability between pairs of tags derived from NC and CC samples in order to define more accurately the threshold above which a difference in tpm may become significant. To achieve this, we exploited the results of a previous MPSS analysis of 32 normal human tissues (11). We selected the 580 unambiguously identified genes for which tags had been detected in all 32 tissues and defined them as commonly expressed genes. Subsequently, we monitored the tpm variability of those genes in the NC and CC samples. We found that the tpm of approximately 95% of these commonly expressed genes varied by less than a factor 20 between the NC and CC samples (Figure 1C). Only 25 commonly expressed genes (4.31%) showed tpm differences greater than 20 and were similarly distributed along either axis of the scatter plot. Based on these results, we defined our cut-off value at 20. Using this value and excluding genes absent from NC or CC, we found that 61 and 86 genes were significantly overexpressed in NC and CC, respectively (reducing the cut-off value to 10 resulted in 145 and 212 genes potentially overexpressed in NC and CC, respectively). These numbers are surprisingly small considering that approximately 50% of the 6364 genes identified by MPSS appeared to be expressed selectively in NC or CC.
Identification of gene ontology (GO) terms associated with NC and CC
Among the many differentially expressed genes, we sought to identify groups of genes that would allow us to discriminate between NC and CC. To do this, we selected to use the GO-slim vocabulary. Three GO-slim terms were significantly over-represented in NC (P <0.01). They were defined by the terms "catalytic activity", "metabolic process" and "transferase activity" and comprised 42, 28 and 23 genes, respectively (Table 2 and data not shown). Interestingly, 15 of the 23 genes identified by the GO term "transferase activity" were kinases (n = 8) or transferases (n = 7). Two different GO-slim terms were also significantly over-represented in CC: "extracellular space" (P = 0.01) and "nucleus" (P = 0.005). Genes identified by the first GO-slim term regrouped primarily genes involved in matrix remodeling, invasion and motility (Table 2). These included members of the TGF-β family, matrix metalloproteases, tenascin, and chemokines. These results suggest that some biological characteristics of the genes uncovered by MPSS are discriminative between NC and CC.
Validation of differentially expressed genes
Based on the MPSS databases of normal and neoplastic colon, breast and melanocyte, we performed Boolean searches to identify differentially expressed genes. The results of these queries were then ranked in decreasing order, with the gene displaying maximal tpm differences at the top of the list. Only genes with tpm differences >20-fold between NC and CC were considered. For each of the selected categories, we chose the top 3-6 candidates and validated their expression profile by semi-quantitative RT-PCR.
Genes specifically expressed in CC
These genes were identified based on the detection of MPSS tags in samples from CC but not in samples from NC, NB and NM. Among the 130 unambiguously identified genes in this category, we performed experimental validation on the 5 candidates showing the highest number of tags in CC (Table 3 and Figure 2A). PCR was first performed on five-fold dilutions of the cDNAs from individual NC and CC samples of the 4 patients included in our MPSS analysis. As shown in Figure 2A, expression of regenerating islet-derived 1α (REG1A) was found to be specifically expressed in CC samples from 2 of the 4 patients; no differential expression was detectable in the remaining 2 patients. Analysis of an additional 15 primary tumors revealed REG1A expression in 13 samples and in 2 of 3 liver metastases (Table 4). REG1A was overexpressed in 5 of 6 tumor samples for which matched NC tissue was available. While these results confirmed a previous report on the overexpression of REG1A in CC (12), the complete absence of tags from the NC samples could not be confirmed experimentally. Expression of renal dipeptide peptidase 1 (DPEP1) was clearly detectable in CC samples but only weakly visible in the NC samples. Similar results were found in primary CC samples of 15 additional patients and 4 metastases (Table 4). Again, this result confirmed previously reported overexpression of DPEP1 in CC (13). Most striking was the expression of the endogenous retroviral element HERV-H located on chromosome Xp22, which was exclusively detected in CC. This retroviral sequence has been reported by others to be specifically expressed in CC lesions (14). No signal was detectable in any of the 4 tumor-matched NC samples. Moreover, we confirmed for each sample that no PCR amplification was detectable in the absence of reverse transcription (data not shown). Further analyses of this transcript are described below. Finally, axin-2 (AXN2) and formyl peptide receptor 1 (FPR1) were clearly overexpressed in the CC samples of all 4 patients. Analyses of additional patients confirmed these findings (Table 4). It is interesting to note that in patient 846 (whose primary and related metastatic lesions were available) the expression levels of all genes were similar in the primary and metastatic lesions. Altogether, the genes belonging to this category were frequently overexpressed in CC but their expression was not restricted solely to CC, with the exception of HERV-H.
Genes overexpressed in CC
Next, we identified genes that were overexpressed in CC by calculating the tpm ratio between CC and NC. Genes with 0 tpm in NC, as well as those present in NB or NM, were excluded. We performed semi-quantitative RT-PCR on the only 3 unambiguously identified genes overexpressed by a factor of 20 or more in CC and absent from NB and NM (Table 3). As shown in Figure 2B, bone morphogenetic protein (BMP7), a member of the TGF-β superfamily, and membrane-anchored FGF receptor substrate 3 (FRS3) were clearly overexpressed in CC. BMP7 was also detected in 12/14 samples of primary CC and 4/4 of liver metastases, while its expression remained mostly undetectable in 6 tumor-matched NC samples (Table 4). Expression of lymphocyte-specific protein 1 (LSP1) was only marginally overexpressed in patients 892 and 942 and equally expressed in patients 888 and 954.
Genes down-regulated in CC
We searched among colon-specific genes (i.e. absent from NB and NM) for those that demonstrated the greatest reduction in tpm between NC and CC (Table 3). As shown in Figure 2C, expression of carbonic anhydrase 1 (CA1) and 4 (CA4), HCA520, UGT2B17 and zymogen granule protein 16 (ZG16) was down-regulated in most CC samples, confirming the results predicted by MPSS. HCA520 was reported to be expressed in several normal tissues but not in tumor cell lines of different histological origins, except for an ovarian cancer cell line (15). ZG16 was also found to be down-regulated or even absent in over 80% of hepatocellular carcinomas (16), while CA1 was reported to be down-regulated in a large proportion of CC (17, 18) and decreased expression of CA4 in renal cell carcinomas was associated with poor patients' prognosis (19). Considering that tumor cells may undergo dedifferentiation, decreased expression of some of these genes that are typically expressed in terminally differentiated cells may be expected.
Genes overexpressed in multiple cancers
Finally, we searched for genes that were not only overexpressed in CC but also in BC and MM. Criteria for the selection of such candidate genes were that their tpm values be >0 in normal tissues NC, NB and NM, the tpm ratio between NC and CC >20 and the ratio between normal and neoplastic breast and melanocyte be >1. Among the 13 genes fulfilling these criteria, the 7 genes with highest tpm ratios between NC and CC are shown in Table 3, together with their tpm ratios between normal and neoplastic breast and melanocyte. Within this category, KIAA1533, cancer susceptibility candidate 3 (CASC3/MLN51) and dolichol-phosphate mannosyltransferase (DPM1) were most clearly overexpressed in CC, heat shock protein 105 (HSPH1) was moderately overexpressed while splicing factor R/S-rich 2 interacting protein (SFRS2IP) and UBX domain-containing protein 4 (UBXD4) were not (Figure 2D). Again, with the exception of the latter two, all genes predicted to be overexpressed in CC were confirmed. CASC3 has been reported to be overexpressed in BC (and in gastric cancers) (20) and HSPH1 in a variety of tumors, including breast. It is noteworthy that HSPH1, also known as NY-CO-25, is a target of autologous antibodies in colorectal cancer patients (21). We were unable to perform PCR amplifications of IER2 in any condition tested (data not shown).
These results demonstrate the validity of MPSS in detecting differentially expressed genes. However, it should be noted that a gene with 0 tpm in a given sample analyzed by MPSS does not automatically imply that this particular gene is not expressed in that sample (see Discussion).
Expression pattern of HERV-H gag transcript
Among the genes tested above, we focused our attention on HERV-H located on chromosome Xp22. Human endogenous retroviral elements constitute up to 8% of our genome (22). Several HERV sequences were identified in our MPSS analysis (data not shown). However, HERV-H Xp22 was the only one for which a significant tag count difference between NC and CC was observed. The vast majority of the HERVs are defective, owing to the accumulation of multiple mutations and/or deletions. It has been previously reported that other members of the HERVs, in particular HERV-K, is expressed in some melanomas (23). HERVs are composed of a single open-reading frame containing, from 5' to 3', gag, pol and env, under the control of the 5' long terminal repeat (LTR) (Figure 3). Because of our interest in translated gene products and considering the high frequency of mutations leading to premature stops in the HERV transcripts, we selected pairs of oligonucleotides from the gag region situated downstream from the 5' LTR. Several putative open-reading frames (ORFs) were predicted for gag (Figure 3A). The DNA sequence of each of the predicted ORF was translated (Figure 3B and data not shown) and the amino acid sequences were used to perform protein BLAST analyses. The highest degree of homology between any of the predicted Gag ORFs and the Gag sequence of other HERVs was found for the Gag denoted GAG:(14400) in Figure 3A (Figure 3C). No homology between GAG:(14396) with any Gag protein was found, while GAG:(14399) and GAG:(14419) did not contain any initiation codons (data not shown). We performed RT-PCR on primary CC samples from 25 patients. We also analyzed 6 CC metastases to the liver and 1 sample of tubulo-villous colon adenoma. As shown in Figure 4, HERV-H gag transcripts were detected in 60% of the primary CC samples, in 7 of 8 metastases and in the pre-cancerous adenoma. In contrast, no expression of HERV-H gag transcripts was found in any of the following normal tissues: colon, liver, spleen, stomach, kidney, lung, skin, endometrium, ovary, prostate, peripheral blood monocytes and thyroid. Finally, weak expression (thin band detectable only at the lowest cDNA dilution) was found in bladder. The amplified cDNA from the gag region was sequenced and an amber mutation was found 280 bp downstream from the initiation codon. The same mutation was found in all samples analyzed (n = 7) and corresponded to the genomic sequence of chromosome X available in public databases. It is therefore unlikely to be a driving mutation that contributes to the development of CC. Taken together our results indicate that HERV-H encodes a truncated protein of 93 amino acids that could serve as target for anti-tumor therapy.
MPSS has been developed as a method to sequence and identify very large numbers of transcripts simultaneously. The main advantages of this technique are the unbiased identification of genes expressed in a given sample, the high number of sequenced tags ensuring complete coverage of the transcriptome and the non-saturable detection of abundant transcripts. Nevertheless, some expressed genes remain undetectable (11). Several reasons account for this, including long sequence lengths between the 3' end of the coding sequence and the polyA sequence, repetitive or highly homologous sequences, gene polymorphisms and genome mis-annotations. In the current study, the presence of false negative tags is the most relevant issue. Theoretically, a tag with 0 tpm should indicate that this particular transcript is absent from the sample. For example, a large fraction of genes was found to have 0 tpm in NC and >0 in CC. However, our validation by semi-quantitative RT-PCR did not confirm these results, as 0 tpm rarely correlated with the complete absence of detectable transcript. One reason for this is the MPSS analysis method itself, which only scores genes that have at least 1 tpm in each of at least two independent sequencing runs. Tags derived from very rare transcripts which have been found in only one run or have <0.5 counts per million sequenced tags (i.e. less than 2 tags among the 4 x 106 sequenced tags) will receive the value 0, even though they have been detected at least once. A second reason is sample heterogeneity. Because we were primarily interested in identifying frequently deregulated transcripts, we have pooled the RNA from 4 individuals who may not share identical expression profiles. Moreover, the tissue was not microdissected and most likely contained other cell types, such as endothelial cells from blood vessels, stromal and immune cells. Thus, while we could confirm the frequent deregulation of genes identified by our MPSS analysis in a cohort of patients, rare RNA species expressed in the tumor cells of only 1 patient might have been diluted to levels lower than 0.5 tpm. Altogether, we conclude that the 0 tpm values should be evaluated with caution. Similar conclusions were reached by Stolovitzky and colleagues (24).
Aside from the problem of the false-negative tag counts, the question of the threshold defining over- or under-expression arose. The significance of differential gene expression in two biologically distinct samples analyzed by MPSS was previously established by statistical methods (7). However, because the tpm values of commonly expressed genes varied greatly between the NC and CC samples, we set the threshold of significance based on the following calculation: We determined the tpm variation encompassing at least 95% of the 580 commonly expressed genes between NC and CC. The value that was obtained was 20 (a 40-fold difference in tpm was required to reach 99% inclusion). This threshold may appear unusually high, as 5- to 10-fold differences in gene expression are frequently reported in comparisons between normal and cancerous tissues. It should be noted that variations between a range of so-called housekeeping genes have been experimentally tested by quantitative RT-PCR and found to vary greatly, in some cases up to 100-fold (9, 25). We would therefore conclude that the significance of a variation in gene expression that is lower than at least an order of magnitude should be considered with great caution.
Among the genes identified by MPSS that displayed significant differential expression in NC and CC, we selected the top candidates of each category defined by a given Boolean search and confirmed the results by semi-quantitative RT-PCR. In most cases, differential expression revealed by MPSS could be confirmed, not only in the samples used for the MPSS analysis but also in those of larger cohort of CC patients. Genes such as REG1A, DPEP1, BMP7 and AXN2 had been previously found to be overexpressed in CC (13, 26-28) and other cancers, such as melanomas (29), ovarian (30) and breast cancer (31), while CA1, CA4 and ZG16 were reported to be down-regulated (Table 5). We also identified several new genes that were differentially expressed in CC, including FRS3, KIAA1533, HCA520 and DPM1. These gene products operate in seemingly distinct cellular pathways, suggesting that deregulated gene expression affects multiple pathways. It is nevertheless noteworthy that gene products active in the extracellular space appear to distinguish CC from NC.
Most interestingly, we uncovered the selective expression of HERV-H, an endogenous retrovirus. Human endogenous retroviral sequences are estimated to represent between 1-8% of the human genome (22, 32). Despite the fact that most of them are defective, their promoters, the 5' LTRs, remain functional and can drive not only the transcription of retroviral genes but also that of neighboring genes (33, 34). Translocation of the FGFR1 kinase downstream of a HERV 5' LTR resulted in the aberrant transcription of that gene in atypical stem-cell myeloproliferative disorder (35) and the 5' LTR of HERV-H located on chromosome 17 was recently shown to act as alternative promoter of the GSDML gene in the human colon cancer cell line HCT-116 (36). Finally, immunosuppressive properties of certain HERV sequences have also been documented (37).
The HERV-H family is the largest of the HERVs, with sequences present on almost every chromosome including chromosome X. Our study identified a transcript from the HERV-H located on chromosome Xp22 in the majority of primary and metastatic CC samples analyzed, as well as in adenoma. In contrast, no detectable expression was found in any normal tissue tested, except for bladder. Expression of that HERV was also detected in approximately 25% of non-small cell lung carcinoma but not in melanoma (data not shown). Moreover, it was reported to be expressed in approximately 40% gastric and 17% pancreatic cancers (38). This pattern of expression, i.e. lack of expression in most normal tissues, is reminiscent of a category of genes, the so-called cancer/testis genes, whose expression is restricted to testis and tumors (39). Similar to HERV-H Xp22, the majority of cancer/testis genes are also located on chromosome X. However, unlike HERV-H Xp22, which is only infrequently expressed in normal testis (data not shown), cancer/testis genes are commonly expressed in normal testis. Sequence analyses indicated that the expression of HERV-H Xp22 would generate a protein of 93 amino acids. Based on its frequent expression in CC, its limited expression in normal tissue and the predicted expression of a truncated protein, HERV-H Xp22 should be considered as a therapeutic target for the active immunotherapy of human colon cancer.
Materials and methods
Samples from CC patients were obtained after informed consent. The study protocol was approved by the Ludwig Institute for Cancer Research ethical review committee, as well as by the medical and ethical committees of the University Hospital (Lausanne, Switzerland). All patients were operated according to standard procedures and had not undergone any preoperative treatment. Detailed information about the patients included in this study is given in Table 6.
Tissue samples and handling
Tissue samples from normal colonic mucosa (NC), CC, normal liver and CC liver metastases were obtained at the time of surgical resection. The normal tissue was collected at distant sites from the tumor. Tissue fragments were isolated with the help of a qualified pathologist, cut into small fragments and snap-frozen in liquid nitrogen. Samples were stored at -80˚C until mRNA extraction. In parallel, tissue samples were also embedded in paraffin and used for pathological analysis and tumor staging (40).
Isolation of mRNA
Frozen fragments were processed using a Qiagen RNAeasy MiniKit (Qiagen, Hilden, Germany). In brief, frozen material was weighted and mechanically dissociated by Polytron (Kinematica AG, Newark, NY, USA) in RLT buffer (1 ml/20-30 mg tissue) following the manufacturer's instructions. The samples were then treated with 30 U DNAse I (Qiagen) to remove genomic DNA. The quality and quantity of the RNA was assessed using the Agilent Bioanalyzer chip. Purified RNA was stored at -80˚C until use.
For the MPSS analysis, RNA was extracted from paired NC and CC tissues of 4 patients (LAU888, LAU892, LAU942 and LAU954). The patients were representative of the local population of CC patients and provided sufficient material for the study. Each of the NC and CC RNA samples of the 4 patients were pooled. A total of 130 µg RNA of each NC and CC pool was sent to Lynx Therapeutics (Hayward, CA) for MPSS analysis.
Sample processing and gene annotation
The samples were processed and analyzed following the "Megaclone signature" procedure described previously (6, 41). Briefly, mRNA was isolated and reverse transcribed and the cDNA was digested with the restriction enzyme DpnII. The cDNA fragment adjacent to the polyA proximal DpnII restriction site was cloned. The resulting library of templates was amplified and annealed to microbeads. The microbeads were then loaded into flow cells and the signature sequences of these templates (or tags) were determined by a series of enzymatic sequencing cycles. Approximately 4 x 106 sequences were analyzed over 4 independent sequencing runs. Only tags that were detected in at least 2 independent sequencing experiments were scored. For each tag, the highest value obtained over the different sequencing runs was selected. Scoring for each tag was then calculated as follows and expressed as tag per million (tpm): tpm (tag A)=sum of tag A in run N x 106/sum of all tags in run N.
Tag sequences were assigned to known transcripts and thence to genes, using the two stage procedure described previously (42, 43) and the NCBI36 assembly of the human genome. Counts from tags that mapped to more than one transcript were discarded, unless those transcripts came from the same gene as a result of alternative splicing, in which case the counts were pooled.
MPSS data mining
A database containing all genes identified by MPSS in the NC and CC samples was assembled. This database was integrated into a larger MPSS database, which also included genes expressed in normal breast epithelium (NB) and breast cancer (BC), normal melanocyte (NM) and melanoma (MM). This larger database was used throughout this work and allowed us to compare the expression pattern of genes in NC and CC with their expression in other normal and tumor tissues of different histological origin. The assembled database was curated so as to contain only tags identifying genes unambiguously. This curated database was subjected to multiparametric Boolean searches. The list of genes identified by individual queries was then organized in such a way that the genes with the highest bias were ranked first.
To identify commonly expressed genes, a database containing unique tags identified by MPSS analyses of 32 normal tissues (44) was compared with the MPSS database described above. Even though the MPSS data from the normal and neoplastic tissues were not directly comparable with those of the 32 normal tissues because of differences in sequencing procedures, we nevertheless considered tags present in both databases to be commonly expressed.
Identification of gene ontology terms associated with genes overexpressed in CC
The representation of GO terms in genes expressed in colon was investigated using the GO-slim ontologies obtained from the EBI (45). GO terms were assigned to genes based on the associated RefSeq annotation (46) and mapped onto GO-slim using the goaslim.map file (dated 21-AUG-2007). To test for significant GO-slim terms for genes specific to NC or CC, we defined a set of colon-specific genes (see Supplementary Table 1) and assigned GO-slim terms as described. This served as the reference list for the number of genes having a given GO-slim term. We then assembled subsets of this gene list, based on expression criteria (e.g. 0 tpm in NC and at least 20 tpm in CC), and determined the number of genes having a given GO-slim term. The number of genes associated with each GO-slim term in the subset was compared to the same GO term in the reference list, and any difference tested for significance using a Fisher exact test in the R package. Only differences with P <0.05 were considered significant.
The differential expression of selected genes identified by MPSS was experimentally validated by RT-PCR, first on the material isolated from the 4 patients whose RNA had been subjected to MPSS analysis and then on the material isolated from a larger cohort of colon cancer patients. The sequences of primer pairs used to amplify each gene and the amplification conditions are listed in Table 7. Primer pairs were designed in such a way that they matched coding sequences separated by at least one intron (except for the intron-less HERV-H sequence). The housekeeping gene β-actin was used as calibrator. All oligonucleotides were purified by HPLC. Semi-quantitative PCR was performed on serial 5-fold dilutions of cDNA. Complementary DNA was produced using 100 U M-MLV reverse transcriptase per µg of RNA, following the manufacturer's protocol. However, for the analysis of the intron-less HERV-H sequences, an additional DNAse treatment was performed before reverse transcription so as to ensure absence of genomic DNA contamination. DNA sequencing of the HERV-H gag region was performed on amplified cDNA.
This work was supported in part by the Ludwig Institute for Cancer Research, the Cancer Research Institute, the NCCR, a research instrument of the Swiss National Foundation, the Swiss National Science Foundation and the Hans-Altschüler Foundation.
- Received February 29, 2008.
- Accepted May 21, 2008.
- Copyright © 2008 by Frédéric Lévy