- Open Access
An algorithm to discover gene signatures with predictive potential
© Hallett et al; licensee BioMed Central Ltd. 2010
Received: 25 March 2010
Accepted: 2 September 2010
Published: 2 September 2010
The advent of global gene expression profiling has generated unprecedented insight into our molecular understanding of cancer, including breast cancer. For example, human breast cancer patients display significant diversity in terms of their survival, recurrence, metastasis as well as response to treatment. These patient outcomes can be predicted by the transcriptional programs of their individual breast tumors. Predictive gene signatures allow us to correctly classify human breast tumors into various risk groups as well as to more accurately target therapy to ensure more durable cancer treatment.
Here we present a novel algorithm to generate gene signatures with predictive potential. The method first classifies the expression intensity for each gene as determined by global gene expression profiling as low, average or high. The matrix containing the classified data for each gene is then used to score the expression of each gene based its individual ability to predict the patient characteristic of interest. Finally, all examined genes are ranked based on their predictive ability and the most highly ranked genes are included in the master gene signature, which is then ready for use as a predictor. This method was used to accurately predict the survival outcomes in a cohort of human breast cancer patients.
We confirmed the capacity of our algorithm to generate gene signatures with bona fide predictive ability. The simplicity of our algorithm will enable biological researchers to quickly generate valuable gene signatures without specialized software or extensive bioinformatics training.
Clinicians are commonly faced with two important decisions when treating cancer patients: whether or not adjuvant chemotherapy is required, and selecting the most appropriate treatment. Traditionally, several histopathological characteristics of the tumor are taken into consideration when deciding on the best treatment. However, it has been reported that 70-80% of breast cancer patients do not benefit from the use of chemotherapy, but are still exposed to the deleterious side effects of these drugs. Therefore additional prediction methods are needed to improve the quality of life for breast cancer patients. One of these methods relies on gene expression profiling based predictors, which can be used to further inform the decision making process and increase a clinician's ability to successfully treat cancer patients . Recently, researchers developed a 70-gene signature that can correctly separate patients into good- and poor-prognosis groups, and identified patients who can be spared unnecessary chemotherapy [2, 4]. However, constructing such a signature requires the use of various clustering and classification algorithms, which in turn require specialized software and bioinformatics training. Consequently, the need arises for strategies that can be used to generate predictive gene signatures, which are amenable to the software and skill sets available to the cancer biologist.
Typically gene expression based predictors are "trained" on a cohort of samples whose gene expression profiles are known, and for which at least one biological characteristic has been measured. After the "training" of a predictor it must be validated on a set of samples, which were not used to initially "train" the algorithm. Predictors should in turn be able to accurately forecast the biological characteristic of samples of interest.
For our purposes we used a data set consisting of whole tumor gene expression profiles derived from 295 primary human breast tumors, as well as clinical data relating to the patients survival and occurrence of metastasis . We then coarsely grained the expression data into high, average and low expression levels, and ranked genes based on the extent of their expression in patients who either survived or succumbed to breast cancer. In this fashion we were able to find genes whose transcripts generally had high and low expression in patients who succumbed and survived, respectively, and vice versa. By combining the top ranked candidates from a 144 patient training dataset we were able to create a 20 gene signature which performed well on a 151 patient validation dataset.
Our analyses establish an effective method to obtain gene expression based predictors that clearly separate human breast cancer patients into distinguishable prognosis groups with statistically significant differences in survival.
Microarray and clinical data
The microarray data used for our analyses was obtained from the Stanford microarray repository (downloaded from http://microarray-pubs.stanford.edu/wound_NKI/explore.html, henceforth called NKI dataset). A matrix containing clinical data for the patients that provided samples for the microarray profiles used in the present study was downloaded from the same location. This data consists of the gene expression profiles of primary breast tumors biopsied from 295 human breast cancer patients. All patients had either stage I or stage II breast cancer, and were younger than 53 years old. The prevalence of lymph-node positive and lymph-node negative disease was 49% and 51%, respectively. We combined these data into one matrix containing indices for survival, metastasis, and the gene expression profiles for each patient. We used 12 year overall survival as the clinical endpoint for this study.
Organization of data
We blindly divided the patients into two groups consisting of similar numbers of patients, one for algorithm training (144 patients) and the other for algorithm validation (151 patients).
Defining levels of gene expression
If the expression of a gene in a given patient's tumor was greater than the upper limit of the 95% confidence interval for the expression of the same gene across all patient tumors, then the gene's expression was scored high for that patient's tumor.
If the expression of a gene in a given patient's tumor was less than the lower limit of the 95% confidence interval for the expression of the same gene across all patient tumors, then the gene's expression was scored low for that patient's tumor.
If the expression of a gene in a given patient's tumor was within the 95% confidence interval for the expression of the gene across all patient tumors, then the gene's expression was scored average for that patient's tumor. These steps were completed for every gene across every patient tumor.
This new matrix consisting of clinical patient data, as well as the gene expression score for each gene, represented by either high, average or low, was then used to rank the genes based on their predictive capacity.
Ranking the predictive capacity of each gene
We ranked each gene in the training set according to its expression in the tumor of patients who either survived or died from breast cancer. We expected genes whose expression was associated with poor prognosis to be generally highly expressed in patients who died and to be expressed at low levels in patients who survived. Conversely, we expected genes whose expression was associated with good prognosis to generally be highly expressed in patients who survived and to be expressed at low levels in those patients who succumbed. Therefore, the ranking of the genes was performed as follows for genes predictive of poor or good prognosis.
Genes predictive of poor prognosis
A predictive score for each gene was computed for each gene across all patients, and was initially set at 0.
- 1The score for each gene was increased by 1 when the patient had both high gene expression and died, or had both low gene expression and survived.
The score was decreased by 1 when the patient had both low gene expression and died, or had both high gene expression and survived.
Average gene expression levels did not lead to any changes in the predictive score.
Genes predictive of good prognosis
A predictive score for each gene is computed for each gene across all patients, and was initially set at 0.
- ii)1. The score was increased by 1 when the patient had both high gene expression and survived, or had both low gene expression and died.
The score is decreased by 1 when the patient had both low gene expression and survived, or had both high gene expression and died.
Average gene expression levels did not lead to any changes in the predictive score.
We then combined the top ranked genes from both the poor-prognosis and good-prognosis gene lists to generate a predictor gene signature. We completed all of the steps described above using Microsoft Excel™ 2007. Template file available upon request.
Measuring the predictive ability of the gene signature
In order to separate the training data set into good prognosis and poor prognosis groups we summed the expression of both poor-prognosis genes (poor-prognosis gene score) and good-prognosis genes (good-prognosis gene score) for all the patients in our training set. To give each patient a single overall-prognosis score we subtracted the good-prognosis gene score from the poor-prognosis gene score, and ranked the patients according to this new total. This led patients with the highest and lowest expression of poor-prognosis and good-prognosis genes, respectively, to receive the highest scores, and patients with the lowest and highest expression of poor-prognosis and good-prognosis genes, respectively, to receive the lowest scores. In this fashion, high scores were indicative of poor prognosis and low scores were indicative of good prognosis. In order to determine a optimal cut-off score which would yield prognosis predictions with the highest possible specificity and sensitivity, we used receiver-operator characteristic curves (ROC) . This generated a list of possible cut-off scores, as well as each score's associated specificity and sensitivity. We next summed the specificity and sensitivity for each cut-off score and used the cut-off which yielded the highest total. For the random control sample, we generated a 20-gene signature where the signature was populated with randomly selected genes selected by a random number generator http://www.random.org.
Analysis of survival differences between good-prognosis and poor-prognosis groups
Unless otherwise indicated, GraphPad Prism 5™ software was used to complete survival analysis, linear regression, and comparison of survival means, as well as all associated statistical tests, and ROC analysis, to measure the predictive ability of the prognosis gene signature in both the training and validation data sets. Additional details available as supplementary methods.
Comparison of models
We calculated the predictive accuracy (Cases correctly predicted Vs All cases), specificity (Cases of correctly predicted good overall survival Vs Cases of actual good overall survival), and positive predictive value (PPV) (Cases correctly predicted of poor survival Vs All cases predicted poor survival) for our 20-gene signature, the Aurora kinase A, and 70-gene signature models. Patients were divided into good and poor survival groups based on Aurora kinase A expression, where the average expression of Aurora kinase A for all patients was used as the cut-off separating the two groups. The 70-gene signature classification for the patients was included in the original clinical data file.
Gene names were uploaded to the gene ontology website http://www.geneontology.org, and the biological processes associated with the human form of the gene were recorded.
Generation and validation of a gene signature that predicts human breast cancer patient survival
Genes comprising the 20-gene signature
95% CI interval
Analysis of the 20-gene signature
Predictive ability of the Aurora kinase A, 20-gene signature, and 70-gene signature.
Aurora kinase A (NKI dataset)
20-gene (151 validation set)
70-gene (NKI dataset)
Positive predictive value
Gene ontology of the 20-gene signature
Protein folding/Response to virus
Double stranded break DNA repair/Meosis I/Spermatogonial Development/Oocyte maturation/Pachytene (cell cycle)/Meotic recombination/Transcription from RNA Pol II promoter
Interspecies interaction between organisms/Intronless viral mRNA export from nucleus/mRNA export from nucleus/mRNA processing/Transport
DNA repair/DNA replication/Double stranded break DNA repair/UV protection/Phosphoinositide mediated signaling
DNA replication/DNA dependent DNA replication
Cytokine mediated signaling pathway/Endocytosis/Hippocampus development/Layer formation in the cerebral cortex/Lipid metabolic process/Positive regulation of kinase activity/Proteolysis/Signal transduction
Anaphase-promoting complex-dependent proteosomal ubiquitin dependent protein catabolic process/Interspecies interaction between organisms/Negative regulation of ubiquitin ligase activity involved in mitotic cell entry/Positive regulation of ubiquitin ligase activity involved in mitotic cell entry/Proteolysis involved in cellular protein catabolic process
Branching morphogenesis of a tube/Chromatin remodelling/Epithelial cell differentiation/Prostate gland development/Glucose homeostasis/Hormone metabolic process/Lung development/Multicellular organismal development/Negative regulation of survival gene product/Negative regulation of transcription fro RNA pol II promoter/Neuron fate specification
Angiogenesis/Apoptosis/Cell Migration/Induction of apoptosis by extracellular signals/Integrin mediated signaling pathway/Lamellipodium assembly/Positive regulation of cell adhesion/Positive regulation of PI3 kinase activity/Regulation of GTPase activity/Regulation of Rho protein signal transduction/Small GTPase mediated signal transduction/Vesicle fusion
Regulation of translation/Translation initiation
Apoptosis/Cell cycle arrest/Induction of apoptosis/Response to stress
Androgen metabolic process/Antral ovarian follicle growth/Epithelial cell development/Epithelial cell proliferation involved in mammary gland duct elongation/Estrogen receptor signaling pathway/Male gonad development/Mammary gland alveolus development/Mammary gland branching involved in pregnancy/Neuroprotection/Osteoblast development
Golgi organisation/Golgi vesicle transport/Protein amino acid phosphorylation/Retrograde transport, vesicle recycling within golgi
We sought to generate an algorithm with the following properties: (i) simple implementation with straight forward methodology, and (ii) high predictive accuracy. The reasons for this were to facilitate non-bioinformatic expert biologist development of valuable and biologically useful gene expression based prediction models. Importantly, we completed all steps of our algorithm using Microsoft Excel™ 2007, and will share the template files used for these analyses with interested researchers. This software is widely (if not universally) accessible to and used by the biological research community, suggesting that implementation of this technique will not be hampered by lack of software or training. As mentioned previously, most other feature selection techniques require the use of sophisticated clustering and classification algorithms, whose use requires specialized software and software based training.
To confirm that our algorithm produced a prediction model with comparable predictive power to other techniques in feature selection we compared its predictive power with that of an Aurora kinase A expression model as well as the 70-gene signature MammaPrint™ model. The Aurora kinase A model was previously shown to have comparable predictive accuracy to several feature selection techniques at predicting breast cancer patient survival, and can be used to make comparisons between feature selection techniques . Additionally, the 70-gene signature has previously been tested on the NKI dataset, which allowed us to make model comparisons on the same patients. The 70-gene signature is also used clinically and thus represents a "gold standard" against which to compare predictive accuracy of gene signatures which predict breast cancer patient outcome . We observed that our model had a slightly higher overall predictive accuracy than either the Aurora kinase A expression model or the 70-gene signature, and all three models had comparable specificities and positive predictive values (Table 2). Importantly, these observations demonstrate that our algorithm produces prediction models with comparable accuracy to other feature selection techniques while having generally better accessibility and useability for biological research scientists. To this end, we've begun using our algorithm to generate gene expression based prediction models of breast cancer cell sensitivity to commonly used anti-cancer therapies.
Here, we present an algorithm to generate gene signatures with predictive potential. It is noteworthy that our algorithm was developed using Microsoft Excel™ and tested using GraphPad Prism5™, commonly available software that should significantly increase its use. Importantly, the signature developed using our method had comparable predictive accuracy to either the Aurora kinase A expression or 70-gene MammaPrint™ models [2, 8]. Our methods represent a novel and broadly applicable technique to generate predictive gene signatures that we anticipate will prove useful to the molecular biological research community.
Conflict of interests
The authors declare that they have no competing interests.
Survival analysis was completed using Graphpad Prism 5™ software's "survival" option. Time to endpoint or time to study censorship was included as the independent variable (x-axis column) and death or survival (denoted 1 = death, 0 = survival) was included on the y-axis column. Independent y-axis columns were used for each group (good or poor prognosis). Statistical analyses (Log-rank test) was accessed and completed using the Graphpad analyze tab.
Linear regression was completed using Graphpad Prism 5™ software's "XY" option. The survival score was plotted as the independent variable (x-axis column), whereas survival time or time to death was plotted in the y-axis column. Statistical analyses to confirm correlation was completed using the Graphpad analyze tool.
Survival time mean
Survival time mean comparison was completed using Graphpad Prism 5™ software's "column" option. The survival or time to death times for both the good and poor prognosis groups were plotted in independent columns. A t-test was used to compare the means between the groups, and was completed using the Graphpad analyze tool.
The ROC analysis to determine optimal cut-off score was complete using Graphpad Prism 5™ software's "column" option. The survival scores for the good and poor outcome groups were plotted in independent columns. The ROC analysis tool (accessed through the Graphpad analyze tool) was used determined the sensitivity and specificity of each possible cut-off score. The cut-off score yielding the highest sum of specificity and sensitivity was then used to divide the patients into good and poor outcome groups.
This work was generously supported by a grant from the Canadian Stem Cell Network.
- Hayes DF, Trock B, Harris AL: Assessing the clinical impact of prognostic factors: when is "statistically significant" clinically useful?. Breast Cancer Res Treat. 1998, 52 (1-3): 305-19. 10.1023/A:1006197805041.View ArticleGoogle Scholar
- van de Vijver MJ, et al: A gene-expression signature as a predictor of survival in breast cancer. N Engl J Med. 2002, 347 (25): 1999-2009. 10.1056/NEJMoa021967.View ArticleGoogle Scholar
- Potti A, et al: Genomic signatures to guide the use of chemotherapeutics. Nat Med. 2006, 12 (11): 1294-300. 10.1038/nm1491.View ArticleGoogle Scholar
- van 't Veer LJ, et al: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415 (6871): 530-6. 10.1038/415530a.View ArticleGoogle Scholar
- Simon R, et al: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst. 2003, 95 (1): 14-8. 10.1093/jnci/95.1.14.View ArticleGoogle Scholar
- Zou KH, O'Malley AJ, Mauri L: Receiver-operating characteristic analysis for evaluating diagnostic tests and predictive models. Circulation. 2007, 115 (5): 654-7. 10.1161/CIRCULATIONAHA.105.594929.View ArticleGoogle Scholar
- Richard Peto JP: Asymptotically Efficient Rank Invariant Test Procedures. 1972, Blackwell Publishing, 135:Google Scholar
- Haibe-Kains B, et al: A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all?. Bioinformatics. 2008, 24 (19): 2200-8. 10.1093/bioinformatics/btn374.View ArticleGoogle Scholar
- Sotiriou C, Pusztai L: Gene-expression signatures in breast cancer. N Engl J Med. 2009, 360 (8): 790-800. 10.1056/NEJMra0801289.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.