An algorithm to discover gene signatures with predictive potential

Background The advent of global gene expression profiling has generated unprecedented insight into our molecular understanding of cancer, including breast cancer. For example, human breast cancer patients display significant diversity in terms of their survival, recurrence, metastasis as well as response to treatment. These patient outcomes can be predicted by the transcriptional programs of their individual breast tumors. Predictive gene signatures allow us to correctly classify human breast tumors into various risk groups as well as to more accurately target therapy to ensure more durable cancer treatment. Results Here we present a novel algorithm to generate gene signatures with predictive potential. The method first classifies the expression intensity for each gene as determined by global gene expression profiling as low, average or high. The matrix containing the classified data for each gene is then used to score the expression of each gene based its individual ability to predict the patient characteristic of interest. Finally, all examined genes are ranked based on their predictive ability and the most highly ranked genes are included in the master gene signature, which is then ready for use as a predictor. This method was used to accurately predict the survival outcomes in a cohort of human breast cancer patients. Conclusions We confirmed the capacity of our algorithm to generate gene signatures with bona fide predictive ability. The simplicity of our algorithm will enable biological researchers to quickly generate valuable gene signatures without specialized software or extensive bioinformatics training.


Introduction
Clinicians are commonly faced with two important decisions when treating cancer patients: whether or not adjuvant chemotherapy is required, and selecting the most appropriate treatment. Traditionally, several histopathological characteristics of the tumor are taken into consideration when deciding on the best treatment [1]. However, it has been reported that 70-80% of breast cancer patients do not benefit from the use of chemotherapy, but are still exposed to the deleterious side effects of these drugs [2]. Therefore additional prediction methods are needed to improve the quality of life for breast cancer patients. One of these methods relies on gene expression profiling based predictors, which can be used to further inform the decision making process and increase a clinician's ability to successfully treat cancer patients [3]. Recently, researchers developed a 70-gene signature that can correctly separate patients into good-and poor-prognosis groups, and identified patients who can be spared unnecessary chemotherapy [2,4]. However, constructing such a signature requires the use of various clustering and classification algorithms, which in turn require specialized software and bioinformatics training. Consequently, the need arises for strategies that can be used to generate predictive gene signatures, which are amenable to the software and skill sets available to the cancer biologist.
Typically gene expression based predictors are "trained" on a cohort of samples whose gene expression profiles are known, and for which at least one biological characteristic has been measured [5]. After the "training" of a predictor it must be validated on a set of samples, which were not used to initially "train" the algorithm. Predictors should in turn be able to accurately forecast the biological characteristic of samples of interest.
For our purposes we used a data set consisting of whole tumor gene expression profiles derived from 295 primary human breast tumors, as well as clinical data relating to the patients survival and occurrence of metastasis [2]. We then coarsely grained the expression data into high, average and low expression levels, and ranked genes based on the extent of their expression in patients who either survived or succumbed to breast cancer. In this fashion we were able to find genes whose transcripts generally had high and low expression in patients who succumbed and survived, respectively, and vice versa. By combining the top ranked candidates from a 144 patient training dataset we were able to create a 20 gene signature which performed well on a 151 patient validation dataset.
Our analyses establish an effective method to obtain gene expression based predictors that clearly separate human breast cancer patients into distinguishable prognosis groups with statistically significant differences in survival.

Microarray and clinical data
The microarray data used for our analyses was obtained from the Stanford microarray repository (downloaded from http://microarray-pubs.stanford.edu/wound_NKI/ explore.html, henceforth called NKI dataset). A matrix containing clinical data for the patients that provided samples for the microarray profiles used in the present study was downloaded from the same location. This data consists of the gene expression profiles of primary breast tumors biopsied from 295 human breast cancer patients. All patients had either stage I or stage II breast cancer, and were younger than 53 years old. The prevalence of lymph-node positive and lymph-node negative disease was 49% and 51%, respectively. We combined these data into one matrix containing indices for survival, metastasis, and the gene expression profiles for each patient. We used 12 year overall survival as the clinical endpoint for this study.

Organization of data
We blindly divided the patients into two groups consisting of similar numbers of patients, one for algorithm training (144 patients) and the other for algorithm validation (151 patients).

Defining levels of gene expression
In order to rank the predictive ability of a gene, we first needed to assess its expression in each given patient tumor relative to its expression in the tumors of all patients. To this end we first calculated the 95% confidence interval for expression of each gene. The level of expression for each gene was then defined as the following: i) If the expression of a gene in a given patient's tumor was greater than the upper limit of the 95% confidence interval for the expression of the same gene across all patient tumors, then the gene's expression was scored high for that patient's tumor. ii) If the expression of a gene in a given patient's tumor was less than the lower limit of the 95% confidence interval for the expression of the same gene across all patient tumors, then the gene's expression was scored low for that patient's tumor. iii) If the expression of a gene in a given patient's tumor was within the 95% confidence interval for the expression of the gene across all patient tumors, then the gene's expression was scored average for that patient's tumor. These steps were completed for every gene across every patient tumor.
This new matrix consisting of clinical patient data, as well as the gene expression score for each gene, represented by either high, average or low, was then used to rank the genes based on their predictive capacity.

Ranking the predictive capacity of each gene
We ranked each gene in the training set according to its expression in the tumor of patients who either survived or died from breast cancer. We expected genes whose expression was associated with poor prognosis to be generally highly expressed in patients who died and to be expressed at low levels in patients who survived. Conversely, we expected genes whose expression was associated with good prognosis to generally be highly expressed in patients who survived and to be expressed at low levels in those patients who succumbed. Therefore, the ranking of the genes was performed as follows for genes predictive of poor or good prognosis.
Genes predictive of poor prognosis i) A predictive score for each gene was computed for each gene across all patients, and was initially set at 0. ii) 1. The score for each gene was increased by 1 when the patient had both high gene expression and died, or had both low gene expression and survived.
2. The score was decreased by 1 when the patient had both low gene expression and died, or had both high gene expression and survived. 3. Average gene expression levels did not lead to any changes in the predictive score.
Genes predictive of good prognosis i) A predictive score for each gene is computed for each gene across all patients, and was initially set at 0.
ii) 1. The score was increased by 1 when the patient had both high gene expression and survived, or had both low gene expression and died. 2. The score is decreased by 1 when the patient had both low gene expression and survived, or had both high gene expression and died. 3. Average gene expression levels did not lead to any changes in the predictive score.
We then combined the top ranked genes from both the poor-prognosis and good-prognosis gene lists to generate a predictor gene signature. We completed all of the steps described above using Microsoft Excel™ 2007. Template file available upon request.

Measuring the predictive ability of the gene signature
In order to separate the training data set into good prognosis and poor prognosis groups we summed the expression of both poor-prognosis genes (poor-prognosis gene score) and good-prognosis genes (good-prognosis gene score) for all the patients in our training set. To give each patient a single overall-prognosis score we subtracted the good-prognosis gene score from the poor-prognosis gene score, and ranked the patients according to this new total. This led patients with the highest and lowest expression of poor-prognosis and good-prognosis genes, respectively, to receive the highest scores, and patients with the lowest and highest expression of poor-prognosis and good-prognosis genes, respectively, to receive the lowest scores. In this fashion, high scores were indicative of poor prognosis and low scores were indicative of good prognosis. In order to determine a optimal cut-off score which would yield prognosis predictions with the highest possible specificity and sensitivity, we used receiver-operator characteristic curves (ROC) [6]. This generated a list of possible cut-off scores, as well as each score's associated specificity and sensitivity. We next summed the specificity and sensitivity for each cut-off score and used the cut-off which yielded the highest total. For the random control sample, we generated a 20-gene signature where the signature was populated with randomly selected genes selected by a random number generator http://www.random.org.

Analysis of survival differences between good-prognosis and poor-prognosis groups
Unless otherwise indicated, GraphPad Prism 5™ software was used to complete survival analysis, linear regression, and comparison of survival means, as well as all associated statistical tests, and ROC analysis, to measure the predictive ability of the prognosis gene signature in both the training and validation data sets. Additional details available as supplementary methods.

Comparison of models
We calculated the predictive accuracy (Cases correctly predicted Vs All cases), specificity (Cases of correctly predicted good overall survival Vs Cases of actual good overall survival), and positive predictive value (PPV) (Cases correctly predicted of poor survival Vs All cases predicted poor survival) for our 20-gene signature, the Aurora kinase A, and 70-gene signature models. Patients were divided into good and poor survival groups based on Aurora kinase A expression, where the average expression of Aurora kinase A for all patients was used as the cut-off separating the two groups. The 70-gene signature classification for the patients was included in the original clinical data file.

Gene ontology
Gene names were uploaded to the gene ontology website http://www.geneontology.org, and the biological processes associated with the human form of the gene were recorded.

Generation and validation of a gene signature that predicts human breast cancer patient survival
To establish a gene signature that could accurately predict the survival outcome of human breast cancer patients we used a 295 patient database containing both clinical data relating to patient survival and occurrence of metastases, as well as the patient's individual tumor gene expression profiles. We divided this database into training and validation groups, containing 144 and 151 patients, respectively. We then identified genes whose expression levels correlated with patient survival as described in Methods. The 10 most highly ranked genes predictive of poor-prognosis and those 10 genes most highly predictive of good-prognosis established a 20gene expression based predictor (Table 1).
To learn whether this gene signature could accurately predict survival of the patients from which it was created, we used our 20 gene signature to rank all 144 patients within the training set and divided them into a poor-prognosis group and good-prognosis group (Fig.  1A). We also compared the overall survival between the two groups (Fig. 1B, log-rank test [7], p < 0.0001), fitted linear regression to examine the correlation between time-to-death or censure and prognosis score (Fig. 1C, F-test, significant negative correlation, p < 0.0001), and mean survival time (or time to censure) between the two groups (Fig. 1D, Mann-Whitney test, p < 0.0001).
In total, our results demonstrated the capacity of our gene signature to properly segregate human breast cancer patients into good-and poor-prognosis groups.
To validate our signature in patients whose data had not been used to generate the signature, we divided the 151 patient validation group into poor-prognosis and good-prognosis groups ( Fig. 2A). Again, our signature correctly separated patients based on survival (Fig. 2B, log-rank test p < 0.0001), correlated prognosis score with survival time (Fig. 2C, F-test, significant negative correlation, p = 0.034), and predicted mean survival time (Fig. 2D, Mann-Whitney test, p = 0.0056). To rule out the possibility that our signature's significance was a result of chance, we randomly generated a different 20gene signature. As expected the random 20-gene signature did not separate patients into groups with differences in survival (Fig. 2E).

Analysis of the 20-gene signature
To ensure that our algorithm produced predictors with comparable predictive power to other forms of feature selection we compared the 20-gene signature to a previously published Aurora kinase A expression model, as well as the FDA approved 70-gene signature (Mamma-Print™) [2,8]. The 70-gene MammaPrint signature was originally tested on the NKI dataset, the same dataset we used for the development of our 20-gene signature. We also included the Aurora kinase A expression model, as this model was shown to predict breast cancer patient outcome with similar accuracies to many other feature selection techniques [8]. Our 20-gene signature had a slightly higher predictive accuracy (0.67 Vs 0.64 Vs 0.61, 20-gene signature, Aurora kinase A, 70-gene signature models, respectively), and roughly comparable specificity and positive predictive value to the Aurora kinase A expression and 70-gene signature models (Table 2). Importantly, these comparisons indicate that our algorithm produces classifiers of at least similar predictive power than those produced by other feature selection techniques.
Since gene signatures are readily measurable cell characteristics which serve to indicate biological processes, we mapped the gene ontology of each gene-member of our 20-gene signature to learn whether our signature was linked to a particular biological process (Table 3). We found that genes linked to poor-prognosis were generally involved in processes such as mitosis, transcription, as well as DNA replication and DNA repair, whereas genes linked to good-prognosis were generally involved in processes such as cell differentiation and induction of apoptosis. These observations are consistent with the histological observations that patients with highly proliferative and poorly differentiated tumors generally have poorer survival outcomes than those with well differentiated and non-proliferative tumors.

Discussion
We sought to generate an algorithm with the following properties: (i) simple implementation with straight forward methodology, and (ii) high predictive accuracy. The reasons for this were to facilitate non-bioinformatic expert biologist development of valuable and biologically useful gene expression based prediction models. Importantly, we completed all steps of our algorithm using Microsoft Excel™ 2007, and will share the template files used for these analyses with interested researchers. This software is widely (if not universally) accessible to and used by the biological research community, suggesting that implementation of this technique will not be hampered by lack of software or training. As mentioned previously, most other feature selection techniques require the use of sophisticated clustering and classification algorithms, whose use requires specialized software and software based training.
To confirm that our algorithm produced a prediction model with comparable predictive power to other techniques in feature selection we compared its predictive power with that of an Aurora kinase A expression model as well as the 70-gene signature MammaPrint™ model. The Aurora kinase A model was previously shown to have comparable predictive accuracy to several feature selection techniques at predicting breast cancer patient survival, and can be used to make comparisons between feature selection techniques [8]. Additionally,     the 70-gene signature has previously been tested on the NKI dataset, which allowed us to make model comparisons on the same patients. The 70-gene signature is also used clinically and thus represents a "gold standard" against which to compare predictive accuracy of gene signatures which predict breast cancer patient outcome [9]. We observed that our model had a slightly higher overall predictive accuracy than either the Aurora kinase A expression model or the 70-gene signature, and all three models had comparable specificities and positive predictive values (Table 2). Importantly, these observations demonstrate that our algorithm produces prediction models with comparable accuracy to other feature selection techniques while having generally better accessibility and useability for biological research scientists.
To this end, we've begun using our algorithm to generate gene expression based prediction models of breast cancer cell sensitivity to commonly used anti-cancer therapies.

Conclusion
Here, we present an algorithm to generate gene signatures with predictive potential. It is noteworthy that our algorithm was developed using Microsoft Excel™ and tested using GraphPad Prism5™, commonly available software that should significantly increase its use. Importantly, the signature developed using our method had comparable predictive accuracy to either the Aurora kinase A expression or 70-gene MammaPrint™ models [2,8]. Our methods represent a novel and broadly applicable technique to generate predictive gene signatures that we anticipate will prove useful to the molecular biological research community.