- Open Access
Nomogram integrating gene expression signatures with clinicopathological features to predict survival in operable NSCLC: a pooled analysis of 2164 patients
Journal of Experimental & Clinical Cancer Researchvolume 36, Article number: 4 (2017)
The current tumor-node-metastasis (TNM) staging system is insufficient to predict outcome of patients with operable Non-Small Cell Lung Cancer (NSCLC) owing to its phenotypic and genomic heterogeneity. Integrating genomic signatures with clinicopathological factors may provide more detailed evaluation of prognosis.
All 2164 clinically annotated NSCLC samples (1326 in the training set and 838 in the validation set) with corresponding microarray data from 17 cohorts were pooled to develop and validate a clinicopathologic-genomic nomogram based on Cox regression model. Two computational methods were applied to these samples to capture expression pattern of genomic signatures representing biological statuses. Model performance was measured by the concordance index (C-index) and calibration plot. Risk group stratification was proposed for the nomogram.
Multivariable analysis of the training set identified independent factors including age, TNM stage, combined prognostic classifier, non-overlapping signature, and the ratio of neutrophil to plasma cells. The C-index of the nomogram for predicting survival was statistically superior to that of the TNM stage (training set, 0.686 vs 0.627, respectively; P < .001; validation set, 0.689 vs 0.638, respectively; P < .001). The calibration plots showed that the predicted 1-, 3- and 5-year survival probabilities agreed well with the actual observations. Stratifying patients into three risk groups detected significant differences among survival curves.
These findings offer preliminary evidence that genomic data provide independent and complementary prognostic information and incorporation of this information can refine prognosis in NSCLC. Prospective studies are required to further explore the value of this composite model for prognostic stratification and tailored therapeutic strategies.
Lung cancer is a major cause of morbidity and mortality worldwide, with non-small cell lung cancer (NSCLC) accounting for around 4 in 5 new diagnoses . Treatment decisions and prognosis for patients with NSCLC are largely driven by the assessment of the tumor-node-metastasis (TNM) staging system . Currently, stage IA patients do not receive postoperative chemotherapy due to no survival advantage and its potential therapeutic toxicity [2, 3]. With respect to stage IB patients, disagreement prevails regarding which patients might be appropriate candidates for adjuvant therapy [3–5]. Although radical resection and adjuvant treatment extended cancer-specific survival of patients with stage II or IIIA disease, long-term prognosis continues to be jeopardized by the high risk of subsequent recurrence and drug toxicity [3, 5, 6]. Therefore, it remains a largely unmet need to reduce 30–50% rate of recurrence  to improve survival of these patients. Moreover, the identification of patients at high risk of recurrence and death, who are most likely to benefit from aggressive systemic therapy, is absolutely critical.
Rational application of adjuvant therapy requires accurate prognosis prediction for patients with resectable NSCLC. Numerous clinicopathological schemes have been proposed to optimize the staging system by introducing other prognostic variables, such as age, gender, and tumor lymphocytic infiltration [8–10]. Nonetheless, such models may have hit the ceiling and they still lack genomic information, which can add prognostic value to clinical models in lung cancer . Recently, several gene expression signatures have been used to predict clinical outcomes in human cancer [12–19]. These genomic signatures may reflect specific biological/microenvironmental features of this underlying heterogeneous disease that can determine the tumor phenotype. However, these signatures often focus on one or several distinct aspects of tumor heterogeneity and few studies have attempted to demonstrate the power of combining all related clinicopathological factors with biological/microenvironmental features, which may render more precise information concerning risk assessment .
However, a simple, user-friendly and reliable nomogram incorporating genomic information with the traditional risk factors, widely validated in different cohorts, is needed to refine the prognosis of postoperative patients with stage I-IIIA NSCLC. Here, We aimed the following: (1) to build a composite clinicopathologic-genomic nomogram by systematically analyzing factors with potential prognostic value in a pooled cohort of 1326 postoperative NSCLC patients; (2) to assess the added value of genomic information compared with standard TNM staging, clinicopathological model and genomic model; and (3) to externally validate this nomogram by another independent pooled cohort. Finally, we constructed a composite model which can robustly identify patients at high risk of death and it was shown to outperform those based on clinicopathological variables or gene signatures alone.
Patients and samples
To identify gene expression data arrayed using Affymetrix Human Genome U133A or U133A plus 2.0 with clinically annotated data, we systematically searched Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA), ArrayExpress, caArray and related literature with the terms “lung cancer”, “NSCLC”, “lung adenocarcinoma”, “lung squamous cell carcinoma”, “survival”, “relapse”, “recurrence”, “prognostic” and “prognosis”. For some datasets whose clinical data were not with their gene expression profiles, we either searched the supplements or contacted one or more of the investigators to get the missing information. Raw microarray data and corresponding clinical data of these datasets were retrieved and manually organized when available. Only patients diagnosed with stage I-IIIA NSCLC, and with clinicopathological and survival information available, were included. We excluded patients who had follow-up time or survival time less than 1 month. In addition, patients with any missing or insufficient data on age, gender or histology were also excluded from subsequent processing. All of those studies previously were approved by their respective institutional review boards. Our study followed the Reporting Recommendations for Tumor Marker Prognostic Studies (REMARK) criteria as listed in their guidelines  and guidelines for prognostic factor studies in NSCLC .
Where raw CEL files from Affymetrix Human Genome U133A/Plus2 were available, data were normalized and annotated using an MAS5 algorithm and corresponding annotation files from R Bioconductor to obtain summarized values for each probeset, and otherwise we used pre-processed data as provided by the contributors. For each sample in all data sets, measurements without a gene annotation were excluded and multiple probesets corresponding a single gene were summarized into a gene symbol by taking the most variable probeset measured by interquartile range (IQR).
Identification of genomic signatures with potential to predict prognosis
We queried literature database to identify gene expression signatures in studies where the prognostic value in human cancer or lung cancer were reported. The probesets or genes of those signatures were re-annotated using SOURCE web tool (http://source-search.princeton.edu/cgi-bin/source/sourceBatchSearch) to deal with the retired gene symbols and their differences in tested platforms.
The preprocessed gene data were classified with gene expression signatures identified above using the Nearest Template Prediction (NTP) method as implemented in the Gene Pattern software (Broad Institute of Harvard and MIT, Boston, MA) . NTP required only a list of pre-specified template signature genes and a dataset to be tested, without requiring corresponding training dataset, to capture the presence or absence of gene expression patterns in each sample. Briefly, a template of representative expression pattern of the signature genes was defined based on published gene signature from their respective study. Proximity of the signature genes expression pattern of the sample to the template was evaluated by calculating cosine distance. Significance of the proximity was assessed by a nominal p-value estimated based on a null distribution for the distance generated by randomly resampling the same number of signature genes from the sample genes 1000 times. False discovery rate (FDR) was used to correct the p-values for multiple hypothesis testing. The prediction analysis was performed separately for each dataset. A prediction of presence or absence of related signature was determined based on prediction FDR <0.05, and the rest of the samples with an intermediate expression level of the correlated genes in the signature were classified as uncertainty. Concordance among these predictions was evaluated using unsupervised clustering according to Cramer’s V coefficient of the paired prediction overlap as previously described . Cramer’s V statistic values range from 0 to 1, values between 0.36 and 0.49 indicating substantially correlated and values higher than 0.5 strongly correlated.
Enumeration of hematopoietic cells subsets from gene expression profiles
To quantify the relative abundances of 22 tumor-associated Leukocyte (TAL) subsets, we employed Cell type Identification By Estimating Relative Subsets Of known RNA Transcripts (CIBERSORT) method (500 iterations) and the LM22 gene signature which allowed for highly sensitive and specific discrimination of hematopoietic cells and were well-designed and validated on gene expression profiles from Affymetrix Human Genome U133A/Plus2 . Subsequently, we aggregated 22 leukocyte subsets into 11 immune populations for clarity. The proportions of immune cells were predicted in each dataset separately.
Development, comparison, and validation of prognostic models
The samples were separated into training/validation sets based on cohorts for identifying and evaluating the predictors. Overall survival (OS) was calculated from the date of diagnosis or surgery to the date of death or last follow-up. Patients who were alive at 5 years were administratively censored with OS as 5 years. Continuous variables were expressed as median (IQR) or median (range), and group comparison was performed by the t-test or the Wilcoxon rank sum test. Categorical variables were expressed as percentages, and group comparison was performed by Pearson’s χ2 test or the Fisher’s exact test. Median follow-up was calculated using the reverse Kaplan-Meier method . Clinical variables previously shown of prognostic value (age, gender, histologic subtype and TNM stage) and all genomic factors identified with P values less than 0.05 in the univariate log-rank test were separately entered into a clinical Cox model and a genomic Cox model to compare the prognostic efficacy of clinical and genomic strategies. Clinicopathological and genomic factors identified above were then introduced into the multivariable analyses via the backward stepwise Cox proportional hazard model, from which coefficients were used to develop the composite nomogram. The whole population was divided into three risk groups (high, intermediate, low) according to the tertiles of the total scores given by the established nomogram in the training set. The discrimination and calibration of the nomogram for 1-, 3-, and 5-year were measured by the concordance index (C-index) and by calibration plot comparing the expected and observed survival probabilities respectively. For internal validation, the bootstrapping technique was used to adjust for over-fitting and over-optimistic model performance. An optimism-corrected C-index using 1000 bootstrap samples created with replacement was reported. The bootstrap resampling procedure was detailed in Additional file 1. Finally, for external validation, the total scores of each patient of the independent validation set were calculated according to the proposed nomogram to verify its generalization. Three risk groups were determined by the tertiles defined in the training set, and the respective Kaplan-Meier survival curves were delineated. All statistical tests performed were two-sided and the P values less than 0.05 were considered as statistical significance. Data analyses were implemented using the GenePattern software (http://genepattern.broadinstitute.org/gp/), the CIBERSORT tool (http://cibersort.stanford.edu/) and the R statistical package (http://www.r-project.org/).
Identification of eligible samples and gene signatures
Predefined prognosis analysis was constrained to patients with early stages operable lung cancer. Of the 3398 patients from 21 previously established cohorts retrieved by the initial systematic search, 1234 were excluded because of duplicates, normal lung samples, or missing clinical information; the remaining 2164 clinically annotated NSCLC samples pertaining to 17 cohorts were considered eligible for inclusion in this study (Fig. 1a). All of the eligible patients were then divided into training/validation sets in terms of cohorts to develop and externally validate the composite model. Ultimately, a total of 1326 samples from ten cohorts were assigned to a training set to develop the model, and 838 samples from other seven cohorts to an independent validation set. Information about the datasets entering for our analysis and relevant clinicopathological characteristics of the patients included in the training set and validation set were shown in Additional file 2 and Table 1.
In the training set of 1326 patients with stage I-IIIA NSCLC, the majority were male (801 [60.4%]) and smoker (907 [68.4%]), with a median age of 64 years (IQR 57–70). Most patients had stage I disease (879 [66.3%]) and adenocarcinoma (880 [66.4%]) was the predominant histology. During a median follow-up times of 73.4 months (range 1–256), there were 569 deaths within the first 5 years, corresponding to 42.9% of the enrolled patients. Unlike the patients in the training set, patients in the validation set were older (median age was 65 years [IQR 58–71]; P = .003) and had a lower proportion of adenocarcinoma (420 [50.1%]) and deaths within the first 5 years (274 [32.7%]; P < .001). Median follow-up time for patients in the validation set was 61.9 months (range 1–252; P < .001).
Gene expression signatures might reflect diverse pathways, biological processes or other functions related to the heterogeneous tumor biology and microenvironment . As different signatures may be active in different individual tumors, expression analyses of gene clusters could prove more revealing than single gene analyses . Accordingly, all 27 gene expression signatures with potential prognostic value in lung cancer were identified from literature review, with the genes being clearly described in their respective studies [12–19, 28–42] (Table 2). Based on its initial purpose, we then classified these gene signatures into three major functional groups: (1) prognosis related signatures: six gene sets for predicting prognosis of lung cancer [28–33]; (2) biology/microenvironment related signatures: 14 gene clusters reflecting tumor biology or microenvironment state [12–17, 34–41]; (3) pathway related signatures: seven previously reported gene expression signatures to probe the status of oncogenic pathways [18, 19, 42]. Additionally, an association between cancer survival and abundance of diverse TAL subsets had been recently proposed by applying a computational approach known as CIBERSORT, which can use gene expression data to enumerate and characterize infiltrating immune cells in bulk tumor . For example, tumor-associated neutrophil and plasma cell signatures exhibited opposite prognostic associations in lung cancer. We reasoned that these signatures would allow the description of tumor heterogeneity .
Genomic landscape of early stage NSCLC
Among the 27 signatures evaluated, all of them were able to confidently classify patients (FDR <0.05) into their predicted presence or absence subclass. Table 2 and Fig. 1b summarizes the prediction result obtained for each of the 1326 patients. The cell cycle signature  was the most prevalent prediction in the training set (88.6%; 1214 of 1326), whereas the Src signature  was identified in only 3.5% (46 of 1326) of tumor samples. Interestingly, 290 of 1326 (21.9%) patients concomitantly harbored 5 or more presence/absence signatures. We then sought to evaluate the concordance of these 27 signatures using Cramer’s V coefficient. Signature unsupervised clustering based on these coefficients indicated a substantial association among the three predefined groups of signatures (Fig. 1b) as follows: (1) signatures reflecting biological and microenvironmental characteristics related to cell cycle, chromosomal instability, proliferation, homologous recombination defect, wound healing, acidosis, and metabolism, such as cell cycle signature , CIN70 signature , malignancy-risk signature , HRD signature , wound signature , lactic acidosis response signature , and metabolism proteome signature ; (2) signatures predicting lung cancer prognosis, such as combined prognostic classifier , non-overlapping signature , poor survival signature  and recurrence signature ; (3) signatures indicating patterns of specific pathway deregulation, e.g., TGF-β signature , RAS signature  and E2F3 signature . The observation that some of them did not have a strong correlation implied that these signatures could capture complementary biological features essential in prognosis prediction.
The CIBERSORT method was applied to samples in the training set to quantify the relative proportions of 11 TAL subsets. As shown in Fig. 1c, CIBERSORT revealed striking consistency in relative immune cell fractions within and across clinical subgroups of lung cancer, with plasma cells, monocytes and macrophages, CD4 T cells, and mast cells being the most common immune cell subsets with mean fractions of 0.354, 0.249, 0.128 and 0.064 respectively. Patient samples were divided into 3 groups (low, medium, and high) according to the tertiles values of neutrophil-to-plasma cells fractions inferred in the training set, which were then applied to the validation set.
Independent prognostic factors for survival in the training set
We then sought to test which factors were statistically significant for lung cancer-related death. The results of the univariate log-rank test are displayed in Additional file 3. Younger age (P < .001) and female gender (P = .012) were associated with favorable prognosis. Compared with the stage IA group, other groups (stage IB, stage II, and stage IIIA) were associated with poor prognosis, with a hazard ratio (HR) of 1.55 (95% CI 1.22–1.97; P < .001), 2.39 (95% CI 1.86–3.07; P < .001), and 4.21 (95% CI 3.22–5.50; P < .001), respectively. Oddly, there was no significant survival difference between adenocarcinoma and squamous cell carcinoma (P = .312). With respect to genomic variables, most of them were identified as significantly associated with a less-favorable prognosis, whereas the presence of HRD signature  (P < .001), lactic acidosis response signature  (P < .001), angiogenesis signature  (P < .001) and angiogenic signature  (P < .001) were associated with an improved survival. Of note, in concordance with the results previously reported , we observed a strong association between inferior survival and a higher ration of neutrophil to plasma cells (P < .001) in NSCLC.
To assess the hypothesis that survival prediction models integrating gene signatures can better identify patients at high risk of death than those based on clinicopathological variables or genomic information alone, we developed a clinicopathological model, a genomic model and a composite model based on the training set using all significant factors in the univariable analysis (Table 3). Clinicopathological multivariable analyse demonstrated that age (P < .001), gender (P = .010) and TNM stage (P < .001) were independent risk factors for overall survival. By contrast, combined prognostic classifier  (P < .001), non-overlapping signature  (P = .001) and the ratio of neutrophil to plasma cells  (P = .012) were identified as independent predictors in the genomic model. We next sought to test the independent predictive value of the three genomic factors when confronted with known clinicopathological variables in the training set. Intriguingly, these genomic factors were significantly associated with overall survival in most of the subgroups analyzed, regardless of the age, gender, smoking history, grade, histology, tumor stage or lymph nodal status (Fig. 2a). Moreover, in patients categorized by the TNM stage, these genomic variables were also able to classify them into good and poor survival outcome especially in stage IA or stage IB. Furthermore, 2 out of 3 genomic factors were also prognostically significant within subsets of patients harboring either wild-type or activating mutation of EGFR, ALK, KRAS, or TP53 in the overall pooled datasets (see Additional file 4). Importantly, when age, gender, TNM stage, combined prognostic classifier , non-overlapping signature  and the ratio of neutrophil to plasma cells  were entered as covariates in a joint Cox model, combined prognostic classifier  (P < .001), non-overlapping signature  (P = .036) and the ratio of neutrophil to plasma cells  (P = .004) remained statistically significant in addition to age (P < .001) and TNM stage (P < .001) in the training set.
Construction, comparison and validation of the composite nomogram
To develop a composite prognostic model, we built a nomogram that integrated the significant factors identified above to predict survival of patients with operable NSCLC (Fig. 2b). The nomogram demonstrated that the TNM stage had the largest contribution to prognosis, followed by the combined prognostic classifier  and non-overlapping signature . Age and the ratio of neutrophil to plasma cells  showed a moderate effect on survival rate. Each category within these variables was assigned a point on the top scale based on the coefficients from Cox regression. By summing all of the assigned points for the five variables and drawing a vertical line between Total Points and survival probability axis, we were easily able to obtain the estimated probability of 1-, 3- and 5-year survival. The risk score cutoff values were selected to group patients in terms of total points in the training set into roughly equal tertiles, accurately divided patients into low, intermediate, and high risk (Fig. 2c). Samples were divided into 3 subgroups according to the tertiles of the total number of points in the training set: patients at low risk (<7.39 points), intermediate risk (7.39–12.16 points), and high risk (>12.16 points) of death.
To assess the added value of genomic information, we sought to compare the performance of the proposed nomogram with standard TNM staging, clinicopathological model and genomic model by applying them to the training set and validation set. Expectedly, the composite nomogram had the highest C-index compared with single prognostic variables, clinical model and genomic model in both the training and validation set (Fig. 2d). This suggested that the joint clinicopathologic-genomic model could attain superior prognostic performance than either clinicopathological or genomic information alone.
Prognostic performance of the composite nomogram was further validated using bootstrap resampling and an independent validation set of 899 surgically resected NSCLC. In the training set, the C-index of this joint model (0.686, 95% CI 0.662–0.710) for predicting OS was statistically superior to that of TNM staging system (0.627, 95% CI 0.604–0.650; P < .001). Likewise, the C-index continued to be statistically higher for the nomogram (0.689, 95% CI 0.654–0.724) in the validation set than for the TNM stage (0.638, 95% CI 0.605–0.671; P < .001). The calibration plots showed that the predicted 1-, 3- and 5-year survival probabilities of the nomogram agreed well with the actual observations in both sets (Fig. 2e, f). Applying cutoff value from the training set can stratify patients in the validation set into three distinct risk subgroups with statistically significant difference in overall survival curves (Fig. 3a). Moreover, in patients categorized by major clinicopathological features, the survival rates predicted by the nomogram continued to illustrate significant distinctions between the Kaplan-Meier curves (Fig. 3b-f).
The genetic heterogeneity both between and within tumors poses significant challenges to predicting patient clinical outcome. Although the TNM staging system indicates the level of the malignant potential and disease progression and is a strong prognostic factor in NSCLC , but it has some shortcomings. Lung cancer patients with the same stage of disease can have remarkably different overall outcome following curative resection . Moreover, the staging system strongly depends on the tumor size, the degree of lymph node involvement, and the extent of distant metastases, and it neither takes into account other clinicopathological risk factors, such as age, gender, or grade, nor does it considers the biological heterogeneity of NSCLC patients . To address this, we pooled the largest clinically annotated NSCLC gene expression profiling datasets to date and separated it into training/validation sets to develop and validate a composite clinicopathologic-genomic nomogram for estimation of the risk of death in patients with early stage operable NSCLC. The proposed nomogram outperformed the prognostic potential of models based on clinicopathologic variables or gene signatures alone and could robustly stratify patients into three different prognosis groups with significantly different median survivals.
Each eligible cohort contained samples from a great majority of the resectable lung cancer patients who were diagnosed within a specific geographical region and time period and did not receive adjuvant therapy (see Additional file 2). The wide spatial and temporal distribution and relatively large sample size in this pooled cohorts enhanced its clinical and genomic representativeness and generalizability for NSCLC patients. However, it would be increasingly difficult to conduct a similar research project, since adjuvant therapies which would affect patient outcomes were recommended for most of the patients.
Despite evidence that clinicopathological factors (e.g. age, gender, histology) are prognostic relevant [8–10], the advent of high-throughput genomic profiling tools have enabled additional systematic evaluation of lung cancer genomic heterogeneity to aid prognosis [26, 27]. The comprehensiveness of such data provides an opportunity to dissect this heterogeneous and elusive disease entity into more homogeneous subgroups that can detect genomic phenotypes representing potential candidates for prognostic and predictive biomarkers, in ways that clinical risk factors cannot [26, 27]. Recent microarray studies have used gene expression signatures that reflect pathway activation, biological status or outcome prediction to predict clinical prognosis in human cancer and lung cancer [12–19, 28–42]. For instance, the malignancy-risk gene signature associated with both cancer risk and progression is composed of numerous cell cycle and DNA replication-related genes . Aggressive tumors resulted in the development of an invasiveness gene signature that is correlated significantly with both overall survival and tumor recurrence . Similarly, primary tumors harbored the metastases-related gene expression pattern are associated with poor outcome . Furthermore, homologous recombination repair deficiency may contribute to cancer initiation and carcinogenesis . Individuals with the HRD gene signature have significantly longer survival times compared with those without . Differences in survival were found between lung cancer patients with and without DNA alterations in genes encoding the metabolism proteome . Additionally, multiple oncogenic pathway statuses can stratify patients with different outcomes . As for aspects of the tumor microenvironment, patients who have lung tumors that exhibit a gene expression pattern similar to that of serum-induced fibroblasts have a poor clinical outcome . Hypoxia signature and lactic acidosis response signature are two strong prognostic factors for overall survival in multiple cancers [13, 17, 38]. Moreover, high expression of angiogenesis-related genes is associated with good prognosis in multiple cancer types [40, 41]. Finally, the prognostic potential of a leukocyte gene signature has been recently revealed by a computational approach known as CIBERSORT . These genomic signatures which can help to refine the molecular classification and prognosis in human cancer include genes that exhibit coherently similar expression pattern, closely reflect specific biological processes, and reciprocally compensate each other in capturing the same patterning of biological dysregulation [22, 26, 27].
Indeed, the result of univariate analysis confirmed that most of the signatures were associated with survival prognosis, as stated in their respective publications (see Additional file 3). The observation that some of the patients concurrently harbored more than one poor-outcome signatures (Fig. 1b) indicated that their tumors were genetically unstable and compatible with an aggressive disease. This may partially explain wide nonoverlapping among the prognosis-related signatures identified in various lung cancer microarray studies [28–32]. Another explanation was these signatures might reflect similar biological behavior. Importantly, unsupervised clustering analysis of these signatures identified 3 subgroups: one subgroup defining biological and microenvironmental features, another subgroup predicting lung cancer prognosis, and a third indicating oncogenic pathway status (Fig. 1b). Despite the strong association between some signatures within these subgroups, the predicted result varied considerably owing to interpatient heterogeneity in the biological status of the samples. Multiple, different biological statuses were regularly present in the individual tumor within predicted outcome-related categories (Fig. 1b). For example, samples predicted as good outcome in prognosis-related signatures [28–33] exhibited a gene expression pattern similar to HRD signature  and angiogenesis signature . It was in high concordance with previous reports correlating HRD signature  and angiogenesis signature  with good outcome in lung cancers. In contrast, an increased invasive , proliferative [34, 35], metastatic  and hypoxic [13, 38] response was present in samples predicted as poor outcome in prognosis-related signatures [28–33]. To sum up, these genomic signatures reflecting multiple distinct biological aspects of tumor heterogeneity may dissect the complexity and interplay between cancer cells and stromal cells, capture complementary biological states, and thus help to refine prognosis in early stage NSCLC. Additionally, identification of prognostically distinct subgroup of patients with specifically featured candidates using genomic profiling-based signatures may provide an opportunity for selection of therapeutic strategies tailored to the individual patterns of the biological state (e.g. oncogenic pathway activation, hypoxia, wound, and metabolism) .
However, our primary goal was to unravel the prognostic power of these reported genomic signatures coupled with clinicopathological variables in a large cohort of early stage NSCLC. Multivariable analysis in the training set identified age, TNM stage, combined prognostic classifier, non-overlapping signature, and the ratio of neutrophil to plasma cells as independent prognostic factors (Table 3), which were highly consistent with studies concerning risk factors in lung cancer [8, 9, 28, 29, 43]. The combined prognostic classifier was enriched for genes reflecting epidermal differentiation, signaling, cell cycle and growth, transcription, translation and metabolism ; contrastingly, non-overlapping signature contained genes involved in cell movement, cell death, cell cycle, and signaling processes , supporting that it captures complementary prognostic information in NSCLC. Subsequent subgroup analysis confirmed the independent prognostic value when confronted with known clinicopathological variables (Fig. 2a). Accordingly, we cautiously decided to use the five significant prognostic factors (age, TNM stage, combined prognostic classifier , non-overlapping signature , and the ratio of neutrophil to plasma cells  to build the composite nomogram (Fig. 2b). For validation of this nomogram to prevent overfitting and to verify generalization, the C-index and calibration plots were used. The C-index of the nomogram demonstrated a superior prognostic value over the sixth AJCC TNM classification in the training set and the validation set (Fig. 2c). The calibration plots showed optimal agreement between the expected and observed survival probabilities in both sets (Fig. 2e, f), insuring the reliability and repeatability of the proposed nomogram. In addition, three risk groups were defined on the basis of total points by using tertiles in the training set (Fig. 2d). Using these tertiles, we stratified patients from diverse geographical and ethnic origins into three distinct risk subgroups (Fig. 3). A prognostic model for NSCLC can be considered clinically useful if it is either 1) more effective than standard prognostic factors in identifying high risk subgroup of stage I patients who might benefit from adjuvant chemotherapy or 2) identifies low risk stage II patients who may spare needless drug toxicity . The proposed nomogram can identify those patients that are most likely to be misclassified by clinical models, especially for patients with stage IB disease to whom the role of chemotherapy currently remains controversial (Fig. 3c). Of note, the nomogram could also be used in the cohort of GSE29013 which was based on analysis of FFPE specimens , indicating its clinical application in a broad and practical tissue source (Fig. 3f). Different treatment and follow-up strategies may be appropriate for each of the three categories. For example, more intense adjuvant therapy or new treatment regimen could be given to patients in the high risk group. Similarly, in the intermediate risk group, intensive follow-up or standard adjuvant therapy of these patients was clearly essential. Low risk patients were likely cured by surgery and might not be candidates for systemic therapy. Thus complications deriving from potentially unnecessary adjuvant therapy and life-threatening progression owing to insufficient treatment could both be avoided.
Different accrual times and technologies differences may introduce potential confounding effects. Therefore, Affymetrix Human Genome U133A and Affymetrix Human Genome U133 Plus2 are included only, because they have 22,277 probesets in common to measure the same gene with the same specificity, sensitivity, and dynamic range and because the sample size is sufficient to detect an important difference between the proposed nomogram and TNM staging system. Furthermore, this study did not overcome limitations related to patient comorbidities, treatment heterogeneity, retrospective nature with potential for error or bias, and unbalancement of clinicopathological features between the training set and the validation set. Also, we noted that prognosis of patients with NSCLC was better in the validation set than in the training set. A reason for this might be that the two sets had significant geographical and clinicopathologic differences in distribution. Especially, the training set included more stage IIIA patients (11.8% vs. 6.2%; Table 1). Another relevant issue was that some prognostic factors (e.g., gender , histology , and smoking status ) were not included in the model owing to not reaching statistical significance in the training set or insufficient data. Still, the forthcoming 8th edition of the TNM classification of lung cancer  may perform better than either the current 7th edition or the 6th edition used in our study. Direct comparison of the different versions of the TNM staging is ideal for accuracy comparison. Unfortunately, we don’t have much data for direct comparison in our study. But, we do find some references about the accuracy comparison. A slight increase in the value for R2, which is an estimate of the percent variance explained by the Cox proportional hazards regression models, was observed for both the newly edition when compared to the previous one [45, 46]. However, in a Norwegian cancer registry series contained 1885 patients from 2001 to 2005, the c-index was 0.68 for both the current 7th edition and the previous edition, indicating no difference in their predictive accuracy . These factors were important for survival for patients with early stage lung cancer and their incorporation may further improve the prognostic stratification of patients. Another potential shortcoming was that the panel of signatures representing biological status was not exhaustive and a more sensitive, specific and efficient prediction method to capture the presence of absence of these signatures was needed. Currently, patients with advanced NSCLC are generally treated with systemic therapy or a symptom-based palliative approach. Therapy for these patients should be guided by the mutation status of the tumor whenever possible. Therefore, this nomogram is not suitable for late stage patients.
The improved performance of this combinatorial scheme accentuated the importance of integrating all aspects of the tumor biology and microenvironment into prognostic stratification. Pending further prospective clinical validation, these results provide preliminary evidence that the genomic information could be combined with clinical data to help refine patient prognosis and should be considered as an independent and complementary approach to the current clinicopathologic prognostic model. Ultimately, further studies on prospective data collection and inclusion of some other factors are required to validate and optimize this model.
Cell type identification by estimating relative subsets of known RNA transcripts
False discovery rate
Large cell carcinoma
Large cell neuroendocrine carcinoma
Monocytes and macrophages
Natural killer cells
Non-small cell lung cancer
Nearest template prediction
Squamous cell carcinoma
Stewart B, Wild CP. World cancer report 2014. Lyon: International Agency for Research on Cancer, World Health Organization; 2014.
NCCN Clinical practice guidelines in oncology: Non-Small Cell Lung Cancer version 4.2016. [http://www.nccn.org/professionals/physician_gls/pdf/nscl.pdf]. Accessed 27 Feb 2016.
Pignon JP, Tribodet H, Scagliotti GV, Douillard JY, Shepherd FA, Stephens RJ, Dunant A, Torri V, Rosell R, Seymour L, et al. Lung adjuvant cisplatin evaluation: a pooled analysis by the LACE Collaborative Group. J Clin Oncol. 2008;26:3552–9.
Strauss GM, Herndon 2nd JE, Maddaus MA, Johnstone DW, Johnson EA, Harpole DH, Gillenwater HH, Watson DM, Sugarbaker DJ, Schilsky RL, et al. Adjuvant paclitaxel plus carboplatin compared with observation in stage IB non-small-cell lung cancer: CALGB 9633 with the Cancer and Leukemia Group B, Radiation Therapy Oncology Group, and North Central Cancer Treatment Group Study Groups. J Clin Oncol. 2008;26:5043–51.
Butts CA, Ding K, Seymour L, Twumasi-Ankrah P, Graham B, Gandara D, Johnson DH, Kesler KA, Green M, Vincent M, et al. Randomized phase III trial of vinorelbine plus cisplatin compared with observation in completely resected stage IB and II non-small-cell lung cancer: updated survival analysis of JBR-10. J Clin Oncol. 2010;28:29–34.
Douillard J-Y, Rosell R, De Lena M, Carpagnano F, Ramlau R, Gonzáles-Larriba JL, Grodzki T, Pereira JR, Le Groumellec A, Lorusso V, et al. Adjuvant vinorelbine plus cisplatin versus observation in patients with completely resected stage IB–IIIA non-small-cell lung cancer (Adjuvant Navelbine International Trialist Association [ANITA]): a randomised controlled trial. Lancet Oncol. 2006;7:719–27.
Consonni D, Pierobon M, Gail MH, Rubagotti M, Rotunno M, Goldstein A, Goldin L, Lubin J, Wacholder S, Caporaso NE, et al. Lung cancer prognosis before and after recurrence in a population-based setting. J Natl Cancer Inst. 2015;107:djv059.
Liang W, Zhang L, Jiang G, Wang Q, Liu L, Liu D, Wang Z, Zhu Z, Deng Q, Xiong X, et al. Development and validation of a nomogram for predicting survival in patients with resected non-small-cell lung cancer. J Clin Oncol. 2015;33:861–9.
Birim O, Kappetein AP, Waleboer M, Puvimanasinghe JP, Eijkemans MJ, Steyerberg EW, Versteegh MI, Bogers AJ. Long-term survival after non-small cell lung cancer surgery: development and validation of a prognostic model with a preoperative and postoperative mode. J Thorac Cardiovasc Surg. 2006;132:491–8.
Brambilla E, Le Teuff G, Marguet S, Lantuejoul S, Dunant A, Graziano S, Pirker R, Douillard JY, Le Chevalier T, Filipits M, et al. Prognostic Effect of Tumor Lymphocytic Infiltration in Resectable Non-Small-Cell Lung Cancer. J Clin Oncol. 2016;34:1223–30.
Subramanian J, Simon R. Gene expression-based prognostic signatures in lung cancer: ready for clinical use? J Natl Cancer Inst. 2010;102:464–74.
Carter SL, Eklund AC, Kohane IS, Harris LN, Szallasi Z. A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nat Genet. 2006;38:1043–8.
Chi JT, Wang Z, Nuyten DS, Rodriguez EH, Schaner ME, Salim A, Wang Y, Kristensen GB, Helland A, Borresen-Dale AL, et al. Gene expression programs in response to hypoxia: cell type specificity and prognostic significance in human cancers. PLoS Med. 2006;3:e47.
Chang HY, Sneddon JB, Alizadeh AA, Sood R, West RB, Montgomery K, Chi JT, van de Rijn M, Botstein D, Brown PO. Gene expression signature of fibroblast serum response predicts human cancer progression: similarities between tumors and wounds. PLoS Biol. 2004;2:E7.
Ramaswamy S, Ross KN, Lander ES, Golub TR. A molecular signature of metastasis in primary solid tumors. Nat Genet. 2003;33:49–54.
Peng G, Chun-Jen Lin C, Mo W, Dai H, Park YY, Kim SM, Peng Y, Mo Q, Siwko S, Hu R, et al. Genome-wide transcriptome profiling of homologous recombination DNA repair. Nat Commun. 2014;5:3361.
Chen JL, Lucas JE, Schroeder T, Mori S, Wu J, Nevins J, Dewhirst M, West M, Chi JT. The genomic analysis of lactic acidosis and acidosis response in human cancers. PLoS Genet. 2008;4:e1000293.
Saal LH, Johansson P, Holm K, Gruvberger-Saal SK, She QB, Maurer M, Koujak S, Ferrando AA, Malmstrom P, Memeo L, et al. Poor prognosis in carcinoma is associated with a gene expression signature of aberrant PTEN tumor suppressor pathway activity. Proc Natl Acad Sci U S A. 2007;104:7564–9.
Coulouarn C, Factor VM, Thorgeirsson SS. Transforming growth factor-beta gene expression signature in mouse hepatocytes predicts clinical outcome in human cancer. Hepatology. 2008;47:2059–67.
Pittman J, Huang E, Dressman H, Horng CF, Cheng SH, Tsou MH, Chen CM, Bild A, Iversen ES, Huang AT, et al. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc Natl Acad Sci U S A. 2004;101:8431–6.
McShane LM, Altman DG, Sauerbrei W, Taube SE, Gion M, Clark GM, Statistics Subcommittee of the NCIEWGoCD. Reporting recommendations for tumor marker prognostic studies (REMARK). J Natl Cancer Inst. 2005;97:1180–4.
Hoshida Y. Nearest template prediction: a single-sample-based flexible class prediction with confidence assessment. PLoS One. 2010;5:e15543.
Fan C, Oh DS, Wessels L, Weigelt B, Nuyten DS, Nobel AB, van’t Veer LJ, Perou CM. Concordance among gene-expression-based predictors for breast cancer. N Engl J Med. 2006;355:560–9.
Newman AM, Liu CL, Green MR, Gentles AJ, Feng W, Xu Y, Hoang CD, Diehn M, Alizadeh AA. Robust enumeration of cell subsets from tissue expression profiles. Nat Methods. 2015;12:453–7.
Schemper M, Smith TL. A note on quantifying follow-up in studies of failure time. Control Clin Trials. 1996;17:343–6.
Nevins JR, Potti A. Mining gene expression profiles: expression signatures as cancer phenotypes. Nat Rev Genet. 2007;8:601–9.
Rhodes DR, Chinnaiyan AM. Integrative analysis of the cancer transcriptome. Nat Genet. 2005;37(Suppl):S31–7.
Raponi M, Zhang Y, Yu J, Chen G, Lee G, Taylor JM, Macdonald J, Thomas D, Moskaluk C, Wang Y, Beer DG. Gene expression signatures for predicting prognosis of squamous cell and adenocarcinomas of the lung. Cancer Res. 2006;66:7466–72.
Sun Z, Wigle DA, Yang P. Non-overlapping and non-cell-type-specific gene expression signatures predict lung cancer survival. J Clin Oncol. 2008;26:877–83.
Xie Y, Xiao G, Coombes KR, Behrens C, Solis LM, Raso G, Girard L, Erickson HS, Roth J, Heymach JV, et al. Robust gene expression signature from formalin-fixed paraffin-embedded samples predicts prognosis of non-small-cell lung cancer patients. Clin Cancer Res. 2011;17:5705–14.
Lu Y, Lemon W, Liu PY, Yi Y, Morrison C, Yang P, Sun Z, Szoke J, Gerald WL, Watson M, et al. A Gene Expression Signature Predicts Survival of Patients with Stage I Non-Small Cell Lung Cancer. PLoS Med. 2006;3:e467.
Roepman P, Jassem J, Smit EF, Muley T, Niklinski J, van de Velde T, Witteveen AT, Rzyman W, Floore A, Burgers S, et al. An immune response enriched 72-gene prognostic profile for early-stage non-small-cell lung cancer. Clin Cancer Res. 2009;15:284–90.
Mitra R, Lee J, Jo J, Milani M, McClintick JN, Edenberg HJ, Kesler KA, Rieger KM, Badve S, Cummings OW, et al. Prediction of postoperative recurrence-free survival in non-small cell lung cancer by using an internationally validated gene expression model. Clin Cancer Res. 2011;17:2934–46.
Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell. 2002;13:1977–2000.
Chen DT, Hsu YL, Fulp WJ, Coppola D, Haura EB, Yeatman TJ, Cress WD. Prognostic and predictive value of a malignancy-risk gene signature in early-stage non-small cell lung cancer. J Natl Cancer Inst. 2011;103:1859–70.
Liu R, Wang X, Chen GY, Dalerba P, Gurney A, Hoey T, Sherlock G, Lewicki J, Shedden K, Clarke MF. The prognostic role of a gene signature from tumorigenic breast-cancer cells. N Engl J Med. 2007;356:217–26.
Li L, Wei Y, To C, Zhu CQ, Tong J, Pham NA, Taylor P, Ignatchenko V, Ignatchenko A, Zhang W, et al. Integrated omic analysis of lung cancer reveals metabolism proteome signatures with prognostic impact. Nat Commun. 2014;5:5469.
Winter SC, Buffa FM, Silva P, Miller C, Valentine HR, Turley H, Shah KA, Cox GJ, Corbridge RJ, Homer JJ, et al. Relation of a hypoxia metagene derived from head and neck cancer to prognosis of multiple cancers. Cancer Res. 2007;67:3441–9.
Navab R, Strumpf D, Bandarchi B, Zhu CQ, Pintilie M, Ramnarine VR, Ibrahimov E, Radulovich N, Leung L, Barczyk M, et al. Prognostic gene-expression signature of carcinoma-associated fibroblasts in non-small cell lung cancer. Proc Natl Acad Sci U S A. 2011;108:7160–5.
Masiero M, Simoes FC, Han HD, Snell C, Peterkin T, Bridges E, Mangala LS, Wu SY, Pradeep S, Li D, et al. A core human primary tumor angiogenesis signature identifies the endothelial orphan receptor ELTD1 as a key regulator of angiogenesis. Cancer Cell. 2013;24:229–41.
Hu J, Bianchi F, Ferguson M, Cesario A, Margaritora S, Granone P, Goldstraw P, Tetlow M, Ratcliffe C, Nicholson AG, et al. Gene expression signature for angiogenic and nonangiogenic non-small-cell lung cancer. Oncogene. 2005;24:1212–9.
Bild AH, Yao G, Chang JT, Wang Q, Potti A, Chasse D, Joshi MB, Harpole D, Lancaster JM, Berchuck A, et al. Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature. 2006;439:353–7.
Gentles AJ, Newman AM, Liu CL, Bratman SV, Feng W, Kim D, Nair VS, Xu Y, Khuong A, Hoang CD, et al. The prognostic landscape of genes and infiltrating immune cells across human cancers. Nat Med. 2015;21:938–45.
Kratz JR, He J, Van Den Eeden SK, Zhu ZH, Gao W, Pham PT, Mulvihill MS, Ziaei F, Zhang H, Su B, et al. A practical molecular assay to predict survival in resected non-squamous, non-small-cell lung cancer: development and international validation studies. Lancet. 2012;379:823–32.
Goldstraw P, Chansky K, Crowley J, Rami-Porta R, Asamura H, Eberhardt WE, Nicholson AG, Groome P, Mitchell A, Bolejack V, et al. The IASLC Lung Cancer Staging Project: Proposals for Revision of the TNM Stage Groupings in the Forthcoming (Eighth) Edition of the TNM Classification for Lung Cancer. J Thorac Oncol. 2016;11:39–51.
Goldstraw P, Crowley J, Chansky K, Giroux DJ, Groome PA, Rami-Porta R, Postmus PE, Rusch V, Sobin L, International Association for the Study of Lung Cancer International Staging C, Participating I. The IASLC Lung Cancer Staging Project: proposals for the revision of the TNM stage groupings in the forthcoming (seventh) edition of the TNM Classification of malignant tumours. J Thorac Oncol. 2007;2:706–14.
Strand TE, Rostad H, Wentzel-Larsen T, von Plessen C. A population-based evaluation of the seventh edition of the TNM system for lung cancer. Eur Respir J. 2010;36:401–7.
We thank Chunrong Huang for useful comments, suggestions and discussions.
This work was financially supported by grants from the National Natural Science foundation of China (81570008, Yanbin Zhou), the Natural Science Foundation of Guangdong Province of China (2014A030313052, Yanbin Zhou), the Science and Technology Program of Guangzhou, China (2014J4100132, Yanbin Zhou) and Medical Scientific Research Foundation of Guangdong Province of China (A2015311, Lixia Huang).
Availability of data and materials
The datasets supporting the conclusions of this article are available in the in public databases (Gene Expression Omnibus (GEO), The Cancer Genome Atlas (TCGA), or ArrayExpress) with the accession numbers: GSE8465, GSE30219, GSE37745, GSE8894, GSE14814, E-MTAB-923, GSE19188, GSE10245, GSE31908, GSE6253, GSE31210, GSE50081, GSE4573, TCGA/GSE68793, GSE3141, E-MTAB-2435/-1727, GSE29013. All of those studies previously were approved by their respective institutional review boards.
Conception and design: JW, LZ, LH, YZ. Development of methodology: JW, LZ, YZ. Acquisition of data (searched databases, acquired and managed data, and provided facilities, etc.): JW, LZ, LH, JG, SL, BL, JF. Analysis and interpretation of data (e.g., statistical, biostatistics and computational analysis): LZ, JW, LH, JG, YZ. Writing, review, and/or revision of the manuscript: JW, LZ, LH, YZ. Study supervision: YZ. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Ethics approval and consent to participate
Bootstrap resampling procedure for internal validation (TIF 1594 kb)
Characteristics of cohorts included in the study. (DOCX 37 kb)
Univariable associations of clinicopathologic and genomic factors with risk of 5-year overall mortality in the training set. (XLSX 12 kb)
Prognostic value of the three genomic factors within mutational subsets of NSCLC in the overall pooled dateses, training set and validation set. (XLSX 12 kb)