Identification and validation of potential prognostic lncRNA biomarkers for predicting survival in patients with multiple myeloma

Background Dysregulated long non-coding RNAs (lncRNAs) have been found to have oncogenic and/or tumor suppressive roles in the development and progression of cancer, implying their potentials as novel independent biomarkers for cancer diagnosis and prognosis. However, the prognostic significance of expression profile-based lncRNA signature for outcome prediction in patients with multiple myeloma (MM) has not yet been investigated. Methods LncRNA expression profiles of a large cohort of patients with MM were obtained and analyzed by repurposing the publically available microarray data. An lncRNA-focus risk score model was developed from the training dataset, and then validated in the testing and another two independent external datasets. The time-dependent receiver operating characteristic (ROC) curve was used to evaluate the prognostic performance for survival prediction. The biological function of prognostic lncRNAs was predicted using bioinformatics analysis. Results Four lncRNAs were identified to be significantly associated with overall survival (OS) of patients with MM in the training dataset, and were combined to develop a four-lncRNA prognostic signature to stratify patients into high-risk and low-risk groups. Patients of training dataset in the high-risk group exhibited shorter OS than those in the low-risk group (HR = 2.718, 95 % CI = 1.937-3.815, p <0.001). The similar prognostic values of four-lncRNA signature were observed in the testing dataset, entire GSE24080 dataset and another two independent external datasets. Multivariate Cox regression and stratified analysis showed that the prognostic power of four-lncRNA signature was independent of clinical features, including serum beta 2-microglobulin (Sβ2M), serum albumin (ALB) and lactate dehydrogenase (LDH). ROC analysis also demonstrated the better performance for predicting 3-year OS. Functional enrichment analysis suggested that these four lncRNAs may be involved in known genetic and epigenetic events linked to MM. Conclusions Our results demonstrated potential application of lncRNAs as novel independent biomarkers for diagnosis and prognosis in MM. These lncRNA biomarkers may contribute to the understanding of underlying molecular basis of MM. Electronic supplementary material The online version of this article (doi:10.1186/s13046-015-0219-5) contains supplementary material, which is available to authorized users.


Background
Multiple myeloma (MM) is an incurable cancer of plasma cells caused by abnormal accumulation of monoclonal plasma cells in bone marrow [1]. MM is one of the most common blood cancers and is characterized by wide clinical and pathophysiologic heterogeneities leading to fatal outcome. The survival periods of patients with MM varied significantly, ranging from a few weeks to more than 10 years, and the five-year survival rate is nearly 40 % [2]. Identifying patients who are at the high risk may help optimize the choice of personalized treatment and improve clinical outcomes.
It is well known that the vast majority (>90 %) of the human genome sequence can be actively transcribed, while less than 2 % of transcripts serve as mRNA to encode protein [3,4]. A substantial fraction of transcripts is non-coding RNA (ncRNA) with no or limited protein coding capacity, including short ncRNA and long ncRNA. Long non-coding RNA (lncRNA), constituting an important class of ncRNA, are mRNA-like transcripts which are transcribed by RNA polymerase II and are longer than 200 nucleotides in length [5,6]. Accumulating evidence indicates that lncRNA function as important regulators involved in diverse aspects of gene regulation at transcriptional, posttranscriptional and epigenetic levels [5,7], and participate in a variety of biological processes [8,9]. The aberrant lncRNA expression has also been observed in many complex human diseases, especially in cancers [10][11][12]. Similar to mRNA and miRNA, these dysregulated lncRNAs can play oncogenic and/or tumor suppressive roles in the development and progression of cancer. Some well-characterized lncRNAs, such as MALAT1, HOTAIR and SRA, were found to be highly up-regulated in lung cancer, breast cancer, hepatocellular cancer and so on [13][14][15], while MEG3, GAS5 and LincRNA-p21 have been shown to be tumor suppressors [16][17][18]. These cancer-associated lncRNAs displayed aberrant expression patterns in tissue-or cancer-type specific manner [19,20], suggesting their potentials as novel independent biomarkers for cancer diagnosis and prognosis. Several expression-based lncRNA signatures have been established in glioblastoma multiforme [21], breast cancer [22], oesophageal squamous cell carcinoma [23], colorectal cancer [24] and lung cancer [25]. For multiple myeloma, recent studies have also found that lncRNAs MALAT1 and MEG3 are overexpressed in patients with MM compared to healthy individuals by real-time quantitative reverse transcription polymerase chain reactions (RT-PCR) analysis [26,27]. However, the prognostic significance of lncRNA signature in patients with MM remains unknown.
In the present study, by integrating lncRNA expression profiles and matched clinical information in a large cohort of patients with MM, we identified four prognostic lncRNA biomarkers associated with overall survival of patients with MM and established a four-lncRNA-focus prognostic risk model that can effectively predict clinical survival. The significant prognostic power of four-lncRNA-focus prognostic risk model was further validated in testing dataset and another two independent external patient datasets.

Microarray analysis and lncRNA re-annotation
The probe ID-centric gene expression data was normalized using the MAS5 algorithm and log2 transformed. To obtain lncRNA expression profiles of patients with MM, the microarray probes were re-annotated as previously described [31]. Briefly, the probes (probe sets) from Affymetrix HG-U133_Plus_2.0 array and Affymetrix HG-U133A array were re-mapped to the human genome (GRCh38) using SeqMap tool [32]. Then those probes (probe sets) that were uniquely mapped to the human genome with no mismatch were retained for further analysis. By matching the chromosomal position of retaining probes (probe sets) to the chromosomal position of lncRNA from the GENCODE project (http:// www.gencodegenes.org, release 22) [33], we obtained 3215 probes (probe sets) covering 2330 lncRNAs for Affymetrix HG-U133_Plus_2.0 array and 855 probes (probe sets) covering 663 lncRNAs for Affymetrix HG-U133A array, respectively. The expression data of multiple probes (probe sets) mapping to the same lncRNA were integrated by using the arithmetic mean to represent the expression level of single lncRNA.

Identification of potential prognostic lncRNA biomarkers associated with OS in patients with MM
A univariate Cox regression analysis was carried out to evaluate the association between expression levels of lncRNAs and patients' OS. Those lncRNAs whose expression levels were significantly associated with patients' OS were fitted in a multivariate Cox regression analysis in the training dataset by using OS as the dependent variable and other clinical information as the covariables. We kept those lncRNAs with p value <0.01 to develop a risk score model for predicting OS in patients with MM. The lncRNA-based risk score model was defined as the linear combination of the expression values of the prognostic lncRNAs and the multivariable Cox regression coefficient as the weight. The patients with MM in each dataset were classified into high-risk group and low-risk group according to the median risk score derived from the training dataset.

Statistical analysis
Differences in patients' OS between high-risk group and low-risk group were demonstrated using the Kaplan-Meier survival curves, and the statistical significance was obtained using the two-sided log-rank test. Univariate and multivariate analyses with Cox proportional hazards regression were carried out with OS as the dependent variable and lncRNA risk score and other individual clinical features as explanatory variables in each dataset. Hazard ratios (HR) and 95 % confidence intervals (CI) were calculated. The timedependent receiver operating characteristic (ROC) curve was used to evaluate the prognostic performance for survival prediction of the lncRNA risk score and the area under the ROC curves (AUC) value were calculated. All the analysis was performed using the R/Bio-Conductor (version 3.1.1).

Functional enrichment analysis
The Pearson correlation coefficient was utilized to evaluate co-expression relationship between lncRNA and mRNA. The functional enrichment analysis of co-expressed mRNAs was performed to predict biological function of lncRNA using the DAVID Bioinformatics Tool (http:// david.abcc.ncifcrf.gov/, version 6.7), which is widely used bioinformatics resources [34,35]. The enriched results was reported limited to Gene Ontology (GO) terms in the "Biological Process"(GOTERM-BP-FAT) and Kyoto encyclopedia of genes and genomes (KEGG) pathway categories using the functional annotation clustering and functional annotation chart options. The GO terms and KEGG pathways with p value of <0.05 and Enrichment score > 2 was considered as significantly enriched function annotations.

Results
Identification of prognostic lncRNA biomarkers associated with patients' OS from the training dataset The 559 patients with MM from GSE24080 were randomly split into the training dataset (n = 280) and the testing dataset (n = 279). We first conducted a univariate Cox regression analysis for expression data of each lncRNA with OS as a dependent variable to measure the relationship between lncRNA expression and patients' OS. A total of 59 lncRNAs, whose expression levels were significantly associated with patients' OS (p < 0.01), were identified and ranked according to their univariate z scores (Fig. 1). Of these, the high expression levels of 40 lncRNAs with negative z scores were associated with longer OS, and high expression levels of the remaining 19 lncRNAs with positive z scores were associated with shorter OS. In order to evaluate whether these lncRNAs have independently predictive power to predict patients' OS when considering the mutual effect among 59 lncRNAs and clinical features, a multivariate regression analysis was further performed on the expression levels of 59 lncRNAs with OS as a dependent variable and other individual clinical features as explanatory variables in the training dataset. When considering the mutual effect among 59 lncRNAs and clinical features, only 4 of 59 lncRNAs (RP4-803 J11.2, RP1-43E13.2, RP11-553 L6.5, ZFY-AS1) showed predictive power and were able to independently predict patients' OS at a statistically significant level of 0.01 (Fig. 1). The detailed information of these four lncRNAs was summarized in Table 1. To build a predictive model that should be independent of other factors (such as clinical features or other lncRNAs), we only used these four lncRNAs to construct risk score model.

Construction and validation of lncRNA-focus risk score model for predicting OS in the training dataset
To construct a predictive model, these four lncRNAs were fitted in a multivariate Cox regression model with OS as a dependent variable to measure relative contributions for survival prediction. Then a lncRNA-focus risk score model for OS prediction was developed by integrating the expression data of these four lncRNA and corresponding coefficient derived from above multivariate regression analysis, as follows: The risk score of each patient in the training dataset was calculated according to the lncRNA-focus risk score model. Then 280 patients of training dataset were assigned to a high-risk group (n = 140) or a lowrisk group (n = 140) using the median risk score as the cutoff point. The result of Kaplan-Meier analysis showed significant differences in patients' OS between high-risk group and low-risk group (log-rank test p < 0.001, Fig. 2a). Patients in the high-risk group had significantly shorter OS (mean 58.7 months) than those in the low-risk group (mean 104.6 months). The univariate Cox regression analysis also demonstrated that the risk scores derived from the four-lncRNA signature was significantly correlated with patients' OS with risk scores as a continuous variable (p < 0.001, HR = 2.718, 95 % CI = 1.937-3.815) ( Table 2). The expression of lncRNAs RP4-803 J11.2 and RP1-43E13.2 tended to be up-regulated and the remaining two lncRNAs (ZFY-AS1 and RP11-553 L6.5) were downregulated for patients in high-risk group (Fig. 2b).
Performance evaluation of lncRNA-focus risk score model for survival prediction in the testing and entire GSE24080 datasets To evaluate the prognostic power of lncRNA-focus risk score model for patients' OS prediction, this risk score model and cutoff point derived from the training dataset was applied to patients with MM in the testing dataset and the entire GSE24080 dataset. The 279 patients of the testing dataset were classified into either high-risk group (n = 119) or low-risk group (n = 160). Kaplan-Meier curves for the two groups within the testing dataset is shown in Fig. 3a, demonstrating a significant difference in OS between high-risk group and low-risk group (log-rank test p = 0.054). Patients in the high-risk group exhibited poorer OS (mean 68.5 months) than those in the low-risk group (mean 78.3 months). The significant association between risk score and OS has also been observed in the testing dataset with risk scores as a continuous variable in the univariate Cox regression analysis (p = 0.014, HR = 1.579, 95 % CI = 1.099-2.270) ( Table 2). The distribution of risk score, survival status and lncRNA expression in the testing dataset of 279 patients is shown in Fig. 3b. Patients in the  high-risk group tended to express risky lncRNAs (RP4-803 J11.2 and RP1-43E13.2) at higher level than those in the low-risk group, whereas patients in the low-risk group tended to express protective lncRNAs (ZFY-AS1 and RP11-553 L6.5) at higher level than those in the high-risk group. In consistent with the finding in the training dataset and testing dataset, Kaplan-Meier and univariate Cox regression analysis showed that this lncRNA-focus risk score model was able to separate 559 patients in the entire GSE24080 dataset into two groups with significantly different OS (mean 63.3 months versus 100.1 months, HR = 2.099, 95 % CI = 1.638-2.688; p < 0.001, log-rank test) (Fig. 3c). The distribution of risk score, survival status and lncRNA expression also yielded similar results (Fig. 3d).
Further validation of lncRNA-focus risk score model for survival prediction in another two independent external patient datasets To further examine the robustness and practical application of the four-lncRNA risk score model, we validated the prognostic power of the risk score model using lncRNA expression values and survival information of patients with MM in another two independent external datasets (GSE57317 and GSE9782). As shown in Fig. 4a, the lncRNA-focus risk score model could effectively predict OS in patients with MM from GSE57317 (log-rank test p = 0.053). All 55 patients in the GSE57317 dataset were divided into the high-risk group (n = 26) and the low-risk group (n = 29) with significant different OS according to the same risk score cutoff point derived from the training dataset (mean 31.3 months versus 37.5 months, HR = 2.64, 95 % CI = 1.013-6.879, p = 0.047). Another external patient dataset (GSE9782) was based on the Affymetrix U133A array platform. After probe re-annotating, we found that only 3 lncRNA (RP1-43E13.2, RP11-553 L6.5, ZFY-AS1) of four prognostic lncRNAs from the training dataset were covered on the Affymetrix U133A array. So, the risk score model only based on these three lncRNAs without re-estimating parameters was used to predict OS for GSE9782 dataset, which perhaps reduce the predictive power. The median risk score cutoff point obtained from GSE9782 dataset classified 264 patients into the high-risk group (n = 132) and the low-risk group (n = 132). The Kaplan-Meier curves for the high-risk group and the low-risk group in the independent external GSE9782 dataset are shown in Fig. 4b Fig. 4c and d).   Table 2) (There was no available clinical features in GSE57317 dataset). We also found that higher levels of serum beta 2-microglobulin (Sβ2M), serum albumin (ALB) and lactate dehydrogenase (LDH) were significant in the multivariate analysis. However, the estimation of hazard ratios of lncRNA-focus risk score  Table 2), suggesting that lncRNA-focus risk score model may be more powerful prognostic factor than established laboratory prognostic parameters. Next, data stratification analysis was then performed according to these three significant clinical features. All patients of GSE24080 were stratified into patient group with higher Sβ2M level (≥3.5 mg/L) or patient group with lower Sβ2M level (<3.5 mg/L). All 239 patients with higher Sβ2M level were divided into the high-risk group (n = 119) with shorter OS or the low-risk group (n = 120) with longer OS (mean 44.9 vs. 93.4 months, log-rank test p < 0.001) (Fig. 5a). For the patient with lower Sβ2M level, patients with low-risk scores (n = 180) also had longer OS (mean 97.1 months) than those with high-risk scores (n = 140) (mean 75.2 months), although the p value of 0.068 was slightly above the 0.05 significance level (Fig. 5b). Another clinical feature, ALB, stratified the entire GSE24080 patients into two subgroups with higher (≥35 g/L) or lower (<35 g/L) levels of ALB. The lncRNA-focus risk score model could effectively classify patients into high-risk group and low-risk group with significantly different OS for both two subgroups with higher or lower levels of ALB (mean 66.4 vs. 100.8 months, log-rank test p < 0.001 for 482 patients with higher level of  (Fig. 5c and d).
Significant differences for OS between high-risk group and low-risk group also were observed for stratified subgroup by LDH level (mean 51.5 vs. 67.9 months, log-rank test p = 0.037 for 168 patients with LDH > 190 U/L, and mean 69.1 vs. 104.4 months, log-rank test p < 0.001 for 391 patients with LDH ≤ 190 U/L) ( Fig. 5e and f). Taken together, the results of multivariate Cox regression analyses and stratification analysis suggested that the predictive power of lncRNA-focus risk score is independent of other clinical features for OS of patients with MM.

Performance comparison by time-dependent ROC curve analysis
We performed the time-dependent ROC curve analysis to compare sensitivity and specificity for survival prediction between lncRNA-focus risk score model and an established UAMS 17-gene prognostic model [36] in the GSE24080 dataset, GSE57317 dataset and GSE9782. The AUC value was obtained from ROC analysis and compared between these two predictive models. In the GSE24080 dataset and GSE9782 dataset, the lncRNAfocus risk score model achieved AUC value of 0.682 and 0.595, which is higher than those (AUC = 0.666 and 0.572) derived from UAMS 17-gene prognostic model (Fig. 6), indicating that the predictive ability of lncRNAfocus risk score model was better than UAMS 17-gene prognostic model in GSE24080 and GSE9782 datasets. However, in the GSE57317 dataset, established UAMS 17-gene prognostic model had a higher AUC value than our lncRNA-focus risk score model (0.937 vs. 0.656, Fig. 6).

Functional prediction of prognostic lncRNA biomarkers
To explore the functional implication of four prognostic lncRNA biomarkers in MM tumorigenesis and development, we performed bioinformatics analysis to predict lncRNA functions. We first calculated the Pearson correlation coefficient between lncRNA and mRNA by examining the paired lncRNA and mRNA expression profiles of 559 patients with MM in the GSE24080 dataset. The top 1 % mRNA was selected as co-expressed mRNAs with prognostic lncRNA biomarkers. A total of 789 mRNAs were positively or negatively correlated with at least one of the four prognostic lncRNAs (see Additional file 2). Functional enrichment analysis showed that these co-expressed mRNAs with prognostic lncRNAs were significantly enriched in 104 GO terms and 9 KEGG pathways (p < 0.05 and Fold Enrichment > 2) (see Additional file 3), which are mainly involved in six functional clusters including cell cycle, chromatin modification, DNA replication, microtubule-based process, DNA repair and RNA processing (Table 3). We further examined whether there were any important genes of interest identified from integrative analysis of lncRNA-mRNA. We found that 135 of 789 co-expressed mRNAs (corresponding to 62 genes) with four prognostic lncRNAs are known cancer genes recorded in NCG database (http://ncg.kcl. ac.uk/index.php, version 4.0) [37] (see Additional file 4), which is a manually curated cancer gene repository. Especially, gene NRAS has been verified experimentally to be associated with MM [38].

Discussion
During the past years, great progress in our understanding of the initiation and progression of multiple myeloma has been witnessed [39,40]. However, the clinical outcome of patients with MM still remains highly heterogeneous. Traditional laboratory parameters Sβ2M and serum albumin, referred to the International Staging System (ISS), have been used as an objective staging system [41]. Subsequent cytogenetic studies found that cytogenetic abnormalities, such as 13q14 deletion and t(4;14) translocation, also can provide valuable prognostic information [42,43]. However, both ISS and cytogenetic abnormalities demonstrated limited ability for therapeutic risk stratification. With the development of high-throughput techniques, expression profiles-based  [44,45].
Several multigene-expression signatures, including UAMS 17-gene [36], IFM 15-gene [2] and EMC 92-gene [46] models, have been developed to predict survival in patients with MM. More recently, dysregulation of lncRNA expression were observed in the newly diagnosed patients with MM, indicating their potentials as biomarkers for diagnosis and prognosis in MM [27]. However, the prognostic significance of expression profile-based lncRNA signature for outcome prediction in patients with MM has not yet been investigated.
In this study, we have investigated the lncRNA expression profiles of a large cohort of patients with MM by repurposing the publically available microarray data. Through integrative analysis of lncRNA expression data with clinical features, 59 lncRNAs were found to be significantly associated with patients' OS in MM. After considering interrelation among 59 lncRNAs and clinical features using multivariate analysis, we identified four prognostic lncRNAs that were able to independently predict patients' OS. Two (RP4-803 J11.2 and RP1-43E13.2) of four lncRNAs are located chromosome 1 that has been proven to be key players in MM progression [36], and their expression correlated with shortened survival. Then an lncRNA-focus risk model was developed by incorporating expression patterns of four prognostic lncRNAs and their relative contributions in the multivariate analysis in the training dataset. By applying this four-lncRNA-based risk model to the patients of training dataset, a better risk stratification for patients' outcome was observed between survival curves of patients with high-risk or low-risk scores. Patients in the high-risk group had significant shorter OS than those in the low-risk group. Further validation of the four-lncRNA-based risk model constructed in the training dataset showed similar prognostic power in the testing dataset and another two independent external patient datasets. These analyses suggested that the prognostic value of the four-lncRNA-based risk model is robust and reliable for survival prediction in MM.
We next performed multivariate analysis to test whether the prognostic power of the four-lncRNA-based risk model for survival prediction is independent of known prognostic variables and other clinical features. The estimations of HR for OS were 2.066, 1.726, 1.905 and 1.909 in the training, testing, entire GSE24080 and GSE9782 datasets, respectively. Also, some known prognostic variables, including Sβ2M, serum albumin and LDH revealed significant correlation with patients' OS. So we carried out stratification analysis for Sβ2M, ALB and LDH to further evaluate the independence of the four-lncRNA-based risk model for survival prediction. The results of stratification analysis suggested that the four-lncRNA-based risk model was able to effectively classify patients into high-risk group and low-risk group with significantly different OS for both two subgroups stratified by three clinical prognostic variables. These results of multivariate analysis, taken together with stratification analysis, demonstrated that the four-lncRNA-based risk model was an independent prognostic factor for survival prediction in MM.
Although substantial computational evidence for the prognostic significance of the lncRNA signature in MM has been revealed, the underlying mechanisms of these four prognostic lncRNAs in the development of MM were still unclear. So we performed an integrative analysis of lncRNA-mRNA by utilizing the matched lncRNA and mRNA expression profiles to infer functional implication of these four prognostic lncRNAs. The functional enrichment analysis of mRNA co-expressed with lncRNAs revealed that the biological functions annotated to the four prognostic lncRNAs mainly involve cell cycle, chromatin modification, DNA replication, microtubule-based process, DNA repair and RNA processing. These functions are all of essential significance contributing to the initiation and progression of MM [39]. Although bioinformatics analysis indicated that these four prognostic lncRNAs may play significant role in the initiation and progression of MM through associations with known genetic and epigenetic events linked to MM, further experimental validation of these four prognostic lncRNAs is necessary for understanding their functional roles in MM.