Skip to main content

Gene expression subtraction of non-cancerous lung from smokers and non-smokers with adenocarcinoma, as a predictor for smokers developing lung cancer



Lung cancer is the commonest cause of cancer death in developed countries. Adenocarcinoma is becoming the most common form of lung cancer. Cigarette smoking is the main risk factor for lung cancer. Long-term cigarettes smoking may be characterized by genetic alteration and diffuse injury of the airways surface, named field cancerization, while cancer in non-smokers is usually clonally derived. Detecting specific genes expression changes in non-cancerous lung in smokers with adenocarcinoma may give us instrument for predicting smokers who are going to develop this malignancy.


We described the gene expression in non-cancerous lungs from 21 smoker patients with lung adenocarcinoma and compare it to gene expression in non-cancerous lung tissue from 10 non-smokers with primary lung adenocarcinoma.


Total RNA was isolated from peripheral non-cancerous lung tissue. The cDNA was hybridized to the U133A GeneChip array. Hierarchical clustering analysis on genes obtained from smokers and non-smokers, after subtracting were exported to the Ingenuity Pathway Analysis software for further analysis.


The genes subtraction resulted in disclosure of 36 genes with high score. They were subsequently mapped and sorted based on location, cellular components, and biochemical activity. The gene functional analysis disclosed 20 genes, which are involved in cancer process (P = 7.05E-5 to 2.92E-2).


Detected genes may serve as a predictor for smokers who may be at high risk of developing lung cancer. In addition, since these genes originating from non-cancerous lung, which is the major area of the lungs, a sample from an induced sputum may represent it.


Lung cancer is the commonest cause of cancer death in developed countries and throughout the world. Worldwide, the estimated number of new lung cancer cases in 2002 is 1.2 million (12.3% of all new cancer cases). Over 90% of these new cases will die as a result of the disease [1]. The death rate for lung cancer exceeds the combined total for breast, prostate and colon cancer in developed countries [1]. Cigarette smoking is the main risk factor for lung cancer, accounting for about 90% of cases in men and 70–85% of cases in women [2]. Genetic risk factors contribute to an individual's susceptibility to lung cancer, which is illustrated by the fact that about 16% of long-term smokers will develop lung cancer [3]. Unfortunately, no effective clinical tests are available for early detection of lung cancer. Smoking cessation programs are critical, but ex-smokers continue to have a higher risk to develop lung cancer. Even more than 40 years after cessation compared with never-smokers [4]. These days ex-smokers comprise almost 50% of all new lung cancer cases in developed countries, indicating a strong need for a search for new means of early diagnosis in lung cancer and chemoprevention in this high-risk group (4). Nevertheless, an important issue for these approaches is the appropriate selection of an optimal high-risk population.

Long-term carcinogenic exposure (cigarettes smoke) is characterized by genetic alteration and diffuse injury of the airways surface [5]. These changes in epithelium can give rise to cancer at multiple points, for that reason named field cancerization. Genetic changes detected in premalignant lesions or malignant in one region of the field, translate into an increased risk of cancer development throughout the entire field. Therefore, it can be concluded that cigarettes smoking may induce field cancerization whereas cancer in non-smokers is clonally derived, i.e. by single cell transformed is the ancestor of all cells that compose the neoplasm [68].

The consequence of this is that the non-cancerous tissue in non-smoker patient with primary lung cancer should not show any alteration compared to normal, whereas in smokers due to field cancerization morphological and genetic alteration may possibly be observed. Probably, this genetic alteration in smokers could serve for early detection of lung cancer.

Identification of individuals at greatest risk of developing lung cancer could enhance the efficacy of intervention modalities, thereby greatly reducing mortality from this disease. One strategy for identifying these people is to establish molecular markers that reflect the severity of their cancerization field.

Microarray can simultaneously measure the expression of thousands of mRNAs. They are used in many biological fields and in different species. This high-throughput technique can be used to predict the function of unknown genes, in medical diagnostics, in biomarker discovery, to infer networks from the regulatory interactions between genes, and to investigate the mechanisms by which a drug, disease, mutation and environmental condition affects gene expression and cell function. Large datasets are produced, particularly from whole-genome arrays, and public databases hold substantial quantities of gene-expression information [9, 10].

Adenocarcinoma (AC) is becoming the most common form of lung cancer. It has increased relative to other histological types of lung cancer over the last several decades [1114].

These observations lead us to investigate the gene expression in non-involved lungs from patients with adenocarcinoma and compare it to gene expression in non-involved lung tissue from non-smokers. We used the U133A GeneChip array (Affymetrix, Santa Clara, CA), GeneSpring analysis software and the Ingenuity Pathway Analysis software.


Patient Lung Tissue Samples

We obtained lungs tissue from 32 patients with adenocarcinoma of lung who underwent pulmonary lobectomy (Table 1). All were pre-operatively diagnosed as having lung adenocarcinoma by bronchoscopy or fine needle aspiration. Tumors were carefully staged preoperatively, and clinically believed to be N2 node negative by, computerized tomography, positron emission tomography, or more frequently mediastinoscopy. None of the patients received preoperative anti-cancer therapy. Surgery included mediastinal lymph node sampling. The histological classification was made according to the standard WHO criteria. The affected lobes were removed; all non-involved lung tissue samples were obtained in accordance with Institutional Review Board guidelines. The tissues were carefully inspected, and declared histologically normal (never-smokers) or non-cancerous (smokers).

Table 1 Profile of Lung adenocarcinoma patients

RNA isolation and microarray hybridization

RNA isolation and microarray hybridization. Total RNA was isolated from lung cancer tumor tissue and adjacent non cancerous lung tissue using RNeasy midi kit (Qiagen sciences Maryland USA). cDNA was synthesized from 8 ug total RNA using the SuperScript double stranded cDNA synthesis kit (Invitrogen, Carlsbad CA, USA), and T7-oligo(dT)24 primer, according to the manufacturer's instructions. Biotynilated cRNA was synthesized using the "ENZO bioarray high yield RNA transcript labeling kit". (Enzo life sciences Inc., Farmingdale NY USA). The fragmented probe was hybridized to the Affymetrix Human genome U133A Genechip, according to the manufacturer's protocol (Affymetrix inc. Santa Clara, CA USA). The microarrays were scanned using a confocal scanner (Affymetrix Inc.). 18S ribosomal RNA was used as an endogenous control. A single weighted mean expression level for each gene was derived by using Microarray Suite 5.0 software (Affymetrix).

GeneSpring hierarchical clustering analysis

Expression data were normalized using GeneSpring 7.0 software (Silicon Genetics inc. Redwood CA USA). Gene expression data were normalized in 2 steps. (1) Per chip normalization: and (2) Per gene normalization. We compared gene expression originating from normal lung to non-cancerous lung. More than three fold in expression were considered as significant and used for further analysis.


The seven genes (FYN, MYCN, BRCA1, CD44, Cyclin B2, CDK5RAP3 and RAGE), were further analyzed using quantitative RT-PCR. Cancerous and non cancerous lung tissues from eight randomly chosen patients were studied. 1st strand cDNA was synthesized from 2 ug total RNA by random priming, using the Superscript II cDNA synthesis kit (Invitrogen, Carlsbad CA, USA) according to the manufacturer's instructions. RT PCR reactions were carried out using Taq-Man "Assay on demand" gene expression primers and probe sets; results were analysed by GeneAmp 5700 SDS software (Applied Biosystems, Branchburg, NJ USA). The assays used were: FYN- PPH00147A_ml, MYCN- PPH10927A_ml, BRCA1- PPH00322B_ml, CD44- PPH00114A_ml, RAGE – Hs0015395_m1, Cyclin B2 – Hs00270424_m1 and CDK5RAP3 – Hs01003183_g1. 18S ribosomal RNA was used as an endogenous control. (Hs99999901_s1). The RT-PCR results were analyzed using the comparative threshold cycle (CT) method. The "relative Quantity" (RQ) of each gene was calculated as follows: RQ = 2 Δ Δ C T MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeGOmaiZaaWbaaSqabeaacqGHsislcqqHuoarcqqHuoarcqWGdbWqdaWgaaadbaGaeeivaqfabeaaaaaaaa@3318@ . (15)

Ingenuity Pathway analysis

A network pathway is initiated by the gene with the highest specificity of connections, and is propagated according to the descent of the specificity. Individual significant pathways identified by a statistical likelihood calculation (P < 0.0001) were merged to represent the biological processes. The Ingenuity Pathways Knowledge Base (KB) is the largest curetted database of previously published findings on mammalian biology from the public literature (Ingenuity Systems). Reports on individual studies of genes in human, mouse or rat were first identified from peer-reviewed publications, and findings were then encoded into ontology by content and modeling experts. Manual extraction and curation probably results in more specific and comprehensive interactions, with far fewer false-positives than automated alternatives (for example, natural language processing and high-throughput screening).

Identification of significant pathways in biological processes: The following steps were used: (1) Genes identified as significant from the experimental data sets were overlaid onto the interactome. Focus genes were identified as the subset having direct interaction(s) with other genes in the database. (2) The specificity of connections for each focus gene was calculated by the percentage of its connections to other significant genes. The initiation and growth of pathways proceeded from genes with the highest specificity of connections. Each pathway had a maximum of 35 genes. (3) Pathways of highly interconnected genes were identified by statistical likelihood using the following equation:

Score = log -⁡ 10 ( 1 i = 0 f 1 C ( G , i ) C ( N G , s i ) C ( N , s ) ) MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaee4uamLaee4yamMaee4Ba8MaeeOCaiNaeeyzauMaeyypa0JaeyOeI0IagiiBaWMaei4Ba8Maei4zaC2aaSbaaSqaaiabigdaXiabicdaWaqabaGcdaqadaqaaiabigdaXiabgkHiTmaaqahajuaGbaWaaSaaaeaacqWGdbWqcqGGOaakcqWGhbWrcqGGSaalcqWGPbqAcqGGPaqkcqWGdbWqcqGGOaakcqWGobGtcqGHsislcqWGhbWrcqGGSaalcqWGZbWCcqGHsislcqWGPbqAcqGGPaqkaeaacqWGdbWqcqGGOaakcqWGobGtcqGGSaalcqWGZbWCcqGGPaqkaaaaleaacqWGPbqAcqGH9aqpcqaIWaamaeaacqWGMbGzcqGHsislcqaIXaqma0GaeyyeIuoaaOGaayjkaiaawMcaaaaa@5E7D@

Where N is the number of genes in the genomic network, of which G are focus genes, for a pathway of s genes, f of which are focus genes. C(n,k) is the binomial coefficient. (4) Pathways with a score greater than 4 (P < 0.0001) were combined to form a composite network representing the underlying biology of the process.

Network and gene ontology analysis

The genes, which were most differentially expressed between smokers and non – smokers, were used for network and gene ontology analysis. These genes were exported to the Ingenuity Pathway Analysis (IPA) software (Ingenuity Systems, Mountain View, CA). Of the 36 genes, 35 were mapped and sorted based on location, cellular components, and reported or suggested biochemical, biologic, and molecular functions using the software The identified genes also were mapped to genetic networks available in the Ingenuity database and then ranked by score. The score is the probability that a collection of genes equal to or greater than the number in a network could be achieved by chance alone. A score of 3 indicate that there is 1/1000 chance that the focus genes are in a network due to random chance. Therefore, score of 3 or higher have a 99.9% confidence level of not being generated by random chance alone. It is possible for a pathway to be both up-regulated and down-regulated, perhaps because of a block in the pathway where genes above and below the block respond differently. Visualizing the results on pathways will assist the identification of genes that are missing from a pathway.


The clinical profile of the patients is shown in Table 1. This includes patients' pulmonary function data, which showed that 12 smokers had chronic obstructive pulmonary disease (COPD)(FEV1 < 75%), whereas none of the non-smokers own this limitation. GeneSpring hierarchical clustering was applied to gene expression profiles. We subtracted highly expressed genes of non-smokers from smokers this disclosed 36 genes that were expressed differently between smoker and non-smokers. The Ingenuity analysis and sorting of the 36 genes according theirs expression (bolded and italic over-expressed) or suppressed (normal) while forming networks (Table 2). The first network includes very high scored and focus genes, which are involved in cell death, cancer and inflammatory processes. The following networks with lower score (but still high) and focus genes, included genes that are involved in cellular assembly and organization, cellular function and maintenance. The illustrated network (Fig. 1) (serve as a paradigm) demonstrates the intensity of genes over-expressed (red) or suppressed (green) in smokers compared to non-smokers, and their genes network formation. The network includes genes that are cancer promoters and tumor suppressors such as FYN HSPA8, YWHAZ, LEPR and TGFBR2, IRF2, EP3000 respectively. The suppression of the genes EGR1 and STAT1 correlates with active tumor formation. Many networks include fewer genes in spite of being scored high. This is probably due to lack of knowledge.

Table 2 Genes in networks
Figure 1
figure 1

Gene network is from IPA analysis. Gene interaction is shown, red colored up-regulated, green colored suppressed. Color intensity is relative to expression. Genes along canonical pathways that were not changed at transcript level are not colored. The arrow pointed toward the direction of positive regulation.

The gene ontology analysis using the IPA tool showed that many functions were identified as high-level functions (Table 3). Fischer's exact test was used to calculate a p-value determining the probability that each biological function and/or disease assigned to that data set is due to chance alone. Of these, twenty genes were revealed as involved in cancer process (P = 7.05E-5 to 2.92E-2).

Table 3 Functional Analysis*


Tumorigenesis is a multistep process characterized by a myriad of genetic and epigenetic alterations. The evolution of human cells is driven by the aberrant function of genes that positively and negatively regulate various aspects of the cancer phenotype, including altered responses to mitogenic and cytostatic signals, resistance to programmed cell death, immortalization, neo-angiogenesis, and invasion and metastasis. Therefore identification of individuals at greatest risk of developing lung cancer at the initial steps of cancer development could enhance the efficacy of intervention modalities, thereby greatly reducing mortality from this disease.

We compared gene expression profiles of non-malignant lung tissue from patients with AC, smokers and non- smokers, matched for clinical staging all were at stage 1 besides 1 from each group at stage 2 and histologically had well differentiated adenocarcinoma. One can argue why not using matched smokers as control to those who developed already lung cancer? Both have the same risk factor to develop lung cancer but this doesn't preclude future lung cancer in the control group. Therefore they weren't useful for detection of markers for developing lung cancer in smokers.

Thirty six genes were differently expressed in smokers non-cancerous lung tissue compared to same tissue of non-smokers. Many of these genes are involved in the cancer process as described above. Smoking is the major risk factor for developing lung cancer but also it is a major risk factor for COPD, which is a diffuse pulmonary inflammatory and obstructive lung disease. Furthermore, it is an established fact that airflow obstruction, measured by simple spirometry, is an independent risk factor for the development of lung cancer. Some studies [15, 16] have reported an increased risk of four to six times that of patients without airflow obstruction. The exact mechanism of the impact of COPD on the development of lung cancer is unknown [17]. A better genes isolation by addition of control groups from smokers only or COPD patients lung may be hampered by the fact that they may contain cells in precancerous or even cancerous changes, and various degree of inflammation. We also initially examined 4 patients with metastatic colon adenocarcinoma. Since their gene expression was different from lung adenocarcinoma as expected, it was stopped due to lack of relevance.

This study was based on the observation that in smokers, even the non-cancerous tissue demonstrates widespread morphological changes and tumors may arise within the genetically abnormal cells [18]. In contrast, adenocarcinoma in nonsmokers more likely arises in a field of relatively normal cells, as might be the case with prior infection, such as seen in "scar carcinomas."

As mentioned smoking induce changes of variable magnitude to the airways epithelia. Some of these subjects develop COPD, some develop lung cancer, and some both. RNA gene expression might be changed by all the described above. In addition, the presence of tumor itself may induce some inflammatory changes i.e. genes changes. The variety of genes obtained represent specific changes occurring in lung of smokers who develop AC of lung. It should be emphasized that the so called field cancerization may not always develop lung cancer and therefore these changes are not a one way road. Since the inflammatory and cancer development are a dynamic processes the genes in the network may change their state of expression, to be precise their activity. A biological progression model can be postulated in which field development plays a central role and its network gene expression change reciprocally. Consequently, monitoring of field may have profound implications for cancer detection and prevention. In view of the fact that detection of lung cancer in early state provide a better prognosis it's One of the major objective to early detect lung cancer is sampling of the suspected material. Usually the most convenient materials are those originated from blood or sputum because of simple to obtain. The value of genes expression changes and their location in biological networks, may serve as a warning for smokers and ex-smokers for the higher risk of developing lung cancer, i.e. early detection of lung cancer. Since these changes are as described field distributed the sampling could be regional rather than focal. Therefore, sputum sample cells may represent cells within the "field" and altered with genetic precancerous changes.


We described 36 genes that may represent the responsible genes for enhanced risk for developing lung cancer in smokers. The relatively small sample size may contribute to reduced power of the study results. Movement towards the development of consortia to pool findings across studies and increase sample size and power will address some of this issue.


  1. Jemal A, Murray T, Ward E, Samuels A, Tiwari RC, Ghafoor A, Feuer EJ, Thun MJ: Cancer statistics. CA Cancer J Clin. 2005, 55 (4): 259-2005 Jul–Aug;

    Article  Google Scholar 

  2. Shopland DR: Tobacco use and its contribution to early cancer mortality with a special emphasis on cigarette smoking. Environ Health Perspect. 1995, 103: 131-142. 10.2307/3432300.

    Article  Google Scholar 

  3. Alberg AJ, Samet JM: Epidemiology of lung cancer. Chest. 2003, 123: 21S-49S. 10.1378/chest.123.1_suppl.21S.

    Article  Google Scholar 

  4. Peto R, Darby S, Deo H, Silcocks P, Whitley E, Doll R: Smoking, smoking cessation and lung cancer in the UK since 1950: combination of national statistics with two case-control studies. Br Med J. 2000, 321: 323-9. 10.1136/bmj.321.7257.323.

    Article  CAS  Google Scholar 

  5. Romeo MS, Sokolova IA, Morrison LE, Zeng C, Baron AE, Hirsch FR, Miller YE, Franklin WA, Varella-Garcia M: Chromosomal abnormalities in non-small cell lung carcinomas and in bronchial epithelia of high-risk smokers detected by multi-target interphase fluorescence in situ hybridization. J Mol Diagn. 2003, 5 (2): 103-12.

    Article  CAS  Google Scholar 

  6. Slaughter DP, Southwick HW, Smejkal W: "Field cancerization" in oral stratified squamous epithelium. Cancer. 1953, 6: 963-968. 10.1002/1097-0142(195309)6:5<963::AID-CNCR2820060515>3.0.CO;2-Q.

    Article  CAS  Google Scholar 

  7. Dakubo GD, Jakupciak JP, Birch-Machin MA, Parr RL: Clinical implications and utility of field cancerization. Cancer Cell Int. 2007, 7: 2-13. 10.1186/1475-2867-7-2.

    Article  Google Scholar 

  8. Yih-Leong C, Chen-Tu W, Shu-Chen L, Chin-Fu H, Yuh-Shan J, Yung-Chie L: Clonality and Prognostic Implications of p53 and Epidermal Growth Factor Receptor Somatic Aberrations in Multiple Primary Lung Cancers. Clinical Cancer Research. 2007, 13: 52-58. 10.1158/1078-0432.CCR-06-1743.

    Article  Google Scholar 

  9. Gollub J, Ball CA, Sherlock G: The Stanford Microarray Database: a user's guide. Methods Mol Biol. 2006, 338: 191-208.

    CAS  Google Scholar 

  10. Parkinson H, Kapushesky M, Shojatalab M, Abeygunawardena N, Coulson R, Farne A: ArrayExpress–a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 2007, 35 (Database issue): D747-D750. 10.1093/nar/gkl995. 2006 Nov 28.

    Article  CAS  Google Scholar 

  11. Powell CA, Spira A, Derti A, DeLisi C, Liu G, Borczuk A, Busch S, Sahasrabudhe S, Chen Y, Sugarbaker D, Bueno R, Richards WG, Brody JS: Gene expression in lung adenocarcinomas of smokers and nonsmokers. Am J Respir Cell Mol Biol. 2003, 29: 157-162. 10.1165/rcmb.2002-0183RC.

    Article  CAS  Google Scholar 

  12. Little AG, Gay EG, Gaspar LE, Stewart AK: National survey of non-small cell lung cancer in the United States: Epidemiology, pathology and patterns of care. Lung Cancer. 2007, 57 (3): 253-60. 10.1016/j.lungcan.2007.03.012.

    Article  Google Scholar 

  13. Ebbert JO, Yang P, Vachon CM, Vierkant RA, Cerhan JR, Folsom AR, Sellers TA: Lung cancer risk reduction after smoking cessation: observations from a prospective cohort of women. J Clin Oncol. 2003, 21: 921-926. 10.1200/JCO.2003.05.085.

    Article  CAS  Google Scholar 

  14. Belinsky SA, Palmisano WA, Gilliland FD, Crooks LA, Divine KK, Winters SA, et al: Aberrant promoter methylation in bronchial epithelium and sputum from current and former smokers. Cancer Res. 2002, 62: 2370-2377.

    CAS  Google Scholar 

  15. Nomura A, Stemmermann GN, Chyou PH, Marcus EB, Buist AS: Prospective study of pulmonary function and lung cancer. Am Rev Respir Dis. 1991, 144: 307-311.

    Article  CAS  Google Scholar 

  16. Kurishima K, Satoh H, Ishikawa H, Yamashita YT, Homma T, Ohtsuka M, Sekizawa K: Lung cancer patients with chronic obstructive pulmonary disease. Oncol Rep. 2001, 8: 63-65.

    CAS  Google Scholar 

  17. Purdue MP, Gold L, Järvholm B, Alavanja MCR, Ward MH, Vermeulen R: Impaired lung function and lung cancer incidence in a cohort of Swedish Construction Workers. Thorax. 2007, 62: 51-56. 10.1136/thx.2006.064196.

    Article  Google Scholar 

  18. Powell CA, Klares S, O'Connor G, Brody JS: Loss of heterozygosity in epithelial cells obtained by bronchial brushing: clinical utility in lung cancer. Clin Cancer Res. 1999, 5: 2025-2034.

    CAS  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to David Stav.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

DS plan the study made all coordination and was involved in the laboratory processing. IB carried out all the surgical procedure and staging and handling the samples. JS responsible for tissue diagnosis and preparing samples for RNA testing. All authors read and approved the final version of manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Stav, D., Bar, I. & Sandbank, J. Gene expression subtraction of non-cancerous lung from smokers and non-smokers with adenocarcinoma, as a predictor for smokers developing lung cancer. J Exp Clin Cancer Res 27, 45 (2008).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: