Non-Hodgkin lymphoma response evaluation with MRI texture classification

Background To show magnetic resonance imaging (MRI) texture appearance change in non-Hodgkin lymphoma (NHL) during treatment with response controlled by quantitative volume analysis. Methods A total of 19 patients having NHL with an evaluable lymphoma lesion were scanned at three imaging timepoints with 1.5T device during clinical treatment evaluation. Texture characteristics of images were analyzed and classified with MaZda application and statistical tests. Results NHL tissue MRI texture imaged before treatment and under chemotherapy was classified within several subgroups, showing best discrimination with 96% correct classification in non-linear discriminant analysis of T2-weighted images. Texture parameters of MRI data were successfully tested with statistical tests to assess the impact of the separability of the parameters in evaluating chemotherapy response in lymphoma tissue. Conclusion Texture characteristics of MRI data were classified successfully; this proved texture analysis to be potential quantitative means of representing lymphoma tissue changes during chemotherapy response monitoring.

cal significance and which correlate with pathophysiology detected by other methods, i.e. clinical examination, other imaging modalities and pathological-anatomical diagnosis, and secondly to provide this new information on the properties of tissues to be used alone or in combination with other clinical information allowing more reliable detection of disease and sophisticated tissue classification as a clinical diagnostic and follow-up tool.
Precise and earlier diagnostics and monitoring treatment response are significant both for the individual patient's prognosis and on a larger scale in developing treatment procedures, especially in malignant diseases. Within the research on solid tumors extensive and widely used Response Evaluation Criteria in Solid Tumors (RECIST) Guidelines may be followed to obtain intra-and inter center comparable results. RECIST defines measurability of tumor lesions and specifies methods of measurements with different techniques [1]. According to the RECIST criteria measure of tumor response from radiological images is done by measuring lesions one-dimensionally, furthermore the World Health Organization (WHO) criteria use two dimensional analysis and several research groups volumetric three-dimensional analysis [2].
Staging of non-Hodgkin's lymphomas (NHL) is the key element of treatment planning for this heterogeneous group of malignancies. A variety of diagnostic tools, including biopsies, computed tomography (CT), magnetic resonance imaging (MRI), 18 F-fluorodeoxyglucose positron emission tomography (FDG-PET) or molecular markers are used in pre-treatment staging [3]. Enhancement with contrast media could also help the evaluation in using different imaging modalities. The same tools are applied to evaluate the response to different types of treatment. Novel techniques such as hybrid positron emission tomography -computed tomography (PET-CT) imaging and new PET tracers like 18 F-fluoro-thymidine ( 18 F-FLT) may increase the sensitivity of response assessment [4]. Reports aiming international standardization of clinical response criteria for NHL have been published [5,6], and these criteria are in wide clinical use. A combination of cyclophosphamide, doxorubicin, vincristine and prednisone (CHOP) remains the mainstay of therapy. The addition of a chimeric-anti-CD20 immunoglobulin G1 monoclonal antibody, rituximab (Mabthera ® ), has resulted in a dramatic improvement in the outcome of the most common NHL, diffuse large B-cell lymphoma, but has also been shown to effective in other type of B-cell lymphomas [7][8][9].
In this paper we report the ability of TA to detect changes in NHL solid tissue masses during chemotherapy. The change in texture appearance is controlled by quantitative volumetric analysis. We classify statistical, autoregressive (AR-) model and wavelet texture parameters representing pre-treatment and two under chemotherapy stages of tumors with four analyses: raw data analysis (RDA), principal component analysis (PCA), linear (LDA) and nonlinear discriminant analysis (NDA). The final objective is to show that these texture parameters of MRI data can be successfully tested with Wilcoxon paired test and Repeatability and Reproducibility (R&R) test for assess the impact of the parameters usability in evaluating chemotherapy response in lymphoma tissue.

Methods
Tumor Response Evaluation (TRE) is a wide prospective clinical project ongoing at our university hospital on cancer patients, where tumor response to treatment is evaluated and followed up using simultaneously CT, MRI and PET imaging methods. Clinical responses for these lymphoma patients were assessed according to the guidelines of the international working group response criteria. In this texture analysis study, as a part of extensive project, the focus was on quantitative imaging methods and only the response in predefined solid NHL masses was evaluated. The ethics committee of the hospital approved the study and participants provided written informed consent. Primary inclusion criteria were NHL patients with at least one bulky lesion (over 3 centimeters) coming for curative aimed treatment. Exclusion criteria were central nervous disease, congestive heart failure New York Heart Association Classification (NYHA) III-IV, serious psychiatric disease, HIV infection and pregnancy.

Patients
MRI images of nineteen NHL patients participating in the TRE project were selected for the first part of this study. One of these patients was excluded due to the smaller amount of image data from the second part analyses. There were 14 male and 5 female patients aged 34-75. These patients had untreated or relapsed histologically diagnosed high/intermediate (N = 8, 42%) or low-grade (N = 11, 58%) NHL with an evaluable lymphoma lesion either in the abdominal area (N = 16) or in the clavicular and axillary lymph node area (N = 3). The treatment given was chemotherapy alone or combined with humanized antibody, rituximab (Mabthera ® ). Therapy regimens were CHOP (N = 5), R-CHOP (rituximab and CHOP) (N = 8), and CVP (cyclophosphamide, vincristine and pred-nisone) (N = 1), CHOP-like CNOP (cyclophosphamide, mitoxantrone, vincristine and prednisone) (N = 1), ChlP (chlorambucil and prednisone) (N = 1), starting with CHOP and changing to R-CHOP (N = 2), starting with R-CHOP and changing to R-CVP (N = 1). Chemotherapy regimens were selected according to patients' clinical status. Chemotherapy courses were repeated every three weeks, and 4 to 9 courses were given according to clinical response. Two patients received 4 cycles, four patients 6 cycles, one patient 7 cycles, and 11 received 8 cycles, and one 9 cycles.

MR imaging schedule
MR imaging in clinical practice as well as in this study was carried out at staging phase before any treatment (examination 1, E1), after the first chemotherapy cycle (examination 2, E2), and after the fourth chemotherapy cycle (examination 3, E3). In addition patients were followed up by using MRI six months and 6-61 months after the completion of therapy. The time frame of the study is presented in Figure 1.

MR image acquisition
Imaging was performed on a 1.5 T MRI device (GE Signa, Wisconsin, USA).
One contrast enhanced sequence acquired from the first and second imaging timepoint were included for volume analysis of lymphoma masses. The sequence used was axial T2-weighted fast spin echo (FSE) fat saturation (FAT SAT) sequence (TR 620 ms, TE 10 ms), with intravenous contrast agent gadolinium chelate (gadobenate dimeglu-mine, 0.2 mg/ml, 10 ml), slice thickness ranged from 5 mm to 12 mm.
One or two T1-and T2-weighted axial image serquences from the first three imaging timepoints of every patient were taken for texture analysis. The T1-weighted series comprised T1-weighted spin echo (SE) and T1-weighted SE FAT SAT sequences (TR 320-700 ms, TE 10 ms), the T2-weighted sequences were FSE FAT SAT (TR 3 320-10 909 ms, TE 96 ms). Repetition time TR varied between and within patients. Slice thickness varied between patients according to clinical status from 5 mm to 12 mm; most patients had two different slice thickness series, the general combination was 5 mm and 8 mm series. Pixel size varied from 1.33 mm*1.33 mm to 1.80 mm*1.80 mm, and a 256*256 matrix was used.

Texture analysis with MaZda
Texture parameter calculation was the first stage of the texture analyses. Stand-alone DICOM viewer application was used to select three to five slices from every image series for analysis. Region of interest (ROI) setting and texture analysis were carried out with MaZda software (MaZda 3.20, The Technical University of Lodz, Institute of Electronics) [33,34]. The lymphoma masses were manually selected and set as ROIs ( Figure 2). Texture features calculated were based on histogram, gradient, run-length matrix, co-occurrence matrix, autoregressive model and wavelet-derived parameters [34]. Image grey level intensity normalization computation separately for each ROI was performed with method limiting image intensities in the range [μ-3σ, μ+3σ], where μ is the mean grey level value and σ the standard deviation. This method has been shown to enhance differences between two classes when comparing image intensity normalization methods in texture classification [35].
Fisher coefficient (Fisher) and classification error probability (POE) combined with average correlation coefficients (ACC) provided by MaZda were used to identify the most significant texture features to discriminate and classify the three evaluation stages of lymphoma tissue. Ten texture features were chosen by both methods (Fisher, POE+ACC). This feature selection was performed separately for the T1-and T2-weighted image sets. In these subgroups feature selection was run for the following imaging stages: combination of all imaging timepoints (E1, E2, and E3), and all combinations of the two aforementioned. Slice thickness was not taken into account.

Volumetric analysis
The volumetry of the solid lymphoma masses was evaluated between diagnostic stage (E1) and after the first treatment (E2). The masses were selected for evaluation before chemotherapy. The same masses were followed after the Time frame of the study Figure 1 Time frame of the study. E1-E5 refers to the MRI examination timepoints 1-5, respectively. first treatment. Volumetric analysis based on MRI images was performed with semiautomatic segmentation software Anatomatic™ [36] with region growing method. [37].

Clinical parameters analyses
The patients' subjective views on their clinical symptoms was observed between two stages: at the diagnosis and after the first treatment. The subjective views were set in two groups: symptoms unchanged or relieved.

Tissue classification
B11 application (version 3.4) of MaZda software package was used for texture data analysis and classification. Analyses were run between all combinations of imaging stages separately for T1-and T2-weighted images. Analyses were performed for combination of parameters selected automatically with Fisher and POE+ACC methods for 1) the specific imaging timepoint pair in question and 2) for all imaging stages in particular image type (T1-, T2weighted). Feature standardization was used in B11, the mean value being subtracted from each feature and the result divided by the standard deviation. Raw data analysis (RDA), principal component analysis (PCA), and linear (LDA) and nonlinear discriminant analysis (NDA) were run for each subset of images and chosen texture fea-ture groups. B11 default neural network parameters were used. Nearest-neighbor (1-NN) classification was performed for the raw data, the most expressive features resulting from PCA and the most discriminating features resulting from LDA. Nonlinear discriminant analysis carried out the classification of the features by artificial neural network (ANN). These classification procedures were run by B11 automatically.

Statistical analyses
Statistical analyses were run for the texture features MaZda's automatic methods (Fisher and POE+ACC) had shown to give best discrimination between imaging timepoints. The T1-and T2-weighted image texture parameters were tested separately. Texture parameters for 18 patients were included in the test, one patient participating in MaZda texture parameter calculation was excluded because of smaller amount of image data than other patients leading to reduced textural data.
In analyzing and seeking the best parameters for classification, it is vital to ensure low overall variation in the treatment process and to ascertain how this variation can be focused onto different components in the whole process. In the present study the repeatability and reproducibility (R&R) method was applied. The design of the study was experimental, the aim being to estimate different sources of variation in the lymphoma texture at the three different timepoints (examinations 1, 2, and 3) and repeating the same measurements three times. Because the distributions were skewed, the range method was used.
According to the standard Gage R&R terminology timepoints stand for operators, patients for parts and repeated measurements for trials. In statistical terms the following variance components were estimated: repeatability (difference across measurements), reproducibility (difference across timepoints) and variability (difference across patients). Repeatability describes intrapatient variation, i.e., how a given measurer repeats the same planning process. Reproducibility describes interpatient variation, i.e., how different measurements at the timepoints follow the same planning process and variability describes interpatient variation, i.e. how well the same physician can repeat the planning process for different kinds of patients. The total error -also known as the combined R&R effectincludes repeatability and reproducibility, and only patient-to-patient variation is excluded. In industrial applications the combined R&R should not exceed 10% of the total variation, but in certain situations a total error up to 30% may be acceptable. The present statistical analyses were performed by Statistica/W (Version 5.1, 98 edition, Statsoft. Inc, Tulsa, OK, USA).
Textural data from T1-and T2-weighted fat saturation image series were analysed separately and both groups divided into two subgroups according to slice thickness: 5-7 mm and 8-12 mm. Differences between imaging timepoints were analysed by Wilcoxon Signed Ranks.
Mann-Whitney test was used to test rank parameters grouped by grade of malignity and subjective change of symptoms. These analyses were performed by SPSS for Windows, version 14.0.2.

Volumetric analysis
The median volume of the lymphoma masses before treatment (E1) was 429 cm 3 , ranging from 72 cm 3 to 2144 cm 3 . The median volume of the masses calculated from the second imaging timepoint (E2) was 190 cm3, ranging from 30 cm 3 to1622 cm 3 . After the first treatment cycle, the lymphoma mass volume had decreased in all patients. The median decline in volume was 32%, ranging from 3% to 76%. The results of this volumetric analysis have been published earlier in more detail [37]. The volumetry results of the first and second imaging are given in cm 3 , and the volume change is calculated in percentages in Table 1.

Clinical parameters analyses
According to the patient's subjective estimates clinical symptoms between first and second imaging timepoint were unchanged in eight patients and relieved in 11 patients. Grades of malignancy and subjective view on symptoms are presented in Table 1 with volumetry results. Results of the volumetric analysis of first (E1) and second imaging stages (E2). Volumes are given in cm 3 , and the volume change calculated in percentages.
Texture features were selected with Fisher and POE+ACC methods in MaZda from 300 original parameters calculated for each of the four subgroups in both image data classes T1-and T2-weighted.
We found that the most significant features varied clearly between imaging stages. The whole of 74 TA features ranked first to tenth significant feature in tested subgroups. There were three histogram parameters, 55 cooccurrence parameters, nine run-length parameters, four absolute gradient parameters and three autoregressive model parameters. No wavelet parameters were placed in the top group.
Data analyses RDA, PCA, LDA and NDA show texture changes between imaging points. The analyses did not perform well the task of discriminating all three imaging timepoints (E1, E2, E3) at same time. Slightly better classification was achieved between the first and second examinations, and between the second and third examinations. The method was successful in classifying the textural data achieved from the pre-treatment and third imaging timepoints, the best discrimination was obtained within T2-weighted leading to NDA classification error of 4%, and within T1-weighted NDA 5% error. Classification of different examination stages lead to same level results in T1-and T2-weighted images. The overall classification results are presented in Table 2 and Table 3.

Texture data: Statistical analyses
The values of 73 features obtained with MaZda feature selection methods were tested with Wilcoxon paired test for groups obtained from imaging timepoints a) E1 and E2, b) E2 and E3, c) E1 and E3. T1-and T2-weighted fat saturation image series data were set as their own groups and further into two subgroups according to slice thickness: 5-7 mm and 8-12 mm.
R&R test parameter repeatability was used to describe the variation in texture features between image slices within imaging sequence, and parameter reproducibility to describe the variation between examination stages. This test was performed separately for T1-and T2-weighted images in all three combinations of two imaging points. Differences in slice thickness were not taken into account. Reproducibility values were expected to be quite large because the aim was that the treatment given between imaging stages would take effect and be shown in image texture. In contrast, repeatability values (i.e. differences between images taken at the same timepoint) were expected to be zero. There is no exact expected ratio for reproducibility and patient-to-patient variation in such studies and thus no exact value for percentage of reproducibility, so that the difference between different imaging stages was significant.
The texture parameters giving the best discrimination within T1-weighted image groups in two imaging stage comparison are given in Table 4, Table 5 and Table 6; and respectively for T2-weighted image groups in Table 7, Table 8 and Table 9. Reproducibility percentage and Repeatability percentage of the total are given for all parameters. Wilcoxon paired test p-values are given for all parameters for separate groups regarding slice thickness (groups 5-7 mm and 8-12 mm). Imaging timepoint (E1, E2, E3) combinations for classification analyses. Feature selection methods given in rows. Misclassification percentage (mis%) given for raw data analysis (RDA), principal component analysis (PCA), linear discriminant analysis (LDA) and non-linear discriminant analysis (NDA) in columns. "Combination E1, E2, E3" in feature selection methods refers to features, which have proved to give best discrimination in all imaging timepoints analyses with Fisher and POE+ACC methods, combination of two imaging timepoints refers respectively to features from the analyses in question. Imaging timepoint (E1, E2, E3) combinations for classification analyses. Feature selection methods given in rows. Misclassification percentage (mis%) given for raw data analysis (RDA), principal component analysis (PCA), linear discriminant analysis (LDA) and non-linear discriminant analysis (NDA) in columns. "Combination E1, E2, E3" in feature selection methods refers to features, which have proved to give best discrimination in all imaging timepoints analyses with Fisher and POE+ACC methods, combination of two imaging timepoints refers respectively to features from the analyses in question. R&R inverted ratio and the small difference between values are associated with poor results in Wilcoxon test with certain exceptions. Comparisons between first and third imaging points achieved significant Wilcoxon test p-values most consistently: within T2-weighted images in both slice thickness groups, and within T1-weighted images in the group of thinner slices. Features ranked in T1weighted image data were tested in T2-weighted image data and vice versa. These tests with ranked features transposed with T1-and T2-weighted image groups lead to statistically relevant p-values in thinner T1-weighted images and all images in T2-weighted group. In the analyses of first and second imaging timepoints thin slices in general achieved poorer separation than thick slices. Between the second and third imaging sessions Wilcoxon test gave an unsatisfactory result in T1-weighted group. This trend can be seen in the B11 classification results in the framework of T1-weighted images, while the T2-weighted image analyses in B11 show better classification between second and third than first and second imaging points. The best overall discrimination between imaging timepoints in T1-weighted images was given by the run-length matrix parameters describing grey level non-uniformity, runlength non-uniformity, short-run emphasis and fraction of image in runs in one or more directions calculated (horizontal, vertical, 45 degrees and 135 degrees). In the framework of T2-weighted image analyses best the performers were absolute gradient mean and grey level nonuniformity There were some scattering in well acquitted parameters between sub analyses.
Mann-Whitney test was performed for all texture features ranked 1-5 in any classification sub-analysis separately in T1-and T2-weighted images and further subgroups according to slice thickness to analyze differences between stage of malignity (low vs. high/intermediate) and between subjective change of symptoms (unchanged vs. relieved). These analyses did not yield any relevant and consequential additional information on the relation of texture features to grouping parameters. Texture parameters are given in rows. In the columns R&R repeatability and reproducibility of total, and Wilcoxon test for fat saturation series grouped with image slice thickness less than 8 mm, and 8 mm or thicker.

Discussion
The goals of this study were show that a) MRI texture analysis can be used in NHL chemotherapy response evaluation b) statistical tests Wilcoxon paired test and R&R can be used to evaluate the separability of texture parameters used to describe textural changes in NHL.
Limitations of our study may be the non-standardized MRI sequence protocols within intra and inter patient images and the use of different slice thickness due to imaging in clinical practice, where patient's clinical stage and the size of the tumor were taken into account when setting imaging parameters. However, multicenter studies on MRI TA have shown transferability of TA parameters achieved from MRI images obtained at different MRI centers with own acquisition parameters [16,38].
To achieve new clinical relevant information by means of texture analysis, the texture changes should come out at the same or earlier timepoint as other quantitative measures of tumor response, for example decrease in tumor volume. The RECIST and WHO criteria for evaluating tumor response in one-or two-dimensional (diameter and product) tumor size is equivalent to a 65% decrease in tumor volume [1]. In this study we calculated tumor size decrease in a short time period: before and after the first cycle of chemotherapy. There are no commonly used criteria for early response assessment using volumetric analysis for use as early in the therapy course as our volumetric evaluation was performed. Considering this, we can use the volumetric results as indicative of early imaging based evaluation of response, not to meet response, and also accept tumor volume decrease percentages smaller than 65% as consequential decrease in tumor size. However, in lymphomas, final clinical response evaluation should include other clinical tests according to [5,6].
Wilcoxon test showed encouraging values in the analyses of E1 and E3, including transferability of feature sets between T1-and T2-weighted images. This confirms our recent results with smaller patient data MaZda texture analysis of combination of T1-and T2-weighted images in single analysis [32].
Our study show that the statistical and autoregressive model texture parameters of MRI data can be successfully tested one by one with Wilcoxon paired test and Gage Repeatability and Reproducibility test to assess the impact of parameter separability in evaluating chemotherapy response in lymphoma tissue. Our results strengthen the applicability of Fisher and POE+ACC methods used in MaZda for automatic feature selection, and also confirm the suitability of the raw parameters in statistical tests. This indicates that raw parameters may be used in analyses other than LDA, NDA and PCA tests to acquire classification.
We have shown that texture parameters change during tumor response to chemotherapy. Comparing initial imaging to the second imaging timepoint, just after the first chemotherapy cycle, there were not such clear changes as at the third imaging timepoint, after four cycles of chemotherapy. The difference in texture appearance between staging and the third imaging timepoint was distinct and emerged from the results of other combinations in both T1-weighted and T2-weighted image types. There might have been better separation in texture features between diagnostic and first evaluation stage if standardized imaging sequence had been used. Our non-standardized MRI sequence may lead too heterogeneous TA features to exactly describe subtle changes in lymphoma tissue in extremely early stages of therapy response evaluation. We still cannot state the importance of subtle textural changes in early response assessment in comparison to volumetric changes in the same time intervals. Further, as controls for examined NHL masses no normal lymph nodes neither NHL masses after treatment were analyzed, Texture parameters are given in rows. In the columns R&R repeatability and reproducibility of total, and Wilcoxon test for fat saturation series grouped with image slice thickness less than 8 mm, and 8 mm or thicker.
since their small size leading to not exact differentiation from surrounding soft tissue structures in MR images.
The response evaluation of lymphomas under treatment using radiological imaging methods is connected strongly with tumor dimensions, instead when using positron emission tomography, tumor lesion activity of tracer uptake is measured. Both methods have certain advantages and disadvantages; major disadvantages related to sensitivity to differentiate residual masses and inflammatory processes from active disease. Functional responses for nocicepti stimuli and antivascular therapy have been detected in recent MRI TA studies [18,31]. In this context changes in textural appearance in MRI during the treatment process probably reflect chemotherapy induced changes in cellular proliferation.
In treatment with a curative orientation it is essential to get early an estimate of response to determine further treatment. MRI texture analysis may provide new insight to be used alone or in combination with other tools in diagnostics and response monitoring of non-Hodgkin lymphomas.

Conclusion
In conclusion NHL tissue MRI texture imaged before treatment and during chemotherapy can be correctly classified.
Our results show promise for texture analysis as a possible new quantitative means for evaluating NHL response. Statistical and autoregressive model texture parameters of MRI data can be successfully tested with Wilcoxon paired test and Gage Repeatability and Reproducibility test to assess the impact of the parameters separability in evaluating chemotherapy response in lymphoma tissue.

Competing interests
The authors declare that they have no competing interests. Texture parameters are given in rows. In the columns R&R repeatability and reproducibility of total, and Wilcoxon test for fat saturation series grouped with image slice thickness less than 8 mm, and 8 mm or thicker.