br Feature importance analysis for colorectal cancer surviva
4.4.5. Feature importance analysis for colorectal cancer survival prediction
Besides the exploration on prediction performance improvement, it is also valuable to figure out the critical factors of colorectal cancer in survival time prediction. In this study, semi-random feature selection is carried out for generating good individual learners in the proposed ensemble regression method. As the regression tree is the learning algorithm of the individual learners, we evaluate the importance of the cancer prognostic features using a regression tree for the semi-random feature selection. The importance of a feature is calculated as the (normalized) total reduction of squared error brought by that feature during the process of regression tree construction, which is introduced in Table 1 of Section 3. Here, the practical significance of the important features is analyzed in detail. The best 10 features of the total 44 features are illustrated in Fig. 6 and they are ranked according to the order of their importance.
Among the features used in model, year of birth and age at diagnosis are the most important features. This result re-veals that aging and poor survival are closely related in colorectal cancer. Positive M3814 (nedisertib) nodes examined, number of nodes examined and CS extension appear in the top five important variables, among which the variable of positive lymph nodes examined records the exact number of regional lymph nodes that were found to contain metastases examined by the pathol-ogist, the variable of number of nodes examined records the total number of regional lymph nodes that were removed and examined by the pathologist and the variable of CS extension codes the farthest extension of tumor away from the primary site. Colorectal cancer is a kind of cancer with high recurrence and metastasis rate, and metastasis and micrometastasis are the main reasons that colorectal cancer is hard to be cured after radical resection . So, these three variables have a great impact on the survival time of colorectal cancer.
The feature ranked 6th is month of diagnosis. A study  on survival pattern of breast cancer and colorectal cancer us-ing datasets of 32,807 breast cancer patients and 12,950 colorectal cancer patients verified that month of diagnosis is a vital prognostic factor in both cancers. The results of this study indicate that those diagnosed in July and August have higher risk of death and the risk is lower in March and November. Our study confirms the previous research work about the important
Fig. 6. Top 10 important features of SEER colorectal data in survival prediction obtained by a regression tree.
influence of the diagnostic month on the outcomes of patients which might seem counter-intuitive but counts a great deal. Specially, our correlation analysis studies indicate that the patient identification number is an important attribute in the SEER colorectal data for cancer survival prediction as this number has a significant correlation with several other attributes, such as SEER registry, age at diagnosis, positive lymph nodes examined, and diagnostic date. Among the features shown in Fig. 6, age at diagnosis, grade, tumor size, primary site are also listed in the top ten important features in breast cancer, female genital cancer, male genital cancer and urinary cancer, which indicate that they are shared important variables in these cancers with different degree of importance and the results are in accord with previous findings about the influence of prognostic factors in various cancers [18,51].
It is critical to precisely predict the cancer survivability in cancer prognosis. This paper proposes a two-stage model for advanced-stage cancer survival prediction, where the first stage is to predict whether a patient can survive more than five years and the second stage is to predict the precise survival time in months of the patients who cannot survive for five years. The first stage adopts a tree ensemble classification method that takes imbalanced data into account. For the second stage, a tree-based selective ensemble regression method called SRRT-SEM is proposed for cancer survival time prediction, which selects base learners from a pool by trading off between the performance and diversity. The trees in the pool are generated based on the semi-random feature selection approach employing a priori knowledge of features and the pres-election strategy using MPEI, a newly proposed measurement for regression performance. Experimental results performed on the advanced-stage colorectal cancer dataset from publicly available SEER data indicate that the proposed model with SRRT-SEM not only has the lowest prediction error, but also consistently shows the best generalization performance. The proposed method shows very promising performance and can potentially be extended to medical prognosis to assist medi-cal doctors in treatment decision-making, thereby increasing patient satisfaction, saving medical resources and reducing the cost of medical care.