thesis

Introduction
Random forest is one of the most successful integration methods, showing excellent performance at the level of promotion and support vector machines. The fast, anti-noise process does not overfit and provides the possibility to interpret and visualize its output. We will study options to increase the strength of individual trees in the forest or reduce their correlation. Using several attribute evaluation methods instead of just one method will produce promising results. On the other hand, in most similar cases, using weighted marginal voting instead of ordinary voting can provide statistically significant improvements across multiple data sets.
Nowadays, machine learning (ML) is becoming more and more critical, and with the rapid growth of medical data and information quality, it has become a key technology. However, due to complex, incomplete, and multi-dimensional healthcare data, early and accurate detection of diseases remains a challenge. Data preprocessing is an essential step in machine learning. The primary purpose of machine learning is to provide processed data to improve prediction accuracy. This dissertation summarizes accessible data preprocessing steps based on usage, popularity, and literature. After that, the selected preprocessing method is applied to the original data, and then the classifier uses it for prediction.
Data mining faces the test of finding orderly information in critical information streams to help the executives dynamic. Although the examination on activities research, direct showcasing and AI centers around the investigation and structure of information mining calculations, the connection between information mining and the past phase of information Preprocessing has not been concentrated in detail. This paper considers the impacts of various preprocessing techniques of appropriate scaling, testing, order coding, and constant trait coding on the exhibition of choice trees, neural systems, and bolster vector machines.
Problem statement.
We are utilizing machine learning to predict breast cancer cases through patient treatment history and health data. We will utilize the Data set of Wisconsin breast cancer center. Among ladies, breast cancer is the main source of death. Breast cancer risk prediction can give information to screening and preventive measures.
Recent studies found that adding contribution to the broadly utilized Gaelic model can improve its capacity to anticipate breast cancer risks. Be that as it may, these models utilize straightforward factual designs, and other information originates from costly and obtrusive procedures.
Interestingly, we need to come up a machine learning model that utilizes individual health data to predict breast cancer risk for more than five years. There is a need to come up with a machine learning model utilizing just Gail model information and a model utilizing Gail model information and other individual health data identified with breast cancer hazard.
The essential objectives of cancer prediction are not the same as those of cancer recognition and determination. In cancer prediction/visualization, one is identified with three basic purposes of prediction: 1) cancer vulnerability prediction (i.e., risk evaluation), 2) cancer recurrence prediction and 3) cancer endurance rate prediction. In the first case, individuals are attempting to foresee the probability of building up a specific sort of cancer before it happens. In the subsequent case, individuals are attempting to foresee the chance of creating cancer after the infection has vanished.
In the third case, individuals attempt to anticipate the result (life hope, endurance, movement, sedate tumour affectability) after the disease is diagnosed. In the last two cases, prognostic prediction’s prosperity depends to some extent on the achievement or nature of the finding. Be that as it may, the forecast of the infection must be accomplished after clinical finding, and visualization prediction must think about more than a basic determination.
Through a multifaceted analysis of the variance of various performance indicators and method parameters, it is possible to evaluate and provide empirical evidence that data preprocessing will significantly affect the accuracy of prediction, and that specific solutions have proven inferior to competing methods. It is also found that: (1) The selected method is proved to be sensitive to different data representation methods such as method parameterization, which shows the potential of improving performance through effective preprocessing; (2) The influence of the preprocessing scheme depends on the process.
Different, indicators that use various “best practice” settings can improve the amazing results of a particular method; (3) Therefore, the sensitivity of the algorithm to preprocessing is a necessary criterion for method evaluation and selection. In predictive data mining, it needs to be different from traditional methods. Careful consideration of forecasting ability and calculation efficiency indicators.
To maximize the prediction accuracy of data mining, machine learning research mainly focuses on enhancing competitive classifiers and effectively adjusting algorithm parameters. This is usually tested in extensive benchmark experiments, using pre-processed data sets to evaluate the impact on prediction accuracy and computational efficiency.
In contrast, the research on component selection resampling and continuous quality discretization has been studied in detail, and there are not many publication survey data predictions that will affect classification attributes and scaling. More critically, in data mining, especially in the medical field, there is no precise analysis of the interaction of prediction accuracy.

3.1. Preprocessing methods
This dissertation considers the three main standard preprocessing steps of NLP: stemming, punctuation expulsion, and stop word evacuation. In stemming analysis, we obtain the stem type of each word in the data set, which is a piece of the name that can be attached with affixes.
The blocking algorithm is language-specific and differs in performance and accuracy. A wide range of methods can be used, such as fasten deletion stemming, n-gram stemming, and table inquiry stemming. A critical preprocessing step of NLP is to expel punctuation, which-used to separate the content into sentences, paragraphs, and phrases-affects the results of any content processing method, especially the effects that rely upon the recurrence of words and phrases because punctuation is Often used in the content.
Before any NLP processing, the most common terms used in stop words are erased. A gathering of as often as possible used words without some other information, such as articles, specific words, and prepositions called stop words. By eliminating these original words from the content, we can focus on the critical words.

Significance of using Random Forest?
Whether you have a regression task or a classification task, a random forest is a suitable model to solve your problem. It can handle dual features, classification features and numeric features. Hardly any pretreatment is required. The data should not be rescaled or transformed.
They are parallelizable, which means we can split the process into various machines to run. This can shorten the calculation time. On the contrary, the upgraded model is sequential and takes longer to calculate. In fact, in Python, to run this code on many computers, add “jobs = -1” to the boundary. One way is to use every available PC. Great and high size.
Training is faster than decision trees, because we only arrange part of the features in the model, so we can easily use hundreds of features. The prediction speed is significantly faster than the training speed because we can save the resulting forest after some time. Random forest deals with outliers by essentially classifying them. It is also indifferent to nonlinear features.
It has a way to balance errors in the general embarrassment of the class. Random forest tries to minimize the overall error rate. When the data set is not uniform, the wider the classification, the lower the error rate, and the lower the classification, the higher the error rate. The difference between each decision tree is larger, and the deviation is smaller. Nevertheless, since we normalized all the trees in the random forest, we also normalized the normalization, so we have a small deviation and a medium difference model.
As with any algorithm, there are advantages and disadvantages to using it. The advantages and disadvantages of using the random forest for classification and regression. The random forest algorithm does not depend on any model because there are various trees, and each tree is trained on a subset of the data.
The random forest algorithm relies on the strength of the “group”. Therefore, the general deviation of the algorithm is reduced. The algorithm is completely stable. Regardless of whether new data points are introduced in the data set, the general algorithm will not be affected too much, because the original data may affect one tree, but it is difficult to change all trees.
The random forest algorithm with both classification and numbering functions works well. The random forest algorithm can also work well when the data lacks values or is not scaled proportionally (although we have scaled the elements in this article only for demonstration purposes).

Drawbacks
Interpretability of the model: The random forest model is not easy to interpret. They are similar to secret elements. For large data sets, the size of the tree can take up a lot of memory. It may be too suitable, so you should adjust the Hyperparameters. It has been observed that random forests are too suitable for specific data sets with noisy classification/regression tasks. It is more complicated than the decision tree algorithm and requires a lot of calculation. Due to their complexity, they require more training opportunities than other similar algorithms.

Materials and methods
Data
The model was trained and evaluated on the PLCO dataset. This data set was generated as part of a randomized, controlled, prospective study to determine the effectiveness of different prostate, lung, colorectal, and ovarian cancer screenings. Participants participated in the research and filled out the baseline questionnaire, detailing their previous and current health status. All processing of this data set is done in Python (version 3.6.7).
We initially downloaded the data of all women from the PLCO data set. The dataset consists of 78,215 women aged 50-78. We choose to exclude women who meet any of the following conditions:
1. Lack of data on whether they have been diagnosed with breast cancer and the time of diagnosis
2. Were diagnosed with breast cancer before the baseline questionnaire
3. Not Self-identification as white, black, or Hispanic
4. Identified as Hispanic, but no information about the place of birth
5. Missing data for 13 selected predictors
Before the baseline questionnaire, we excluded women who had been diagnosed with breast cancer because BCRAT was not sufficient for women with a personal history of breast cancer.
BCRAT is also not suitable for women with breast cancer who have received chest radiotherapy or BCRA1 or BCRA2 gene mutations, or have lobular carcinoma in situ, ductal carcinoma in situ, or other rare cases that quickly cause syndromes, such as Li-Froumei Neil syndrome. Since there is no data for these conditions in the PLCO data set we assume that these conditions do not apply to any women in the data set. Since only PLCO white, black, and Hispanic race/ethnic categories match the BCRAT implementation we used, we excluded specific topics based on self-identified race/ethnicity.
We do not include subjects who consider themselves Hispanic but do not have data on their place of birth because BCRAT implements different breast cancer compound rates for US-born and foreign-born Hispanic women. When deleting objects based on these conditions, we reduced the number of women to 64,739.
We trained a set of machine learning models that fed five of the usual seven inputs into BCRAT These five inputs, including age, age at menarche, age at first live birth, number of first-degree relatives with breast cancer, and race/ethnicity, are the only traditional BCRAT inputs in the PLCO data set. We compared the machine learning model BCRAT and got these five inputs.
Our input to the model with a broader set of predictors includes five BCRAT data and eight additional factors. These other predictors were selected based on the availability in the PLCO data set and their correlation with breast cancer risk including menopausal age, indicators of current hormone use, hormone age, BMI, packaged smoking Number of years, the number of years of birth control, the number of live births, and personal cancer history indicators.
To facilitate the training and testing of the model, we made limited modifications to the predictor variables. First, we assign values to categorical variables appropriately. The PLCO data set classifies age at menarche, age at first live birth, age at menopause, generation of hormones, and age of birth control as categorical variables. For example, the menarche variable’s age code is age less than ten years old: 1, age 10-11 years old, age 2, 12-13 years old age 3, age 14-15 years old, age 4, 16 years old age 5, elder. For the value of the categorical variable that represents the maximum age/age or less (for example, under ten years old), we set the value of the variable to the maximum value (for example, ten years old).
For values that represent a range strictly less than the maximum value (for example, less than ten years old), we set the variable’s value equal to the upper limit of the range (for example, less than ten years old). Similarly, for values representing the minimum age/age or above (16 years old or above), we set it to the minimum value (for example, 16 years old). For values that contain a closed range (for example, 12-13 years old), we set the variable’s importance to the average cost of the field (for example, 12.5 years old).
After modifying the categorical variables, we made some adjustments to the age version of the first live birth, and the race/ethnic variables entered into the machine learning model. For the BCRAT model, we set the age of the first live birth variable of non-fertile women to 98 (as the “BCRA” software package (version 2.1) in R (version 3.4.3), using The implementation of BCRAT stated to do so) and provided different race/ethnic category values for foreign-born and American-born Hispanic women. For the machine learning model, we set the age of the first live birth variable of zero birth women to the current generation, and use two indicators to represent race/ethnicity, one symbol for white women and one indicator for black women. Each woman is classified as only one race/ethnicity (white, black, or Hispanic). Therefore, in addition to the white and black racial indicators, we do not need Hispanic racial signs. A Hispanic woman’s white and black racial symbols are both 0. For the machine learning model, we did not distinguish between Hispanic women born in the United States and Hispanic women born abroad.

Tags: , ,