medicinelooki.blogg.se

Step_corr(all_predictors()) %>% # remove variables that have large absolute correlations with Step_zv(all_predictors()) %>% # remove variables that are highly sparse and unbalanced. Step_knnimpute(all_predictors())%>% # impute missing data using nearest neighbors. Step_dummy(all_predictors())%>% # convert nominal data into one or more numeric. Telco_custumer_2 %dplyr::select(-"customerID")

It will return a clean data set, and the partition of the data set into training and validation data set. The caret and modelgrid packages are used to train and to evaluate the candidate models ( For a very accessible introduction to caret and modelgid please have a look at here and here).ĭataExplorer packet has been used to explore the data. This list can be expanded with further classifiers by using the add_model function from the model grid package. Linear discriminant analysis (method: lda).eXtreme Gradient Boosting (method: xgbDART).I use the following candidate classifiers: For tuning the hyperparameters the method adaptive_cv will be used (also allow us to reduce the tuning time).A common setting/trainControl will be created, which will be shared by all considered classifiers(this will allow us to fit the models under the same conditions).Data pre-Processing has been performed using the recipe method.The learning problem(as an example) is the binary classification problem predict customer churn.I will use the same training/test split.In order to perform a fair comparison of the candidate classifiers: The AUC/Accuracy/Kappa on the validation set: The AUC/Accuracy/Kappa on the training set: On the training data the Random Forest classifier(method: rf) tends to perform well **but on the validation data, the eXtreme Gradient Boosting classifier(method: xgbDART) is performing better and gives better results in respect to the AUC and to the model accuracy as well.

With this new application (BMuCaret: Best Model using Caret) I evaluate five classifiers arising from 5 families (discriminant analysis, neural networks, support vector machines, Generalized Linear Models, random forests) and identify the classifier with the highest AUC. The area under the ROC curve (not the curve) has been considered as a key metric to measure the model performance. I have built a new shiny application BMuCaret to fit and evaluate multiple classifiers and select the best one, which achieves the best performance for a given data **. It's tough to make predictions, especially about the future (Yogi Berra), but I think the way to get there shouldn't be.