As contributors of the R package said in their github website:
“The forester package is an AutoML tool in R that wraps up all machine learning processes into a single train() function, which includes:”
rendering a brief data check report,
preprocessing initial dataset enough for models to be trained,
training 5 tree-based models with default parameters, random search and Bayesian optimization,
evaluating them and providing a ranked list.
In this blog, I will introduce forester package as a case study using a simulation example and a real example.
graph TD;
A[data check]-->B;
A-->C;
B-->D;
C-->D;
1 Useful links
One of the contributor, Hubert Ruczynski, wrote up a detailed tutorial published in Mar 1, 2023.
2 An example
First of all, forester::check_data is a very convenient function to glimpse the data for data analysis. By default it will output a “CHECK DATA REPORT”, including some data quality checking list, meanwhile it will also do Spearman correlations.
-------------------- CHECK DATA REPORT --------------------
The dataset has 150 observations and 5 columns, which names are:
Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species;
With the target value described by a column Species.
✔ No static columns.
✔ No duplicate columns.
✔ No target values are missing.
✔ No predictor values are missing.
✔ No issues with dimensionality.
✖ Strongly correlated, by Spearman rank, pairs of numerical values are:
Sepal.Length - Petal.Length: 0.87;
Sepal.Length - Petal.Width: 0.82;
Petal.Length - Petal.Width: 0.96;
✖ These observations migth be outliers due to their numerical columns values:
16 ;
✖ Multilabel classification is not supported yet.
✔ Columns names suggest that none of them are IDs.
✔ Columns data suggest that none of them are IDs.
-------------------- CHECK DATA REPORT END --------------------