A tutorial for forester R package

tutorial

package

Author

Jihong Zhang

Published

June 28, 2023

R Code

knitr::opts_chunk$set(echo = TRUE, message=FALSE, warnings=FALSE, include = FALSE)
options(knitr.kable.NA = '')
library('tidyverse')
library('forester')
mycolors = c("#4682B4", "#B4464B", "#B4AF46", 
             "#1B9E77", "#D95F02", "#7570B3",
             "#E7298A", "#66A61E", "#B4F60A")
softcolors = c("#B4464B", "#F3DCD4", "#ECC9C7", 
               "#D9E3DA", "#D1CFC0", "#C2C2B4")
mykbl <- function(x, ...){
  kbl(x, digits = 2, ...) |> kable_styling(bootstrap_options = c("striped", "condensed")) 
}

As contributors of the R package said in their github website:

“The forester package is an AutoML tool in R that wraps up all machine learning processes into a single train() function, which includes:”

rendering a brief data check report,
preprocessing initial dataset enough for models to be trained,
training 5 tree-based models with default parameters, random search and Bayesian optimization,
evaluating them and providing a ranked list.

In this blog, I will introduce forester package as a case study using a simulation example and a real example.

graph TD;
    A[data check]-->B;
    A-->C;
    B-->D;
    C-->D;

1 Useful links

One of the contributor, Hubert Ruczynski, wrote up a detailed tutorial published in Mar 1, 2023.

2 An example

First of all, forester::check_data is a very convenient function to glimpse the data for data analysis. By default it will output a “CHECK DATA REPORT”, including some data quality checking list, meanwhile it will also do Spearman correlations.

R Code

check_dat_res <- forester::check_data(iris, 'Species')

 -------------------- CHECK DATA REPORT -------------------- 
 
The dataset has 150 observations and 5 columns, which names are: 
Sepal.Length; Sepal.Width; Petal.Length; Petal.Width; Species; 

With the target value described by a column Species.

✔ No static columns. 

✔ No duplicate columns.

✔ No target values are missing. 

✔ No predictor values are missing. 

✔ No issues with dimensionality. 

✖ Strongly correlated, by Spearman rank, pairs of numerical values are: 
 
 Sepal.Length - Petal.Length: 0.87;
 Sepal.Length - Petal.Width: 0.82;
 Petal.Length - Petal.Width: 0.96;

✖ These observations migth be outliers due to their numerical columns values: 
 16 ;

✖ Multilabel classification is not supported yet. 

✔ Columns names suggest that none of them are IDs. 

✔ Columns data suggest that none of them are IDs. 

 -------------------- CHECK DATA REPORT END --------------------

R Code

str(check_dat_res$str)

 chr [1:38] " -------------------- **CHECK DATA REPORT** -------------------- " ...