15 Random Forest

This module will teach you how to run a full machine learning workflow with a random forest model using the tidymodels package in R.

Graphical output of a river network along with associated dams created using nhdplusTools along with data from the National Inventory of Dams.
Graphical output of a river network along with associated dams created using nhdplusTools along with data from the National Inventory of Dams.

Authors: Ed Stowe (writing, editing)
Last update: 2025-10-30
Acknowledgements: Case study includes code that was adapted from demonstration analyses by Julia Silge on random forest and partial dependence plots, used under Creative Commons Attribution-ShareAlike 4.0 International License

15.1 Learning objectives

  • Understand how random forest models works and when they are useful
  • Carry out a full machine learning workflow
  • Learn how to tune and assess a random forest model and determine which variables are important

15.2 Background

15.2.1 What is machine learning?

Machine learning constitutes using algorithms that can learn to recognize patterns in data and then make predictions in new scenarios.

15.2.2 When is it most useful?

In general, ML is most useful when prediction are more important than explanation; and when there is an adequate amount of data. ML models work better with more data, typically when there are at least a few hundred observations.

15.2.3 Can I do machine learning

Yes. Machine learning (ML) has a reputation for complexity and technical difficulty. But many machine learning algorithms are relatively simple, have been in use for decades, and are now fairly accessible via R packages like the tidymodels

15.2.4 What programs can I use for ML?

There are lots of tools/programs for running ML algorithms, but one very accessible way is to use the tidymodels packages in R, which provides an integrated framework for using dozens of different ML algorithms. Different algorithms might make sense for different applications. Here we use random forest, an algorithm that’s been popular for decades because of its flexibility, performance, relative ease, and suitability for tabular data (i.e., spreadsheets).

15.2.5 How does random forest Work

Random forest is an extension of a technique called decision trees. Decision trees, like other supervised machine learning algorithms, are designed to predict outcomes based on input data. There are two main types of outcomes that decision trees might predict: - Classification: predicting categorical outcomes (e.g., present/absent; male/female, etc.) - Regression: predicting numeric outcomes (e.g., abundance of fish; number of species, etc.)

Input data is used to identify break points, called decision nodes, in the data that help predict what the outcome might be. For example, if we were to predict whether a given passenger survived the sinking of the Titanic, the first decision node (i.e., the most informative) would be whether the person was male or female. Other input variables, like the age of the passenger, would likely make up other important nodes in the tree, and the final prediction (i.e., did the person survive or note?) is denoted by the leaf node.

Decision trees are versatile and interpretable, but they often learn the data that they are trained on too well, which is called being “overfit.” Overfitted models tend to do poorly when used in prediction, because they are not general enough.

Random forest (and some similar algorithms) help overcome the problem of overfitting by leveraging the predictions of many distinct trees. It does this by:

  1. Creating many different decision trees, each using a random subset of the data (called bagging).
  2. Building each tree using only a few randomly chosen predictor variables.
  3. Generating independent predictions from each tree.
  4. Combining the predictions of the different trees by selecting the prediction that the most trees agree on.

Using random data and input variables for each tree has been shown to increase the overall accuracy and generalizability of these models compared to single decision trees.

15.3 Case study

Here we present a case study using machine learning to predict whether or not bluegill will be present at different sites in the Upper Mississippi River based on habitat covariates. Bluegill are often the focus of restoration efforts there, so better understanding what controls bluegill presence can help determine what features restoration should focus on. This analysis includes steps that are fundamental to all machine learning workflows and also relevant in many other models forms as well, which include:

  • Splitting the data: partitioning data so that models can be trained and then tested on independent data
  • Tuning and training models: using the training processing to examine model fit and determine which model specifications (i.e., hyperparameters) yield the best predictions
  • Assessing model performance
  • Predicting independent data: compare model predictions to test dataset to see if it performs well

15.3.1 Add packages

This case study will use the tidymodels package. We will also load the following packages: - tidyverse for plotting and data-handling - doParallel for tuning models in parallel - vip for calculating variable importance - DALEXtra for making partial dependence plots

These packages should be installed if you have not used them before (e.g., install.packages("vip"))

First, we can import the bluegill occupancy dataset, and look at the data structure ### Import dataset

blgl <- read_csv("data/bluegill.csv")
## Rows: 1422 Columns: 11
## ── Column specification ──────────────────────────────────────────────
## Delimiter: ","
## chr  (1): pool
## dbl (10): occ, utm_e, utm_n, period, secchi, temp, depth, cond, current, do
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(blgl)
## spc_tbl_ [1,422 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ occ    : num [1:1422] 0 1 1 1 0 1 1 0 0 0 ...
##  $ utm_e  : num [1:1422] 734608 734108 734608 735008 734108 ...
##  $ utm_n  : num [1:1422] 4648501 4648551 4648501 4650751 4648601 ...
##  $ pool   : chr [1:1422] "13" "13" "13" "13" ...
##  $ period : num [1:1422] 2 3 3 2 2 3 3 1 1 1 ...
##  $ secchi : num [1:1422] 35 60 48 46 NA 29 42 42 32 NA ...
##  $ temp   : num [1:1422] 22.7 20.7 16.6 23.8 NA 18.9 15.8 27.8 25.4 NA ...
##  $ depth  : num [1:1422] 0.5 0.6 0.9 1.2 NA 0.9 1 0.8 0.8 NA ...
##  $ cond   : num [1:1422] 404 464 321 376 NA 459 312 367 492 NA ...
##  $ current: num [1:1422] 0 0 0 0.08 NA 0.06 0.06 0 0 NA ...
##  $ do     : num [1:1422] 11.8 4.4 10.5 8 NA 7.4 9.9 8.2 5.3 NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   occ = col_double(),
##   ..   utm_e = col_double(),
##   ..   utm_n = col_double(),
##   ..   pool = col_character(),
##   ..   period = col_double(),
##   ..   secchi = col_double(),
##   ..   temp = col_double(),
##   ..   depth = col_double(),
##   ..   cond = col_double(),
##   ..   current = col_double(),
##   ..   do = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

We see that all but one of these variables are numeric. However, for our classification algorithm, we need to convert the presence/absence column (occ for occupancy) to a factor. We also have two other categorical variables that should be factors: pool for the number of the navigation pool, and period, for the season (with 2, 3, 4 being approximately spring, summer, and fall, respectively).

blgl <- blgl %>%
  mutate(across(.cols = c(occ, pool, period), as.factor))

We may also want to explore the data a bit. Here we look to see if there are any patterns of bluegill occupancy across space, and based on depth.

filter(blgl,
       depth < 4) %>%
ggplot()+
geom_point(aes(utm_e/1000, utm_n/1000, color = depth, shape = occ))+
  facet_wrap(~pool, scales = "free")+
  scale_color_viridis_c()+
  theme_minimal() +
  labs(x = "UTM E (thousands of m)",
       y = "UTM N (thousands of m)",
       shape = "Presence",
       color = "Depth (m)")

Clearly these data are complex, and modeling should help us make sense of it.

15.3.2 Data partitioning

Splitting the data is very important so that we have some independent data on which to judge our models performance. Here we split 75% of the data into the dataset that will be used to train the model, and the remainder into the dataset for testing/assessing our model. 75% is the tidymodels default splitting proportion.

set.seed(123)
blgl_split <- initial_split(blgl, strata = occ, prop = .75)
blgl_train <- training(blgl_split)
blgl_test <- testing(blgl_split)

15.3.3 Pre-processing with recipes

Build a recipe for data preprocessing.

First, we must tell the recipe() what our model is going to be (using a formula here) and what our training data is. The recipe is where we can do lots of helpful data manipulations, such as normalizing the data (step_normalize()), replacing categorical variables with dummy variables (step_dummy), adding interactions (step_interact()), and numerous other options. Random forest models needed fewer data transformations than many algorithms, so we’ll just to one powerful transformation: step_impute_bag(). This uses a machine learning algorithm to impute missing variables, since we can’t have any missing predictor data to run this or many other machine learning algorithm. Another option would be to remove any rows with missing data (using step_naomit()), but that would remove a lot of data.

blgl_rec <- recipe(occ ~ ., data = blgl_train) %>%
  step_impute_bag(all_predictors())

ALthough not strictly necessary, to see what the data look like after applying transformations, we can use the functions prep and then juice. The juiced dataframe now shows us what our data will look like when it’s fed into the ML algorithm.

blgl_prep <- prep(blgl_rec)
juiced <- juice(blgl_prep)  

15.3.4 Model specifications with parsnip

We will specify our random forest model. This is where we tell the model whether to try any different combinations of the hyperparameters (i.e., whether to tune the model). In this case, we’re going to set the number of trees in our model to be 1,000, but we will tune 2 other hyperparameters. The mtry parameter controls how many variables we’ll include in each model, and the min_n hyperparameter controls tree complexity: it indicates the minimum number of data points needed in a node for that node to be split, so higher values of min_n will force trees to be less complex. Tuning both of these parameters will help us find a more predictive model. For the model specification, we also indicate that we want to run our random forest using the version of the algorithm as implemented by the ranger package, and we want to do classification as opposed to regression.

tune_spec <- 
    rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
    set_engine("ranger") %>% 
    set_mode("classification")

15.3.5 Workflow

Next we set up our workflow. This is simply where we indicate what recipe and model we will use.

tune_wf <- workflow() %>%
  add_recipe(blgl_rec) %>%
  add_model(tune_spec)

15.3.6 Tune hyperparameters

Now we’re ready to tune the hyperparameters. We need to create a set of cross-validation resamples to use for tuning. This means we’ll divide our training data into 10 slices, and sequentially fit models with different hyperparameters using 9 of the slices, and testing it on the 10th slice. This is how we’ll figure out the best hyperparameters.

set.seed(234)
blgl_folds <- vfold_cv(blgl_train, v = 10)

We will use the tune_grid function to tune our model across a grid of hyperparameters (using grid = 20 will automatically pick 20 grid points automatically, but these can also be chosen by the user).

We will use parallel processing to make this go faster, but note that this could take about 1-2 minutes, depending on your computer.

doParallel::registerDoParallel()

set.seed(345)
tune_res <- tune_grid(
  tune_wf,
  resamples = blgl_folds,
  grid = 20
)

tune_res
## # Tuning results
## # 10-fold cross-validation 
## # A tibble: 10 × 4
##    splits            id     .metrics          .notes          
##    <list>            <chr>  <list>            <list>          
##  1 <split [959/107]> Fold01 <tibble [60 × 6]> <tibble [0 × 3]>
##  2 <split [959/107]> Fold02 <tibble [60 × 6]> <tibble [0 × 3]>
##  3 <split [959/107]> Fold03 <tibble [60 × 6]> <tibble [0 × 3]>
##  4 <split [959/107]> Fold04 <tibble [60 × 6]> <tibble [0 × 3]>
##  5 <split [959/107]> Fold05 <tibble [60 × 6]> <tibble [0 × 3]>
##  6 <split [959/107]> Fold06 <tibble [60 × 6]> <tibble [0 × 3]>
##  7 <split [960/106]> Fold07 <tibble [60 × 6]> <tibble [0 × 3]>
##  8 <split [960/106]> Fold08 <tibble [60 × 6]> <tibble [0 × 3]>
##  9 <split [960/106]> Fold09 <tibble [60 × 6]> <tibble [0 × 3]>
## 10 <split [960/106]> Fold10 <tibble [60 × 6]> <tibble [0 × 3]>

We can plot the accuracy of our models (using area under the curve, or “AUC”) to see which values of our hyperparameters are best. AUC is a score that indicates the balance between the sensitivity and specificity of a model, and thus is a good measure of model generalizability.

It appears that min_n values below 30 and mtry values between 4 and 6 are best.

tune_res %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  dplyr::select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
    values_to = "value",
    names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "AUC")

This grid did not involve every combination of min_n and mtry but we can get an idea of what is going on. If we wanted we could test a narrower and more specific range of hyperparameters using the grid_regular() function.

15.3.7 Choosing the best model

Now that we’ve tuned models with several different hyperparameters, we can select the best model with the select_best(), and then update our original model specification to create our final model specification.

best_auc <- select_best(tune_res, metric = "roc_auc")

final_rf <- finalize_model(
  tune_spec,
  best_auc
)

final_rf
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = 6
##   trees = 1000
##   min_n = 8
## 
## Computational engine: ranger

This shows that our final model has an mtry value of 6 and a min_n value of 8.

To ultimately see how our good our model is, we want to evaluate predictions from the best model against the test dataset.

To do this, we make a final workflow with our original recipe and our best model final_rf. We can fit it one last time with the function last_fit(), which fits our best model on the entire training set and evaluates on the testing set, when provided with our initial data split.

final_wf <- workflow() %>%
  add_recipe(blgl_rec) %>%
  add_model(final_rf)

final_res <- final_wf %>%
  last_fit(blgl_split)

final_res %>%
  collect_metrics()
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary        0.879  Preprocessor1_Model1
## 2 roc_auc     binary        0.916  Preprocessor1_Model1
## 3 brier_class binary        0.0925 Preprocessor1_Model1

If we collect the metrics from our final model fit we see it has an ROC > .91, where values over .9 are generally considered to be very good.

This suggests that we have a very predictive model for presence of bluegill and one that is not overfit.

Let’s bind our testing results back to the original test set, and make one more map. Where in our pool are there more or less incorrectly predicted trees?

final_res %>%
  collect_predictions() %>%
  mutate(correct = case_when(
    occ == .pred_class ~ "Correct",
    TRUE ~ "Incorrect"
  )) %>%
  bind_cols(blgl_test) %>%
  ggplot(aes(utm_e, utm_n, color = correct)) +
  geom_point(size = 0.5, alpha = 0.5) +
  labs(color = NULL) +
  scale_color_manual(values = c("gray80", "darkred"))+
  facet_wrap(~pool, scales = "free")+
  theme_void()

15.3.8 Examine variable importance and partial dependence

We still don’t know what environmental factors matter to bluegill, but there are a couple of ways we can assess this.

First, we can see what variables are most important, using variable importance factors using the vip package.

final_rf %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(occ ~ ., data = juice(blgl_prep)) %>%
  vip(geom = "point")

This indicates that depth and current are far and away the most important factors determining the presence of bluegill compared to other factors.

We can also look at how these variables influence the probability of occurrence of bluegill using partial dependnece plots, implemeneted here with the DALEXtra package.

To make these plots, we first extract the a fitted workflow from the final model fit object.

Then we create an explainer object using the explain_tidymodels function. We need to provide this function with the final workflow, the data used in the model, other than the response variable `occ’ and the response variable, as an integer.

final_fitted <- final_res$.workflow[[1]]

blgl_explainer <- explain_tidymodels(
  final_fitted,
  data = dplyr::select(blgl_train, -occ),
  y = as.integer(blgl_train$occ),
  verbose = FALSE
)

Once we have the explainer object we can calculate the dependence of occupancy on depth. Here we’re including pool as a grouping variable to see if the effect of depth varies among pool.

pdp_depth <- model_profile(
  blgl_explainer,
  variables = "depth",
  N = NULL,
  groups = "pool"
)

Now we can plot the partial dependence. We could simply run plot(pdp_depth) but we can also plot it manually to have more control over the plot. To do this we can extract the relevant portion of the pdp_depth object, and then plot with ggplot. Note that we filter this dataset to depths < 5m, as there are only a couple of points in the dataset greater than this depth.

pdp_df <- as_tibble(pdp_depth$agr_profiles) %>%
  filter(`_x_` < 5) %>%
  mutate(`_label_` = str_remove(`_label_`, "workflow_"))

ggplot(pdp_df, aes(`_x_`, `_yhat_`, color = `_label_`)) +
  geom_line(lwd = 1.2, alpha = 0.8) +
  labs(
    x = "Depth",
    y = "Predicted occurrence probability",
    color = "Period",
    title = "Partial dependence plot for Bluegill occupancy"
  )

We can see that in general, increases in depths are associated with increased probability of bluegill presence, which holds across pools, although the pools appear to have different overall probabilities of bluegill occurrence. There is also a dip in probability at depths slightly below 1 m, but further investigation would be needed to see if this is biologically meaningful.

15.4 Summary

  • Machine learning models can excel at prediction if there is adequate data, suggesting it has great potential value for USACE applications
  • The tidymodels package provides a complete and comprehensive framework for utilizing numerous machine learning algorithms and carrying out full analyses
  • Random forest is a powerful, accessible algorithm that can make highly accurate predictions based on tabular data