library(tidyverse)
library(tidymodels)
library(palmerpenguins)
library(MASS) #for stepAIC
Suggested Answers: ae-14
Note - if you don’t have the MASS package, you can install it before running library
by writing the following code in the console: install.packages(“MASS”)
Packages
Warm Up
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ island *flipper_length_mm, data = penguins) |>
tidy()
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -5464. 431. -12.7 2.51e-30
2 islandDream 3551. 969. 3.66 2.89e- 4
3 islandTorgersen 3218. 1680. 1.92 5.62e- 2
4 flipper_length_mm 48.5 2.05 23.7 1.66e-73
5 islandDream:flipper_length_mm -19.4 4.94 -3.93 1.03e- 4
6 islandTorgersen:flipper_length_mm -17.4 8.73 -1.99 4.69e- 2
penguins <- penguins |>
mutate(island = factor(island, levels = c("Dream" , "Biscoe", "Torgersen")))
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ island * flipper_length_mm, data = penguins) |>
tidy()
# A tibble: 6 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -1913. 868. -2.20 2.82e- 2
2 islandBiscoe -3551. 969. -3.66 2.89e- 4
3 islandTorgersen -333. 1841. -0.181 8.57e- 1
4 flipper_length_mm 29.1 4.49 6.49 3.12e-10
5 islandBiscoe:flipper_length_mm 19.4 4.94 3.93 1.03e- 4
6 islandTorgersen:flipper_length_mm 1.99 9.60 0.208 8.36e- 1
R-squared demonstration
model_1 <- linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ island, data = penguins)
set.seed(33)
penguins <- penguins |>
mutate(random_numbers = rnorm(nrow(penguins), 1, 10))
model_2 <- linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ island * random_numbers, data = penguins)
glance(model_1)$r.squared
[1] 0.3935772
glance(model_2)$r.squared
[1] 0.3952714
– Did r-squared increase or decrease from model 1 to model 2? Why?
Increased. it will always increase as we add more terms.
Building a model
Scenario: We have fit many models over the last 3 classes that model body mass of penguins as a response. Any of the variables in the data set could be used as predictors (explanatory variables). We want our model to only include the most useful predictors.
Adjusted r-squared
Run names(penguins)
to recall the number of variables in our data set. Next, fit two competing models that model body mass as the response variable. Make an argument for the “better fitting model” using adjusted r-squared below.
Hint: In the same pipe used to fit the model, use glance |> pull(adj.r.squared)
to get the appropriate information
Backward elimination
Backward elimination starts with the the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the model any further.
Procedure:
Start with a full model that has all predictors we consider and compute the AIC or adjusted-r-squared.
Next fit every possible model with 1 less predictor.
Compare AIC or adjusted-r-squared scores to select the best model with 1 less predictor if they show improvement over full model.
Repeat steps 2 and 3 until you score the model with no predictors.
Compare AIC or adjusted-r-squared among all tested models to select the best model.
Example
To better understand the idea of backwards selection, we are going to manually walk through one iteration of the procedure above. Our “full” model will be defined as an additive model with flipper length, bill length and island on our response body mass. With this information, perform backwards selection and report which variable will not be in the final backward elimination model. Use AIC as your criteria.
Hint: In the same pipe used to fit the model, instead of using tidy()
, use glance |> pull(AIC)
to get the appropriate information.
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ flipper_length_mm + bill_length_mm + island, data = penguins) |>
glance() |>
pull(AIC)
[1] 5035.615
## Choose the model above
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ bill_length_mm + island, data = penguins) |>
glance() |>
pull(AIC)
[1] 5216.441
Forward Selection
Procedure:
Start with a model that has no predictors.
Next fit every possible model with 1 additional predictor and score each model.
Compare AIC scores to select the best model with 1 additional predictor vs first model.
Repeat steps 2 and 3 until you can no longer improve the model.
Perform 1 step of forward selection using the same variables mentioned above. What variable will be in the final forward selection model?
Hint: We can fit a model with no predictors using ~ 1
in place of any explanatory variable.
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ 1 , data = penguins) |>
glance() |>
pull(AIC)
[1] 5547.496
### island, flipper length, bill length
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ flipper_length_mm, data = penguins) |>
glance() |>
pull(AIC)
[1] 5062.855
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ flipper_length_mm + island, data = penguins) |>
glance() |>
pull(AIC)
[1] 5044.513
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ flipper_length_mm * island * bill_length_mm, data = penguins) |>
glance() |>
pull(AIC)
[1] 4997.089
Notes about model selection
– what a “meaningful” difference? You should set it
– backward elimination and forward selection sometimes arrive at different final models
So which do I choose?
Forward stepwise selection is used when the number of variables under consideration is very large
Starting with the full model (backward selection) has the advantage of considering the effects of all variables simultaneously.
Performing stepwise in R
Before we show how to do this in R…..
Note! Please read!
It is a bad idea to simply let R alone choose your best fitting model for you. Good practices include:
– Talking with the subject expert about variables that must or should not be in your model
– Align variables to your research question. This includes:
— keeping the variables you are interested in…. and the variables you want to account for…
If Time (Not Main Focus of Lesson)
People use stepwise selection when their goal is to pick the overall “best” model.
There are many functions that will allow you to perform stepwise selection in R. Today, we are going to introduce ourselves to stepAIC
. Currently, performing stepwise selection is very difficult with tidymodels, so we are going to write our models in base R using the lm
function. Let’s do this below.
lm1 <- lm(body_mass_g ~ flipper_length_mm*island, data = penguins)
In stepAIC
, the arguments are:
– object: your model name – scope: your “full model” if doing forward selection – direction: “forward”, or “backward”.
Backward selection
Let’s assume that our full model is an additive model between flipper length and island. Is this the best model? Is a SLR model better?
lm1 <- lm(body_mass_g ~ flipper_length_mm+island, data = penguins)
stepAIC(lm1, direction = "backward")
Start: AIC=4071.96
body_mass_g ~ flipper_length_mm + island
Df Sum of Sq RSS AIC
<none> 49512346 4072.0
- island 2 3342450 52854796 4090.3
- flipper_length_mm 1 83480840 132993186 4407.9
Call:
lm(formula = body_mass_g ~ flipper_length_mm + island, data = penguins)
Coefficients:
(Intercept) flipper_length_mm islandBiscoe islandTorgersen
-4887.17 44.54 262.18 77.05
Write out how to interpret the output below:
The code suggests what the AIC would be if the variable listed in the first column was removed. The
Now, fit the interaction model and perform backward selection using stepAIC
.
lm1 <- lm(body_mass_g ~ flipper_length_mm*island, data = penguins)
Write out how to interpret the output below:
Based on the output, we should select the interaction model.
Forward Selection
When performing forward selection, we need to be conscious about the scope (full model). Fit a model with just island as the explanatory variable. Define the scope as an additive model between island, bill length, and flipper length.
lm3 <- lm(body_mass_g ~ island, data = penguins)
stepAIC(lm3, scope = ~bill_length_mm + island + flipper_length_mm, direction = "forward")
Start: AIC=4407.88
body_mass_g ~ island
Df Sum of Sq RSS AIC
+ flipper_length_mm 1 83480840 49512346 4072.0
+ bill_length_mm 1 51139275 81853911 4243.9
<none> 132993186 4407.9
Step: AIC=4071.96
body_mass_g ~ island + flipper_length_mm
Df Sum of Sq RSS AIC
+ bill_length_mm 1 1552910 47959436 4063.1
<none> 49512346 4072.0
Step: AIC=4063.06
body_mass_g ~ island + flipper_length_mm + bill_length_mm
Call:
lm(formula = body_mass_g ~ island + flipper_length_mm + bill_length_mm,
data = penguins)
Coefficients:
(Intercept) islandBiscoe islandTorgersen flipper_length_mm
-4602.85 336.64 162.38 38.86
bill_length_mm
18.40
An important note
The calculations of AIC are well beyond the scope of this course. However, it should be noted that stepAIC
makes this calculation slightly differently than how AIC is calculated and pulled form our tidy model with glance and pull. Thus, it is critical that you (for whatever reason) do not cross compare AIC values when trying to make a model selection decision.
Additional Practice
Researcher 1 wants to model flipper length by sex of the penguin. A second researcher wants to add island to the model as well, arguing their model will be better. Researcher 1 does not believe so. Who is right? Justify your answer below.
Suggestion: For practice, use adjusted-r-squared
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ sex, data = penguins) |>
glance() |>
pull(adj.r.squared)
[1] 0.1781385
linear_reg() |>
set_engine("lm") |>
fit(body_mass_g ~ sex + island, data = penguins) |>
glance() |>
pull(adj.r.squared)
[1] 0.5608203
Researcher 2 is right. Adding island to the model vastly improves the adjusted r squared, suggesting that this model has a better overall fit.