Suggested Answers: ae-14

Note - if you don’t have the MASS package, you can install it before running library by writing the following code in the console: install.packages(“MASS”)

Packages

Warm Up

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island *flipper_length_mm, data = penguins) |>
  tidy()
# A tibble: 6 × 5
  term                              estimate std.error statistic  p.value
  <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                        -5464.     431.      -12.7  2.51e-30
2 islandDream                         3551.     969.        3.66 2.89e- 4
3 islandTorgersen                     3218.    1680.        1.92 5.62e- 2
4 flipper_length_mm                     48.5      2.05     23.7  1.66e-73
5 islandDream:flipper_length_mm        -19.4      4.94     -3.93 1.03e- 4
6 islandTorgersen:flipper_length_mm    -17.4      8.73     -1.99 4.69e- 2
penguins <- penguins |>
  mutate(island = factor(island, levels = c("Dream" , "Biscoe", "Torgersen")))

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island * flipper_length_mm, data = penguins) |>
  tidy()
# A tibble: 6 × 5
  term                              estimate std.error statistic  p.value
  <chr>                                <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                       -1913.      868.      -2.20  2.82e- 2
2 islandBiscoe                      -3551.      969.      -3.66  2.89e- 4
3 islandTorgersen                    -333.     1841.      -0.181 8.57e- 1
4 flipper_length_mm                    29.1       4.49     6.49  3.12e-10
5 islandBiscoe:flipper_length_mm       19.4       4.94     3.93  1.03e- 4
6 islandTorgersen:flipper_length_mm     1.99      9.60     0.208 8.36e- 1

R-squared demonstration

model_1 <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island, data = penguins) 

set.seed(33)
penguins <- penguins |>
  mutate(random_numbers = rnorm(nrow(penguins), 1, 10))

model_2 <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island * random_numbers, data = penguins) 


glance(model_1)$r.squared
[1] 0.3935772
glance(model_2)$r.squared
[1] 0.3952714

– Did r-squared increase or decrease from model 1 to model 2? Why?

Increased. it will always increase as we add more terms.

Building a model

Scenario: We have fit many models over the last 3 classes that model body mass of penguins as a response. Any of the variables in the data set could be used as predictors (explanatory variables). We want our model to only include the most useful predictors.

Adjusted r-squared

Run names(penguins) to recall the number of variables in our data set. Next, fit two competing models that model body mass as the response variable. Make an argument for the “better fitting model” using adjusted r-squared below.

Hint: In the same pipe used to fit the model, use glance |> pull(adj.r.squared) to get the appropriate information

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island*year + flipper_length_mm, data = penguins) |>
  glance() |>
  pull(adj.r.squared)
[1] 0.7840809
linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ island*bill_length_mm, data = penguins) |>
  glance() |>
  pull(adj.r.squared)
[1] 0.7577928

Backward elimination

Backward elimination starts with the the model that includes all potential predictor variables. Variables are eliminated one-at-a-time from the model until we cannot improve the model any further.

Procedure:

  • Start with a full model that has all predictors we consider and compute the AIC or adjusted-r-squared.

  • Next fit every possible model with 1 less predictor.

  • Compare AIC or adjusted-r-squared scores to select the best model with 1 less predictor if they show improvement over full model.

  • Repeat steps 2 and 3 until you score the model with no predictors.

  • Compare AIC or adjusted-r-squared among all tested models to select the best model.

Example

To better understand the idea of backwards selection, we are going to manually walk through one iteration of the procedure above. Our “full” model will be defined as an additive model with flipper length, bill length and island on our response body mass. With this information, perform backwards selection and report which variable will not be in the final backward elimination model. Use AIC as your criteria.

Hint: In the same pipe used to fit the model, instead of using tidy(), use glance |> pull(AIC) to get the appropriate information.

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm + bill_length_mm + island, data = penguins) |>
  glance() |>
  pull(AIC)
[1] 5035.615
## Choose the model above 


linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~  bill_length_mm + island, data = penguins) |>
  glance() |>
  pull(AIC)
[1] 5216.441

Forward Selection

Procedure:

  • Start with a model that has no predictors.

  • Next fit every possible model with 1 additional predictor and score each model.

  • Compare AIC scores to select the best model with 1 additional predictor vs first model.

  • Repeat steps 2 and 3 until you can no longer improve the model.

Perform 1 step of forward selection using the same variables mentioned above. What variable will be in the final forward selection model?

Hint: We can fit a model with no predictors using ~ 1 in place of any explanatory variable.

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ 1 , data = penguins) |>
  glance() |>
  pull(AIC)
[1] 5547.496
### island, flipper length, bill length 

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins) |>
  glance() |>
  pull(AIC)
[1] 5062.855
linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm + island, data = penguins) |>
  glance() |>
  pull(AIC)
[1] 5044.513
linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm * island * bill_length_mm, data = penguins) |>
  glance() |>
  pull(AIC)
[1] 4997.089

Notes about model selection

– what a “meaningful” difference? You should set it

– backward elimination and forward selection sometimes arrive at different final models

So which do I choose?

Forward stepwise selection is used when the number of variables under consideration is very large

Starting with the full model (backward selection) has the advantage of considering the effects of all variables simultaneously.

Performing stepwise in R

Before we show how to do this in R…..

Note! Please read!

It is a bad idea to simply let R alone choose your best fitting model for you. Good practices include:

– Talking with the subject expert about variables that must or should not be in your model

– Align variables to your research question. This includes:

— keeping the variables you are interested in…. and the variables you want to account for…

If Time (Not Main Focus of Lesson)

People use stepwise selection when their goal is to pick the overall “best” model.

There are many functions that will allow you to perform stepwise selection in R. Today, we are going to introduce ourselves to stepAIC. Currently, performing stepwise selection is very difficult with tidymodels, so we are going to write our models in base R using the lm function. Let’s do this below.

lm1 <- lm(body_mass_g ~ flipper_length_mm*island, data = penguins)

In stepAIC, the arguments are:

– object: your model name – scope: your “full model” if doing forward selection – direction: “forward”, or “backward”.

Backward selection

Let’s assume that our full model is an additive model between flipper length and island. Is this the best model? Is a SLR model better?

lm1 <- lm(body_mass_g ~ flipper_length_mm+island, data = penguins)

stepAIC(lm1,  direction = "backward")
Start:  AIC=4071.96
body_mass_g ~ flipper_length_mm + island

                    Df Sum of Sq       RSS    AIC
<none>                            49512346 4072.0
- island             2   3342450  52854796 4090.3
- flipper_length_mm  1  83480840 132993186 4407.9

Call:
lm(formula = body_mass_g ~ flipper_length_mm + island, data = penguins)

Coefficients:
      (Intercept)  flipper_length_mm       islandBiscoe    islandTorgersen  
         -4887.17              44.54             262.18              77.05  

Write out how to interpret the output below:

The code suggests what the AIC would be if the variable listed in the first column was removed. The represents the “full” model, or when no variables are removed.

Now, fit the interaction model and perform backward selection using stepAIC.

lm1 <- lm(body_mass_g ~ flipper_length_mm*island, data = penguins)

Write out how to interpret the output below:

Based on the output, we should select the interaction model.

Forward Selection

When performing forward selection, we need to be conscious about the scope (full model). Fit a model with just island as the explanatory variable. Define the scope as an additive model between island, bill length, and flipper length.

lm3 <- lm(body_mass_g ~ island, data = penguins)

stepAIC(lm3,  scope = ~bill_length_mm + island + flipper_length_mm, direction = "forward")
Start:  AIC=4407.88
body_mass_g ~ island

                    Df Sum of Sq       RSS    AIC
+ flipper_length_mm  1  83480840  49512346 4072.0
+ bill_length_mm     1  51139275  81853911 4243.9
<none>                           132993186 4407.9

Step:  AIC=4071.96
body_mass_g ~ island + flipper_length_mm

                 Df Sum of Sq      RSS    AIC
+ bill_length_mm  1   1552910 47959436 4063.1
<none>                        49512346 4072.0

Step:  AIC=4063.06
body_mass_g ~ island + flipper_length_mm + bill_length_mm

Call:
lm(formula = body_mass_g ~ island + flipper_length_mm + bill_length_mm, 
    data = penguins)

Coefficients:
      (Intercept)       islandBiscoe    islandTorgersen  flipper_length_mm  
         -4602.85             336.64             162.38              38.86  
   bill_length_mm  
            18.40  

An important note

The calculations of AIC are well beyond the scope of this course. However, it should be noted that stepAIC makes this calculation slightly differently than how AIC is calculated and pulled form our tidy model with glance and pull. Thus, it is critical that you (for whatever reason) do not cross compare AIC values when trying to make a model selection decision.

Additional Practice

Researcher 1 wants to model flipper length by sex of the penguin. A second researcher wants to add island to the model as well, arguing their model will be better. Researcher 1 does not believe so. Who is right? Justify your answer below.

Suggestion: For practice, use adjusted-r-squared

linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ sex, data = penguins) |>
  glance() |>
  pull(adj.r.squared)
[1] 0.1781385
linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ sex + island, data = penguins) |>
  glance() |>
  pull(adj.r.squared)
[1] 0.5608203

Researcher 2 is right. Adding island to the model vastly improves the adjusted r squared, suggesting that this model has a better overall fit.