Suggested Answers: Prediction

Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2
Warning: package 'broom' was built under R version 4.2.2
Warning: package 'dials' was built under R version 4.2.2
Warning: package 'parsnip' was built under R version 4.2.2
Warning: package 'recipes' was built under R version 4.2.2
Warning: package 'ggridges' was built under R version 4.2.2

By the end of today, you will…

We will again be working with the email data set. Please re-familiarize yourself with these data below:

email <- read_csv("data/email.csv") |>
  mutate(spam = factor(spam),
         image = factor(image))
Rows: 3890 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (2): winner, number
dbl  (18): spam, to_multiple, from, cc, sent_email, image, attach, dollar, i...
dttm  (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#|label: glimpse

glimpse(email)
Rows: 3,890
Columns: 21
$ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ to_multiple  <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cc           <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2, 1, 0, 2, 0, …
$ sent_email   <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, …
$ time         <dttm> 2012-01-01 06:16:41, 2012-01-01 07:03:59, 2012-01-01 16:…
$ image        <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ dollar       <dbl> 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 5, 0, 0, …
$ winner       <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ inherit      <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ password     <dbl> 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ num_char     <dbl> 11.370, 10.504, 7.773, 13.256, 1.231, 1.091, 4.837, 7.421…
$ line_breaks  <dbl> 202, 202, 192, 255, 29, 25, 193, 237, 69, 68, 25, 79, 191…
$ format       <dbl> 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, …
$ re_subj      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, …
$ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
$ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ exclaim_mess <dbl> 0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 10, 4, 10, 20, 0…
$ number       <chr> "big", "small", "small", "small", "none", "none", "big", …

Testing vs Training

Before we decide on our model, let’s build a testing and training data set using the following code.

– Go through line by line and comment what the code is doing…

Once complete, take a glimpse at each new data set.

set.seed(0310) # to make sure we are consist

email_split <- initial_split(email, prop = 0.80) 

train_data <- training(email_split)
test_data <- testing(email_split)

glimpse(test_data)
Rows: 778
Columns: 21
$ spam         <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ to_multiple  <dbl> 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, …
$ from         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ cc           <dbl> 0, 0, 0, 0, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, …
$ sent_email   <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, …
$ time         <dttm> 2012-01-01 16:00:32, 2012-01-01 17:55:06, 2012-01-01 18:…
$ image        <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ attach       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ dollar       <dbl> 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 9, 0, 0, 0, 0, 0, …
$ winner       <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no…
$ inherit      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ viagra       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ password     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ num_char     <dbl> 7.773, 4.837, 2.643, 0.200, 4.549, 10.614, 45.842, 2.317,…
$ line_breaks  <dbl> 192, 193, 68, 10, 67, 118, 881, 30, 51, 24, 411, 54, 81, …
$ format       <dbl> 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ re_subj      <dbl> 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, …
$ exclaim_subj <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, …
$ urgent_subj  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ exclaim_mess <dbl> 6, 1, 0, 0, 1, 1, 5, 0, 3, 0, 49, 0, 4, 1, 8, 4, 16, 0, 3…
$ number       <chr> "small", "big", "small", "small", "small", "small", "big"…

Now that our testing and training data sets have been built, let’s practice picking a model!

First, we are going to decide on our model to model the if an email is spam or not.

The variables we’ll use for the first model will be:

  • spam: 1 if the email is spam, 0 otherwise
  • exclaim_mess: The number of exclamation points in the email message
  • winner: Has the word “winner” in the email or not

The variables we’ll use for the second model will be:

  • spam: 1 if the email is spam, 0 otherwise
  • exclaim_mess: The number of exclamation points in the email message
  • image: Had an image attached to the email

Now, fit each of these two additive generalized linear models below. Use AIC to make an argument for which model we should use to analyze if an email is spam or not.

email_fit <- logistic_reg() |>
  set_engine("glm") |>
  fit(spam ~ exclaim_mess + winner, data = train_data, family = "binomial")

glance(email_fit)$AIC  
[1] 1806.426
model2 <- logistic_reg() |>
  set_engine("glm") |>
  fit(spam ~ exclaim_mess + image, data = train_data , family = "binomial")

glance(model2)$AIC
[1] 1848.544

Name your chosen model email_fit.

Prediction

Now, let’s evaluate our model using our test data using the following code below. Comment on what the code is doing…

Note: Not predicting probability of success for a single response (type = response - see ae-15); Calculating probabilities for both success and failure for all of testing data (type = “prob)

email_pred <- predict(email_fit, test_data, type = "prob") |>  
  bind_cols(test_data |> select(spam))  

Name the above tibble email_pred

How can we plot this?

Make an Receiver operating characteristic (ROC) curve (plot true positive rate vs false positive rate)

email_pred |>
  roc_curve(
    truth = spam, 
    .pred_1, 
    event_level = "second" #which level is a success?
  ) |>
  autoplot()

Note Any small movement to the right indicates when a decision was made incorrectly.

What is this ROC curve telling us?

We have more true positives than false positives. We are doing a better job than just predicting spam by flipping a coin (50-50).

How did R create this graph?

Picked a cutoff that maximizes area under the ROC curve

Area under the curve

We can calculate the area under the curve using the following code:

email_pred |>
  roc_auc(
    truth = spam, 
    .pred_1, 
    event_level = "second" #which level is a success
  ) 
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.650

What is the AUC?

0.650

There are two things we can do with this number…

– Is this number > 0.5?

– How does this number compare to another AUC calculation?

Your Turn!

– Fit a competing model

– Generate an ROC plot

– Calculate the AUC and compare it to the model above

Hint: You can either rewrite code from above, or copy, paste, and edit to save time during class

model_class <- logistic_reg() |>
  set_engine("glm") |>
    fit(spam ~ viagra + exclaim_mess , data = train_data , family = "binomial")

glance(model_class)$AIC
[1] 1836.059
email_pred2 <- predict(model_class, test_data, type = "prob") |>  
  bind_cols(test_data |> select(spam))  


email_pred2 |>
  roc_curve(
    truth = spam, 
    .pred_1, 
    event_level = "second" #which level is a success?
  ) |>
  autoplot()

email_pred2 |>
  roc_auc(
    truth = spam, 
    .pred_1, 
    event_level = "second" #which level is a success
  ) 
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.657

Questions for next time?

In general, how does a cut point effect sensitivity (true positive)? False negatives?

We will explore this in R next time!

Only if time

Linear Regression

What if we don’t have a testing data set?

These are the data our model were trained on. Not optimal for assessing performance but it is something.

Even if we don’t have a test data set, we could still create a new column of predictions like before:

Context of Penguins data set

# predict based on new data

myPredictiveModel <- linear_reg() |>
  set_engine("lm") |>
  fit(body_mass_g ~ flipper_length_mm, data = penguins)

predict_peng <- penguins |>
  mutate(myPrediction = predict(myPredictiveModel, penguins)$.pred)

From here we can plot \(\hat{y}\) vs \(y\):

predict_peng |>
  ggplot(aes(x = body_mass_g, y = myPrediction)) +
  geom_point() +
  labs(x = "True Body Mass", y = "Predicted Body Mass", title = "Predicted vs True Body Mass") +
  geom_abline(slope = 1, intercept = 0, color = "steelblue")
Warning: Removed 2 rows containing missing values (`geom_point()`).

Assumptions of Linear Regression

Alternatively, we could create a residual plot. Residual plots can be used to assess whether a linear model is appropriate.

A common assumption of linear regression models is that the error term, \(\epsilon\), has constant variance everywhere.

  • If the linear model is appropriate, a residual plot should show this.

  • Patterned or nonconstant residual spread may sometimes be indicative a model is missing predictors or missing interactions.

Residuals

Create a new column residuals in predict_peng and save your data frame as predict_peng_2

The Plot

predict_peng_2 <- predict_peng |>
  mutate(residuals = body_mass_g - myPrediction)


predict_peng_2 |>
  ggplot(aes(x = myPrediction, y = residuals)) + 
  geom_point() +
  geom_hline(yintercept = 0) +
  labs(x = "Predicted body_mass", y = "Residual")
Warning: Removed 2 rows containing missing values (`geom_point()`).