Inference Overview

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.2.2

Warning: package 'tidyr' was built under R version 4.2.2

Warning: package 'readr' was built under R version 4.2.2

Warning: package 'purrr' was built under R version 4.2.2

library(tidymodels)

Warning: package 'broom' was built under R version 4.2.2

Warning: package 'dials' was built under R version 4.2.2

Warning: package 'parsnip' was built under R version 4.2.2

Warning: package 'recipes' was built under R version 4.2.2

rc <- read_csv("data/roller_coasters.csv")

rc <- na.omit(rc) #remove all NA values from data set

In this application exercise, we will be looking at a roller coaster and amusement park database by Duane Marden.¹ This database records multiple features of roller coasters. For the purpose of this activity, we will work with a random sample of 157 roller coasters.

Variable	Description
`age_group`	1: Older (Built between 1900-1979) 2: Recent (1980-1999) 3: Newest (2000-current)
`coaster`	Name of the roller coaster
`park`	Name of the park where the roller coaster is located
`city`	City where the roller coaster is located
`state`	State where the roller coaster is located
`type`	Material of track (Steel or Wooden)
`year_opened`	Year when roller coaster opened
`top_speed`	Maximum speed of roller coaster (mph)
`max_height`	Highest point of roller coaster (ft)
`drop`	Length of largest gap between high and low points of roller coaster (ft)
`length`	Length of roller coaster track (ft)
`duration`	Time length of roller coaster ride (sec)
`inversions`	Whether or not roller coaster flips passengers at any point (Yes or No)
`num_of_inversions`	Number of times roller coaster flips passengers

Practice with data manipulation

Demo: Create a new variable called opened_recently that has a response of yes if the roller coaster was opened after 1970, and no if the roller coaster was built during or before 1970. Save this new factor variable in the rc data set. Additionally, make type a factor variable.

Read each line of code below as a sentence to ensure you understand what each line of code is doing.

rc <- rc |>
  mutate(
    opened_recently = if_else(year_opened > 1970, "yes", "no"), 
    opened_recently = as.factor(opened_recently),
    type = as.factor(type)
  )

Confidence Intervals

We are interested in investigating if roller coasters after 1970 are faster than those opened before. For this question, we want to estimate the difference.

What is our explanatory variable?

What is our response variable?

x – opened_recently

y – top_speed

Demo: Use bootstrapping to estimate the difference between the true speeds of roller coasters before and after 1970.

set.seed(12345)

boot_df <- rc |>
  specify(explanatory  = opened_recently , response = top_speed) |>
  generate(reps = 1000, type = "bootstrap") |>
  calculate(stat = "diff in means", order = c("yes", "no"))

Your turn: Create a 99% confidence interval and interpret it in the context of the data and the research question.

boot_df |>
  summarize(
    lower = quantile(stat, 0.005),
    upper = quantile(stat, 0.995)
  )

# A tibble: 1 × 2
  lower upper
  <dbl> <dbl>
1  8.28  24.4

ORDER MATTERS (yes - no)

Interpretation: We are 99% confident that the true mean speed for roller coasters opened after 1970 (\(\mu_1\)) is (9.15, 23) mph LARGER than the true mean speed for roller coasters opened before 1970 (\(\mu_2\))

Now, calculate your confidence interval using theoretical measures

speed <- rc |>
  group_by(opened_recently) |>
  summarize(mean_speed = mean(top_speed, na.rm = T)) |>
  pull(mean_speed)

#remind us of sample sizes
rc |>
  group_by(opened_recently) |>
  summarize(groups = n())

# A tibble: 2 × 2
  opened_recently groups
  <fct>            <int>
1 no                  10
2 yes                105

obs_stat <- speed[2] - speed[1]

n1 <- 16

n2 <- 141

s1 <- rc |> 
  filter(opened_recently == "yes") |>
  summarise(sd_s = sd(top_speed, na.rm = T)) |>
  pull()

s2 <- rc |> 
  filter(opened_recently == "no") |>
  summarise(sd_s = sd(top_speed, na.rm = T)) |>
  pull()

obs_stat + qt(0.995, df = 15) * (sqrt(((s1^2)/n1) + ((s2^2)/n2)))

[1] 25.8743

obs_stat - qt(0.995, df = 15) * (sqrt(((s1^2)/n1) + ((s2^2)/n2)))

[1] 4.478077

Hypothesis Test

Do steel roller coasters have more inversions than wooden roller coasters? Conduct a hypothesis test using simulation techniques.

Ho: The true proportion of inversions on steel roller coasters is equal to the true proportion of inversions on a wooden roller coaster.

Ha: The true proportion of inversions on steel roller coasters is less than the true proportion of inversions on a wooden roller coaster.

– Now calculate your statistic below

stat_df <- rc |>
  group_by(inversions, type) |>
  summarize(props = n()) |>
  pull(props)

`summarise()` has grouped output by 'inversions'. You can override using the
`.groups` argument.

steel_stat <- stat_df[3]/(stat_df[1] + stat_df[2])
wood_stat <- stat_df[4]/(stat_df[3] + stat_df[4])

obs_stat <- steel_stat - wood_stat

– Create your null distribution, explain how this distribution is being created

permute_df <- rc |>
  specify(response = inversions , explanatory = type , success = "Y") |>
  hypothesise(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in props", order = c("Steel", "Wooden"))

– Calculate your p-value below

permute_df |>
  get_p_value(obs_stat = obs_stat, direction = "right")

Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step. See
`?get_p_value()` for more information.

# A tibble: 1 × 1
  p_value
    <dbl>
1       0

Now, calculate your p-value using theoretical measures

p_pool <- (stat_df[3] + stat_df[4]) / nrow(rc) 

se <- sqrt(p_pool*(1-p_pool)*((1/79) + (1/36)))

zscore <- obs_stat / se

pnorm(zscore, lower.tail = FALSE)

[1] 9.080354e-14

Fitting a Model: Prediction

We want to investigate the relationship between how fast a roller coaster goes, and how long a roller coaster lasts. Specifically, we are interested in how well the duration of a roller coaster explains how fast it is.

Your turn: Based on this research question, which two variables should we use from our data set? Which is our response?

Add Response

Now let’s practice predicting!

Demo: Fit the appropriate linear model below. Report the tidy output.

speed_fit <- linear_reg() |>
  set_engine("lm") |>
  fit(top_speed ~ duration, data = rc) 

tidy(speed_fit)

# A tibble: 2 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)  52.7       4.36       12.1  4.52e-22
2 duration      0.0513    0.0327      1.57 1.20e- 1

Write estimated model in proper notation below:

\(\hat{speed} = 52.7 + 0.0513*duration\)

Demo: Use your model to estimate the top speed of a roller coaster if their duration is 155 minutes

duration_155 <- tibble(duration = 155)

predict(speed_fit, new_data = duration_155)

# A tibble: 1 × 1
  .pred
  <dbl>
1  60.6

Interpret

Now, interpret the slope coefficient.

For a 1 minute increase in duration, we estimate on average a 0.0513 mph increase in speed.

We extended this idea to:

Multiple Linear Regression

Logistic Regression

and compared models using R-squared (if we had the same number of coefficients; adjusted-r-squared, and AIC)

Footnotes

Roller Coaster DataBase, UltimateRollerCoaster.com, ESTEEM: Enhancing Statistics Teachers Education with E-Modules.↩︎