Warning: package 'broom' was built under R version 4.2.2
Warning: package 'dials' was built under R version 4.2.2
Warning: package 'parsnip' was built under R version 4.2.2
Warning: package 'recipes' was built under R version 4.2.2
rc<-read_csv("data/roller_coasters.csv")rc<-na.omit(rc)#remove all NA values from data set
In this application exercise, we will be looking at a roller coaster and amusement park database by Duane Marden.1 This database records multiple features of roller coasters. For the purpose of this activity, we will work with a random sample of 157 roller coasters.
Variable
Description
age_group
1: Older (Built between 1900-1979)
2: Recent (1980-1999)
3: Newest (2000-current)
coaster
Name of the roller coaster
park
Name of the park where the roller coaster is located
city
City where the roller coaster is located
state
State where the roller coaster is located
type
Material of track (Steel or Wooden)
year_opened
Year when roller coaster opened
top_speed
Maximum speed of roller coaster (mph)
max_height
Highest point of roller coaster (ft)
drop
Length of largest gap between high and low points of roller coaster (ft)
length
Length of roller coaster track (ft)
duration
Time length of roller coaster ride (sec)
inversions
Whether or not roller coaster flips passengers at any point (Yes or No)
num_of_inversions
Number of times roller coaster flips passengers
Practice with data manipulation
Demo: Create a new variable called opened_recently that has a response of yes if the roller coaster was opened after 1970, and no if the roller coaster was built during or before 1970. Save this new factor variable in the rc data set. Additionally, make type a factor variable.
Read each line of code below as a sentence to ensure you understand what each line of code is doing.
rc<-rc|>mutate( opened_recently =if_else(year_opened>1970, "yes", "no"), opened_recently =as.factor(opened_recently), type =as.factor(type))
Confidence Intervals
We are interested in investigating if roller coasters after 1970 are faster than those opened before. For this question, we want to estimate the difference.
What is our explanatory variable?
What is our response variable?
x – opened_recently
y – top_speed
Demo: Use bootstrapping to estimate the difference between the true speeds of roller coasters before and after 1970.
set.seed(12345)boot_df<-rc|>specify(explanatory =opened_recently , response =top_speed)|>generate(reps =1000, type ="bootstrap")|>calculate(stat ="diff in means", order =c("yes", "no"))
Your turn: Create a 99% confidence interval and interpret it in the context of the data and the research question.
Interpretation: We are 99% confident that the true mean speed for roller coasters opened after 1970 (\(\mu_1\)) is (9.15, 23) mph LARGER than the true mean speed for roller coasters opened before 1970 (\(\mu_2\))
Now, calculate your confidence interval using theoretical measures
speed<-rc|>group_by(opened_recently)|>summarize(mean_speed =mean(top_speed, na.rm =T))|>pull(mean_speed)#remind us of sample sizesrc|>group_by(opened_recently)|>summarize(groups =n())
# A tibble: 2 × 2
opened_recently groups
<fct> <int>
1 no 10
2 yes 105
– Create your null distribution, explain how this distribution is being created
permute_df<-rc|>specify(response =inversions , explanatory =type , success ="Y")|>hypothesise(null ="independence")|>generate(reps =1000, type ="permute")|>calculate(stat ="diff in props", order =c("Steel", "Wooden"))
– Calculate your p-value below
permute_df|>get_p_value(obs_stat =obs_stat, direction ="right")
Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step. See
`?get_p_value()` for more information.
# A tibble: 1 × 1
p_value
<dbl>
1 0
Now, calculate your p-value using theoretical measures
We want to investigate the relationship between how fast a roller coaster goes, and how long a roller coaster lasts. Specifically, we are interested in how well the duration of a roller coaster explains how fast it is.
Your turn: Based on this research question, which two variables should we use from our data set? Which is our response?
Add Response
Now let’s practice predicting!
Demo: Fit the appropriate linear model below. Report the tidy output.
speed_fit<-linear_reg()|>set_engine("lm")|>fit(top_speed~duration, data =rc)tidy(speed_fit)