Lecture 13
Dr. Elijah Meyer
Duke University
STA 199 - Spring 2023
Feburary 24th, 2023
– Clone `ae-12
– Homework 2 due tonight (11:59)
– Lab 4 due Tuesday (11:59)
— 1 submission. Attach everyone to it
– data analysis competition where teams of up to five students attack a large, complex, and surprise dataset over a weekend
– DataFest is a great opportunity to gain experience that employers are looking for
– Each team will give a brief presentation of their findings that will be judged by a panel of judges comprised of faculty and professionals from a variety of fields.
– Team Submission
– Attach ALL team members to submission on Gradescope
– Communicate!
– “It was my responsibility to turn the lab in and I forgot….”
– Discuss correlation
– Finish categorical single predictor
– Model with multiple predictors
Below is a scatterplot from ae-11
. Alone or with a partner, discuss how R chose to fit this line over any other.
– Proper notation:
— Population correlation: \(\rho\)
— r
strength and direction of a linear relationship
bounded between [-1, 1]
– Play against yourself
– Be better at correlation than your friends
https://www.rossmanchance.com/applets/2021/guesscorrelation/GuessCorrelation.html
Can find this with the cor
or correlate
function in R
https://www.tidyverse.org/blog/2020/12/corrr-0-4-3/
estimates the relationship between a quantitative response variable and two or more explanatory variables
motivated by scenarios where many variables may be simultaneously connected to an output
In words….
The relationship between x and y do not change based on the values of z (additive)
The relationship between x and y DO change based on the values of z (interaction)
for a statistical model states that: a simpler model with fewer parameters is favored over more complex models with more parameters, provided the models fit the data similarly well
KEEP IT SIMPLE (when you can)
Many different ways
– Initial visual evidence
– R-squared & Adjusted R-squared
– statistical measure in a regression model that determines the proportion of variance in the response variable that can be explained by the explanatory variable(s).
– statistical measure in a regression model that determines the proportion of variance in the response variable that can be explained by the explanatory variable(s).
– The more variables you include, the larger the R-squared value will be (always)
Takeaway: Adds a penalty for “unimportant” predictors (x’s)