Lecture 16
Dr. Elijah Meyer
Duke University
STA 199 - Spring 2023
March 8th, 2022
– Clone ae-15
– HW 3 due Friday (3-10)
– Project Proposal due Friday (3-10)
– The goal of the statistics experience assignments is to help you engage with the statistics and data science communities outside of the classroom
– No GitHub repo for this assignment
Experience statistics outside of the classroom
– Attend a talk or conference
– Talk with a statistician/ data scientist (myself and TAs do not count)
– Listen to a podcast / watch video
– Participate in a data science competition or challenge
– Read a book on statistics/data science
– TidyTuesday challenges
– Coding out loud project
– data analysis competition where teams of up to five students attack a large, complex, and surprise dataset over a weekend
– DataFest is a great opportunity to gain experience that employers are looking for
– Each team will give a brief presentation of their findings that will be judged by a panel of judges comprised of faculty and professionals from a variety of fields.
– Summarize your experience
– Guidelines are in the instructions
– What is the difference between R-squared and Adjusted R-squared?
— How are each defined?
— When are each appropriate to use?
– How are each defined?
R-squared: The proportion of variability in our response that is explained by our model
Adjusted-R-squared: Measure of overall model fit
— When are each appropriate to use?
R-squared: when the models have the same number of variables
Adjusted-R-squared: when the models have a different number of variables
– The What, Why, and How of Logistic Regression
Similar to linear regression…. but
Modeling tool when our response is categorical
– This type of model is called a generalized linear model
– Bernoulli Distribution
2 outcomes: Success (p) or Failure (1-p)
\(y_i\) ~ Bern(p)
What we can do is we can use our explanatory variable(s) to model p
– 1: Define a linear model
– 2: Define a link function
\(\eta_i = \beta_o + \beta_1*X_i + ...\)
Note: We use \(p_i\) for estimated probabilities
– Preform a transformation to our response variable so it has the appropriate range of values
– Or…. takes values between negative and positive infinity and map them to probabilities
– A logit link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded
– Note: log is in reference to natural log
Takes a [0,1] probability and maps it to log odds (-\(\infty\) to \(\infty\).)
This isn’t exactly what we need though…..
Will help us get to our goal
The logit link function is defined as follows:
\(logit(p)\) = \(\widehat{\beta_o} +\widehat{\beta}_1X1 + ....\)
logit(p) is also known as the log-odds
logit(p) = \(log(\frac{p}{1-p})\)
\(log(\frac{p}{1-p})\) = \(\widehat{\beta_o} +\widehat{\beta}_1X1 + ....\)
– Recall, the goal is to take values between -\(\infty\) and \(\infty\) and map them to probabilities. We need the opposite of the link function… or the inverse
– How do we take the inverse of a natural log?
\(logit(p)\) = \(\widehat{\beta_o} +\widehat{\beta}_1X1 + ....\)
\[log(\frac{p}{1-p}) = \widehat{\beta_o} +\widehat{\beta}_1X1 + ....\]
Lets take the inverse of the logit function
Example Figure:
– We can not model these data using the tools we currently have
– We can overcome some of the shortcoming of regression by fitting a generalized linear regression model
– We can model binary data using an inverse logit function to model probabilities of success