Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2
Go to the sta199-s23-2 organization on GitHub. Click on the repo with the prefix hw-01
. It contains the starter documents you need to complete the homework assignment.
Clone the repo and start a new project in RStudio. See the Lab 0 instructions for details on cloning a repo and starting a new R project.
As we’ve discussed in lecture, your plots should include an informative title, axes should be labeled, and careful consideration should be given to aesthetic choices.
Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete this homework and other assignments in this course. There will be periodic reminders in this assignment to remind you to knit, commit, and push your changes to GithHub. You should have at least 3 commits with meaningful commit messages by the end of the assignment.
For the following two exercises you will work with data on houses that were sold in the Duke Forest neighborhood of Durham, NC in November 2020. The duke_forest
dataset comes from the openintro package. You can see a list of the variables on the package website or by running ?duke_forest
in your console.
Suppose you’re helping some family friends who are looking to buy a house in Duke Forest. As they browse Zillow listings, they realize some houses have garages and others don’t, and they wonder: Does having a garage make a difference?
Luckily, you can help them answer this question with data visualization!
garage
(with levels "Garage"
and "No garage"
).mutate()
the duke_forest
data frame to add a new variable called garage
which takes the value "Garage"
if the text string "Garage"
is detected in the parking
variable and takes the test string "No garage"
if not.duke_forest |>
mutate(garage = if_else(str_detect(parking, "Garage"), "Garage", "No garage"))
garage
and use different colors for the two facets.Now is a good time to render, commit, and push. Make sure that you commit and push all changed documents and your Git pane is completely empty before proceding.
It’s expected that within any given marker larger houses will be priced higher. It’s also expected that the age of the house will have an effect on the price. However in some markets new houses might be more expensive while in others new construction might mean “no character” and hence be less expensive. So your family friends ask: “In Duke Forest, do houses that are bigger and more expensive tend to be newer ones than those that are smaller and cheaper?”
Once again, data visualization skills to the rescue!
price
and area
, conditioning for year_built
.geom_smooth()
with the argument se = FALSE
to add a smooth curve fit to the data and color the points by year_built
.Now is a good time to render, commit, and push.
Make sure that you commit and push ALL changed documents and your Git pane is completely empty before proceding.
The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
Source: cdc.gov/brfss
In the following exercises we will work with data from the 2020 BRFSS survey. The originally come from here, though we will work with a random sample of responses and a small number of variables from the data provided. These have already been sampled for you and the dataset you’ll use can be found in the data
folder of your repo. It’s called brfss.csv
.
brfss <- read_csv("data/brfss.csv")
brfss
dataset? What does each row represent?brfss
dataset? Indicate the type of each variable.
Now is a good time to render, commit, and push.
Do people who smoke more tend to have worse health conditions?
smoke_freq
) and general health (general_health
). Decide on which variable to represent with bars and which variable to fill the color of the bars by.fct_relevel
function to reorder the levels of the variables.
general_health
. Here we first convert general_health
to a factor (how R stores categorical data) and then order the levels from Excellent to Poor.brfss |>
mutate(
general_health = as.factor(general_health),
general_health = fct_relevel(general_health, "Excellent", "Very good", "Good", "Fair", "Poor")
)
Now is a good time to render, commit, and push.
How are sleep and general health associated?
sleep
and general_health
.
Now is a good time to render, commit, and push.
x
and y
aesthetics in a scatterplot, you get a straight ___ line. (Choose between “vertical”, “horizontal”, or “diagonal”.)ggplot(data=mpg,mapping=aes(x=drv,fill=class))+geom_bar() +scale_fill_viridis_d()
?facet_wrap
. What does nrow
do? What does ncol
do? What other options control the layout of the individual panels? Why doesn’t facet_grid()
have nrow
and ncol
arguments?
Render, commit, and push one last time.
Make sure that you commit and push all changed documents and your Git pane is completely empty before proceding.