Application exercise
Below, we are going to practice joins using the following fake data sets about coffee.

coffee1 <- tibble(
  Month = c("July" , "July", "August", "August" , "September"),
  Coffee_Shop = c("Starbucks" , "Starbucks", "ThePerk" , "ThePerk", "Starbucks"),
  Drinks_Sold = c(3,2,6,5,1)

coffee2 <- tibble(
  month = c("June", "July", "August"),
  Special = c("Free Drink", "Half-Off", "Free Drink")

Below, left_join coffee2 to coffee1. Comment on how these two data sets were joined together. Hint: You may need to use the by argument in the left_join function.

coffee1 |>
  left_join(coffee2 , by = c("Month" = "month"))
# A tibble: 5 × 4
  Month     Coffee_Shop Drinks_Sold Special   
  <chr>     <chr>             <dbl> <chr>     
1 July      Starbucks             3 Half-Off  
2 July      Starbucks             2 Half-Off  
3 August    ThePerk               6 Free Drink
4 August    ThePerk               5 Free Drink
5 September Starbucks             1 <NA>      

Same thing different way:

The reason we used the by argument in the left_join function was because the column names were not the same across the two data sets. Run the following code below, and compare it to the output above. Same? Different?

coffee2 <- coffee2 |>
  rename("Month" = month)

coffee1 |>
Joining, by = "Month"
Now, full_join and right_join the two data sets together. Comment on the results.

coffee1 |>
Joining, by = "Month"
coffee1 |>
Joining, by = "Month"
coffee1 |>
Joining, by = "Month"
left_join - coffee2 gets joined to coffee1 by Month. Anywhere there is a match from the Month, it will add informatino from the other columns of Y. Only keeps rows of coffee1.

right_join - the opposite of left_join. Keeps rows of y and adds x columns.

full_join - includes both x and y rows.

Summary Statistics

In this short activity, we will use the Orange data set built in R. Please run ?Orange to learn more.

Below, please complete the following:

  • Calculate the mean circumference of each tree.

  • Create a new variable called old to indicate when the tree became over 1000 years old. Use the value Yes if the measurement is over 1000, and No if it is not. Hint: A way to answer this involves using if_else

Orange |>
  group_by(Tree) |>
  summarize(mean_cir = mean(circumference))
# A tibble: 5 × 2
  Tree  mean_cir
  <ord>    <dbl>
1 3         94  
2 1         99.6
3 5        111. 
4 2        135. 
5 4        139. 
Orange |>
  mutate(circumference = if_else(age > 1000, "Yes", "No"))
