Exam-Review - SUGGESTED ANSWERS

Application exercise
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   0.3.5
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.3     ✔ forcats 0.5.2
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Below, we are going to practice joins using the following fake data sets about coffee.

coffee1 <- tibble(
  Month = c("July" , "July", "August", "August" , "September"),
  Coffee_Shop = c("Starbucks" , "Starbucks", "ThePerk" , "ThePerk", "Starbucks"),
  Drinks_Sold = c(3,2,6,5,1)
)

coffee2 <- tibble(
  month = c("June", "July", "August"),
  Special = c("Free Drink", "Half-Off", "Free Drink")
  )

Below, left_join coffee2 to coffee1. Comment on how these two data sets were joined together. Hint: You may need to use the by argument in the left_join function.

coffee1 |>
  left_join(coffee2 , by = c("Month" = "month"))
# A tibble: 5 × 4
  Month     Coffee_Shop Drinks_Sold Special   
  <chr>     <chr>             <dbl> <chr>     
1 July      Starbucks             3 Half-Off  
2 July      Starbucks             2 Half-Off  
3 August    ThePerk               6 Free Drink
4 August    ThePerk               5 Free Drink
5 September Starbucks             1 <NA>      

Same thing different way:

The reason we used the by argument in the left_join function was because the column names were not the same across the two data sets. Run the following code below, and compare it to the output above. Same? Different?

coffee2 <- coffee2 |>
  rename("Month" = month)

coffee1 |>
  left_join(coffee2)
Joining, by = "Month"
# A tibble: 5 × 4
  Month     Coffee_Shop Drinks_Sold Special   
  <chr>     <chr>             <dbl> <chr>     
1 July      Starbucks             3 Half-Off  
2 July      Starbucks             2 Half-Off  
3 August    ThePerk               6 Free Drink
4 August    ThePerk               5 Free Drink
5 September Starbucks             1 <NA>      

Now, full_join and right_join the two data sets together. Comment on the results.

coffee1 |>
  left_join(coffee2)
Joining, by = "Month"
# A tibble: 5 × 4
  Month     Coffee_Shop Drinks_Sold Special   
  <chr>     <chr>             <dbl> <chr>     
1 July      Starbucks             3 Half-Off  
2 July      Starbucks             2 Half-Off  
3 August    ThePerk               6 Free Drink
4 August    ThePerk               5 Free Drink
5 September Starbucks             1 <NA>      
coffee1 |>
  full_join(coffee2)
Joining, by = "Month"
# A tibble: 6 × 4
  Month     Coffee_Shop Drinks_Sold Special   
  <chr>     <chr>             <dbl> <chr>     
1 July      Starbucks             3 Half-Off  
2 July      Starbucks             2 Half-Off  
3 August    ThePerk               6 Free Drink
4 August    ThePerk               5 Free Drink
5 September Starbucks             1 <NA>      
6 June      <NA>                 NA Free Drink
coffee1 |>
  right_join(coffee2)
Joining, by = "Month"
# A tibble: 5 × 4
  Month  Coffee_Shop Drinks_Sold Special   
  <chr>  <chr>             <dbl> <chr>     
1 July   Starbucks             3 Half-Off  
2 July   Starbucks             2 Half-Off  
3 August ThePerk               6 Free Drink
4 August ThePerk               5 Free Drink
5 June   <NA>                 NA Free Drink

left_join - coffee2 gets joined to coffee1 by Month. Anywhere there is a match from the Month, it will add informatino from the other columns of Y. Only keeps rows of coffee1.

right_join - the opposite of left_join. Keeps rows of y and adds x columns.

full_join - includes both x and y rows.

Summary Statistics

In this short activity, we will use the Orange data set built in R. Please run ?Orange to learn more.

Below, please complete the following:

  • Calculate the mean circumference of each tree.

  • Create a new variable called old to indicate when the tree became over 1000 years old. Use the value Yes if the measurement is over 1000, and No if it is not. Hint: A way to answer this involves using if_else

Orange |>
  group_by(Tree) |>
  summarize(mean_cir = mean(circumference))
# A tibble: 5 × 2
  Tree  mean_cir
  <ord>    <dbl>
1 3         94  
2 1         99.6
3 5        111. 
4 2        135. 
5 4        139. 
Orange |>
  mutate(circumference = if_else(age > 1000, "Yes", "No"))
   Tree  age circumference
1     1  118            No
2     1  484            No
3     1  664            No
4     1 1004           Yes
5     1 1231           Yes
6     1 1372           Yes
7     1 1582           Yes
8     2  118            No
9     2  484            No
10    2  664            No
11    2 1004           Yes
12    2 1231           Yes
13    2 1372           Yes
14    2 1582           Yes
15    3  118            No
16    3  484            No
17    3  664            No
18    3 1004           Yes
19    3 1231           Yes
20    3 1372           Yes
21    3 1582           Yes
22    4  118            No
23    4  484            No
24    4  664            No
25    4 1004           Yes
26    4 1231           Yes
27    4 1372           Yes
28    4 1582           Yes
29    5  118            No
30    5  484            No
31    5  664            No
32    5 1004           Yes
33    5 1231           Yes
34    5 1372           Yes
35    5 1582           Yes