AE 07: Pivoting StatSci Majors - SUGGESTED ANSWERS

Application exercise

Important

Go to the course GitHub organization and locate the repo titled exam1-review-s23-YOUR_GITHUB_USERNAME to get started.

This AE is due Monday, Feb 13 at 11:59pm.

Packages

library(tidyverse)

Goal

Our ultimate goal in this application exercise is to make the following data visualization.

Line plot of numbers of Statistical Science majors over the years (2011 - 2021). Degree types represented are BS, BS2, AB, AB2. There is an increasing trend in BS degrees and somewhat steady trend in AB degrees.

Your turn (3 minutes): Take a close look at the plot and describe what it shows in 2-3 sentences.

Add your response here.

Data

The data come from the Office of the University Registrar. They make the data available as a table that you can download as a PDF, but I’ve put the data exported in a CSV file for you. Let’s load that in.

statsci <- read_csv("data/statsci.csv")

Rows: 4 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): degree
dbl (12): id, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

And let’s take a look at the data.

glimpse(statsci)

Rows: 4
Columns: 13
$ id     <dbl> 1, 2, 3, 4
$ degree <chr> "Statistical Science (AB2)", "Statistical Science (AB)", "Stati…
$ `2011` <dbl> NA, 2, 2, 5
$ `2012` <dbl> 1, 2, 6, 9
$ `2013` <dbl> NA, 4, 1, 4
$ `2014` <dbl> NA, 1, NA, 13
$ `2015` <dbl> 4, 3, 5, 10
$ `2016` <dbl> 4, 6, 6, 17
$ `2017` <dbl> 1, 3, 6, 24
$ `2018` <dbl> NA, 4, 8, 21
$ `2019` <dbl> NA, 4, 8, 26
$ `2020` <dbl> 1, 1, 17, 27
$ `2021` <dbl> 2, NA, 16, 35

The dataset has 4 rows and 13 columns. The first column (variable) is the degree, and there are 4 possible degrees: BS (Bachelor of Science), BS2 (Bachelor of Science, 2nd major), AB (Bachelor of Arts), AB2 (Bachelor of Arts, 2nd major). The remaining columns show the number of students graduating with that major in a given academic year from 2011 to 2021.

Your turn (4 minutes): Take a look at the plot we aim to make and sketch / think about the data frame we need to make the plot. Determine what each row and each column of the data frame should be. Hint: We need data to be in columns to map to aesthetic elements of the plot.

Add your response here.

Pivoting

Demo: Pivot the statsci data frame longer such that each row represents a degree type / year combination and year and number of graduates for that year are columns in the data frame.

Explain why the following code below accomplishes the question above.

statsci |>
  pivot_longer(
    cols = !c(degree,id),
    names_to = "year",
    values_to = "n"
  )

# A tibble: 44 × 4
      id degree                    year      n
   <dbl> <chr>                     <chr> <dbl>
 1     1 Statistical Science (AB2) 2011     NA
 2     1 Statistical Science (AB2) 2012      1
 3     1 Statistical Science (AB2) 2013     NA
 4     1 Statistical Science (AB2) 2014     NA
 5     1 Statistical Science (AB2) 2015      4
 6     1 Statistical Science (AB2) 2016      4
 7     1 Statistical Science (AB2) 2017      1
 8     1 Statistical Science (AB2) 2018     NA
 9     1 Statistical Science (AB2) 2019     NA
10     1 Statistical Science (AB2) 2020      1
# … with 34 more rows

Question: What is the type of the year variable? Why? What should it be?

It’s a (categorical/quantitative) variable since the information came from the columns of the original data frame and R cannot know that these character strings represent years. The variable type should be (categorical/quantitative).

Demo: Start over with pivoting, and this time also make sure year is a numerical variable in the resulting data frame. How does this code differ from above?

statsci |>
  pivot_longer(
    cols = -c(degree,id),
    names_to = "year",
    names_transform = as.numeric,
    values_to = "n"
  )

# A tibble: 44 × 4
      id degree                     year     n
   <dbl> <chr>                     <dbl> <dbl>
 1     1 Statistical Science (AB2)  2011    NA
 2     1 Statistical Science (AB2)  2012     1
 3     1 Statistical Science (AB2)  2013    NA
 4     1 Statistical Science (AB2)  2014    NA
 5     1 Statistical Science (AB2)  2015     4
 6     1 Statistical Science (AB2)  2016     4
 7     1 Statistical Science (AB2)  2017     1
 8     1 Statistical Science (AB2)  2018    NA
 9     1 Statistical Science (AB2)  2019    NA
10     1 Statistical Science (AB2)  2020     1
# … with 34 more rows

Question: What does an NA mean in this context? Hint: The data come from the university registrar, and they have records on every single graduates, there shouldn’t be anything “unknown” to them about who graduated when.

NAs should actually be 0s.

Start Here ——————————————————————

Demo: Add on to your pipeline that you started with pivoting and convert NAs in n to 0s.

statsci |>
  pivot_longer(
    cols = !c(id,degree),
    names_to = "year",
    names_transform = as.numeric,
    values_to = "n"
  ) |>
  mutate(n = if_else(is.na(n), 0, n))

# A tibble: 44 × 4
      id degree                     year     n
   <dbl> <chr>                     <dbl> <dbl>
 1     1 Statistical Science (AB2)  2011     0
 2     1 Statistical Science (AB2)  2012     1
 3     1 Statistical Science (AB2)  2013     0
 4     1 Statistical Science (AB2)  2014     0
 5     1 Statistical Science (AB2)  2015     4
 6     1 Statistical Science (AB2)  2016     4
 7     1 Statistical Science (AB2)  2017     1
 8     1 Statistical Science (AB2)  2018     0
 9     1 Statistical Science (AB2)  2019     0
10     1 Statistical Science (AB2)  2020     1
# … with 34 more rows

Demo: In our plot the degree types are BS, BS2, AB, and AB2. This information is in our dataset, in the degree column, but this column also has additional characters we don’t need. Create a new column called degree_type with levels BS, BS2, AB, and AB2 (in this order) based on degree. The code below accomplishes this for you. Comment / practice reading each line.

new.data <- statsci |>
  pivot_longer(
    cols = !c(id,degree), #columns we do not want to make values
    names_to = "year", #column name
    names_transform = as.numeric, #change year to numeric
    values_to = "n" #column name
  ) |>
  mutate(n = if_else(is.na(n), 0, n)) |> #remove NAs and replace with 0s
  separate(degree, sep = "\\(", into = c("major", "degree_type")) |> #separate column degree to two new columns. Separate on the input ()
  mutate(
    degree_type = str_remove(degree_type, "\\)"), #remove string )
    degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2") #relevel factors to match plot
    )

Your turn (5 minutes): Now we start making our plot, but let’s not get too fancy right away. Create the following plot, which will serve as the “first draft” on the way to our Goal. Do this by adding on to your pipeline from earlier.

statsci |>
  pivot_longer(
    cols = !c(id,degree),
    names_to = "year",
    names_transform = as.numeric,
    values_to = "n"
  ) |>
  mutate(n = if_else(is.na(n), 0, n)) |>
  separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
  mutate(
    degree_type = str_remove(degree_type, "\\)"),
    degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
    ) |> 
  ggplot(
    aes(x = year, y = n, color = degree_type)
  ) + 
  geom_point() +
  geom_line()

Your turn (4 minutes): What aspects of the plot need to be updated to go from the draft you created above to the Goal plot at the beginning of this application exercise.
x-axis
line colors
labels
theme
legend
Demo: Update x-axis scale such that the years displayed go from 2011 to 2021 in increments of 2 years. Do this by adding on to your pipeline from earlier.

statsci |>
  pivot_longer(
    cols = !c(id,degree),
    names_to = "year",
    names_transform = as.numeric,
    values_to = "n"
  ) |>
  mutate(n = if_else(is.na(n), 0, n)) |>
  separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
  mutate(
    degree_type = str_remove(degree_type, "\\)"),
    degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
    ) |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011,2021,2))

Demo: Update line colors using the following level / color assignments. Once again, do this by adding on to your pipeline from earlier.
- “BS” = “cadetblue4”
- “BS2” = “cadetblue3”
- “AB” = “lightgoldenrod4”
- “AB2” = “lightgoldenrod3”

statsci |>
  pivot_longer(
    cols = !c(id,degree),
    names_to = "year",
    names_transform = as.numeric,
    values_to = "n"
  ) |>
  mutate(n = if_else(is.na(n), 0, n)) |>
  separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
  mutate(
    degree_type = str_remove(degree_type, "\\)"),
    degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
    ) |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011, 2021, 2)) +
  scale_color_manual(
    values = c("BS" = "cadetblue4", 
               "BS2" = "cadetblue3",
               "AB" = "lightgoldenrod4",
               "AB2" = "lightgoldenrod3")
  )

Your turn (4 minutes):

statsci |>
  pivot_longer(
    cols = !c(id,degree),
    names_to = "year",
    names_transform = as.numeric,
    values_to = "n"
  ) |>
  mutate(n = if_else(is.na(n), 0, n)) |>
  separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
  mutate(
    degree_type = str_remove(degree_type, "\\)"),
    degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
    ) |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011, 2021, 2)) +
  scale_color_manual(
    values = c("BS" = "cadetblue4",
               "BS2" = "cadetblue3",
               "AB" = "lightgoldenrod4",
               "AB2" = "lightgoldenrod3")) +
  labs(
    x = "Graduation year",
    y = "Number of majors graduating",
    color = "Degree type",
    title = "Statistical Science majors over the years",
    subtitle = "Academic years 2011 - 2021",
    caption = "Source: Office of the University Registrar\nhttps://registrar.duke.edu/registration/enrollment-statistics"
  ) +
  theme_minimal()

Demo: Finally, adding to your pipeline you’ve developed so far, move the legend into the plot, make its background white, and its border gray. Set fig-width: 7 and fig-height: 5 for your plot in the chunk options. The code below does this for you. Practice reading the code as a sentence below.

statsci |>
  pivot_longer(
    cols = !c(id,degree),
    names_to = "year",
    names_transform = as.numeric,
    values_to = "n"
  ) |>
  mutate(n = if_else(is.na(n), 0, n)) |>
  separate(degree, sep = " \\(", into = c("major", "degree_type")) |>
  mutate(
    degree_type = str_remove(degree_type, "\\)"),
    degree_type = fct_relevel(degree_type, "BS", "BS2", "AB", "AB2")
    ) |>
  ggplot(aes(x = year, y = n, color = degree_type)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = seq(2011, 2021, 2)) +
  scale_color_manual(
    values = c("BS" = "cadetblue4",
               "BS2" = "cadetblue3",
               "AB" = "lightgoldenrod4",
               "AB2" = "lightgoldenrod3")) +
  labs( #add labels to plot using labs tab
    x = "Graduation year",
    y = "Number of majors graduating",
    color = "Degree type",
    title = "Statistical Science majors over the years",
    subtitle = "Academic years 2011 - 2021",
    caption = "Source: Office of the University Registrar\nhttps://registrar.duke.edu/registration/enrollment-statistics"
  ) +
  theme_minimal() + #add theme
  theme( #function to change elements of legend
    legend.position = c(0.2,0.8), #change position
    legend.background = element_rect( #change legend rec color
      fill = "white" , color = "gray"
    )
  )