AE 05: Joining fisheries - Suggested Answers

Application exercise
Important

Go to the course GitHub organization and locate the repo titled ae-05-23-YOUR_GITHUB_USERNAME to get started.

This AE is due Saturday, Feb 4 at 11:59pm.

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   0.3.5
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.3     ✔ forcats 0.5.2
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
fisheries <- read_csv("data/fisheries.csv")
Rows: 82 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (3): capture, aquaculture, total

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
continents <- read_csv("data/continents.csv")
Rows: 245 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, continent

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

read_csv

Before we get started, I want to introduce read_csv.

read_csv reads comma delimited files. The first argument to read_csv() is the most important: it’s the path to the file to read. This function uses the first line of the data for the column names, which is a very common convention. The data might not have column names. You can use col_names = FALSE to tell read_csv() not to treat the first row as headings, and instead label them: read_csv(path, col_names = FALSE).

Another option that commonly needs tweaking is na: this specifies the value (or values) that are used to represent missing values in your file: read_csv(path, na = "NA")

Working with multiple data frames

Often instead of being provided the data you need for your analysis in a single data frame, you will need to bring information from multiple datasets together into a data frame yourself. These datasets will be linked to each other via a column (usually an identifier, something that links the two datasets together) that you can use to join them together.

There are many possible types of joins. All have the format something_join(x, y).

x <- tibble(
  value = c(1, 2, 3),
  xcol = c("x1", "x2", "x3")
  )

y <- tibble(
  value = c(1, 2, 4),
  ycol = c("y1", "y2", "y4")
  )

x
# A tibble: 3 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
y
# A tibble: 3 × 2
  value ycol 
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y4   

We will demonstrate each of the joins on these small, toy datasets.

Note: These functions below know to join x and y by value because each dataset has value as a column. See for yourself!

[1] "value" "xcol" 
[1] "value" "ycol" 

inner_join() join all rows in x where there are matching values of y

Joining, by = "value"
# A tibble: 2 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

left_join() include all rows in x

left_join(x , y)
Joining, by = "value"
# A tibble: 3 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

right_join() includes all rows of y

right_join(y, x) 
Joining, by = "value"
# A tibble: 3 × 3
  value ycol  xcol 
  <dbl> <chr> <chr>
1     1 y1    x1   
2     2 y2    x2   
3     3 <NA>  x3   

full_join() includes rows for both x and y

full_join(x, y)
Joining, by = "value"
# A tibble: 4 × 3
  value xcol  ycol 
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y4   

semi_join()returns all rows from x that match y

semi_join(x, y)
Joining, by = "value"
# A tibble: 2 × 2
  value xcol 
  <dbl> <chr>
1     1 x1   
2     2 x2   

anti_join() return all rows from x that do not match with y

anti_join(x, y)
Joining, by = "value"
# A tibble: 1 × 2
  value xcol 
  <dbl> <chr>
1     3 x3   

Global aquaculture production

The Fisheries and Aquaculture Department of the Food and Agriculture Organization of the United Nations collects data on fisheries production of countries.

Goal: Our goal is to create a visualization of the mean share of aquaculture by continent.

Let’s start by looking at the fisheries data frame.

glimpse(fisheries)
Rows: 82
Columns: 4
$ country     <chr> "Angola", "Argentina", "Australia", "Bangladesh", "Brazil"…
$ capture     <dbl> 486490, 755226, 174629, 1674770, 705000, 629950, 233190, 8…
$ aquaculture <dbl> 655, 3673, 96847, 2203554, 581230, 172500, 2315, 200765, 9…
$ total       <dbl> 487145, 758899, 271476, 3878324, 1286230, 802450, 235505, …

We have the countries, but our goal is to make a visualization by continent. Let’s take a look at the continents data frame.

glimpse(continents)
Rows: 245
Columns: 2
$ country   <chr> "Afghanistan", "Åland Islands", "Albania", "Algeria", "Ameri…
$ continent <chr> "Asia", "Europe", "Europe", "Africa", "Oceania", "Europe", "…
  • Your turn (2 minutes):
    • Which variable(s) will we use to join the fisheries and continents data frames?

country

-   We want to keep all rows and columns from `fisheries` and add a column for corresponding continents. Which join function should we use?

full_join

full_left

  • Demo: Join the two data frames and name assign the joined data frame back to joined_fish.
joined_fish <- fisheries |> 
  left_join(continents)
Joining, by = "country"
  • Demo: Take a look at the updated joined_fish data frame. There are some countries that were not in continents. In the first code chunk, identify which countries these are (they will have NA values for continent). In the second code chunk, manually update the continent information for these countries using the case_when function. Finally, check that these updates have been made as intended and no countries are left without continent information.
joined_fish |>
  filter(is.na(continent))

Comment through the following code below. Specifically…

– How is case_when being used? – Why quotes? – What does the ~ in words?
– What is TRUE doing?

joined_fish <- joined_fish |> 
  mutate(
    continent = case_when(
    country == "Democratic Republic of the Congo" ~ "Africa",
    country == "Hong Kong" ~ "Asia",
    country == "Myanmar" ~ "Asia", 
    TRUE ~ continent
    )
  )
  • Demo: Add a new column to the joined_fish data frame called aq_prop. We will calculate it as aquaculture / total. Save the resulting frame as joined_fish.
joined_fish <- joined_fish |>
  mutate(aq_prop = aquaculture / total)
  • Your turn (5 minutes): Now expand your calculations to also calculate the mean, minimum and maximum aquaculture proportion for continents in the fisheries data. Note that the functions for calculating minimum and maximum in R are min() and max() respectively.
joined_fish |>
  group_by(continent) |>
  summarize(
    min_aq_prop = min(aq_prop),
    max_aq_prop = max(aq_prop),
    mean_aq_prop = mean(aq_prop)
  )
# A tibble: 5 × 4
  continent min_aq_prop max_aq_prop mean_aq_prop
  <chr>           <dbl>       <dbl>        <dbl>
1 Africa        0             0.803       0.0943
2 Americas      0             0.529       0.192 
3 Asia          0             0.782       0.367 
4 Europe        0.00682       0.618       0.165 
5 Oceania       0.0197        0.357       0.150 

———————– Solution to above answer below ————————-

  • Demo: Using your code above, create a new data frame called fisheries_summary that calculates minimum, mean, and maximum aquaculture proportion for each continent in the fisheries data.
fisheries_summary <- joined_fish |>
  group_by(continent) |>
  summarize(
    min_aq_prop = min(aq_prop),
    max_aq_prop = max(aq_prop),
    mean_aq_prop = mean(aq_prop)
  )
  • Demo: Then, determine which continent has the largest value of max_ap. Take the fisheries_summary data frame and order the results in descending order of mean aquaculture proportion.
fisheries_summary |>
  arrange(desc(mean_aq_prop))
# A tibble: 5 × 4
  continent min_aq_prop max_aq_prop mean_aq_prop
  <chr>           <dbl>       <dbl>        <dbl>
1 Asia          0             0.782       0.367 
2 Americas      0             0.529       0.192 
3 Europe        0.00682       0.618       0.165 
4 Oceania       0.0197        0.357       0.150 
5 Africa        0             0.803       0.0943