AE 03: Wrangling flights - Suggested Answers

Application exercise
Important

This AE is due Saturday, Jan 28 at 11:59pm.

To demonstrate data wrangling we will use flights, a tibble in the nycflights13 R package. It includes characteristics of all flights departing from New York City (JFK, LGA, EWR) in 2013.

Note: As we go through the AE, practicing thinking in steps, and reading your code as sentences

The data frame has over 336,000 observations (rows), 336776 observations to be exact, so we will not view the entire data frame. Instead we’ll use the commands below to help us explore the data.

glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
names(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     1     1      517         515       2     830     819      11 UA     
2  2013     1     1      533         529       4     850     830      20 UA     
3  2013     1     1      542         540       2     923     850      33 AA     
4  2013     1     1      544         545      -1    1004    1022     -18 B6     
5  2013     1     1      554         600      -6     812     837     -25 DL     
6  2013     1     1      554         558      -4     740     728      12 UA     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

The head() function returns “A tibble: 6 x 19” and then the first six rows of the flights data.

Tibble vs. data frame

A tibble is an opinionated version of the R data frame. In other words, all tibbles are data frames, but not all data frames are tibbles!

There are many differences between a tibble and a data frame. The main one is…

  1. When you print a tibble, the first ten rows and all of the columns that fit on the screen will display, along with the type of each column.

Let’s look at the differences in the output when we type flights (tibble) in the console versus typing cars (data frame) in the console.

Data wrangling with dplyr

dplyr is the primary package in the tidyverse for data wrangling. Click here for the dplyr reference page. Click here for the dplyr cheatsheet.

Quick summary of key dplyr functions1:

Rows:

  • filter():chooses rows based on column values.
  • slice(): chooses rows based on location.
  • arrange(): changes the order of the rows
  • sample_n(): take a random subset of the rows

Columns:

  • select(): changes whether or not a column is included.
  • rename(): changes the name of columns.
  • mutate(): changes the values of columns and creates new columns.

Groups of rows:

  • summarise(): collapses a group into a single row.
  • count(): count unique values of one or more variables.
  • group_by(): perform calculations separately for each value of a variable

The pipe

Before working with more data wrangling functions, let’s formally introduce the pipe. The pipe, |>, is an operator (a tool) for passing information from one process to another. We will use |> mainly in data pipelines to pass the output of the previous line of code as the first input of the next line of code.

When reading code “in English”, say “and then” whenever you see a pipe.

Activities

select()

  • Demo: Make a data frame that only contains the variables dep_delay and arr_delay.
flights |>
  select(dep_delay, arr_delay)
# A tibble: 336,776 × 2
   dep_delay arr_delay
       <dbl>     <dbl>
 1         2        11
 2         4        20
 3         2        33
 4        -1       -18
 5        -6       -25
 6        -4        12
 7        -5        19
 8        -3       -14
 9        -3        -8
10        -2         8
# … with 336,766 more rows
  • Demo: Make a data frame that keeps every variable except dep_delay. Call the new data frame new.data
new.data <- flights |> 
  select(-dep_delay)
  • In the console, type 1:10 and hit enter. What happened?

  • Demo: Make a data frame that includes all variables between year through dep_delay (inclusive). These are all variables that provide information about the departure of each flight.

flights |> 
  select(dep_delay:year)
# A tibble: 336,776 × 6
   dep_delay sched_dep_time dep_time   day month  year
       <dbl>          <int>    <int> <int> <int> <int>
 1         2            515      517     1     1  2013
 2         4            529      533     1     1  2013
 3         2            540      542     1     1  2013
 4        -1            545      544     1     1  2013
 5        -6            600      554     1     1  2013
 6        -4            558      554     1     1  2013
 7        -5            600      555     1     1  2013
 8        -3            600      557     1     1  2013
 9        -3            600      557     1     1  2013
10        -2            600      558     1     1  2013
# … with 336,766 more rows
  • Demo: Use the select helper contains() to make a data frame that includes the variables associated with the arrival, i.e., contains the string "arr_" in the name. Reminder: Thinking about code as sentences can help make nesting functions more intuitive.
flights |>
  select(contains("arr_"))
# A tibble: 336,776 × 3
   arr_time sched_arr_time arr_delay
      <int>          <int>     <dbl>
 1      830            819        11
 2      850            830        20
 3      923            850        33
 4     1004           1022       -18
 5      812            837       -25
 6      740            728        12
 7      913            854        19
 8      709            723       -14
 9      838            846        -8
10      753            745         8
# … with 336,766 more rows
  • Why is arr_ in quotes?

slice()

  • Demo: Display the first five rows of the flights data frame.
flights |>
  slice(1:5)
# A tibble: 5 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     1     1      517         515       2     830     819      11 UA     
2  2013     1     1      533         529       4     850     830      20 UA     
3  2013     1     1      542         540       2     923     850      33 AA     
4  2013     1     1      544         545      -1    1004    1022     -18 B6     
5  2013     1     1      554         600      -6     812     837     -25 DL     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  • Demo: Display the last two rows of the flights data frame. Hint: n() produces the number of the last row in the data set.
flights |>
  slice((n()-1):n())
# A tibble: 2 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     9    30       NA        1159      NA      NA    1344      NA MQ     
2  2013     9    30       NA         840      NA      NA    1020      NA MQ     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
flights |>
  slice_tail(n = 2)
# A tibble: 2 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     9    30       NA        1159      NA      NA    1344      NA MQ     
2  2013     9    30       NA         840      NA      NA    1020      NA MQ     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

arrange()

  • Demo: Let’s arrange the data by departure delay, so the flights with the shortest departure delays will be at the top of the data frame.
flights |>
  arrange(dep_delay)
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    12     7     2040       2123     -43      40    2352      48 B6     
 2  2013     2     3     2022       2055     -33    2240    2338     -58 DL     
 3  2013    11    10     1408       1440     -32    1549    1559     -10 EV     
 4  2013     1    11     1900       1930     -30    2233    2243     -10 DL     
 5  2013     1    29     1703       1730     -27    1947    1957     -10 F9     
 6  2013     8     9      729        755     -26    1002     955       7 MQ     
 7  2013    10    23     1907       1932     -25    2143    2143       0 EV     
 8  2013     3    30     2030       2055     -25    2213    2250     -37 MQ     
 9  2013     3     2     1431       1455     -24    1601    1631     -30 9E     
10  2013     5     5      934        958     -24    1225    1309     -44 B6     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  • Demo: Now let’s arrange the data by descending departure delay, so the flights with the longest departure delays will be at the top. Hint, run ?desc in the console.
flights |>
  arrange(desc(dep_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     9      641        900    1301    1242    1530    1272 HA     
 2  2013     6    15     1432       1935    1137    1607    2120    1127 MQ     
 3  2013     1    10     1121       1635    1126    1239    1810    1109 MQ     
 4  2013     9    20     1139       1845    1014    1457    2210    1007 AA     
 5  2013     7    22      845       1600    1005    1044    1815     989 MQ     
 6  2013     4    10     1100       1900     960    1342    2211     931 DL     
 7  2013     3    17     2321        810     911     135    1020     915 DL     
 8  2013     6    27      959       1900     899    1236    2226     850 DL     
 9  2013     7    22     2257        759     898     121    1026     895 DL     
10  2013    12     5      756       1700     896    1058    2020     878 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
  • Your turn (5 minutes): Create a data frame that only includes the plane tail number (tailnum), carrier (carrier), and departure delay for the flight with the longest departure delay. What is the plane tail number (tailnum) for this flight?
flights |>
  select(tailnum, carrier, dep_delay) |>
  arrange(desc(dep_delay)) |>
  slice(1)
# A tibble: 1 × 3
  tailnum carrier dep_delay
  <chr>   <chr>       <dbl>
1 N384HA  HA           1301

filter()

  • Demo: Filter the data frame by selecting the rows where the destination airport is RDU. Comment the code below.
flights |> 
  filter(dest == "RDU")
# A tibble: 8,163 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      800        810     -10     949     955      -6 MQ     
 2  2013     1     1      832        840      -8    1006    1030     -24 MQ     
 3  2013     1     1      851        851       0    1032    1036      -4 EV     
 4  2013     1     1      917        920      -3    1052    1108     -16 B6     
 5  2013     1     1     1024       1030      -6    1204    1215     -11 MQ     
 6  2013     1     1     1127       1129      -2    1303    1309      -6 EV     
 7  2013     1     1     1157       1205      -8    1342    1345      -3 MQ     
 8  2013     1     1     1240       1235       5    1415    1415       0 MQ     
 9  2013     1     1     1317       1325      -8    1454    1505     -11 MQ     
10  2013     1     1     1449       1450      -1    1651    1640      11 MQ     
# … with 8,153 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Now, run the following code with one equals sign instead of two. Does it still work?

(=) is a Assignment operator while (==) is a Equal to operator. (=) is used for assigning the values from right to left while (==) is used for showing equality between values.

Footnotes

  1. From dplyr vignette↩︎