AE-09 Introduction to Probability: Suggested Answers

Application exercise
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   0.3.5
✔ tibble  3.1.8     ✔ dplyr   1.0.9
✔ tidyr   1.2.1     ✔ stringr 1.4.1
✔ readr   2.1.3     ✔ forcats 0.5.2
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

In today’s activity, we will be working with sleep data. Specifically, we will look at the relationship between Myopia, and sleeping conditions. Nearsightedness (myopia) is a common vision condition in which near objects appear clear, but objects farther away look blurry.

sleep_data <- tibble( 
  Slept_With = c("Darkness" , "Nightlight", "Full Light", "Total"),
  No_Myopia = c(155,153,34,342),
  Myopia = c(15,72,36,123),
  High_Myopia = c(2,7,5,14),
  Total = c(172,232,75,479))

Question 1

– Give two examples of an event from the data set sleep_data.

A = Sleep with darkness

B = Have myopia

Question 2

– What is the sample space for how an infant slept?

{Darkness, Nightlight, Full Light}

Question 3

Let’s define our event as follows: A = child slept with a nightlight as an infant (“Nightlight”)

– What is the probability that an infant slept with a nightlight?

232/479

– What is the probability of \(A^c\)?

(172+75)/479

Return of the Penguins

They are back! Please run the following code below to re-familiarize ourselves with these data. Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.

data(penguins)

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Question 4

Now let’s make at table looking at the relationship between species and island. Below, comment on what this code is doing.

penguins |>
  count(species, island)
# A tibble: 5 × 3
  species   island        n
  <fct>     <fct>     <int>
1 Adelie    Biscoe       44
2 Adelie    Dream        56
3 Adelie    Torgersen    52
4 Chinstrap Dream        68
5 Gentoo    Biscoe      124

Gives us species by island information

To make the contingency table, we will use the function in dplry called pivot_wider(). It will take the data frame produced by count() that is current in a “long” format and reshape it to be in a “wide” format. We will also use the kable() function in the knitr package to neatly format our new table.

Use pivot_wider to create a contingency table. Hint: Use ?pivot_wider to find information on the argument values_fill. When creating the contingency table, include 0s in there are any missing values.

penguins |>
  count(species, island) |>
  pivot_wider(
    names_from = species,
    values_from = n,
    values_fill = 0
  )
# A tibble: 3 × 4
  island    Adelie Chinstrap Gentoo
  <fct>      <int>     <int>  <int>
1 Biscoe        44         0    124
2 Dream         56        68      0
3 Torgersen     52         0      0

Question 5

For each of the following exercises:

Calculate the probability using the contingency table above.

Then write code to check your answer using the penguins data frame and dplyr functions.

Hint: Think about creating a new prob column that contains the calculation.

– What is the probability that a random penguin is from the Biscoe island?

Together

penguins |>
  count(island) |>
  mutate(prob = n / nrow(penguins)) |>
  filter(island == "Biscoe")
# A tibble: 1 × 3
  island     n  prob
  <fct>  <int> <dbl>
1 Biscoe   168 0.488

Together

– What is the probability that a random penguin is not from the Biscoe island?

Let’s introduce pull(). The function pull selects a column in a data frame and transforms it into a vector. This is useful to use it in combination with pipe operators and dplyr’s verbs.

penguins |>
  count(island) |>
  filter(island != "Biscoe") |>
  pull(n) |>
  sum() / nrow(penguins)
[1] 0.5116279

These probabilities can be calculated in more than one way. Let’s do it again.

penguins |>
  count(island) |>
  filter(island != "Biscoe") |>
  select(n) |>
  colSums() / nrow(penguins)
        n 
0.5116279 

– What is the probability that a random penguin is of the Chinstrap species?

penguins |>
  count(species) |>
  mutate(prob = n / nrow(penguins)) |>
  filter(species == "Chinstrap")
# A tibble: 1 × 3
  species       n  prob
  <fct>     <int> <dbl>
1 Chinstrap    68 0.198

– What is the probability that a random penguin is not of the Chinstrap species?

penguins |>
  count(species) |>
  filter(species != "Chinstrap") |>
  pull(n) |>
  sum() / nrow(penguins)
[1] 0.8023256

Extension Questions

– What is the probability that a random penguin is on the Biscoe island and is of the Adelie species?

– What is the probability that a random penguin is on the Biscoe island or is of the Adelie species?

How are these questions different from above?

These questions consider more than one event

First, define your events below:

A - penguin is on Biscoe island

B - penguin is of the Adelie species

Write out these questions in proper notation.

P(A and B)

P(A or B)

Perform the calculations in your console by typing the values out.

Next, confirm your calculations in R by writing code.

Remind yourself of your contingency table here:

penguins |>
  count(species, island) |>
  pivot_wider(names_from = species,
              values_from = n,
              values_fill = 0) |>
  kable()
island Adelie Chinstrap Gentoo
Biscoe 44 0 124
Dream 56 68 0
Torgersen 52 0 0

Stopped Here

More definitions

Population: the entire group you want to learn about. Often, it’s useful to think the population is “truth”

Sample: Your sample of the population from which you draw inference.