STA199 Final Project

Proposal

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2

Data 1

Introduction and data

  • Identify the source of the data.

    • Data was first collected in 1983 and has been collected every year since then. HUD used to use surveys of local housing markets to collect data, but now housing authorities conduct surveys on their behalf. They also accept telephone or mail surveys. (https://www.huduser.gov/portal/datasets/fmr.html)
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The Department of Housing and Urban Development’s Office of Public Development and Research (HUD’s PD&R)
  • Write a brief description of the observations.

    • The data is divided by year then state and county. You can choose to view the whole state or just the county. The variables are year, state, counties, zip code, efficiency, one-bedroom, two-bedroom, three-bedroom, four-bedroom, and FMR percentile. 

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • How has the cost of rent changed post-COVID in North Carolina? 
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • The COVID-19 pandemic delivered a devastating blow to the economy, causing increased unemployment and inflation throughout the U.S. The cost of living has increased in multiple different states, and if we want to resolve this issue in North Carolina, we must understand the extent. By comparing North Carolina’s data from 2019, 2020, 2021, and 2022, we will compare the changes in rent to test our hypothesis that rent increased Post-COVID and is making the cost of living unattainable.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Year (2019, 2020, 2021, 2022), State (North Carolina), efficiency, one-bedroom, two-bedroom, three-bedroom, four-bedroom, and FMR percentile.

Literature

  • Find one published credible article on the topic you are interested in researching.

  • Provide a one paragraph summary about the article.

    • This article does an overview of rent changes in the U.S and then discusses the county and zip codes with the highest rent increase. It looks closer at Arizona and San Francisco, finding that Maricopa and Pinal county In Arizona have the highest rent increase from 2020-2023, and that San Francisco have large rent differences by zip code.
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    •  Their visualizations give an example of what we are aiming for, but our research question will give a more in-depth look at the changes in North Carolina and which year had the highest increase.

Glimpse of data

FMR_data_2020 <- read_csv("data/FY20_4050_FMRs_rev.csv")
Rows: 4766 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): fips2010, metro_code, areaname, cousub, countyname, county_town_na...
dbl (13): fmr_0, fmr_1, fmr_2, fmr_3, fmr_4, state, county, pop2017, acs_201...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
FMR_data_2023 <- read_csv("data/FY23_FMRs_revised.csv")
Rows: 4764 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): fips, hud_area_name, hud_area_code, countyname, county_town_name, s...
dbl (8): State, metro, pop2020, fmr_0, fmr_1, fmr_2, fmr_3, fmr_4

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(FMR_data_2020)
Rows: 4,766
Columns: 20
$ fips2010         <chr> "0100199999", "0100399999", "0100599999", "0100799999…
$ fmr_0            <dbl> 583, 744, 477, 804, 804, 474, 474, 468, 486, 462, 594…
$ fmr_1            <dbl> 702, 749, 481, 861, 861, 484, 521, 531, 593, 465, 627…
$ fmr_2            <dbl> 830, 916, 633, 986, 986, 612, 612, 700, 683, 612, 714…
$ fmr_3            <dbl> 1047, 1251, 789, 1291, 1291, 809, 782, 910, 887, 826,…
$ fmr_4            <dbl> 1425, 1566, 925, 1425, 1425, 878, 993, 1072, 977, 829…
$ state            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ metro_code       <chr> "METRO33860M33860", "METRO19300M19300", "NCNTY01005N0…
$ areaname         <chr> "Montgomery, AL MSA", "Daphne-Fairhope-Foley, AL MSA"…
$ county           <dbl> 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29…
$ cousub           <chr> "99999", "99999", "99999", "99999", "99999", "99999",…
$ countyname       <chr> "Autauga County", "Baldwin County", "Barbour County",…
$ county_town_name <chr> "Autauga County", "Baldwin County", "Barbour County",…
$ pop2017          <dbl> 55035, 203360, 26200, 22580, 57665, 10480, 20125, 115…
$ acs_2019_2       <dbl> 825, 888, 666, 873, 873, 628, 628, 676, 691, 628, 628…
$ state_alpha      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
$ fmr_type         <dbl> 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 4…
$ metro            <dbl> 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
$ fmr_pct_chg      <dbl> 0.00606061, 0.03153150, -0.04955000, 0.12943872, 0.12…
$ fmr_dollar_chg   <dbl> 5, 28, -33, 113, 113, -16, -16, 24, -8, -16, 86, -45,…
glimpse(FMR_data_2023)
Rows: 4,764
Columns: 14
$ fips             <chr> "0100199999", "0100399999", "0100599999", "0100799999…
$ hud_area_name    <chr> "Montgomery, AL MSA", "Daphne-Fairhope-Foley, AL MSA"…
$ hud_area_code    <chr> "METRO33860M33860", "METRO19300M19300", "NCNTY01005N0…
$ countyname       <chr> "Autauga County", "Baldwin County", "Barbour County",…
$ county_town_name <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ State            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ state_alpha      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
$ metro            <dbl> 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
$ pop2020          <dbl> 55639, 218289, 25026, 22374, 57755, 10173, 19726, 114…
$ fmr_0            <dbl> 716, 924, 558, 866, 866, 620, 576, 605, 688, 574, 617…
$ fmr_1            <dbl> 817, 928, 562, 942, 942, 631, 664, 609, 692, 584, 634…
$ fmr_2            <dbl> 977, 1206, 740, 1075, 1075, 820, 761, 802, 911, 759, …
$ fmr_3            <dbl> 1241, 1534, 941, 1376, 1376, 999, 926, 1088, 1109, 95…
$ fmr_4            <dbl> 1595, 1971, 994, 1494, 1494, 1229, 1022, 1178, 1224, …

Data 2

Introduction and data

The United States Census Bureau has information regarding voter turnout and registration through the Current Population Survey (CPS) from 1994- 2020, focusing on November. The dataset is CPS Voting Supplement, in which one can create charts based on the variables provided by the CPS.

The data is collected through interviews, such as phone calls and/or in-person visits to ask the individual living in the residential address to answer questions regarding race, ethnicity, marital status, age, sex, etc. Some interviews follow up with residents over a longer period. 

The Census Bureau allows one to select which type of data to view. The data we are observing is the racial and ethnic voter outcome of individuals that voted for per U.S state. On the rows, the data consists of 50 states, including the District of Columbia, based on five racial groups (White, Black, Asian, Hawaiian/Pacific Islander, and American Indian, Alaskan Native) and other mixed racial groups (White-Black, White-Asian, AI-Asian, etc.). On the columns, the data consists of “Did you vote,” in which a responder can be classified as No response, Refused, Don’t know, Not in Universe, Yes, and No. The individual observations consist of the population per racial category per response. 

Research question

Based on the data presented, what can be concluded about voter turnout in the 2024 election per racial group? How do these outcomes differ based on presidential candidates and the political atmosphere? Does the data show racial inequities per group in different states?

The U.S. is rooted in diversity and democracy practiced by Americans. During the 2016 election, there was a high voter turnout due to presidential candidates speaking out against marginalized communities. In the 2020 election, there was an over six million voter turnout increase. Over the years, racial inequities are highlighted through voter suppression, disenfranchisement, felon voting rights, voter ID laws, etc. These obstacles prevent Americans from marginalized communities from participating in civic engagement and voting. In addition, the Census continues to show an increasing number of citizens from diverse backgrounds in various states that alter the outcomes of elections. This data is important to analyze and determine the possible outcomes of the 2024 elections and to ensure states with higher levels of inequality assist marginalized communities.

The research topic will predict voter turnout for the 2024 election per racial group per state. These predicted turnouts will allow the researchers to view which states have higher levels of racial inequities to ensure these states strategize to provide ethnic equities. The team hypothesizes that voter turnout will increase due to the current political atmosphere. There is also a prediction that marginalized groups are likely to vote at higher rates than usual. 

The data consists of categorical variables (states, racial group, responses to “did you vote”) and quantitative outcomes (population per response, state, and racial category). 

Literature

https://www.brennancenter.org/our-work/analysis-opinion/large-racial-turnout-gap-persisted-2020-election

The “Large Racial Turnout Gap Persisted in 2020 Election” article discusses the voter outcome per racial group in the 2020 election. The reading states that Black, Asian, and Hispanic voters “surpassed their previous turnout records.” In addition, their research stated there are gaps between white voters and voters of other racial groups. They also show how racial groups have changed their voter turnouts based on the candidates, such as in the 2008 and 2012 elections with Obama as a candidate who had the highest voter turnout for black individuals since 1996. 

This research is highly relevant in understanding states that have shifted the outcomes of voters per racial groups. Not only is it important to understand how voters are changing their outcomes, but also how political candidates are shifting the historical outcomes for marginalized racial groups. 

Glimpse of data

# Note: we slightly modified the csv files to remove metadata / source information
voter_turnout_2016 <- read_csv("data/VoterTurnout2016.csv")
Rows: 1378 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Demographics- race of respondent (PTDTRACE)
dbl (7): Total, No response, Refused, Don't Know, Not in Universe, Yes, No

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
voter_turnout_2020 <- read_csv("data/VoterTurnout2020.csv")
Rows: 1378 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Demographics- race of respondent (PTDTRACE)
dbl (7): Total, No Response, Refused, Don't Know, Not in Universe, Yes, No

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(voter_turnout_2016)
Rows: 1,378
Columns: 8
$ `Demographics- race of respondent (PTDTRACE)` <chr> "-> Total", "-> Total ->…
$ Total                                         <dbl> 254540226, 3833260, 2720…
$ `No response`                                 <dbl> 24984995, 462829, 332643…
$ Refused                                       <dbl> 3839459, 126805, 102348,…
$ `Don't Know`                                  <dbl> 3836910, 54002, 31105, 2…
$ `Not in Universe`                             <dbl> 30487052, 182051, 115485…
$ Yes                                           <dbl> 137527445, 2097374, 1464…
$ No                                            <dbl> 53864365, 910199, 674179…
glimpse(voter_turnout_2020)
Rows: 1,378
Columns: 8
$ `Demographics- race of respondent (PTDTRACE)` <chr> "-> Total", "-> Total ->…
$ Total                                         <dbl> 261084762, 3885288, 2734…
$ `No Response`                                 <dbl> 29185400, 711377, 455735…
$ Refused                                       <dbl> 3816110, 55945, 35392, 2…
$ `Don't Know`                                  <dbl> 3404002, 42418, 16634, 1…
$ `Not in Universe`                             <dbl> 29495233, 169522, 115385…
$ Yes                                           <dbl> 154575196, 2249413, 1649…
$ No                                            <dbl> 40608821, 656613, 462042…

Data 3

Introduction and data

  • Identify the source of the data.

    • The data comes from NYC Slice, a dataset gathered by Liam Quigley, an independent New York City reporter.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • Liam Quigley himself over an 8-year period gathered data on 464 pizza slices throughout New York City.
  • Write a brief description of the observations.

    • The variables included date, price, location, and style of pizza for 464 slices across New York City. Off first glance, it appears that price varied over the years (unsure whether or not there was a clear increase in price) and it seems that pepperoni pizzas generally tend to be more expensive.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Have pizza slices increased in price over time? How has the style of pizza impacted its price?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • Data on prices of pizza slices can reveal certain trends about the prices of everyday goods for everyday citizens and how changes in the US’s economy have impacted cost of living. With this being said, my hypothesis is that due to inflation, the price of pizza has clearly increased over the past 8 years.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Date, price, and style of pizza. The style of pizza is categorical, while the date and price are quantitative.

Literature

  • Find one published credible article on the topic you are interested in researching.

    • https://www.nytimes.com/2022/08/08/nyregion/inflation-nyc.html
  • Provide a one paragraph summary about the article.

    • This New York Times article provides multiple anecdotes from case studies of everyday New Yorkers who struggle with rising food costs due to inflation. Stories of five different New Yorkers across multiple boroughs provide context to the difficulties provided by rising inflation. The article also describes how recent global events such as the pandemic and the war in Ukraine have impacted the prices of food products.
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    • The New York Times article presents great anecdotal evidence to rising food prices, but this dataset would contribute a more objective perspective to a common New York City food staple: pizza.

Glimpse of data

nyc_slice <- read_csv("data/nyc_slice_rawdata.csv")
Rows: 464 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Link to IG Post, Name, Date, Date Expanded (times in EST), Price, S...
dbl (4): location_lat, location_lng, Year, Price as number

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(nyc_slice)
Rows: 464
Columns: 11
$ `Link to IG Post`              <chr> "https://www.instagram.com/p/CjszJ-fOP5…
$ Name                           <chr> "Angelo’s Pizza", "Ozone Pizzeria", "Pi…
$ location_lat                   <dbl> 40.62325, 40.68089, 40.60001, 40.71334,…
$ location_lng                   <dbl> -73.93792, -73.84263, -73.99946, -73.82…
$ Date                           <chr> "2022-1014", "2022-1008", "2022-1003", …
$ `Date Expanded (times in EST)` <chr> "Oct 14th 2022, 5:57:51 pm", "Oct 8th 2…
$ Year                           <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 202…
$ `Price as number`              <dbl> 3.00, 3.00, 2.75, 3.25, 1.00, 3.50, 3.0…
$ Price                          <chr> "$3.00", "$3.00", "$2.75", "$3.25", "$1…
$ Style                          <chr> "Plain", "Plain", "Plain", "Plain", "Pl…
$ Notes                          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…