library(tidyverse)
library(readr)
Project title
Proposal
Data 1
This data came from the Substance Abuse and Mental Health Data Archive.
It was originally collected from individual states in a NSDUH study, and data collection ranges from 2002 to 2018. It was intended to provide information about national substance abuse to better understand public health.
This data shows substance abuse across different age groups and states in the U.S. The types of substances include cigarettes, marijuana, cocaine, and alcohol use. It includes the population of the users in three different age groups in different years, as well as the totals and rates of the different substance uses in each age group.
Introduction and data
Identify the source of the data.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
Write a brief description of the observations.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?
Literature
Find one published credible article on the topic you are interested in researching.
Provide a one paragraph summary about the article.
The prevalence of substance abuse disorders has been largely considered a public health crisis, particularly with the rise of opioid overdose deaths in 2016. In the United States, the majority of the population will partake in some use of substances, and among users, nicotine, alcohol, and cannabis are the most common drugs of choice. In terms of nicotine consumption, while cigarette smoking has declined significantly in the past few decades, nicotine and cannabis vaping had risen over 2-fold by 2017. There was a brief dip in alcohol consumption, followed by large increases in the past decade, part of which the authors suggest may have been caused by the increased accessibility and time throughout the COVID-19 pandemic. Cannabis use has increased, especially with the rise of legalization across US states. After cannabis, opioids and then heroin were the most common illicit substances, both of which are linked to high rates of overdose. Experts are shifting to understand substance abuse disorder as characterized by neurobiological and social risk factors.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
Glimpse of data
<- read_csv("data/drugs (1).csv") drugs
Rows: 867 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): State
dbl (52): Year, Population.12-17, Population.18-25, Population.26+, Totals.A...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(drugs)
Rows: 867
Columns: 53
$ State <chr> "Alabama", "Alaska…
$ Year <dbl> 2002, 2002, 2002, …
$ `Population.12-17` <dbl> 380805, 69400, 485…
$ `Population.18-25` <dbl> 499453, 62791, 602…
$ `Population.26+` <dbl> 2812905, 368460, 3…
$ `Totals.Alcohol.Use Disorder Past Year.12-17` <dbl> 18, 4, 36, 14, 173…
$ `Totals.Alcohol.Use Disorder Past Year.18-25` <dbl> 68, 12, 117, 53, 5…
$ `Totals.Alcohol.Use Disorder Past Year.26+` <dbl> 138, 27, 258, 101,…
$ `Rates.Alcohol.Use Disorder Past Year.12-17` <dbl> 0.048336, 0.061479…
$ `Rates.Alcohol.Use Disorder Past Year.18-25` <dbl> 0.136490, 0.187891…
$ `Rates.Alcohol.Use Disorder Past Year.26+` <dbl> 0.049068, 0.073677…
$ `Totals.Alcohol.Use Past Month.12-17` <dbl> 57, 11, 91, 39, 48…
$ `Totals.Alcohol.Use Past Month.18-25` <dbl> 254, 38, 352, 162,…
$ `Totals.Alcohol.Use Past Month.26+` <dbl> 1048, 206, 1774, 6…
$ `Rates.Alcohol.Use Past Month.12-17` <dbl> 0.150033, 0.158988…
$ `Rates.Alcohol.Use Past Month.18-25` <dbl> 0.509551, 0.598311…
$ `Rates.Alcohol.Use Past Month.26+` <dbl> 0.372703, 0.559151…
$ `Totals.Tobacco.Cigarette Past Month.12-17` <dbl> 52, 9, 62, 37, 235…
$ `Totals.Tobacco.Cigarette Past Month.18-25` <dbl> 196, 28, 234, 154,…
$ `Totals.Tobacco.Cigarette Past Month.26+` <dbl> 728, 92, 919, 539,…
$ `Rates.Tobacco.Cigarette Past Month.12-17` <dbl> 0.136906, 0.132517…
$ `Rates.Tobacco.Cigarette Past Month.18-25` <dbl> 0.392404, 0.439749…
$ `Rates.Tobacco.Cigarette Past Month.26+` <dbl> 0.258844, 0.249578…
$ `Totals.Illicit Drugs.Cocaine Used Past Year.12-17` <dbl> 6, 2, 16, 4, 53, 1…
$ `Totals.Illicit Drugs.Cocaine Used Past Year.18-25` <dbl> 27, 5, 51, 18, 259…
$ `Totals.Illicit Drugs.Cocaine Used Past Year.26+` <dbl> 49, 5, 86, 26, 410…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.12-17` <dbl> 0.016556, 0.024400…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.18-25` <dbl> 0.054892, 0.083680…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.26+` <dbl> 0.017513, 0.013838…
$ `Totals.Marijuana.New Users.12-17` <dbl> 20, 4, 25, 13, 158…
$ `Totals.Marijuana.New Users.18-25` <dbl> 18, 2, 18, 10, 126…
$ `Totals.Marijuana.New Users.26+` <dbl> 2, 0, 3, 1, 17, 3,…
$ `Rates.Marijuana.New Users.12-17` <dbl> 0.059732, 0.077736…
$ `Rates.Marijuana.New Users.18-25` <dbl> 0.062325, 0.084250…
$ `Rates.Marijuana.New Users.26+` <dbl> 0.000914, 0.001625…
$ `Totals.Marijuana.Used Past Month.12-17` <dbl> 24, 8, 38, 19, 241…
$ `Totals.Marijuana.Used Past Month.18-25` <dbl> 62, 15, 91, 50, 63…
$ `Totals.Marijuana.Used Past Month.26+` <dbl> 73, 26, 122, 57, 9…
$ `Rates.Marijuana.Used Past Month.12-17` <dbl> 0.063662, 0.110781…
$ `Rates.Marijuana.Used Past Month.18-25` <dbl> 0.124672, 0.239907…
$ `Rates.Marijuana.Used Past Month.26+` <dbl> 0.025967, 0.071362…
$ `Totals.Marijuana.Used Past Year.12-17` <dbl> 49, 13, 82, 37, 44…
$ `Totals.Marijuana.Used Past Year.18-25` <dbl> 119, 24, 166, 87, …
$ `Totals.Marijuana.Used Past Year.26+` <dbl> 141, 46, 215, 104,…
$ `Rates.Marijuana.Used Past Year.12-17` <dbl> 0.127535, 0.188730…
$ `Rates.Marijuana.Used Past Year.18-25` <dbl> 0.237880, 0.389026…
$ `Rates.Marijuana.Used Past Year.26+` <dbl> 0.050275, 0.124566…
$ `Totals.Tobacco.Use Past Month.12-17` <dbl> 63, 11, 73, 46, 29…
$ `Totals.Tobacco.Use Past Month.18-25` <dbl> 226, 30, 240, 169,…
$ `Totals.Tobacco.Use Past Month.26+` <dbl> 930, 112, 1032, 66…
$ `Rates.Tobacco.Use Past Month.12-17` <dbl> 0.166578, 0.163918…
$ `Rates.Tobacco.Use Past Month.18-25` <dbl> 0.451976, 0.484270…
$ `Rates.Tobacco.Use Past Month.26+` <dbl> 0.330659, 0.304220…
Data 2
Introduction and data
The data comes from the Washington Post. It was published with the intent to bolster the evidence-base of police killings for the Black Lives Matter movement.
The data was originally collected by manually combing through local news reports; combining information from law enforcement websites, social media, and other databases (including Fatal Encounters and the “Killed by Police” project). Data collection started in 2015 spurred by a slew of fatal shootings, and the information was updated in 2022.
The observations include details about police-involved killings in the United States. The variables include race, age, gender, armed vs not armed status, location, and if the person killed had a mental illness. The observations are primarily focused on key descriptions of the person killed, but do include some details about the police involved (including the presence/lack of a police body camera and the threat of the person as perceived by police).
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?
Literature
Find one published credible article on the topic you are interested in researching.
Provide a one paragraph summary about the article.
As police brutality and violence has come to a national forefront, data has indicated that the burden of fatal police shootings falls disproportionately on BIPOC in terms of mortality and years of life lost (YLL). Data was sourced from the Washington Post repository on fatal police shootings between 2015-2020, which is dependent on curated news reports and thus may exclude necessary data such as gender and minority status. During this time interval, 5367 fatalities were recorded, 4740 presented with significant racial data for analysis, and 4653 included both sufficient racial and age data for YLL calculation. Contrary to popular hypotheses, while there was a small decline in deaths of white victims, there was no significant trend in death rates among all other race/ethnic groups (i.e. rates were stable across the 5 year interval). In order, mean deaths pqpm were as follows: highest among Native Americans (1.74), then Blacks (1.49), Hispanics (0.74) , Whites (0.57), and Asians (0.25). The authors call for the treatment of police violence as a public health crisis and suggest police demilitarization as a potential intervention.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
Glimpse of data
<- read_csv("data/police_shootings (1).csv") police_shootings
Rows: 6569 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): Person.Name, Person.Gender, Person.Race, Incident.Location.City, I...
dbl (4): Person.Age, Incident.Date.Month, Incident.Date.Day, Incident.Date....
lgl (2): Factors.Mental-Illness, Shooting.Body-Camera
date (1): Incident.Date.Full
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(police_shootings)
Rows: 6,569
Columns: 16
$ Person.Name <chr> "Tim Elliot", "Lewis Lee Lembke", "John Paul …
$ Person.Age <dbl> 53, 47, 23, 32, 39, 18, 22, 35, 34, 47, 25, 3…
$ Person.Gender <chr> "Male", "Male", "Male", "Male", "Male", "Male…
$ Person.Race <chr> "Asian", "White", "Hispanic", "White", "Hispa…
$ Incident.Date.Month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ Incident.Date.Day <dbl> 2, 2, 3, 4, 4, 4, 5, 6, 6, 6, 6, 7, 7, 7, 7, …
$ Incident.Date.Year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ Incident.Date.Full <date> 2015-01-02, 2015-01-02, 2015-01-03, 2015-01-…
$ Incident.Location.City <chr> "Shelton", "Aloha", "Wichita", "San Francisco…
$ Incident.Location.State <chr> "WA", "OR", "KS", "CA", "CO", "OK", "AZ", "KS…
$ Factors.Armed <chr> "gun", "gun", "unarmed", "toy weapon", "nail …
$ `Factors.Mental-Illness` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
$ `Factors.Threat-Level` <chr> "attack", "attack", "other", "attack", "attac…
$ Factors.Fleeing <chr> "Not fleeing", "Not fleeing", "Not fleeing", …
$ Shooting.Manner <chr> "shot", "shot", "shot and Tasered", "shot", "…
$ `Shooting.Body-Camera` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU…
Data 3
Introduction and data
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
A description of the research topic along with a concise statement of your hypotheses on this topic.
Identify the types of variables in your research question. Categorical? Quantitative?
Literature
Find one published credible article on the topic you are interested in researching.
Freund, Caroline and Oliver, Sarah, The Origins of the Superrich: The Billionaire Characteristics Database (February 1, 2016). Peterson Institute for International Economics Working Paper No. 16-1, Available at SSRN: https://ssrn.com/abstract=2731353 or http://dx.doi.org/10.2139/ssrn.2731353
Provide a one paragraph summary about the article.
This article aims to provide an overview of the new billionaires dataset to begin analyzing changes in extreme wealth across advanced countries. The data largely comes from Forbes World’s Billionaires list from 1996-2015. Thus far, they have identified three significant trends: 1) wealth is growing faster in emerging markets; 2) wealth is trending towards increasingly self-made; 3) there are large regional differences in wealth trends. In the US, for example, extreme wealth is more likely to be the result of financial and investing endeavors, while European wealth continues to be, for the most part, inherited. The authors also argue that rather than the net worth of billionaires rising as a consequence of current billionaires acquiring more wealth, it is more likely that these numbers come from the increase of total billionaires worldwide. The study of billionaires and the accumulation of wealth is important to understand global wealth distributions and income inequality, as well as allowing experts to suggest means of potential redistribution and benevolent problem-solving targets.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
Glimpse of data
<- read_csv("data/billionaires (1).csv") billionaires
Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(billionaires)
Rows: 2,614
Columns: 22
$ name <chr> "Bill Gates", "Bill Gates", "Bill Gates", "…
$ rank <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5…
$ year <dbl> 1996, 2001, 2014, 1996, 2001, 2014, 1996, 2…
$ company.founded <dbl> 1975, 1975, 1975, 1962, 1962, 1990, 1896, 1…
$ company.name <chr> "Microsoft", "Microsoft", "Microsoft", "Ber…
$ company.relationship <chr> "founder", "founder", "founder", "founder",…
$ company.sector <chr> "Software", "Software", "Software", "Financ…
$ company.type <chr> "new", "new", "new", "new", "new", "privati…
$ demographics.age <dbl> 40, 45, 58, 65, 70, 74, 0, 48, 77, 68, 56, …
$ demographics.gender <chr> "male", "male", "male", "male", "male", "ma…
$ location.citizenship <chr> "United States", "United States", "United S…
$ `location.country code` <chr> "USA", "USA", "USA", "USA", "USA", "MEX", "…
$ location.gdp <dbl> 8.10e+12, 1.06e+13, 0.00e+00, 8.10e+12, 1.0…
$ location.region <chr> "North America", "North America", "North Am…
$ wealth.type <chr> "founder non-finance", "founder non-finance…
$ `wealth.worth in billions` <dbl> 18.5, 58.7, 76.0, 15.0, 32.3, 72.0, 13.1, 3…
$ wealth.how.category <chr> "New Sectors", "New Sectors", "New Sectors"…
$ `wealth.how.from emerging` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ wealth.how.industry <chr> "Technology-Computer", "Technology-Computer…
$ wealth.how.inherited <chr> "not inherited", "not inherited", "not inhe…
$ `wealth.how.was founder` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ `wealth.how.was political` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
# colnames(billionaires)
# unique(billionaires[,6]) |> print(n = 74)
|>
billionaires count(year)
# A tibble: 3 × 2
year n
<dbl> <int>
1 1996 423
2 2001 538
3 2014 1653
#select(`wealth.how.was political`)# |>
#unique()