An Exploration of Health Through Air Quality Measures and Substance Abuse

Proposal

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2

Data 1

Introduction and data

  • Identify the source of the data.

    • The data is from the Center for Disease Control (CDC), and it is “Air Quality Measures on the National Environmental Health Tracking Network.”
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was originally collected by the Environmental Health Tracking Network. This data provides values about air pollution collected from Air Quality Systems (AQS) from 4,000 monitoring stations across the United States in the years 2001 - 2011.
  • Write a brief description of the observations.

    • The data set provides information about the amount of air pollution in urban areas in states and certain counties across the United States. Each observation was one specific recording of air pollution at a certain monitoring station known by its state and county.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • How do air pollution levels differ between states from the West, Southwest, Midwest, Southeast, and Northeast as defined by the National Geographic Society of the mainland United States from the years 2008-2011?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    The air quality variable in our research question is quantitative, while the region variable in our research question is categorical.

Literature

  • Find one published credible article on the topic you are interested in researching.

    • Air Pollution Index Systems in the United States and Canada

    • https://www.tandfonline.com/doi/abs/10.1080/00022470.1976.10470272

  • Provide a one paragraph summary about the article.

    • This study surveyed all available air pollution indices through literature review and conversations with air pollution control agencies in the US and found that 35 out of 55 metropolitan air pollution control agencies used some form of daily air pollution index. The study developed a system to classify the indices into 14 different types based on four criteria. It was found that no two indices were exactly the same. The survey results and agency comments were used to identify characteristics and criteria for a uniform air pollution index.
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    • This study focused on building an air pollution index across metropolitan areas in the US, which we can contract with our findings in metropolitan areas as well as other more rural regions in the US. In addition, we can analyze potential factors that lead to certain indices being higher than that of their region.

Glimpse of data

air_quality <- read_csv("data/Air_Quality_Measures_on_the_National_Environmental_Health_Tracking_Network (1).csv")
Rows: 218635 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): MeasureName, MeasureType, StratificationLevel, StateName, CountyNam...
dbl (5): MeasureId, StateFips, CountyFips, ReportYear, MonitorOnly
num (1): Value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(air_quality)
Rows: 218,635
Columns: 14
$ MeasureId           <dbl> 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83, 83…
$ MeasureName         <chr> "Number of days with maximum 8-hour average ozone …
$ MeasureType         <chr> "Counts", "Counts", "Counts", "Counts", "Counts", …
$ StratificationLevel <chr> "State x County", "State x County", "State x Count…
$ StateFips           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 4, 4, 4, 4, 4, 4, 4,…
$ StateName           <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alaba…
$ CountyFips          <dbl> 1027, 1051, 1073, 1079, 1089, 1097, 1101, 1117, 11…
$ CountyName          <chr> "Clay", "Elmore", "Jefferson", "Lawrence", "Madiso…
$ ReportYear          <dbl> 1999, 1999, 1999, 1999, 1999, 1999, 1999, 1999, 19…
$ Value               <dbl> 33, 5, 39, 28, 31, 32, 15, 45, 3, 0, 1, 5, 10, 85,…
$ Unit                <chr> "No Units", "No Units", "No Units", "No Units", "N…
$ UnitName            <chr> "No Units", "No Units", "No Units", "No Units", "N…
$ DataOrigin          <chr> "Monitor Only", "Monitor Only", "Monitor Only", "M…
$ MonitorOnly         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Data 2

Introduction and data

  • Identify the source of the data.

    • The population data is from the World Census and the data about death risk factors was acquired from the World Health Organization.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The World census estimates population based on census, survey, and administrative information. The World Health Organization death risk factors data was reported by member states from Civil Registration and Vital Statistics, which records births, deaths, and causes of death.
  • Write a brief description of the observations.

    • Each row represents one country in one year and each column represents a cause of death, such as from nutrition, alcochol/drug abuse, and air pollution, among other things.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • What health risk factors (like diet, health, physical activity, etc.) have the greatest association with substance abuse death among the most populated countries in recent years?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • The research topic is about how different health factors are related to substance abuse. We hypothesize that greater substance abuse deaths would be related to more deaths from nutrition, smoking, and vitamin deficiencies, representing a pattern of poor choices and habits.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • The types of variables in this question are mostly quantitative since the number of deaths from substance abuse and other factors are all continuous variable. The response variable will be deaths from substance abuse, while the predictor variables could include deaths caused by other factors.

Literature

  • Find one published credible article on the topic you are interested in researching.

    • Global Burden Of Disease Studies: Implications For Mental And Substance Use Disorders

    • https://www.healthaffairs.org/doi/full/10.1377/hlthaff.2016.0082

  • Provide a one paragraph summary about the article.

    • This paper explores a number of mental and substance use disorders along with their prevalence and likelihood to contribute to excessive mortality given a population. The contributions to mortality are calculated using comparative risk assessment for risk-factor analysis detailed in the study. The study concluded that in most countries, mental health as a policy area doesn’t have the priority that is commensurate with the extent of its burden and the potential to reduce that burden.
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    • Our research question will identify relationships between substance abuse problems and other risk factors such as physical exercise, nutrition, chronic diseases. This study will be a baseline on which we can observe which countries contribute to a higher and lower mortality rate as a result of their substance abuse prevalence.

Glimpse of data

risk_factors <- read_csv("data/number-of-deaths-by-risk-factor.csv")
Rows: 6468 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Entity, Code
dbl (30): Year, Unsafe water source, Unsafe sanitation, No access to handwas...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
population <- read_csv("data/population.csv")
Rows: 13584 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): CountryCode
dbl (2): Year, total_population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(risk_factors)
Rows: 6,468
Columns: 32
$ Entity                                     <chr> "Afghanistan", "Afghanistan…
$ Code                                       <chr> "AFG", "AFG", "AFG", "AFG",…
$ Year                                       <dbl> 1990, 1991, 1992, 1993, 199…
$ `Unsafe water source`                      <dbl> 7554.050, 7359.677, 7650.43…
$ `Unsafe sanitation`                        <dbl> 5887.748, 5732.770, 5954.80…
$ `No access to handwashing facility`        <dbl> 5412.315, 5287.891, 5506.65…
$ `Household air pollution from solid fuels` <dbl> 22388.50, 22128.76, 22873.7…
$ `Non-exclusive breastfeeding`              <dbl> 3221.139, 3150.560, 3331.34…
$ `Discontinued breastfeeding`               <dbl> 156.09755, 151.53985, 156.6…
$ `Child wasting`                            <dbl> 22778.85, 22292.69, 23102.2…
$ `Child stunting`                           <dbl> 10408.439, 10271.976, 10618…
$ `Low birth weight for gestation`           <dbl> 12168.56, 12360.64, 13459.5…
$ `Secondhand smoke`                         <dbl> 4234.808, 4219.597, 4371.90…
$ `Alcohol use`                              <dbl> 356.5293, 320.5985, 293.257…
$ `Drug use`                                 <dbl> 208.3254, 217.7697, 247.833…
$ `Diet low in fruits`                       <dbl> 8538.964, 8642.847, 8961.52…
$ `Diet low in vegetables`                   <dbl> 7678.718, 7789.773, 8083.23…
$ `Unsafe sex`                               <dbl> 387.1676, 394.4483, 422.453…
$ `Low physical activity`                    <dbl> 4221.303, 4252.630, 4347.33…
$ `High fasting plasma glucose`              <dbl> 21610.07, 21824.94, 22418.7…
$ `High total cholesterol`                   <dbl> 9505.532, NA, NA, NA, NA, 1…
$ `High body-mass index`                     <dbl> 7701.581, 7747.775, 7991.01…
$ `High systolic blood pressure`             <dbl> 28183.98, 28435.40, 29173.6…
$ Smoking                                    <dbl> 6393.667, 6429.253, 6561.05…
$ `Iron deficiency`                          <dbl> 726.4313, 739.2458, 873.485…
$ `Vitamin A deficiency`                     <dbl> 9344.132, 9330.182, 9769.84…
$ `Low bone mineral density`                 <dbl> 374.8441, 379.8542, 388.130…
$ `Air pollution`                            <dbl> 26598.01, 26379.53, 27263.1…
$ `Outdoor air pollution`                    <dbl> 4383.83, 4426.36, 4568.91, …
$ `Diet high in sodium`                      <dbl> 2737.198, 2741.185, 2798.56…
$ `Diet low in whole grains`                 <dbl> 11381.38, 11487.83, 11866.2…
$ `Diet low in nuts and seeds`               <dbl> 7299.867, 7386.764, 7640.62…
glimpse(population)
Rows: 13,584
Columns: 3
$ CountryCode      <chr> "ARB", "CSS", "CEB", "EAS", "EAP", "EMU", "ECS", "ECA…
$ Year             <dbl> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960,…
$ total_population <dbl> 92495902, 4190810, 91401583, 1042475394, 896492991, 2…

Data 3

Introduction and data

  • Identify the source of the data.

    • The data is from the 2018 National Survey on Drug Use and Health (NSDUH).
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • This dataset was put together for the CORGIS Dataset Project by Austin Cory Bart, Ryan Whitcomb, Joung Min Choi, and Bo Guan. The data was collected from the Substance Abuse and Mental Health Data Archive (SAMHDA).
  • Write a brief description of the observations.

    • This dataset is about substance abuse (cigarettes, marijuana, cocaine, alcohol, tobacco) among different age groups and states, separated by year. Each row represents one state in one year, with columns indicating the rates of use, total number of people who use, rates of use disorder, and total number of people who suffer from use disorder in the past year and in the past month for each drug by people in various age groups.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Which state(s) in the US has/have the worst teenage substance abuse problem, and how has the situation there changed from 2002-2018?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • The primary research topic is teenage substance abuse, and which state suffers most from this problem. Our hypothesis is that states with the most large cities, such as California and Texas, would have the worst teenage substance abuse problem.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • The two main categorical variables in this research question are State and Year. Most other variables such as total use and use rate for various drugs, as well as population, are quantitative variables.

Literature

  • Find one published credible article on the topic you are interested in researching.

    • Teenage Substance Abuse: Impact on The Family System and Parents’ Coping Strategies

    • https://www.researchgate.net/profile/Prudence-Mafa/publication/348350895_SOCIAL_SCIENCES_HUMANITIES_Teenage_Substance
      _Abuse_Impact_on_The_Family_System_and_Parents’_Coping_Strategies/links/5ff9
      6151a6fdccdcb83ef1e9/SOCIAL-SCIENCES-HUMANITIES-Teenage-Substance-Abuse-Impact-on-The-Family-System-and-Parents-Coping-Strategies.pdf

  • Provide a one paragraph summary about the article.

    • The study examines the effects of teen substance addiction on families and the coping mechanisms employed by parents. The data were analyzed thematically after nine parents of teenagers in outpatient therapy were interviewed. The discovery of substance usage produces chaos in the household and has an impact on the marriage. The majority of parents try to find a solution on their own, but they frequently lack the required assistance. According to the study, the entire family should be included in efforts to intervene against substance misuse, not just the addicted person.
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    • This study can be used in discussion to how the problem has been combatted from 2002 - 2018, and whether efforts to include the family in substance abuse related problems has been used, and whether it has been effective. It’s different than our research question as it’s a study following nine families specifically as opposed to a wider survey.

Glimpse of data

drugs <- read_csv("data/drugs.csv")
Rows: 867 Columns: 53
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): State
dbl (52): Year, Population.12-17, Population.18-25, Population.26+, Totals.A...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(drugs)
Rows: 867
Columns: 53
$ State                                               <chr> "Alabama", "Alaska…
$ Year                                                <dbl> 2002, 2002, 2002, …
$ `Population.12-17`                                  <dbl> 380805, 69400, 485…
$ `Population.18-25`                                  <dbl> 499453, 62791, 602…
$ `Population.26+`                                    <dbl> 2812905, 368460, 3…
$ `Totals.Alcohol.Use Disorder Past Year.12-17`       <dbl> 18, 4, 36, 14, 173…
$ `Totals.Alcohol.Use Disorder Past Year.18-25`       <dbl> 68, 12, 117, 53, 5…
$ `Totals.Alcohol.Use Disorder Past Year.26+`         <dbl> 138, 27, 258, 101,…
$ `Rates.Alcohol.Use Disorder Past Year.12-17`        <dbl> 0.048336, 0.061479…
$ `Rates.Alcohol.Use Disorder Past Year.18-25`        <dbl> 0.136490, 0.187891…
$ `Rates.Alcohol.Use Disorder Past Year.26+`          <dbl> 0.049068, 0.073677…
$ `Totals.Alcohol.Use Past Month.12-17`               <dbl> 57, 11, 91, 39, 48…
$ `Totals.Alcohol.Use Past Month.18-25`               <dbl> 254, 38, 352, 162,…
$ `Totals.Alcohol.Use Past Month.26+`                 <dbl> 1048, 206, 1774, 6…
$ `Rates.Alcohol.Use Past Month.12-17`                <dbl> 0.150033, 0.158988…
$ `Rates.Alcohol.Use Past Month.18-25`                <dbl> 0.509551, 0.598311…
$ `Rates.Alcohol.Use Past Month.26+`                  <dbl> 0.372703, 0.559151…
$ `Totals.Tobacco.Cigarette Past Month.12-17`         <dbl> 52, 9, 62, 37, 235…
$ `Totals.Tobacco.Cigarette Past Month.18-25`         <dbl> 196, 28, 234, 154,…
$ `Totals.Tobacco.Cigarette Past Month.26+`           <dbl> 728, 92, 919, 539,…
$ `Rates.Tobacco.Cigarette Past Month.12-17`          <dbl> 0.136906, 0.132517…
$ `Rates.Tobacco.Cigarette Past Month.18-25`          <dbl> 0.392404, 0.439749…
$ `Rates.Tobacco.Cigarette Past Month.26+`            <dbl> 0.258844, 0.249578…
$ `Totals.Illicit Drugs.Cocaine Used Past Year.12-17` <dbl> 6, 2, 16, 4, 53, 1…
$ `Totals.Illicit Drugs.Cocaine Used Past Year.18-25` <dbl> 27, 5, 51, 18, 259…
$ `Totals.Illicit Drugs.Cocaine Used Past Year.26+`   <dbl> 49, 5, 86, 26, 410…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.12-17`  <dbl> 0.016556, 0.024400…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.18-25`  <dbl> 0.054892, 0.083680…
$ `Rates.Illicit Drugs.Cocaine Used Past Year.26+`    <dbl> 0.017513, 0.013838…
$ `Totals.Marijuana.New Users.12-17`                  <dbl> 20, 4, 25, 13, 158…
$ `Totals.Marijuana.New Users.18-25`                  <dbl> 18, 2, 18, 10, 126…
$ `Totals.Marijuana.New Users.26+`                    <dbl> 2, 0, 3, 1, 17, 3,…
$ `Rates.Marijuana.New Users.12-17`                   <dbl> 0.059732, 0.077736…
$ `Rates.Marijuana.New Users.18-25`                   <dbl> 0.062325, 0.084250…
$ `Rates.Marijuana.New Users.26+`                     <dbl> 0.000914, 0.001625…
$ `Totals.Marijuana.Used Past Month.12-17`            <dbl> 24, 8, 38, 19, 241…
$ `Totals.Marijuana.Used Past Month.18-25`            <dbl> 62, 15, 91, 50, 63…
$ `Totals.Marijuana.Used Past Month.26+`              <dbl> 73, 26, 122, 57, 9…
$ `Rates.Marijuana.Used Past Month.12-17`             <dbl> 0.063662, 0.110781…
$ `Rates.Marijuana.Used Past Month.18-25`             <dbl> 0.124672, 0.239907…
$ `Rates.Marijuana.Used Past Month.26+`               <dbl> 0.025967, 0.071362…
$ `Totals.Marijuana.Used Past Year.12-17`             <dbl> 49, 13, 82, 37, 44…
$ `Totals.Marijuana.Used Past Year.18-25`             <dbl> 119, 24, 166, 87, …
$ `Totals.Marijuana.Used Past Year.26+`               <dbl> 141, 46, 215, 104,…
$ `Rates.Marijuana.Used Past Year.12-17`              <dbl> 0.127535, 0.188730…
$ `Rates.Marijuana.Used Past Year.18-25`              <dbl> 0.237880, 0.389026…
$ `Rates.Marijuana.Used Past Year.26+`                <dbl> 0.050275, 0.124566…
$ `Totals.Tobacco.Use Past Month.12-17`               <dbl> 63, 11, 73, 46, 29…
$ `Totals.Tobacco.Use Past Month.18-25`               <dbl> 226, 30, 240, 169,…
$ `Totals.Tobacco.Use Past Month.26+`                 <dbl> 930, 112, 1032, 66…
$ `Rates.Tobacco.Use Past Month.12-17`                <dbl> 0.166578, 0.163918…
$ `Rates.Tobacco.Use Past Month.18-25`                <dbl> 0.451976, 0.484270…
$ `Rates.Tobacco.Use Past Month.26+`                  <dbl> 0.330659, 0.304220…