Heart Disease Analysis

Proposal

library(tidyverse)

Data 1

Heart Disease

Introduction and data

  • Identify the source of the data.

    The data set was found on kaggle.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    This data, however, was originally collected by the University of California at Irvine. It was collected as a synthesis of data from the Hungarian Institute of Cardiology and the Cleveland Clinic foundation.

  • Write a brief description of the observations.

    1. id: (Unique id for each patient)

    2. age: (Age of the patient in years)

    3. dataset: (place of study)

    4. sex: (Male/Female)

    5. cp: chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])

    6. trestbps: resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))

    7. chol: (serum cholesterol in mg/dl)

    8. fbs: (if fasting blood sugar > 120 mg/dl)

    9. restecg: (resting electrocardiographic results)
      -- Values: [normal, stt abnormality, lv hypertrophy]

    10. thalach: maximum heart rate achieved

    11. exang: exercise-induced angina (True/ False)

    12. oldpeak: ST depression induced by exercise relative to rest

    13. slope: the slope of the peak exercise ST segment

    14. ca: number of major vessels (0-3) colored by fluoroscopy

    15. thal: [normal; fixed defect; reversible defect]

    16. num: the predicted attribute 0 means no heart disease, and 1-4 is the stage.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    How does heart disease presence and magnitude vary based on the age, sex, cholesterol, blood sugar, and blood pressure? Furthermore, what variables are the best predictor of whether or not an individual will get heart disease?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    Heart disease affects hundreds of thousands of people across the world. It can come in many different shapes and forms, such as coronary artery disease, heart failure, and heart valve disorders. I hypothesize that the older an individual is, the higher cholesterol, blood sugar, and blood pressure and individual has, the more likely they are to develop a heart disease of large magnitude. I also believe that the age, sex, cholesterol, blood sugar, blood pressure, and the number of major vessels are the best predictors of whether or not an individual will develop heart disease. Because the response variable has five levels in factors from 0-4. I plan on combining levels 1-4 into level 1, as that will indicate the presence of heart disease. I also beleive that accuracy is the only metric needed to measure success as there seems to be a roughly even amount of people with and without heart disease, and therefore I do not think that the model will be biased.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    1. age: quantitative

    2. dataset: categorical

    3. sex: categorical

    4. cp: categorical

    5. trestbps: quantitative

    6. chol: quantitative

    7. fbs: categorical

    8. restecg: categorical

    9. thalach: quantitative

    10. exang: categorical

    11. oldpeak: quantitative

    12. slope: categorical

    13. ca: categorical

    14. thal: categorical

    15. num: categorical

Literature

  • Find one published credible article on the topic you are interested in researching.

    “Prevalence and Trends of Coronary Heart Disease in the United States, 2011 to 2018” Yi-Ting Hana Lee, MPH; Jing Fang, MD, MS; Linda Schieb, MSPH; et al.

    https://jamanetwork.com/journals/jamacardiology/fullarticle/2787707

  • Provide a one paragraph summary about the article.

    This article provided a broad overview of recent trends in coronary heart disease (CHD) in the United States from 2011-2018. This study relied on self-reported data from patients. Overall there was a minor decrease in the prevalence of CHD in the United States among all demographics. Between 2011 and 2018, the rate dropped from 6.2% to 6.0%. There were plenty of geographic and demographic differences and trends that are worth noting. First, there were some declines in adults over 65, college graduates, and residents of Utah. However, the prevalence of CHD rose in areas like Oregon and in adults 18-44.

  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    The data we have focuses more on the health aspects of heart disease, while the study looks at broad trends in demographics and geography. Being able to break down and group the data by patient information like race and age is helpful; our dataset has some of these demographic identifiers, but more of them could help our analysis.

  • Glimpse of data

heart_disease <- read_csv("data/heart_disease_uci.csv")
Rows: 920 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): sex, dataset, cp, restecg, slope, thal
dbl (8): id, age, trestbps, chol, thalch, oldpeak, ca, num
lgl (2): fbs, exang

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(heart_disease)
Rows: 920
Columns: 16
$ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ age      <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 5…
$ sex      <chr> "Male", "Male", "Male", "Male", "Female", "Male", "Female", "…
$ dataset  <chr> "Cleveland", "Cleveland", "Cleveland", "Cleveland", "Clevelan…
$ cp       <chr> "typical angina", "asymptomatic", "asymptomatic", "non-angina…
$ trestbps <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 1…
$ chol     <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 2…
$ fbs      <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ restecg  <chr> "lv hypertrophy", "lv hypertrophy", "lv hypertrophy", "normal…
$ thalch   <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 1…
$ exang    <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, T…
$ oldpeak  <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0…
$ slope    <chr> "downsloping", "flat", "flat", "downsloping", "upsloping", "u…
$ ca       <dbl> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ thal     <chr> "fixed defect", "normal", "reversable defect", "normal", "nor…
$ num      <dbl> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0…

Data 2

Billionaires

Introduction and data

  • Identify the source of the data.

    The source of the data is the CORGIS Dataset Project.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    The data was collected based off Forbes World’s Billionaires list from 1996 - 2014, and people from Peterson Institute for international economics added more variables.

  • Write a brief description of the observations.

    Variable  Description
    name Name of person
    rank Rank of richest in world
    year Year data collected
    company.founded Year company founded
    company.name Name of company
    company.relationship Person’s role at company
    company.sector Business sector of company
    company.type Type of business
    demographics.age Current age of person
    demographics.gender Gender of person
    location.citizenship Country of citizenship of person
    location.country.code Country code of citizenship
    location.gdp GDP of country of citizenship
    location.region Region of living
    wealth.type Type of billionaire
    wealth.worth Net worth in billions
    wealth.how.category Where money came from
    wealth.how.from.emerging Whether wealth from emerging markets
    wealth.how.industry Industry of profit
    wealth.how.inherited Way/if money inherited 
    wealth.how.was.founder Whether person founded company
    wealth.how.was.political Whether money came from politics

Any null values will be filtered out if being used in analysis.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    1. Does the location gdp of a billionaire affect how likely they are to have founded the company that gave them their wealth? Does this relationship change across age? Gender? Which demographic is the best predictor of if they founded the company?

    The response variable is “wealth.how.was.founder”. The time range will be the entire time the data was collected.

    1. What is the relationship between the net worth of a billionaire and the sector of the company that gave them their wealth? Does this relationship change across age? Whether the money was inherited? Which demographic is the best predictor of the sector their company is in?

    I will verify the hypothesized relationship by determining which of the listed demographics is the best predictor using r squared, as this will diminished statistical bias.

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    The research topic is about billionaires and the relationships between different parts of their life and business. The hypothesis for question 1 is that billionaires with higher location gdp are more likely to have founded the company that gave them their wealth, and this holds across age and gender. The hypothesis for question 2 is that billionaires with higher net worth are more likely to be in technology related fields. This relationship changes as age changes because older billionaires are more likely to be in finance since tech was not as big longer ago. The relationship also changes if money was inherited because inherited money more often comes from finance or oil. Age is the best predictor of the sector. 

  • Identify the types of variables in your research question. Categorical? Quantitative?

    Categorical: whether billionaire founded company (boolean), gender (string), whether the money was inherited (boolean), sector of company (string)

    Quantitative: location gdp (float), age (integer), wealth (float)

Literature

  • Find one published credible article on the topic you are interested in researching.

    “Russia’s Billionaires” Daniel Treisman, 2016. https://www.aeaweb.org/articles?id=10.1257/aer.p20161068

  • Provide a one paragraph summary about the article.

    This paper analyzes the rise and longevity of billionaires from Russia during the past 20 years. The data comes from Forbes research efforts. Since the early 2000, the number of Russian billionaires has increased significantly. Many Russian billionaires made money from oil, banking, gas, and metal; however, Treisman found that the sectors Russain billionaire’s make money from have become more diverse. Now some billionaires have come from industries like real estate, chemicals, and information technology.

  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    Our research question closely parallels the work from Treisman’s paper; the data from the paper has many overlapping characteristics and variables. As a team, we can learn from the methods and analysis from the paper to inform our own analysis.

Glimpse of data

billionaires <- read_csv("data/billionaires.csv")
Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl  (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl  (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(billionaires)
Rows: 2,614
Columns: 22
$ name                       <chr> "Bill Gates", "Bill Gates", "Bill Gates", "…
$ rank                       <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5…
$ year                       <dbl> 1996, 2001, 2014, 1996, 2001, 2014, 1996, 2…
$ company.founded            <dbl> 1975, 1975, 1975, 1962, 1962, 1990, 1896, 1…
$ company.name               <chr> "Microsoft", "Microsoft", "Microsoft", "Ber…
$ company.relationship       <chr> "founder", "founder", "founder", "founder",…
$ company.sector             <chr> "Software", "Software", "Software", "Financ…
$ company.type               <chr> "new", "new", "new", "new", "new", "privati…
$ demographics.age           <dbl> 40, 45, 58, 65, 70, 74, 0, 48, 77, 68, 56, …
$ demographics.gender        <chr> "male", "male", "male", "male", "male", "ma…
$ location.citizenship       <chr> "United States", "United States", "United S…
$ `location.country code`    <chr> "USA", "USA", "USA", "USA", "USA", "MEX", "…
$ location.gdp               <dbl> 8.10e+12, 1.06e+13, 0.00e+00, 8.10e+12, 1.0…
$ location.region            <chr> "North America", "North America", "North Am…
$ wealth.type                <chr> "founder non-finance", "founder non-finance…
$ `wealth.worth in billions` <dbl> 18.5, 58.7, 76.0, 15.0, 32.3, 72.0, 13.1, 3…
$ wealth.how.category        <chr> "New Sectors", "New Sectors", "New Sectors"…
$ `wealth.how.from emerging` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ wealth.how.industry        <chr> "Technology-Computer", "Technology-Computer…
$ wealth.how.inherited       <chr> "not inherited", "not inherited", "not inhe…
$ `wealth.how.was founder`   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ `wealth.how.was political` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…

Data 3

Smokers

Introduction and data

  • Identify the source of the data.

The source of the data is the Centers for Disease Control and Prevention.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was from the CDC’s Behavioral Risk Factor Surveillance System annual data from 1995-2010. The data collected contains percentages weighted by population characteristics.

  • Write a brief description of the observations.

    Variable Description
    year Year data collected
    state State data is from
    smoke_everyday Percentage smokes everyday
    smoke_some_days Percentage smokes some days
    former_smoker Percentage former smoker
    never_smoked Percentage never smoked
    location_1 Location coordinates

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How does smoking frequency vary by location? How has overall smoking frequency changed over time? How has smoking frequency changed in different regions of the United States over time?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

The research topic is about how smoking frequency has changed over time, particularly by states/regions. The hypothesis for question 1 is that smoking frequently is more common in urban, developed areas and also more common is very rural areas. For this hypothesis, the classifications of urban and rural will be taken from the Bureau of the Census. An urban area is defined as an area with a population of 50,000 or more and a rural area is defined as a place with fewer than 2,500 inhabitants. These areas have already been coded by the Bureau of the Census. The hypothesis for question 2 is that overall smoking frequency has decreased over time due to increased awareness about health risks; additionally, it is hypothesized that there is now a higher percentage of former smokers across the United States. The hypothesis for question 3 is that smoking frequency has decreased in lower population regions but is more constant in higher population regions. Population will be measured by population density criteria from the Bureau of the Census with lower population being defined as 35 people per square mile and higher population defined as 1,600 people per square mile or more.

  • Identify the types of variables in your research question. Categorical? Quantitative?
  1. year: quantitative, double
  2. state: categorical, character
  3. smoke_everyday: quantitative, double
  4. smoke_some_days: quantitative, double
  5. former_smoker: quantitative, double
  6. never_smoked: quantitative, double
  7. location_1: categorical, character

Literature

  • Find one published credible article on the topic you are interested in researching.

Prevalence and Factors Associated with Current Cigarette Smoking among Ethiopian University Students: A Systematic Review and Meta-Analysis

  • Provide a one paragraph summary about the article.

This paper and meta-analysis aims to determine the prevalence and factors associated with smoking among Ethiopian university students. There were 13 studies included, and the average prevalence of cigarette smoking is 12.55%. Peer pressure, other tobacco products, and alcohol were often associated with cigarette smoking. The authors also recommended action to prevent smoking in Ethiopian university students. Among them were promoting anti-smoking campaigns and increasing health education.

  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

This paper did not look at data longitudinally, and it also functions as a literature review; however, the paper does take a narrow look into one section of smokers: Ethiopian university students. Our data is much more broad, but we can still mirror the analysis process.

Glimpse of data

cdc_smoker <- read_csv("data/smoking.csv")
Rows: 876 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): State, Location 1
dbl (5): Year, Smoke everyday, Smoke some days, Former smoker, Never smoked

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(cdc_smoker)
Rows: 876
Columns: 7
$ Year              <dbl> 1996, 2005, 2005, 2002, 2003, 2000, 2002, 1996, 2006…
$ State             <chr> "Puerto Rico", "Virgin Islands", "Puerto Rico", "Vir…
$ `Smoke everyday`  <dbl> 9.4, 5.3, 7.9, 7.0, 26.3, 15.3, 15.0, 24.9, 17.7, 17…
$ `Smoke some days` <dbl> 5.1, 2.8, 5.2, 2.4, 7.8, 5.4, 6.2, 3.7, 5.7, 5.6, 4.…
$ `Former smoker`   <dbl> 16.0, 12.8, 16.9, 12.1, 14.3, 28.3, 26.0, 23.7, 20.1…
$ `Never smoked`    <dbl> 69.5, 79.1, 70.0, 78.5, 51.7, 50.9, 52.8, 47.6, 56.4…
$ `Location 1`      <chr> "Puerto Rico", "Virgin Islands", "Puerto Rico", "Vir…