library(tidyverse)
Heart Disease Analysis
Proposal
Data 1
Heart Disease
Introduction and data
Identify the source of the data.
The data set was found on kaggle.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
This data, however, was originally collected by the University of California at Irvine. It was collected as a synthesis of data from the Hungarian Institute of Cardiology and the Cleveland Clinic foundation.
Write a brief description of the observations.
id: (Unique id for each patient)
age: (Age of the patient in years)
dataset: (place of study)
sex: (Male/Female)
cp: chest pain type ([typical angina, atypical angina, non-anginal, asymptomatic])
trestbps: resting blood pressure (resting blood pressure (in mm Hg on admission to the hospital))
chol: (serum cholesterol in mg/dl)
fbs: (if fasting blood sugar > 120 mg/dl)
restecg: (resting electrocardiographic results)
-- Values: [normal, stt abnormality, lv hypertrophy]thalach: maximum heart rate achieved
exang: exercise-induced angina (True/ False)
oldpeak: ST depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment
ca: number of major vessels (0-3) colored by fluoroscopy
thal: [normal; fixed defect; reversible defect]
num: the predicted attribute 0 means no heart disease, and 1-4 is the stage.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How does heart disease presence and magnitude vary based on the age, sex, cholesterol, blood sugar, and blood pressure? Furthermore, what variables are the best predictor of whether or not an individual will get heart disease?
A description of the research topic along with a concise statement of your hypotheses on this topic.
Heart disease affects hundreds of thousands of people across the world. It can come in many different shapes and forms, such as coronary artery disease, heart failure, and heart valve disorders. I hypothesize that the older an individual is, the higher cholesterol, blood sugar, and blood pressure and individual has, the more likely they are to develop a heart disease of large magnitude. I also believe that the age, sex, cholesterol, blood sugar, blood pressure, and the number of major vessels are the best predictors of whether or not an individual will develop heart disease. Because the response variable has five levels in factors from 0-4. I plan on combining levels 1-4 into level 1, as that will indicate the presence of heart disease. I also beleive that accuracy is the only metric needed to measure success as there seems to be a roughly even amount of people with and without heart disease, and therefore I do not think that the model will be biased.
Identify the types of variables in your research question. Categorical? Quantitative?
age: quantitative
dataset: categorical
sex: categorical
cp: categorical
trestbps: quantitative
chol: quantitative
fbs: categorical
restecg: categorical
thalach: quantitative
exang: categorical
oldpeak: quantitative
slope: categorical
ca: categorical
thal: categorical
num: categorical
Literature
Find one published credible article on the topic you are interested in researching.
“Prevalence and Trends of Coronary Heart Disease in the United States, 2011 to 2018” Yi-Ting Hana Lee, MPH; Jing Fang, MD, MS; Linda Schieb, MSPH; et al.
https://jamanetwork.com/journals/jamacardiology/fullarticle/2787707
Provide a one paragraph summary about the article.
This article provided a broad overview of recent trends in coronary heart disease (CHD) in the United States from 2011-2018. This study relied on self-reported data from patients. Overall there was a minor decrease in the prevalence of CHD in the United States among all demographics. Between 2011 and 2018, the rate dropped from 6.2% to 6.0%. There were plenty of geographic and demographic differences and trends that are worth noting. First, there were some declines in adults over 65, college graduates, and residents of Utah. However, the prevalence of CHD rose in areas like Oregon and in adults 18-44.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
The data we have focuses more on the health aspects of heart disease, while the study looks at broad trends in demographics and geography. Being able to break down and group the data by patient information like race and age is helpful; our dataset has some of these demographic identifiers, but more of them could help our analysis.
Glimpse of data
<- read_csv("data/heart_disease_uci.csv") heart_disease
Rows: 920 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): sex, dataset, cp, restecg, slope, thal
dbl (8): id, age, trestbps, chol, thalch, oldpeak, ca, num
lgl (2): fbs, exang
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(heart_disease)
Rows: 920
Columns: 16
$ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18…
$ age <dbl> 63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 5…
$ sex <chr> "Male", "Male", "Male", "Male", "Female", "Male", "Female", "…
$ dataset <chr> "Cleveland", "Cleveland", "Cleveland", "Cleveland", "Clevelan…
$ cp <chr> "typical angina", "asymptomatic", "asymptomatic", "non-angina…
$ trestbps <dbl> 145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 1…
$ chol <dbl> 233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 2…
$ fbs <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,…
$ restecg <chr> "lv hypertrophy", "lv hypertrophy", "lv hypertrophy", "normal…
$ thalch <dbl> 150, 108, 129, 187, 172, 178, 160, 163, 147, 155, 148, 153, 1…
$ exang <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, T…
$ oldpeak <dbl> 2.3, 1.5, 2.6, 3.5, 1.4, 0.8, 3.6, 0.6, 1.4, 3.1, 0.4, 1.3, 0…
$ slope <chr> "downsloping", "flat", "flat", "downsloping", "upsloping", "u…
$ ca <dbl> 0, 3, 2, 0, 0, 0, 2, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ thal <chr> "fixed defect", "normal", "reversable defect", "normal", "nor…
$ num <dbl> 0, 2, 1, 0, 0, 0, 3, 0, 2, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0…
Data 2
Billionaires
Introduction and data
Identify the source of the data.
The source of the data is the CORGIS Dataset Project.
State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data was collected based off Forbes World’s Billionaires list from 1996 - 2014, and people from Peterson Institute for international economics added more variables.
Write a brief description of the observations.
Variable Description name Name of person rank Rank of richest in world year Year data collected company.founded Year company founded company.name Name of company company.relationship Person’s role at company company.sector Business sector of company company.type Type of business demographics.age Current age of person demographics.gender Gender of person location.citizenship Country of citizenship of person location.country.code Country code of citizenship location.gdp GDP of country of citizenship location.region Region of living wealth.type Type of billionaire wealth.worth Net worth in billions wealth.how.category Where money came from wealth.how.from.emerging Whether wealth from emerging markets wealth.how.industry Industry of profit wealth.how.inherited Way/if money inherited wealth.how.was.founder Whether person founded company wealth.how.was.political Whether money came from politics
Any null values will be filtered out if being used in analysis.
Research question
A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
- Does the location gdp of a billionaire affect how likely they are to have founded the company that gave them their wealth? Does this relationship change across age? Gender? Which demographic is the best predictor of if they founded the company?
The response variable is “wealth.how.was.founder”. The time range will be the entire time the data was collected.
- What is the relationship between the net worth of a billionaire and the sector of the company that gave them their wealth? Does this relationship change across age? Whether the money was inherited? Which demographic is the best predictor of the sector their company is in?
I will verify the hypothesized relationship by determining which of the listed demographics is the best predictor using r squared, as this will diminished statistical bias.
A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic is about billionaires and the relationships between different parts of their life and business. The hypothesis for question 1 is that billionaires with higher location gdp are more likely to have founded the company that gave them their wealth, and this holds across age and gender. The hypothesis for question 2 is that billionaires with higher net worth are more likely to be in technology related fields. This relationship changes as age changes because older billionaires are more likely to be in finance since tech was not as big longer ago. The relationship also changes if money was inherited because inherited money more often comes from finance or oil. Age is the best predictor of the sector.
Identify the types of variables in your research question. Categorical? Quantitative?
Categorical: whether billionaire founded company (boolean), gender (string), whether the money was inherited (boolean), sector of company (string)
Quantitative: location gdp (float), age (integer), wealth (float)
Literature
Find one published credible article on the topic you are interested in researching.
“Russia’s Billionaires” Daniel Treisman, 2016. https://www.aeaweb.org/articles?id=10.1257/aer.p20161068
Provide a one paragraph summary about the article.
This paper analyzes the rise and longevity of billionaires from Russia during the past 20 years. The data comes from Forbes research efforts. Since the early 2000, the number of Russian billionaires has increased significantly. Many Russian billionaires made money from oil, banking, gas, and metal; however, Treisman found that the sectors Russain billionaire’s make money from have become more diverse. Now some billionaires have come from industries like real estate, chemicals, and information technology.
In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
Our research question closely parallels the work from Treisman’s paper; the data from the paper has many overlapping characteristics and variables. As a team, we can learn from the methods and analysis from the paper to inform our own analysis.
Glimpse of data
<- read_csv("data/billionaires.csv") billionaires
Rows: 2614 Columns: 22
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): name, company.name, company.relationship, company.sector, company....
dbl (6): rank, year, company.founded, demographics.age, location.gdp, wealt...
lgl (3): wealth.how.from emerging, wealth.how.was founder, wealth.how.was p...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(billionaires)
Rows: 2,614
Columns: 22
$ name <chr> "Bill Gates", "Bill Gates", "Bill Gates", "…
$ rank <dbl> 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5…
$ year <dbl> 1996, 2001, 2014, 1996, 2001, 2014, 1996, 2…
$ company.founded <dbl> 1975, 1975, 1975, 1962, 1962, 1990, 1896, 1…
$ company.name <chr> "Microsoft", "Microsoft", "Microsoft", "Ber…
$ company.relationship <chr> "founder", "founder", "founder", "founder",…
$ company.sector <chr> "Software", "Software", "Software", "Financ…
$ company.type <chr> "new", "new", "new", "new", "new", "privati…
$ demographics.age <dbl> 40, 45, 58, 65, 70, 74, 0, 48, 77, 68, 56, …
$ demographics.gender <chr> "male", "male", "male", "male", "male", "ma…
$ location.citizenship <chr> "United States", "United States", "United S…
$ `location.country code` <chr> "USA", "USA", "USA", "USA", "USA", "MEX", "…
$ location.gdp <dbl> 8.10e+12, 1.06e+13, 0.00e+00, 8.10e+12, 1.0…
$ location.region <chr> "North America", "North America", "North Am…
$ wealth.type <chr> "founder non-finance", "founder non-finance…
$ `wealth.worth in billions` <dbl> 18.5, 58.7, 76.0, 15.0, 32.3, 72.0, 13.1, 3…
$ wealth.how.category <chr> "New Sectors", "New Sectors", "New Sectors"…
$ `wealth.how.from emerging` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ wealth.how.industry <chr> "Technology-Computer", "Technology-Computer…
$ wealth.how.inherited <chr> "not inherited", "not inherited", "not inhe…
$ `wealth.how.was founder` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ `wealth.how.was political` <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
Data 3
Smokers
Introduction and data
- Identify the source of the data.
The source of the data is the Centers for Disease Control and Prevention.
- State when and how it was originally collected (by the original data curator, not necessarily how you found the data).
The data was from the CDC’s Behavioral Risk Factor Surveillance System annual data from 1995-2010. The data collected contains percentages weighted by population characteristics.
Write a brief description of the observations.
Variable Description year Year data collected state State data is from smoke_everyday Percentage smokes everyday smoke_some_days Percentage smokes some days former_smoker Percentage former smoker never_smoked Percentage never smoked location_1 Location coordinates
Research question
- A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)
How does smoking frequency vary by location? How has overall smoking frequency changed over time? How has smoking frequency changed in different regions of the United States over time?
- A description of the research topic along with a concise statement of your hypotheses on this topic.
The research topic is about how smoking frequency has changed over time, particularly by states/regions. The hypothesis for question 1 is that smoking frequently is more common in urban, developed areas and also more common is very rural areas. For this hypothesis, the classifications of urban and rural will be taken from the Bureau of the Census. An urban area is defined as an area with a population of 50,000 or more and a rural area is defined as a place with fewer than 2,500 inhabitants. These areas have already been coded by the Bureau of the Census. The hypothesis for question 2 is that overall smoking frequency has decreased over time due to increased awareness about health risks; additionally, it is hypothesized that there is now a higher percentage of former smokers across the United States. The hypothesis for question 3 is that smoking frequency has decreased in lower population regions but is more constant in higher population regions. Population will be measured by population density criteria from the Bureau of the Census with lower population being defined as 35 people per square mile and higher population defined as 1,600 people per square mile or more.
- Identify the types of variables in your research question. Categorical? Quantitative?
- year: quantitative, double
- state: categorical, character
- smoke_everyday: quantitative, double
- smoke_some_days: quantitative, double
- former_smoker: quantitative, double
- never_smoked: quantitative, double
- location_1: categorical, character
Literature
- Find one published credible article on the topic you are interested in researching.
Prevalence and Factors Associated with Current Cigarette Smoking among Ethiopian University Students: A Systematic Review and Meta-Analysis
- Provide a one paragraph summary about the article.
This paper and meta-analysis aims to determine the prevalence and factors associated with smoking among Ethiopian university students. There were 13 studies included, and the average prevalence of cigarette smoking is 12.55%. Peer pressure, other tobacco products, and alcohol were often associated with cigarette smoking. The authors also recommended action to prevent smoking in Ethiopian university students. Among them were promoting anti-smoking campaigns and increasing health education.
- In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.
This paper did not look at data longitudinally, and it also functions as a literature review; however, the paper does take a narrow look into one section of smokers: Ethiopian university students. Our data is much more broad, but we can still mirror the analysis process.
Glimpse of data
<- read_csv("data/smoking.csv") cdc_smoker
Rows: 876 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): State, Location 1
dbl (5): Year, Smoke everyday, Smoke some days, Former smoker, Never smoked
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(cdc_smoker)
Rows: 876
Columns: 7
$ Year <dbl> 1996, 2005, 2005, 2002, 2003, 2000, 2002, 1996, 2006…
$ State <chr> "Puerto Rico", "Virgin Islands", "Puerto Rico", "Vir…
$ `Smoke everyday` <dbl> 9.4, 5.3, 7.9, 7.0, 26.3, 15.3, 15.0, 24.9, 17.7, 17…
$ `Smoke some days` <dbl> 5.1, 2.8, 5.2, 2.4, 7.8, 5.4, 6.2, 3.7, 5.7, 5.6, 4.…
$ `Former smoker` <dbl> 16.0, 12.8, 16.9, 12.1, 14.3, 28.3, 26.0, 23.7, 20.1…
$ `Never smoked` <dbl> 69.5, 79.1, 70.0, 78.5, 51.7, 50.9, 52.8, 47.6, 56.4…
$ `Location 1` <chr> "Puerto Rico", "Virgin Islands", "Puerto Rico", "Vir…