Stat Wizards Project Proposal

Proposal

Author

Stat Wizards: Amelia, Sophia, Weston, and Elliot

Published

March 10, 2023

library(tidyverse)

Data 1

Introduction and data

  • Identify the source of the data.

The source of the data is ai-jobs.net. It is a website for posting tech jobs and alayzing trends in job data.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The site collects salary information anonymously from professionals all over the world in the AI/ML/Data Science space and makes it publicly available. The vast majority of the data was collected in 2023, 2022, and 2021.

  • Write a brief description of the observations.

Each observation records the person’s job info. This includes work year, employment type, job title, salary, employee residence, company location, etc..

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How does your job within the field of tech (data scientist, machine learning engineer, data analyst), along with whether or not you work in the US, affect your salary?

Another possible question:

What is the probability that someone is a data scientists or a machine learning engineer based on their salary and whether or not they work the US?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

This research topic aims to understand various tech jobs and how they pay on average. It also intends to identify whether or not working for a US based company would affect your salary. My hypotheses is that US-based workers will have higher mean salaries, and that Data science/analyst jobs will have lower salaries on average than machine learning jobs.

  • Identify the types of variables in your research question. Categorical? Quantitative?

We would most likely use the salary_in_usd variable as our response variable, which is quantitative, along with categorical explanatory variables including job_title and employee_residence.

Literature

  • Find one published credible article on the topic you are interested in researching.

    https://www.projectpro.io/article/machine-learning-engineer-vs-data-scientist/534

  • Provide a one paragraph summary about the article.

    “According to Payscale, the salary of Data Scientists lie between the range of $85K and $134K. On the other hand,[machine learning engineers earn somewhere between $93K and $149K ](https://www.payscale.com/research/US/Job=Machine_Learning_Engineer/Salary” machine learning engineers earn somewhere between $93K and $149K “). These figures are purely survey-based and may vary from place to place, company to company.”

    The article says that machine learning engineers tend to have higher average salaries, because they tend to be software engineers that also have data science knowledge. However, it also says that it varies greatly place to place.

  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    My research question builds on the article provided by also including whether the worker lives in the US or not.

Glimpse of data

salaries <- read_csv("data/salaries.csv")

glimpse(salaries)
Rows: 3,046
Columns: 11
$ work_year          <dbl> 2023, 2023, 2023, 2023, 2023, 2023, 2023, 2023, 202…
$ experience_level   <chr> "MI", "SE", "SE", "SE", "SE", "SE", "SE", "SE", "SE…
$ employment_type    <chr> "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT…
$ job_title          <chr> "Financial Data Analyst", "Data Scientist", "Data S…
$ salary             <dbl> 130000, 205000, 140000, 297300, 198200, 141288, 941…
$ salary_currency    <chr> "USD", "USD", "USD", "USD", "USD", "USD", "USD", "U…
$ salary_in_usd      <dbl> 130000, 205000, 140000, 297300, 198200, 141288, 941…
$ employee_residence <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US…
$ remote_ratio       <dbl> 100, 0, 0, 100, 100, 0, 0, 100, 100, 0, 0, 100, 100…
$ company_location   <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US…
$ company_size       <chr> "L", "M", "M", "M", "M", "M", "M", "M", "M", "M", "…

Data 2

Introduction and data

  • Identify the source of the data.

    • This data set is from the World Bank Databank. 
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was originally collected from different national censuses; however, it was compiled by the World Bank.
  • Write a brief description of the observations.

    • This data set is split into five date intervals: the 1960, 1970, 1980, 1990, and 2000. In addition, this data set can be filtered by country as well as gender. However, we chose to look at all countries which data is provided for as well as both male and female immigrants.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • How has the demographic and amount of the nation’s immigration population changed over these five decades?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • With this research question, we intend to investigate the growth in the number of immigrants to the U.S. and where they are coming from. Are the countries where the largest population of people migrating to the U.S. the same? Different? Are there trends across regions or continents? We will look at when/if there are major shifts in the country of origin or spikes in the number of immigrants.

    • We expect that the number of people migrating to the U.S. overall increase across the countries, and the main countries they are coming from become increasingly non-European.

  • Identify the types of variables in your research question. Categorical? Quantitative?

    • In this data set, there are both categorical (country of origin, gender, country destination, etc) as well as quantitative variables (number of migrants to the US in that year)

Literature

  • Find one published credible article on the topic you are interested in researching.

  • Provide a one paragraph summary about the article.

    • This article summarizes data collected by the pew research center on foreign-born U.S. population trends. From 1960-2000 there was an upward trend in the percentage of the foreign-born U.S. population (9.7 million to 31.1 million people respectively). Since 1970 there has been rapid growth, with 4.7% of the U.S. population being foreign-born in 1970 and, as of 2013, that number is up to 13.1%. In addition to this general increase, the demographic of the people migrating to the U.S. is different (what the data set we are looking at breaks down). A significant factor in this shift is the passage of the Immigration and Nationality act of 1965, which eliminated national or origin quotas and allowed immigrants from non-European countries to begin immigrating. For example, the top country in 1960 and 1970 which sent immigrants was Italy, and from 1980-2013 it is now Mexico. This research article then further investigates the state members of the foreign-born population live in, the language they speak, their age, and the amount of time they have spent living in the U.S.
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    • While our research will build on this idea of examining, this article focuses on other internal domestic variables while we will focus mainly on the trends of people migrating. We intend to investigate further the trends this article touches on regarding the increased growth in immigration overall and the resulting change in demographics of that increased immigration population.

Glimpse of data

migration <- read.csv("data/migrationdata.csv")

glimpse(migration)
Rows: 237
Columns: 11
$ Country.Origin.Name      <chr> "Afghanistan", "Albania", "Algeria", "America…
$ Country.Origin.Code      <chr> "AFG", "ALB", "DZA", "ASM", "AND", "AGO", "AI…
$ Migration.by.Gender.Name <chr> "Total", "Total", "Total", "Total", "Total", …
$ Migration.by.Gender.Code <chr> "TOT", "TOT", "TOT", "TOT", "TOT", "TOT", "TO…
$ Country.Dest.Name        <chr> "United States", "United States", "United Sta…
$ Country.Dest.Code        <chr> "USA", "USA", "USA", "USA", "USA", "USA", "US…
$ X1960..1960.             <chr> "293", "10047", "322", "2519", "440", "141", …
$ X1970..1970.             <chr> "207", "10865", "533", "11528", "3", "316", "…
$ X1980..1980.             <chr> "4832", "7963", "4843", "9604", "1548", "1347…
$ X1990..1990.             <chr> "30146", "5844", "5577", "1870", "57", "3194"…
$ X2000..2000.             <chr> "44893", "40555", "11717", "16339", "5", "841…

Data 3

Introduction and data

  • Identify the source of the data.

This data is from FiveThirtyEight, a data science platform created by Nate Silver

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data is updated daily based on soccer club match schedules. Because this is a collection of the latest matches, it dates back to 2019.

  • Write a brief description of the observations.

Each observation in the dataset is one game. The dataset contains the details about each match (date, teams playing, etc), along with FiveThirtyEight’s predictions for goals scored, percentage chance each team wins, and the actual score if the game has been completed already.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

How accurate is FiveThirtyEight’s model in regards to their expected goals stat? What is the average margin that they are off by?

Different direction with same dataset: What are the most successful soccer clubs of the last 3 years by win percentage?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

We are investigating the efficacy of soccer predictions, as FiveThirtyEight is largely regarded as the most comprehensive predictive model for sports games. We want to understand how reliable the predictions FiveThirtyEight makes are. Sports are difficult to make predictions for, so wed predict that there is a fair deal of error in these predictions.

  • Identify the types of variables in your research question. Categorical? Quantitative?

Largely Quantitative

Literature

  • Find one published credible article on the topic you are interested in researching.

https://fivethirtyeight.com/methodology/how-our-club-soccer-predictions-work/

  • Provide a one paragraph summary about the article.

This article discusses how FiveThirtyEight predicts matches and derives their expected goals and power index ratings. Based on previous match success, FiveThirtyEight updates what they call a SPI, or Soccer Power Index for each team that encompasses the relative strength of each team. Because goals do not always accurately reflect the performance of teams in an individual game, FiveThirtyEight bases their SPI on a stat called adjusted goals, which adjusts for overall team performance. With SPI, FiveThirtyEight predicts the percentage chance a match ends in a win for either team or in a draw. They also predict expected goals for each match. This part is not relevant to our research, but they also forecast the odds of teams winning the league they compete in based on their record and SPI strength.

  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    The article describes how their model is derived. With our research we intend to analyze how good their model is, something that the article does not discuss.

Glimpse of data

soccer <- read.csv("data/spi_matches_latest.csv")

glimpse(soccer)
Rows: 11,410
Columns: 23
$ season      <int> 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019…
$ date        <chr> "2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01", "2…
$ league_id   <int> 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979, 1979…
$ league      <chr> "Chinese Super League", "Chinese Super League", "Chinese S…
$ team1       <chr> "Shandong Luneng", "Guangzhou Evergrande", "Shanghai Green…
$ team2       <chr> "Guizhou Renhe", "Tianjin Quanujian", "Shanghai SIPG", "Be…
$ spi1        <dbl> 48.22, 65.59, 39.81, 32.25, 38.24, 31.99, 35.18, 51.59, 65…
$ spi2        <dbl> 37.83, 39.99, 60.08, 54.82, 40.45, 38.75, 45.83, 36.92, 36…
$ prob1       <dbl> 0.5755, 0.7832, 0.2387, 0.2276, 0.4403, 0.3966, 0.3400, 0.…
$ prob2       <dbl> 0.1740, 0.0673, 0.5203, 0.5226, 0.2932, 0.3252, 0.3715, 0.…
$ probtie     <dbl> 0.2505, 0.1495, 0.2410, 0.2498, 0.2665, 0.2783, 0.2885, 0.…
$ proj_score1 <dbl> 1.75, 2.58, 1.22, 1.10, 1.57, 1.41, 1.22, 1.92, 2.74, 1.81…
$ proj_score2 <dbl> 0.84, 0.62, 1.89, 1.79, 1.24, 1.25, 1.29, 0.78, 0.58, 0.78…
$ importance1 <dbl> 45.9, 77.1, 25.6, 35.8, 26.2, 40.5, 33.1, 51.4, 78.7, 46.3…
$ importance2 <dbl> 22.1, 28.8, 63.4, 58.9, 21.3, 24.6, 41.0, 25.9, 24.3, 32.3…
$ score1      <int> 1, 3, 0, 0, 2, 3, 1, 3, 1, 2, 0, 3, 3, 1, 2, 1, 4, 0, 0, 1…
$ score2      <int> 0, 0, 4, 1, 2, 1, 1, 2, 0, 2, 4, 3, 2, 0, 1, 2, 3, 1, 1, 2…
$ xg1         <dbl> 1.39, 0.49, 0.57, 1.12, 2.77, 1.33, 0.84, 3.28, 1.74, 1.64…
$ xg2         <dbl> 0.26, 0.45, 2.76, 0.97, 3.17, 0.65, 1.65, 0.62, 0.31, 1.20…
$ nsxg1       <dbl> 2.05, 1.05, 0.80, 1.51, 1.05, 0.88, 1.29, 1.51, 2.16, 2.52…
$ nsxg2       <dbl> 0.54, 0.75, 1.50, 0.94, 2.08, 1.72, 1.98, 0.41, 0.89, 0.53…
$ adj_score1  <dbl> 1.05, 3.15, 0.00, 0.00, 2.10, 2.61, 1.05, 2.61, 1.05, 2.10…
$ adj_score2  <dbl> 0.00, 0.00, 3.26, 1.05, 2.10, 1.05, 1.05, 2.10, 0.00, 2.10…