Spotify, Sports, and Soup

Proposal

library(tidyverse)

Data 1

Introduction and data

  • Identify the source of the data.

    • This data set comes from Spotify via the spotifyr package.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data set was authored by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff to make it easier to get data or general metadata on songs.
  • Write a brief description of the observations.

    • This dataset contains about 5,000 songs from 6 main music categories (EDM, Latin, Pop, R&B, Rap, and Rock). For each song there contains general information such as the songs album, artist, release data, but also more interesting statistics on danceability, energy, and popularity.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Which combination of variables would be the best predictor of genre?

    • How does danceability affect popularity of songs of the Latin genre? How about for the Rap genre? Which genre does danceability matter the most when determining popularity?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • Our hypothesis is that in the end there will need to be different variables used to calculate likelihood of genre’s but for one general predictor I’d imagine that instrumentalness and danceability would be the best combination to determine.
    • For the second question, I believe that danceability will contribute a larger amount toward popularity for Latin music than Rap music. I also believe that danceability contributes the most to latin music overall.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • There is categorical data that explains more general information about the song such as the album name, artist name, genre, but there is also a lot of quantitative data that measures features of the song like danceability, energy, and loudness.

Literature

  • Find one published credible article on the topic you are interested in researching.

    • https://www.analyticssteps.com/blogs/how-spotify-using-big-data
  • Provide a one paragraph summary about the article.

    • This article provides insight into how Spotify uses its data to provide a better experience for its users. The three main ways Spotify uses data are to develop personalized content, digitize the user’s taste, enhance marketing through targeted ads, continuously update its system, and create Spotify wrapped. Spotify has not claimed a spot in an incredibly competitive industry but has beaten impressive competitors such as Apple and Amazon due to its reliance on data to provide users with music and content they know they will enjoy.
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    • Some of our research questions focus on how different factors affect popularity. This question is critical to consider because if we can adequately predict popularity, then, similar to Spotify, we may be able to predict which songs people will like as soon as they’re released.

Glimpse of data

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
Rows: 32833 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): track_id, track_name, track_artist, track_album_id, track_album_na...
dbl (13): track_popularity, danceability, energy, key, loudness, mode, speec...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tuesdata <- tidytuesdayR::tt_load('2020-01-21') 
--- Compiling #TidyTuesday Information for 2020-01-21 ----
--- There is 1 file available ---
--- Starting Download ---

    Downloading file 1 of 1: `spotify_songs.csv`
--- Download complete ---
tuesdata <- tidytuesdayR::tt_load(2020, week = 4)
--- Compiling #TidyTuesday Information for 2020-01-21 ----
--- There is 1 file available ---
--- Starting Download ---

    Downloading file 1 of 1: `spotify_songs.csv`
--- Download complete ---
spotify_songs <- tuesdata$spotify_songs

glimpse(spotify_songs)
Rows: 32,833
Columns: 23
$ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa…
$ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux…
$ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th…
$ track_popularity         <dbl> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6…
$ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6…
$ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu…
$ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20…
$ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R…
$ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf…
$ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po…
$ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance…
$ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4…
$ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8…
$ key                      <dbl> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,…
$ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38…
$ mode                     <dbl> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, …
$ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127…
$ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, …
$ instrumentalness         <dbl> 0.00e+00, 4.21e-03, 2.33e-05, 9.43e-06, 0.00e…
$ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143…
$ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1…
$ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1…
$ duration_ms              <dbl> 194754, 162600, 176616, 169093, 189052, 16304…

Data 2

Introduction and data

  • Identify the source of the data.

    • The data comes from the College Sports Budget from Tidy Tuesday.
  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

    • The data was originally collected by the Equity in Athletics Data Analysis database and originally from this NPR article: https://www.npr.org/2021/10/27/1049530975/ncaa-spends-more-on-mens-sports-report-reveals
  • Write a brief description of the observations.

    • The dataset looks at every collegiate sports program and their teams, including data for their men and women’s teams. The observations are broken down by participation numbers for both men and female teams in each sport in the program, the division of the program (DI vs. DII vs. DIII), and the expenditure values of the womens vs. mens programs.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

    • Are there expenditure inequities in college sports across divisions (I vs. II vs. III) and gender? Have these gender expenditure inequities been improving with time?
  • A description of the research topic along with a concise statement of your hypotheses on this topic.

    • Yes, there are expenditure differences between divisions because what differentiates programs largely relates to the resources on the athletic department and school size (and therefore, money). When looking specifically at Division I sports programs, there are expenditure differences between men and women’s teams, but they have been improving over the years based on societal changes.
  • Identify the types of variables in your research question. Categorical? Quantitative?

    • Gender expenditures (men and female are different categories) = quantitative

    • Classification (division) = categorical

    • Totalexpenditure (for both men and women combined) = quantitative

    • Year = quantitative

    • Sport = categorical

Literature

  • Find one published credible article on the topic you are interested in researching.

  • Provide a one paragraph summary about the article.

    • This article looks at the average money spent by the NCAA on men vs women. An NCAA gender equity report found that, on average, the NCAA spends more money on male athletes than female athletes. For example, looking at D1 and national championship participants (excluding basketball) in 2018-2019, money spent on female participants was on average $1700 less than for male participants. The report also found that this gap is greater in the 6 single gender sports.  
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    • Our research question will also analyze differences in gender equity in college sports, however we will also analyze how this difference has changed over time and how it differs betweent divisions. We will also look at equity differences between sports. 

Glimpse of data

sports <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
Rows: 132327 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): institution_name, city_txt, state_cd, zip_text, classification_nam...
dbl (20): year, unitid, classification_code, ef_male_count, ef_female_count,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(sports)
Rows: 132,327
Columns: 28
$ year                 <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
$ unitid               <dbl> 100654, 100654, 100654, 100654, 100654, 100654, 1…
$ institution_name     <chr> "Alabama A & M University", "Alabama A & M Univer…
$ city_txt             <chr> "Normal", "Normal", "Normal", "Normal", "Normal",…
$ state_cd             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "…
$ zip_text             <chr> "35762", "35762", "35762", "35762", "35762", "357…
$ classification_code  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1…
$ classification_name  <chr> "NCAA Division I-FCS", "NCAA Division I-FCS", "NC…
$ classification_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ef_male_count        <dbl> 1923, 1923, 1923, 1923, 1923, 1923, 1923, 1923, 1…
$ ef_female_count      <dbl> 2300, 2300, 2300, 2300, 2300, 2300, 2300, 2300, 2…
$ ef_total_count       <dbl> 4223, 4223, 4223, 4223, 4223, 4223, 4223, 4223, 4…
$ sector_cd            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sector_name          <chr> "Public, 4-year or above", "Public, 4-year or abo…
$ sportscode           <dbl> 1, 2, 3, 7, 8, 15, 16, 22, 26, 33, 1, 2, 3, 8, 12…
$ partic_men           <dbl> 31, 19, 61, 99, 9, NA, NA, 7, NA, NA, 32, 13, NA,…
$ partic_women         <dbl> NA, 16, 46, NA, NA, 21, 25, 10, 16, 9, NA, 20, 68…
$ partic_coed_men      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ partic_coed_women    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sum_partic_men       <dbl> 31, 19, 61, 99, 9, 0, 0, 7, 0, 0, 32, 13, 0, 10, …
$ sum_partic_women     <dbl> 0, 16, 46, 0, 0, 21, 25, 10, 16, 9, 0, 20, 68, 7,…
$ rev_men              <dbl> 345592, 1211095, 183333, 2808949, 78270, NA, NA, …
$ rev_women            <dbl> NA, 748833, 315574, NA, NA, 410717, 298164, 13114…
$ total_rev_menwomen   <dbl> 345592, 1959928, 498907, 2808949, 78270, 410717, …
$ exp_men              <dbl> 397818, 817868, 246949, 3059353, 83913, NA, NA, 9…
$ exp_women            <dbl> NA, 742460, 251184, NA, NA, 432648, 340259, 11388…
$ total_exp_menwomen   <dbl> 397818, 1560328, 498133, 3059353, 83913, 432648, …
$ sports               <chr> "Baseball", "Basketball", "All Track Combined", "…

Data 3

Introduction and data

  • Identify the source of the data.

The source of this data is from The Ramen Rater’s Big List.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

This data was created and collected by the founder of the Ramen Rater, Hiens Lienesch. He has been eating and reviewing different brands and flavors of Ramen since 2002. He has reviewed 4300 types of Ramen and created a dataset with their flavors, brands, and ratings. This dataset includes his findings up until 2020.

  • Write a brief description of the observations.

For each of his reviews, he includes the review number, brand of ramen, variety, style, country, and number of starts. There are many varieties offered by the same brand. He has rated the flavors from within each brand to create a comprehensive rating system.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Are ratings of ramen different based on their style: Are cups, packs, boxes, or bowls rate higher overall? What brand of ramen has the highest rating? Do certain countries create ramen with higher ratings?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

This research topic is analyzing ramen preferences based on origin, style, and brand. We hypothesize that the ramen brands that originate from Japan will have higher ratings, because ramen is a traditional Japanese dish. Additionally, we hypothesize that bowls will have higher ratings, because their packaging is more consistent with a traditional serving style.

  • Identify the types of variables in your research question. Categorical? Quantitative?

There are multiple categorical variables: Brand, Variety, Style, Country. The quantitative variables are review number and stars.

Literature

  • Find one published credible article on the topic you are interested in researching.

https://towardsdatascience.com/exploring-the-world-of-ramen-through-text-analytics-1131280c4c6b

  • Provide a one paragraph summary about the article.

This article analyzes the big list to discover what the most popular ramen themes are around the world. Specifically, it looks for the use of specific words within the names of popular ramen brands. Because this list includes ramen data from 38 countries, it is informative on what ramen names are used most frequently. The article’s author found unigrams such as “noodles”, “noodle”, “flavour”, “chicken”, and “cup”. They then found the frequency of these words within the dataset. Then they analyzed these words to see if they were logical in titles. Some frequently used words had to be modified for clarity, such as “tom” to “tom yum” a certain kind of soup. They then organized these themes by country and found that certain countries produce certain themes of ramen. The USA had a frequent use of chicken, soup, spicy. They then visualized these distributions through a map. This article analyzes the many ways you can modify and look at this unique dataset. The information extracted from the data helped inform their ramen purchases.

  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

Our research question analyzes which country has the highest rated ramen. This dataset looked into popular types of ramen by country, while ours will analyze which country produces the best Ramen.

Glimpse of data

ramen_ratings <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-04/ramen_ratings.csv")
Rows: 3180 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): brand, variety, style, country
dbl (2): review_number, stars

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(ramen_ratings)
Rows: 3,180
Columns: 6
$ review_number <dbl> 3180, 3179, 3178, 3177, 3176, 3175, 3174, 3173, 3172, 31…
$ brand         <chr> "Yum Yum", "Nagatanien", "Acecook", "Maison de Coree", "…
$ variety       <chr> "Tem Tem Tom Yum Moo Deng", "tom Yum Kung Rice Vermicell…
$ style         <chr> "Cup", "Pack", "Cup", "Cup", "Tray", "Cup", "Pack", "Pac…
$ country       <chr> "Thailand", "Japan", "Japan", "France", "Japan", "Japan"…
$ stars         <dbl> 3.75, 2.00, 2.50, 3.75, 5.00, 3.50, 3.75, 5.00, 3.50, 4.…