Project R-S2dio - College Basketball

Proposal

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.2.2
Warning: package 'tidyr' was built under R version 4.2.2
Warning: package 'readr' was built under R version 4.2.2
Warning: package 'purrr' was built under R version 4.2.2

Data 1

Introduction and data

  • Identify the source of the data.

Our data is sourced from the 2018 Central Park Squirrel Census, compiled by NYC Open Data.

https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was collected in October 2018 by hundreds of volunteers and key NYC entities who tallied the squirrels in Central Park, New York on behalf of The Squirrel Census (the original data curator).

  • Write a brief description of the observations.

The observations include 3,023 squirrel sightings, unique squirrel IDs, dates spotted, location, age, primary and secondary fur color, elevation, squirrel activities when spotted (running, chasing, climbing, foraging, eating, or other), sounds, tail movement, and interactions between squirrels and with humans.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Research question: What is the relationship between squirrel location, squirrel climbing, and whether the squirrel runs from human spotters?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

Our hypothesis is that squirrels spotted on a ground plane will run from human spotters more than squirrels above ground, given the spotter is on their same elevation and thus constitutes a greater threat. Moreover, we hypothesize that squirrels climbing will run from human spotters more than squirrels above ground and more than squirrels spotted on a ground plane, given such squirrels might climb to escape the human.

  • Identify the types of variables in your research question. Categorical? Quantitative?

All variables are categorical variables - climbing, location, runs from.

Literature

  • Find one published credible article on the topic you are interested in researching.

https://www.jstor.org/stable/3800817?seq=8

  • Provide a one paragraph summary about the article.

The article “Census Methods for Eastern Gray Squirrels” by Stephen H. Bouffard and Dale Hein examines the efficacy of various census methods for estimating squirrel populations in populated areas. The study found that visually counting squirrels can be an effective mechanism for estimating populations, although may leave other squirrels in less visible locations undetected. The study further analyzes the impact of various factors on squirrel visibility (such as weather conditions and food availability).

  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

There is limited research on the variables which impact squirrel visibility. Our research question seeks to build upon this study’s research into the factors that influence squirrels’ propensity to venture into visible areas, specifically focusing on whether the presence of human spotters is a factor that influences squirrel visibility. If the presence of human spotters has a positive correlation with squirrels staying hidden or retreating, human presence may be an additional factor impacting squirrel visibility that ought be considered in future squirrel censuses.

Glimpse of data

Squirrels <- read_csv("data/2018_Central_Park_Squirrel_Census_-_Squirrel_Data.csv")
Rows: 3023 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): Unique Squirrel ID, Hectare, Shift, Age, Primary Fur Color, Highli...
dbl  (4): X, Y, Date, Hectare Squirrel Number
lgl (13): Running, Chasing, Climbing, Eating, Foraging, Kuks, Quaas, Moans, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(Squirrels)
Rows: 3,023
Columns: 31
$ X                                            <dbl> -73.95613, -73.96886, -73…
$ Y                                            <dbl> 40.79408, 40.78378, 40.77…
$ `Unique Squirrel ID`                         <chr> "37F-PM-1014-03", "21B-AM…
$ Hectare                                      <chr> "37F", "21B", "11B", "32E…
$ Shift                                        <chr> "PM", "AM", "PM", "PM", "…
$ Date                                         <dbl> 10142018, 10192018, 10142…
$ `Hectare Squirrel Number`                    <dbl> 3, 4, 8, 14, 5, 3, 2, 2, …
$ Age                                          <chr> NA, NA, NA, "Adult", "Adu…
$ `Primary Fur Color`                          <chr> NA, NA, "Gray", "Gray", "…
$ `Highlight Fur Color`                        <chr> NA, NA, NA, NA, "Cinnamon…
$ `Combination of Primary and Highlight Color` <chr> "+", "+", "Gray+", "Gray+…
$ `Color notes`                                <chr> NA, NA, NA, "Nothing sele…
$ Location                                     <chr> NA, NA, "Above Ground", N…
$ `Above Ground Sighter Measurement`           <chr> NA, NA, "10", NA, NA, NA,…
$ `Specific Location`                          <chr> NA, NA, NA, NA, "on tree …
$ Running                                      <lgl> FALSE, FALSE, FALSE, FALS…
$ Chasing                                      <lgl> FALSE, FALSE, TRUE, FALSE…
$ Climbing                                     <lgl> FALSE, FALSE, FALSE, FALS…
$ Eating                                       <lgl> FALSE, FALSE, FALSE, TRUE…
$ Foraging                                     <lgl> FALSE, FALSE, FALSE, TRUE…
$ `Other Activities`                           <chr> NA, NA, NA, NA, NA, NA, N…
$ Kuks                                         <lgl> FALSE, FALSE, FALSE, FALS…
$ Quaas                                        <lgl> FALSE, FALSE, FALSE, FALS…
$ Moans                                        <lgl> FALSE, FALSE, FALSE, FALS…
$ `Tail flags`                                 <lgl> FALSE, FALSE, FALSE, FALS…
$ `Tail twitches`                              <lgl> FALSE, FALSE, FALSE, FALS…
$ Approaches                                   <lgl> FALSE, FALSE, FALSE, FALS…
$ Indifferent                                  <lgl> FALSE, FALSE, FALSE, FALS…
$ `Runs from`                                  <lgl> FALSE, FALSE, FALSE, TRUE…
$ `Other Interactions`                         <chr> NA, NA, NA, NA, NA, NA, N…
$ `Lat/Long`                                   <chr> "POINT (-73.9561344937861…

Data 2

Introduction and data

  • Identify the source of the data.

This data is sourced from the Division I college basketball seasons from the years of 2013-2019. It was found in the open source data site, Kaggle.

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data was scraped from http://barttorvik.com/trank.php#. This data was cleaned up in 2021, in which the COVID seasons were not included in the dataset.

  • Write a brief description of the observations.

The dataset has various variables that characterize different Division I basketball teams— such as the university they represent and the conference in which they belong. Some of the observations within the dataset include the number of games played, the number of games won, power rating, as well as the stats of the team in a particular season.

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Research question: Which Division 1 basketball conference has the highest success rate from the 2013-2019 season? (number of game wins, offensive rebound rates, total points, etc.). And which best factors (other than wins)(offensive/ defensive efficiency, 2pt and 3pt field goal success, steal rate, turnover rate, etc) best predict regular season success (win percentage)(from 2013 to 2019)? And does this also predict (better predict?/ worse predict) post season success?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

Our research topic will explore the different facets that are considered in possibly predicting the success of college DI basketball teams in 2013-2019. We hypothesize that the best predictors for regular season success (win percentage) will be offensive and defensive efficiency and it will also predict postseason success just as well.

  • Identify the types of variables in your research question. Categorical? Quantitative?

Conference (categorical)

Wins(quantitative)

Losses(quantitative)

Adjusted offensive rating(quantitative)

Adjusted defensive rating(quantitative)

3pt success rate (allowed and committed)(quantitative)

2pt success rate (allowed and committed)(quantitative)

Field goal success rate(quantitative)

Effective field goal percentage (allowed and committed)(quantitative)

Rebound and turnover rates (allowed and committed)(quantitative)

How to calculate success

Post season- How far they made it in March madness (not in tournament, round of 64, round of 32, sweet 16, elite 8, final 4, runner up, champions) (categorical)

Regular season - Seed for the tournament (categorical) (easily changeable to quantitative ) and Win percentage (quantitative)

Literature

  • Find one published credible article on the topic you are interested in researching.

  • Provide a one paragraph summary about the article.

    • This article looks at the success of conferences since the NCAA tournament was made into a 64 team format (now 68). With the ACC being largely in control statistically, they are the dominant post season  conference (which is argued to be the best team predictor since in the tournament teams from the same conference rarely play each other). The ACC also has a winning record against 29 of the 32 conferences. The ACC is also the league’s only double digit champion (with 10 titles since 1985) and has the highest win percentage. (*This article is also reputable as it uses data directly from the NCAA and it written and researched by the NCAA)
  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

    • In our research question we wonder why this is the pattern. Our data should show this same pattern, but it should also suggest a predictor for this behavior as well. Why do some teams perform at a higher success and what are the variables that predict success? 

Glimpse of data

cbb <- read_csv("data/cbb.csv")
Rows: 2455 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): TEAM, CONF, POSTSEASON
dbl (21): G, W, ADJOE, ADJDE, BARTHAG, EFG_O, EFG_D, TOR, TORD, ORB, DRB, FT...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(cbb)
Rows: 2,455
Columns: 24
$ TEAM       <chr> "North Carolina", "Wisconsin", "Michigan", "Texas Tech", "G…
$ CONF       <chr> "ACC", "B10", "B10", "B12", "WCC", "SEC", "B10", "ACC", "AC…
$ G          <dbl> 40, 40, 40, 38, 39, 40, 38, 39, 38, 39, 40, 40, 40, 40, 36,…
$ W          <dbl> 33, 36, 33, 31, 37, 29, 30, 35, 35, 33, 35, 36, 32, 35, 27,…
$ ADJOE      <dbl> 123.3, 129.1, 114.4, 115.2, 117.8, 117.2, 121.5, 125.2, 123…
$ ADJDE      <dbl> 94.9, 93.6, 90.4, 85.2, 86.3, 96.2, 93.7, 90.6, 89.9, 91.5,…
$ BARTHAG    <dbl> 0.9531, 0.9758, 0.9375, 0.9696, 0.9728, 0.9062, 0.9522, 0.9…
$ EFG_O      <dbl> 52.6, 54.8, 53.9, 53.5, 56.6, 49.9, 54.6, 56.6, 55.2, 51.7,…
$ EFG_D      <dbl> 48.1, 47.7, 47.7, 43.0, 41.1, 46.0, 48.0, 46.5, 44.7, 48.1,…
$ TOR        <dbl> 15.4, 12.4, 14.0, 17.7, 16.2, 18.1, 14.6, 16.3, 14.7, 16.2,…
$ TORD       <dbl> 18.2, 15.8, 19.5, 22.8, 17.1, 16.1, 18.7, 18.6, 17.5, 18.6,…
$ ORB        <dbl> 40.7, 32.1, 25.5, 27.4, 30.0, 42.0, 32.5, 35.8, 30.4, 41.3,…
$ DRB        <dbl> 30.0, 23.7, 24.9, 28.7, 26.2, 29.7, 29.4, 30.2, 25.4, 25.0,…
$ FTR        <dbl> 32.3, 36.2, 30.7, 32.9, 39.0, 51.8, 28.4, 39.8, 29.1, 34.3,…
$ FTRD       <dbl> 30.4, 22.4, 30.0, 36.6, 26.9, 36.8, 22.7, 23.9, 26.3, 31.6,…
$ `2P_O`     <dbl> 53.9, 54.8, 54.7, 52.8, 56.3, 50.0, 53.4, 55.9, 52.5, 51.0,…
$ `2P_D`     <dbl> 44.6, 44.7, 46.8, 41.9, 40.0, 44.9, 47.6, 46.3, 45.7, 46.3,…
$ `3P_O`     <dbl> 32.7, 36.5, 35.2, 36.5, 38.2, 33.2, 37.9, 38.7, 39.5, 35.5,…
$ `3P_D`     <dbl> 36.2, 37.5, 33.2, 29.7, 29.0, 32.2, 32.6, 31.4, 28.9, 33.9,…
$ ADJ_T      <dbl> 71.7, 59.3, 65.9, 67.5, 71.5, 65.9, 64.8, 66.4, 60.7, 72.8,…
$ WAB        <dbl> 8.6, 11.3, 6.9, 7.0, 7.7, 3.9, 6.2, 10.7, 11.1, 8.4, 8.9, 1…
$ POSTSEASON <chr> "2ND", "2ND", "2ND", "2ND", "2ND", "2ND", "2ND", "Champions…
$ SEED       <dbl> 1, 1, 3, 3, 1, 8, 4, 1, 1, 1, 2, 1, 7, 1, 4, 3, 6, 1, 2, 9,…
$ YEAR       <dbl> 2016, 2015, 2018, 2019, 2017, 2014, 2013, 2015, 2019, 2017,…

Data 3

Introduction and data

  • Identify the source of the data.

This specific data set was found on data world.

https://data.world/publicsafety/people-killed-by-police-in-us

  • State when and how it was originally collected (by the original data curator, not necessarily how you found the data).

The data is from the Guardian and verified crowdsourcing from the Guardian that looks at deaths caused by police in 2016 with the purpose of monitoring demographics to tell the stories of how people died. The data is ever changing (with new information) and this data set was downloaded March 10th, 2023.

  • Write a brief description of the observations.

The observations look at the names, age, gender, ethnicity of the victims, if they were armed, where they were killed, and which agency they were killed by (Ex. which police station was the officer based in).

Research question

  • A well formulated research question. (You may include more than one research question if you want to receive feedback on different ideas for your project. However, one per data set is required.)

Which groups are more likely to be killed by police while looking at gender, armed status, race, and location?

  • A description of the research topic along with a concise statement of your hypotheses on this topic.

This topic will look at which groups are more likely to be victims of 2016 deadly police violence. We will examine gender with regards to deadly police force, race, and also location by state. We hope this may be able to help us realize which areas in the country and which biases we should focus on first when solving the problem of deadly police force. We hypothesize that males, younger people, black people, and those from states with the highest population will be victims the most of deadly police force.

  • Identify the types of variables in your research question. Categorical? Quantitative?

Gender (categorical) Age (quantitative) Race (categorical) City (categorical) State (categorical) Armed (if they were how were they armed) (categorical)

Literature

  • Find one published credible article on the topic you are interested in researching.

https://spssi.onlinelibrary.wiley.com/doi/full/10.1111/josi.12246?casa_token=uEDJeuc8b5wAAAAA%3A76_aPEqicnTsh_5r_urafVPnuZwbcxp4qsjW40HhhYge0Yu62iYY5kL-hYMZ9rxFxi3QFOZV4Pk

  • Provide a one paragraph summary about the article.

This article looks at police caused deaths and also the death of on duty/ off duty police officers that were mistaken to not be officers. It goes into detail of how race has a difference in statistics. More white people are killed when armed vs unarmed while black people are killed more when unarmed vs armed. Black people in 2015 and 2016 also are killed more as a percentage of unarmed people killed. It then goes on to say that off duty officers who are black are more likely to get killed than white officers by police violence. Next it looks into the psychology of this effect looking at things such as misinformation, weapon perception and shooter bias.

  • In 1-2 sentences, explain how your research question builds on / is different than the article you have cited.

This builds on our question because it looks at race and armed status, but we will build on this by looking at specific locations as well as gender as well.

Glimpse of data

the_counted_2016 <-read_csv("data/the-counted-2016.csv")
Rows: 1035 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): name, age, gender, raceethnicity, month, streetaddress, city, stat...
dbl  (3): uid, day, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(the_counted_2016)
Rows: 1,035
Columns: 14
$ uid                  <dbl> 20161, 20162, 20163, 20164, 20165, 20166, 20167, …
$ name                 <chr> "Joshua Sisson", "Germonta Wallace", "Sean O'Brie…
$ age                  <chr> "30", "30", "37", "22", "27", "54", "29", "52", "…
$ gender               <chr> "Male", "Male", "Male", "Male", "Male", "Male", "…
$ raceethnicity        <chr> "White", "Black", "White", "Black", "Black", "Whi…
$ month                <chr> "January", "January", "January", "January", "Janu…
$ day                  <dbl> 1, 3, 2, 4, 4, 5, 5, 5, 6, 5, 5, 7, 7, 9, 10, 8, …
$ year                 <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
$ streetaddress        <chr> "4200 6th Ave", "2600 Watson Dr", "100 Washington…
$ city                 <chr> "San Diego", "Charlotte", "Livingston", "Oklahoma…
$ state                <chr> "CA", "NC", "MT", "OK", "LA", "PA", "WA", "CA", "…
$ classification       <chr> "Gunshot", "Gunshot", "Gunshot", "Gunshot", "Guns…
$ lawenforcementagency <chr> "San Diego Police Department", "Charlotte-Mecklen…
$ armed                <chr> "Knife", "Firearm", "Knife", "Firearm", "Unknown"…