London Marathon Winners Since 1981

Proposal

library(tidyverse)

london_marathon <- read.csv("data/london_marathon.csv")
winners <- read.csv("data/winners.csv")

Dataset

Our dataset comes from the LondonMarathon R package, made by Nicola Rennie. The data is originally sourced from Wikipedia, as of November 1, 2022. This data is also part of the TidyTuesday project from April 25, 2023, which can be found here.

The data set is made up of two csv files, winners.csv and london_marathon.csv, which can be joined by the variable year. winners.csv contains 163 observations and 5 variables. london_marathon.csv contains 42 observations (one per year from 1981-2022) and 8 variables. Some of the variables london_marathon dataset include the year of the marathon, number of applicants, how many people were accepted, the number of starters, the number of finishers, how much money was raised, and the official charity for the race. The winners dataset featured variables such as the division (men, women, wheelchair men, and wheelchair women), the year of the race, the winner’s nationality, and the winner’s time.

We chose this dataset because we have avid runners on our team who would be very interested in analyzing a dataset like this. We thought the source of the data was also interesting as it comes from an existing R package which was sourced from Wikipedia. Since the dataset has 2 .csv files, we thought we could take the opportunity to further our analysis and make unique visualization choices by joining them, which can be done with the year variable.

Above all, we liked the breadth and depth of variables and the types of variables that we would have access to. The numerical values not only encompass the race results but also how much money was raised for charity and the acceptance rate for the rates. While we acknowledge that the number of observations is on the lower side, we believe that the versatility of having both categorical and numerical variables will allow us to analyze our data in multiple ways. Finally, we believe that this dataset offers an opportunity to incorporate valuable external data to supplement our analysis.

Questions

The first question we want to answer is whether or not the strength of performance among the four different categories of races is correlated on a year to year basis to various factors such as weather. We want to explore this question because it could provide valuable insights into whether elite performances across different race categories are clustered in some manner. We could then hypothesize why there might or might not be correlation across these groups, and include external data such as the average temperature for that race day.
The second question we want to answer is how the acceptance rate and completion rate varies across race categories. We want to explore this question because it provides an insight into the participation dynamics and runner preparedness for the marathon. Moreover, it would help event managers plan where to staff people and resources for the race. From these observations we could then look into why there are discrepancies in the categories and what could rectify some of those discrepancies.

Analysis plan

We would first use the year variable for the race, and create a variable that represents the winning time relative to the average winning time among each race category (rel_wintime). So, this new variable of relative performance would be a calculation of the % faster or slower the winning time was from the average winning time. To clarify, this average time variable would be an average over all years of racing, one for each race category. We would plot this over time for each of the four categories. This would be a time-series plot, using geom_line() and grouping by race category to create four lines representing each of the categories of racing. We could then merge in weather data for each of the race years, and use this to understand any effect race day weather could have on the relative performance of all the categories. We could merge the data using a join function, joining by date. In order to add weather to our plot, we would add a separate line plot of temperature over time below our main plot described above, similar to how we were instructed to do for the Napoleon’s March plot in HW 1.
The variables involved for question two are race category, applicants, accepted, starters, and finishers. The sum of the applicant column will be divided by the sum of the accepted column to create the proportion of accepted people. The same thing will be done for the starters and finishers columns to get the proportion of finishers for each category. The category variables will be able to stay as is for the purposes of this question. We will experiment with different types of charts for this data, and our leading options as of right now are a line graph using geom_line() or a bar chart using geom_bar() (grouped by category). The line graphs would each have two lines, one for accepted percent, and one for finish percent, across all race categories.