London Marathon Winners Since 1981

STA/ISS 313 - Project 1

Author

team_ryan

Abstract

A large number of factors play into the performance of athletes in long distance races, such as marathons. Understanding the different variables that play into the outcomes of these races can be beneficial to athletes, coaches, and race organizers. This project utilizes data from the London Marathon, which is sourced from the April 25, 2023 Tidy Tuesday Project. We analyze how run times have varied from 1981-2020 for male and female runners as well as male and female wheelchair races, specifically taking into account temperature. We also analyze how the London Marathon acceptance rate has varied from year to year as well as by decade. Our findings show that run times have either remained the same or decreased over time, with weather having a relatively small effect. Acceptance rate has been below the average for all years except 1981 and from 1988-2008. The highest acceptance rate occurred from 1991-2000 and the lowest from 2011-2020. These results can help races better understand the performance needed to remain competitive and race organizers gain more insight into how acceptance rate has varied over time.


library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(hms)

Attaching package: 'hms'

The following object is masked from 'package:lubridate':

    hms
library(gridExtra)

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

Introduction

The data we decided to use for our project comes from Tidy Tuesday’s London Marathon page, which has two datasets. These data comes from Nicola Rennie’s LondonMarathon R package, and contains two datasets: winners.csv and london_marathon.csv. The LondonMarathon Package was scraped from Wikipedia. The winners dataset contains information on the winners of the marathon, the category of race (men, women, wheelchair men, wheelchair women), the winner’s nationality, and the winning time. The london_marathon dataset contains the date of the race, the number of applicants, the number accepted, the number of people who started the race, the number of people who finished, how much was raised for charity, and what the official charity for the race was. The marathon data was scraped from Wikipedia using the rvest package, and then cleaned up using dplyr to make them into working datasets.

Question 1: Performance Across Race Categories vs. Weather

Introduction

The first question we want to answer is how winning performances amongst the four different categories has changed over time and how these performances vary from year to year. Specifically, have these times decreased over time, and are certain types of races more variable on a year to year basis than others? Some of the parts of the datasets that will be necessary to answer the question include the category of race and the winning time, both of which will come from the winners dataset. We will also add temperature data to our london_marathon data frame for each year. These temperature values will be from NOAA historical weather data, representing the temperature of a particular race day at 9 AM.

We want to explore this question because from our prior knowledge we see that technology, as well as training techniques, have improved tremendously over the years, and we want to see the potential effects of these factors on time. Also as a second part to our question, we will be comparing temperature data to winning times, again as we speculate that race temperature could play a role in finishing times. If there is noticeable correlation between temperature and winning times, will this affect all four race categories the same, or some more than others?

Approach

(1-2 paragraphs):

To answer question 1, the main plot we are using in a scatter plot that is connected by dotted lines using geom_point() and geom_line(), with four different colors for the four different race categories. The idea with these points and lines is to show the changes in the winning race times over our time period and the variability from year to year. We specifically did not use a line plot because we only have data points for individual years, and we do not want to lead the viewers to believe that there is data in between the years. The dotted line allows us to do this while still conveying trends in the data.

Our secondary plot for answering this question is a sideways bar chart using geom_col(). This chart adds the effect of a potential explanatory variable we thought was reasonable. The nature of this chart allows us to easily see the positive or negative directions of both relative win time and relative temperature, making them easy to compare. If we were to have just added temperature to the scatter and line plot, it would be much more difficult to compare either above-average or below-average winning times and temperatures to each other.

Analysis

london_marathon <- read.csv("data/london_marathon.csv")
winners <- read.csv("data/winners.csv")
london_marathon <- london_marathon |>
  mutate(start_temp_f = c(52, 50, 47, 51, 44, 49, 52, 56, 48, 52, 43, 51, 54, 44, 53, 61, 53, 51, 44, 44, 45, 45, 55, 
46, 46, 50, 59, 45, 53, 56, 57, 48, 47, 53, 49, 45, 52, 62, 51, 50, 54, 59))
winners <- winners |>
  mutate(Time = as_hms(hms::parse_hms(Time)))

winners <- winners |>
  group_by(Category) |>
  mutate(average_win_time = as.numeric(mean(Time)))

winners <- winners |>
  group_by(Category) |>
  mutate(rel_wintime = 1*(as.numeric(Time)/average_win_time - 1))

winners <- winners |>
  mutate(Category = ifelse(Category == "Men", "Men Running", 
                           ifelse(Category == "Women", "Women Running", Category)))

london_marathon <- london_marathon |>
  mutate(rel_temp = (start_temp_f/mean(start_temp_f) - 1))
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.

This figure is a line graph with points at each year titled “Relative Winning Time Over Time by Race Category”. Relative winning time is defined as the winning time for that year divided by the average winning time from 1981-2020. It has lines for Men’s Running, Women’s Running, Wheelchair Men and Wheelchair Women. Times from both Wheelchair Men and Wheelchair Women have remained relatively constant with relative win time around 0%. Men’s Running and Women’s Running started above 80% and have trended downs since then, with both remaining below 0% since 2003.

This figure is a line graph with points at each year titled “Relative Winning Time Over Time by Race Category”. Relative winning time is defined as the winning time for that year divided by the average winning time from 1981-2020. It has lines for Men’s Running, Women’s Running, Wheelchair Men and Wheelchair Women. This is a zoomed in version of the previous plot, only including relative win times under 20%. Zooming in allows us to better notice the trends of relative win times for men and women running: the relative win times trend from +5% in the beginning of the time period to nearly below 5% of the average win time.

This figure is a column graph titled “Relative Running Win Times Plotted Against Relative Temperature” that displays relative win times for Men’s and Women’s Running along with the relative average temperature by year. The plot shows that 2018 had the highest relative temperature and 1991 had the lowest.

This figure is a column graph titled “Relative Wheelchair Win Times Plotted Against Relative Temperature” that displays relative win times for Men’s and Women’s Wheelchair along with the relative average temperature by year. The plot shows that 2018 had the highest relative temperature and 1991 had the lowest. We are visualizing the relative win time via a sideways bar chart rather than a line chart which was used in the first two plots.

Discussion

From the first plot, we see that Wheelchair Men and Women started with the largest relative win time (85% and 95% for men and women wheelchair, respectively). Over time, the relative win time for both wheelchair divisions have decreased over time significantly, with both values going below 0% relative win time in 1993 (indicating that the winning time in 1993 is faster than the average winning time for the whole time period). The reason for the drastic drop in times is the change of technology in the wheelchairs that were used in the marathon: back then, the wheelchairs that were used resembled medical wheelchairs, which is not ideal when prioritizing speed. As we move to more recent times, the wheelchair technology improved significantly, and the wheelchairs are much more aerodynamic. These wheelchairs also can take advantages of downhill slopes much easier than a manual runner can, allowing them to maintain more speed for a longer time, due to the mobility source being the machine itself.

We also included the zoomed-in version of the plot, getting rid of any observations over 20% the average winning time to easier see the men’s and women’s running marathons. For men & women running, there really isn’t much deviation from the average winning time, with the times mostly deviating no more than ±5% from the average winning time. Despite only little deviation, we do see a slight trend downward in winning times, with the most recent marathons generally being below 0% in relative win time, and the oldest marathons being over 0% in relative win time.

The second type plot we added was a sideways, stacked bar graph that measures race times relative to the average by year, with temperature plotted over it. We did this to see if there was any relationship between performance in the race and temperature. There appears to be no apparent relationship between the two– temperature and winning time do not seem to have much overlap per the plot for both wheelchair and running marathons, despite our original thoughts that an “optimal” temperature range would enhance performance.