Team Two Project Proposal: Tackling Tornado Trends

Anika Mandavilli, Anika Desai, Nate Xiao

library(tidyverse)
library(dplyr)

Dataset

This dataset is from TidyTuesday and is a compilation of characteristics of US tornadoes, including their date and time, state, magnitude, and more.

The data was sourced from the NOAA’s National Weather Service Storm Prediction Center Severe Weather Maps, Graphics, and Data Page, which contains data for tornadoes, hail, and damaging wind dated from 1950.

tornadoes <- read_csv("data/1950-2022_actual_tornadoes.csv")

Rows: 68701 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (4): mo, dy, st, stf
dbl  (23): om, yr, tz, stn, mag, inj, fat, loss, closs, slat, slon, elat, el...
date  (1): date
time  (1): time

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(tornadoes)

Rows: 68,701
Columns: 29
$ om    <dbl> 192, 193, 195, 196, 197, 194, 198, 199, 200, 201, 4, 5, 6, 7, 1,…
$ yr    <dbl> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950…
$ mo    <chr> "10", "10", "11", "11", "11", "11", "12", "12", "12", "12", "01"…
$ dy    <chr> "01", "09", "20", "20", "20", "04", "02", "02", "02", "02", "13"…
$ date  <date> 1950-10-01, 1950-10-09, 1950-11-20, 1950-11-20, 1950-11-20, 195…
$ time  <time> 21:00:00, 02:15:00, 02:20:00, 04:00:00, 07:30:00, 17:00:00, 15:…
$ tz    <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ st    <chr> "OK", "NC", "KY", "KY", "MS", "PA", "IL", "IL", "AR", "IL", "AR"…
$ stf   <chr> "40", "37", "21", "21", "28", "42", "17", "17", "5", "17", "5", …
$ stn   <dbl> 23, 9, 1, 2, 14, 5, 7, 8, 12, 9, 1, 2, 3, 1, 1, 2, 1, 2, 3, 4, 5…
$ mag   <dbl> 1, 3, 2, 1, 1, 3, 2, 3, 3, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 2, 2, 2…
$ inj   <dbl> 0, 3, 0, 0, 3, 1, 3, 25, 0, 0, 1, 5, 0, 2, 3, 3, 1, 0, 12, 5, 6,…
$ fat   <dbl> 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1…
$ loss  <dbl> 4, 5, 5, 5, 4, 5, 4, 6, 1, 4, 3, 5, 5, 0, 6, 5, 4, 4, 4, 5, 5, 4…
$ closs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ slat  <dbl> 36.73, 34.17, 37.37, 38.20, 32.42, 40.20, 38.97, 38.75, 36.12, 3…
$ slon  <dbl> -102.52, -78.60, -87.20, -84.50, -89.13, -76.12, -90.05, -89.67,…
$ elat  <dbl> 36.88, 0.00, 0.00, 0.00, 0.00, 40.40, 39.07, 38.90, 36.18, 38.22…
$ elon  <dbl> -102.30, 0.00, 0.00, 0.00, 0.00, -75.93, -89.72, -89.38, -91.72,…
$ len   <dbl> 15.8, 2.0, 0.1, 0.1, 2.0, 15.9, 18.8, 18.0, 7.8, 9.6, 0.6, 2.3, …
$ wid   <dbl> 10, 880, 10, 10, 37, 100, 50, 200, 10, 50, 17, 300, 100, 133, 15…
$ ns    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1…
$ sn    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1…
$ sg    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ f1    <dbl> 25, 47, 177, 209, 101, 71, 119, 119, 65, 157, 113, 93, 91, 47, 0…
$ f2    <dbl> 0, 0, 0, 0, 0, 11, 117, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3…
$ f3    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ f4    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ fc    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

From the glimpse function, we can see that the data set has 29 different variables and 68701 observations. The variables are unique and have a range of class types including integers, characters, doubles, date, and time.

We chose this dataset because it has a mix of numerical and categorical variables, including variables that can be easily mutated into the class type that we would prefer. We are also interested in the healthcare industry and wanted to explore the impacts of tornadoes, including injuries (inj) and fatalities (fat), on the population of different states. The findings of our analysis could be extended to the healthcare industry and beyond. It could be utilized to find relationships between tornado magnitude and number of injuries (inj) or fatalities (fat) across areas within the US. Policymakers can also look at what aspects of tornadoes contribute most to property damage in the same capacity. This could eventually aid in policy-making around the issue; more research can be done to investigate why certain states are impacted harder by tornadoes of the same magnitude (mag) than others, and how policies can be updated to address the potential causes and flaws in the vulnerable states.

First Research Question and Analysis Plan

The first question we want to answer is: How may the nature of tornadoes in terms of the intensity of, size of, and distance traveled by tornadoes vary by the location they occur in?

Our plan for answering the first question question is to create a series of visualizations that use the st (state), mag (magnitude on the F scale), slat (starting latitude), slon (starting longitude), elat (ending latitude), elon (ending longitude), len (length in miles), and wid (width in yards).

Variable Transformations

We can use these variables to mutate any new variables that are needed. We can create a new categorical region variable by categorizing values of the st variable into regions that are held by the region variable. In this way, we can have tornadoes categorized by the region they occurred in in addition to the state they originated in. We can then use the slat, slon, elat, and elon numeric variables to mutate variables regarding the horizontal and vertical distances traveled by the tornado. The size of the tornado is given by the numeric len and wid variables while its intensity is given by the mag variable.

Visualizations

In this way, we can create two scatter plots regarding how the horizontal and vertical distances traveled by tornadoes relate to each other and how the length and width of tornadoes relate to each other as well as how both of these relationships may relate to the magnitude of the tornado and may vary by region by faceting each of the plots by region and making the size of the points based on the magnitude.

Second Research Question and Analysis Plan

The second question we want to answer is: How does the degree of damage in terms of injuries, fatalities, and loss, vary by their magnitude and the season they occur in?

Our plan to answer the second question is to firstly create a new variable, season, with mutate function and then create a series of visualization using inj (injuries), fat (fatalities), loss, and mag (magnitude).

Variable Transformations

We intend to create a categorical variable, season, by categorizing each data point into Spring, Summer, Fall, or Winter by the month that specific tornado occurred. This can be done through a simple case_when() statement with the mutate function.

Visualizations

Then, we can create a series of scatter plots that puts magnitude on x-axis and then one the of inj(injuries), fat(fatalities), and loss on y-axis to investigate the relationship between magnitude and the damage a tornado might cause. Additionally, we can also color code the plot by season to examine if there’s any seasonal trends. If we find trends or seasonal variations among any of those plots, we can further investigate a specific variable (inj, fat, or loss) and its relationship with mag (magnitude) and facet the scatter plot by season to compare the trend across different season. We can also fit a regression model to predict the damage (either injuries, fatalities, or loss) a tornado might cause in a given season.

References

Source of the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-16/readme.md#data-dictionary
Source for the NOAA webpage: https://www.spc.noaa.gov/wcm/#data