library(tidyverse)
library(dplyr)
Team Two Project Proposal: Tackling Tornado Trends
Anika Mandavilli, Anika Desai, Nate Xiao
Dataset
This dataset is from TidyTuesday and is a compilation of characteristics of US tornadoes, including their date and time, state, magnitude, and more.
The data was sourced from the NOAA’s National Weather Service Storm Prediction Center Severe Weather Maps, Graphics, and Data Page, which contains data for tornadoes, hail, and damaging wind dated from 1950.
<- read_csv("data/1950-2022_actual_tornadoes.csv") tornadoes
Rows: 68701 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): mo, dy, st, stf
dbl (23): om, yr, tz, stn, mag, inj, fat, loss, closs, slat, slon, elat, el...
date (1): date
time (1): time
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(tornadoes)
Rows: 68,701
Columns: 29
$ om <dbl> 192, 193, 195, 196, 197, 194, 198, 199, 200, 201, 4, 5, 6, 7, 1,…
$ yr <dbl> 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950, 1950…
$ mo <chr> "10", "10", "11", "11", "11", "11", "12", "12", "12", "12", "01"…
$ dy <chr> "01", "09", "20", "20", "20", "04", "02", "02", "02", "02", "13"…
$ date <date> 1950-10-01, 1950-10-09, 1950-11-20, 1950-11-20, 1950-11-20, 195…
$ time <time> 21:00:00, 02:15:00, 02:20:00, 04:00:00, 07:30:00, 17:00:00, 15:…
$ tz <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
$ st <chr> "OK", "NC", "KY", "KY", "MS", "PA", "IL", "IL", "AR", "IL", "AR"…
$ stf <chr> "40", "37", "21", "21", "28", "42", "17", "17", "5", "17", "5", …
$ stn <dbl> 23, 9, 1, 2, 14, 5, 7, 8, 12, 9, 1, 2, 3, 1, 1, 2, 1, 2, 3, 4, 5…
$ mag <dbl> 1, 3, 2, 1, 1, 3, 2, 3, 3, 1, 3, 2, 2, 2, 3, 3, 1, 2, 3, 2, 2, 2…
$ inj <dbl> 0, 3, 0, 0, 3, 1, 3, 25, 0, 0, 1, 5, 0, 2, 3, 3, 1, 0, 12, 5, 6,…
$ fat <dbl> 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1…
$ loss <dbl> 4, 5, 5, 5, 4, 5, 4, 6, 1, 4, 3, 5, 5, 0, 6, 5, 4, 4, 4, 5, 5, 4…
$ closs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ slat <dbl> 36.73, 34.17, 37.37, 38.20, 32.42, 40.20, 38.97, 38.75, 36.12, 3…
$ slon <dbl> -102.52, -78.60, -87.20, -84.50, -89.13, -76.12, -90.05, -89.67,…
$ elat <dbl> 36.88, 0.00, 0.00, 0.00, 0.00, 40.40, 39.07, 38.90, 36.18, 38.22…
$ elon <dbl> -102.30, 0.00, 0.00, 0.00, 0.00, -75.93, -89.72, -89.38, -91.72,…
$ len <dbl> 15.8, 2.0, 0.1, 0.1, 2.0, 15.9, 18.8, 18.0, 7.8, 9.6, 0.6, 2.3, …
$ wid <dbl> 10, 880, 10, 10, 37, 100, 50, 200, 10, 50, 17, 300, 100, 133, 15…
$ ns <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1…
$ sn <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1…
$ sg <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ f1 <dbl> 25, 47, 177, 209, 101, 71, 119, 119, 65, 157, 113, 93, 91, 47, 0…
$ f2 <dbl> 0, 0, 0, 0, 0, 11, 117, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3…
$ f3 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ f4 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ fc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
From the glimpse function, we can see that the data set has 29 different variables and 68701 observations. The variables are unique and have a range of class types including integers, characters, doubles, date, and time.
We chose this dataset because it has a mix of numerical and categorical variables, including variables that can be easily mutated into the class type that we would prefer. We are also interested in the healthcare industry and wanted to explore the impacts of tornadoes, including injuries (inj
) and fatalities (fat
), on the population of different states. The findings of our analysis could be extended to the healthcare industry and beyond. It could be utilized to find relationships between tornado magnitude and number of injuries (inj
) or fatalities (fat
) across areas within the US. Policymakers can also look at what aspects of tornadoes contribute most to property damage in the same capacity. This could eventually aid in policy-making around the issue; more research can be done to investigate why certain states are impacted harder by tornadoes of the same magnitude (mag
) than others, and how policies can be updated to address the potential causes and flaws in the vulnerable states.
First Research Question and Analysis Plan
The first question we want to answer is: How may the nature of tornadoes in terms of the intensity of, size of, and distance traveled by tornadoes vary by the location they occur in?
Our plan for answering the first question question is to create a series of visualizations that use the st
(state), mag
(magnitude on the F scale), slat
(starting latitude), slon
(starting longitude), elat
(ending latitude), elon
(ending longitude), len
(length in miles), and wid
(width in yards).
Variable Transformations
We can use these variables to mutate any new variables that are needed. We can create a new categorical region
variable by categorizing values of the st
variable into regions that are held by the region
variable. In this way, we can have tornadoes categorized by the region
they occurred in in addition to the state they originated in. We can then use the slat
, slon
, elat
, and elon
numeric variables to mutate variables regarding the horizontal and vertical distances traveled by the tornado. The size of the tornado is given by the numeric len
and wid
variables while its intensity is given by the mag
variable.
Visualizations
In this way, we can create two scatter plots regarding how the horizontal and vertical distances traveled by tornadoes relate to each other and how the length and width of tornadoes relate to each other as well as how both of these relationships may relate to the magnitude of the tornado and may vary by region
by faceting each of the plots by region
and making the size of the points based on the magnitude.
Second Research Question and Analysis Plan
The second question we want to answer is: How does the degree of damage in terms of injuries, fatalities, and loss, vary by their magnitude and the season they occur in?
Our plan to answer the second question is to firstly create a new variable, season
, with mutate function and then create a series of visualization using inj
(injuries), fat
(fatalities), loss
, and mag
(magnitude).
Variable Transformations
We intend to create a categorical variable, season
, by categorizing each data point into Spring, Summer, Fall, or Winter by the month that specific tornado occurred. This can be done through a simple case_when()
statement with the mutate function.
Visualizations
Then, we can create a series of scatter plots that puts magnitude on x-axis and then one the of inj
(injuries), fat
(fatalities), and loss
on y-axis to investigate the relationship between magnitude and the damage a tornado might cause. Additionally, we can also color code the plot by season
to examine if there’s any seasonal trends. If we find trends or seasonal variations among any of those plots, we can further investigate a specific variable (inj
, fat
, or loss
) and its relationship with mag
(magnitude) and facet the scatter plot by season
to compare the trend across different season. We can also fit a regression model to predict the damage (either injuries, fatalities, or loss) a tornado might cause in a given season
.
References
Source of the data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-16/readme.md#data-dictionary
Source for the NOAA webpage: https://www.spc.noaa.gov/wcm/#data