UFO Sightings in the US

Proposal

Dataset

ufo_sightings <- readr::read_csv(here::here("data/ufo_sightings.csv"))
places <- readr::read_csv(here::here("data/places.csv"))
day_parts_map <- readr::read_csv(here::here("data/day_parts_map.csv"))
military_bases <- readr::read_csv("data/military-bases.csv")

This dataset is sourced from TidyTuesday’s May 20th 2023 challenge. The data originates from the National UFO Reporting Center and was cleaned and compiled with data from sunrise-sunset.org by Jon Harmon.

Three different .csv files make up this dataset. Our primary dataframe is ufo_sightings.csv where each observation is a reported UFO sighting. The file places.csv contains additional geographic information and days_parts_map.csv contains times of astrological events for different locations on a given date.

The dimensions for each dataframe are as follows.

Dimensions	`ufo_sightings.csv`	`places.csv`	`day_parts_map.csv`
Observations	96429	14417	26409
Variables	12	10	12

We plan on left joining the dataframe ufo_sightings with places by city and state but omitting day_parts_map from our analysis. The resulting combined dataframe has 96429 observations and 20 variables.

This dataset was chosen because our interest in seeing the trends in UFO sightings and understanding how these may vary by location and change over time. It is also an interesting dataset with a lot of variability and will be intersting to plot and look at visually.

Questions

Which state in the US has the most UFO sightings and in that state which cities have the most sightings?
What are the trends and distributions that can be observed about UFO sighting events in the US in terms of time of day, day of week, month, year, shape, and duration?

Analysis plan

As stated in our research questions, we will only focus on data from the US. This is due to the lack of observations in other countries and will be accomplished using dplyr::filter() based on country_code.

Our analysis will also involve variables representing time which is at present all combined into a single column reported_date_time. It should be possible to convert this column into separate columns for year, month, day, and time using dplyr::mutate() and lubridate::ymd_hms(), lubridate::year(), lubridate::month(), lubridate::day(), etc.

Question 1

In order to capture the geographic distribution of sightings, we plan on generating maps. In order to create the map of the US and of a specific US state we intend to reference the following chapter: https://ggplot2-book.org/maps. We may use data from the maps package ggplot2::map_data() to get the boundaries between states/countries. To turn this data into an actual map, we’ll use ggplot2::geom_polygon().

The first visualization aims to show the distribution of sightings by state in the most recent year of the data set, intended to help determine which states tend to have the most and least sightings. We will use dplyr::mutate() and dplyr::group_by() to compute a count variable for each state. This count will be the fill aesthetic for our ggplot. We may add labels denoting the exact count using something like ggplot2::geom_sf_label() or ggplot2::geom_sf_text().

The second visualization will focus in on the single state with the most sightings and instead show the distribution of sightings by city. City locations will be plotted using their latitude and longitude. Depending on the distribution and number of observations, we may try overlaying points from our dataset on this map to indicate frequency. Otherwise, we will take a similar approach as with the first visualization and compute a count per city, perhaps adjusting an aesthetic such as alpha for cities by that count. It may be interesting to also color or modify the size of cities by their population. If something goes wrong with cities, an alternative approach may be to bring in another dataset to determine which county a particular UFO sighting belongs to and graphing by county instead. We plan to also use a map for the visualization similar to the first part of the question.

Question 2

While there are so many variables we could examine, we are most interested in the time of day of sightings (day_part) and the shape of the UFOs sighted.

Naturally, to capture the trends in UFO sightings we intend to create a timeseries visualization. The x-axis will represent time and the y-axis will represent the count. The will be calculated using a similar process as described previously for question 1, using dplyr::mutate() and dplyr::group_by().

The first visualization will focus on the time of day (day_part). At present, there are 9 distinct categories for this variable all corresponding to astrological definitions. As such, we believe faceting would be a bit too much. Instead, we intend to try changing the color aesthetic to be day_part. If the plot still appears to be too crowded, we may choose to focus solely on a subset of times of the day (ex: variations of dusk and night). Alternatively, we may group all of dawn (astronomical dawn, nautical dawn, civil dawn) together and all of dusk (astronomical dusk, nautical dusk, civil dusk) together and facet or color by this smaller set of categories instead. As to the scale of the x-axis, we would like to explore plotting by year and month. Year may show something like, has the pandemic changed the number of UFO sightings or has interest in (reporting) UFO sightings changed over time/since the pandemic have more people become interested in UFO sightings. Month may show seasonality in UFO observations, perhaps due to the temperature or when people are on vacation.

An interesting alternative for that plot would be one where the x-axis is of month and each year (or a representative year for the decade) is plotted in light gray. One or maybe two specific years will be highlighted in this plot with a different color. This could perhaps be 2020 to see if COVID-19 lockdowns influenced the number of reported sightings. In having the axis being month, it may also be interesting to highlight a typical pre-pandemic year to see if there may be seasonal variability in UFO sightings perhaps due to temperature or holidays, perhaps even a major film’s release.

The second visualization will focus on shape and duration of UFO sightings. Rather than focusing on how the distribution of these two variables have changed over time, it may be interesting enough to focus on the distribution of these from all US observations. One possible idea is to create a lollipop plot based on shape of the UFO with the length of the lollipops determined by the average duration of sighting for that shape of UFO. Another plot we could potentially use could be a bubble plot which introduces a new type of ggplot we have not looked at this year.

It is important to note that this obscures the variability in the dataset and also hides the number of observations for each category. Since there are roughly 25 categories of shape and trying to group shapes together is very subjective (as are the descriptions of these sightings), it would be difficult and overcrowded to add variability. A typical solution such as a boxplot or violinplot would not be visible. Thus, the goal of this plot is not to capture variability. In terms of number of observations, perhaps an annotation could be used to denote this value. If possible, it would also be intriguing to use a complex version of a lollipop plot where a median/mean duration is calculated and from that lollipops will grow up if above the measure of center and grow down if below the measure of center.