library(tidyverse)
UFO Sightings Visualized
Understanding factors influencing UFO sightings with data visualization
Dataset
<- read.csv("data/ufo_sightings.csv")
ufo_sightings <- read.csv("data/places.csv")
places <- read.csv("data/day_parts_map.csv")
day_parts_map
glimpse(ufo_sightings)
Rows: 96,429
Columns: 12
$ reported_date_time <chr> "2022-08-29T06:03:00Z", "2022-08-20T01:51:00Z",…
$ reported_date_time_utc <chr> "2022-08-29T06:03:00Z", "2022-08-20T01:51:00Z",…
$ posted_date <chr> "2022-09-09", "2022-10-08", "2022-09-09", "2022…
$ city <chr> "Pinehurst", "Rapid City", "Cleveland", "Bloomi…
$ state <chr> "NC", "MI", "OH", "IN", "CA", "OK", "VA", "CT",…
$ country_code <chr> "US", "US", "US", "US", "US", "US", "US", "US",…
$ shape <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ reported_duration <chr> "15 mins\u0085", "1 minute", "2 hours", "30 sec…
$ duration_seconds <dbl> 900, 60, 172800, 30, 180, 600, 20, 300, 120, 18…
$ summary <chr> "Saw multi color object above horizon.", "An ob…
$ has_images <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
$ day_part <chr> "night", "nautical dusk", "night", "afternoon",…
glimpse(places)
Rows: 14,417
Columns: 10
$ city <chr> "Pinehurst", "Rapid City", "Cleveland", "Blooming…
$ alternate_city_names <chr> "Pajnkherst,bynhwrst,pynhwrst karwlynay shmaly,П…
$ state <chr> "NC", "MI", "OH", "IN", "CA", "OK", "VA", "CT", "…
$ country <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", …
$ country_code <chr> "US", "US", "US", "US", "US", "US", "US", "US", "…
$ latitude <dbl> 35.19543, 44.83445, 41.49950, 39.16533, 33.66946,…
$ longitude <dbl> -79.46948, -85.28256, -81.69541, -86.52639, -117.…
$ timezone <chr> "America/New_York", "America/Detroit", "America/N…
$ population <int> 15752, 1352, 388072, 84067, 256927, 60451, 24729,…
$ elevation_m <int> 160, 192, 199, 235, 17, 382, 89, 11, NA, 1156, 16…
glimpse(day_parts_map)
Rows: 26,409
Columns: 12
$ rounded_lat <int> 40, 40, 40, 40, 30, 40, 40, 40, -30, 40, 4…
$ rounded_long <int> -80, -90, -80, -90, -120, -100, -80, -70, …
$ rounded_date <chr> "2022-08-28", "2022-08-21", "2022-08-14", …
$ astronomical_twilight_begin <chr> "09:07:43", "09:38:33", "08:48:54", "09:19…
$ nautical_twilight_begin <chr> "09:42:49", "10:14:53", "09:26:40", "09:58…
$ civil_twilight_begin <chr> "10:16:18", "10:49:12", "10:01:56", "10:34…
$ sunrise <chr> "10:42:53", "11:16:15", "10:29:32", "11:02…
$ solar_noon <chr> "17:21:10", "18:03:05", "17:24:39", "18:05…
$ sunset <chr> "23:59:27", "00:49:55", "00:19:45", "01:08…
$ civil_twilight_end <chr> "00:26:02", "01:16:58", "00:47:21", "01:36…
$ nautical_twilight_end <chr> "00:59:31", "01:51:17", "01:22:37", "02:13…
$ astronomical_twilight_end <chr> "01:34:37", "02:27:37", "02:00:24", "02:52…
We are using the UFO Sightings dataset, which includes ufo_sightings, places, and day_parts_map. The dimensions for these three data tables are as follows: ufo_sightings
has 96429 observations (in rows) and 12 variables (in columns), places
has 14417 observations and 10 variables, and day_parts_map
has 26409 observations and 12 variables. The data come from the National UFO Reporting Center and also include data from sunrise-sunset.org by Jon Harmon. We will be examining reported UFO sightings in the United States.
UFO sightings are more common than we originally thought, which strongly sparked our interest. Using R to visualize this massive set of data could help people understand where, when, and how these sightings are being reported. It could also help shed light on potential factors that influence the number and type of UFO sighting. Additionally, we chose this dataset over a few other potential ones due to its robust number of observations and good variety of data, including numerical data, categorical data, and geographical data (longitude and latitude, for example). The large sample size will help make our data visualizations more valid, whereas the variety of data will allow us to explore the relationships between data in a more compelling way.
Questions
Is there a specific geographical region that reports a particularly high number of UFO sightings? If so, what factors might play a role in increasing the UFO sightings in that region? Potential factors may include elevation and population (or population density).
How do astronomical conditions affect the quantity of UFO sightings? Is light pollution related to the number of UFO sightings in some way? Does the quantity of reported UFO sightings differ based on whether it is during twilight, sunrise, or sunset?
Analysis plan
To answer our first question, we plan to focus on making maps. We would like to employ shapefiles to address density of sightings by location, as well as elevation and population data. In order to do that, we will plan to utilize layers on a shapefile - filling the background (in regions of cities, counties, or states, depending on what makes most sense) according to data on elevation and population density, and then adding in the locations of each sighting as points on the map. This enables us to include plenty of data in each visualization, while keeping them from being too clustered. We intend to restrict our analysis to only sightings within the United States, which still allows us to retain 91.4797% of the data in ufo_sightings
. It is easy to join ufo_sightings
to places
based on city
, state
, and country_code
, and in places
we have latitude and longitude data. places
provides us with elevation and population information, and we can also use it to join to census data on population density and US shapefiles using the tidycensus
and tigris
packages, respectively. We do not intend to try to reduce the number of rows in the dataset into individual UFO sighting events, as that would be an imposition of opinion onto the data, and it is very possible that two sightings in nearby areas at similar times might still not be a sighting of the same object (for example, if there was a meteor shower occurring at the time of the event, visible from the location of the sightings).
We do acknowledge from our peer reviewers that there is some issue with the collection of data - some places do indeed report their first sightings of UFOs earlier than others. We will certainly explore this in our data analysis, and we will convert to some measure of rate if it seems to be a problem. However, these data were all collected by a central source (not by individual location governments), and so our base expectation is that some locations report earlier sightings only because UFOs were seen there first - people in all locations have had the same period of time in which to report UFO sightings near them. We also acknowledge their concern with population density - which is precisely why we hope to incorporate that information into our plots for question 1.
To answer our second question, we will look at parts of the day, as well as (if we can find it) data on light pollution in the US (once again, we are restricting our analysis to only sightings in the US). We will focus first on parts of the day, with the hypothesis that these sightings are likely to happen in times of low light (i.e., dawn and dusk). From ufo_sightings
, we have dates, times, and locations; after joining to places
, we also have latitude and longitude. Using these, we can join into day_parts_map
, narrowing down our parts based on latitude, longitude, and date, and then finding out which time period the sighting falls into (using ranges of time derived from the various twilight variables in day_parts_map
; various functions in the lubridate
package will be useful here). With these data, we can create a bar chart focusing on the counts of sightings by time period. We can break this down into astronomical, nautical, and civil twilight, as well as the regions from sunrise-sunset and astronomical twilight end-astronmical twilight beginning (maybe cutting intervals into these periods, depending on what trends the data reveal).
If we can find good data on light pollution (joined through our various location variables), we would make a visualization exploring how light pollution affects sightings in various locations (perhaps returning to geospatial data, perhaps simply a scatterplot of sightings vs pollution). It is slightly difficult to address the concern from our peer reviewers on how to address the change in light pollution over time without having access to these data. However, there are two possible outcomes. One is that we have time-specific data on light pollution in various locations, in which case we can focus more on scatterplots or bar charts (which would allow us some better measure of sightings based on light pollution as a statistic, perhaps grouped by some sort of intervals). The other is that we do not have time-specific data, in which case we would assume (and hope) that it does not vary much within a location - for example, we expect that the relative light pollution in New York City is always substantially higher than the light pollution in rural Montana, even if there is some deviation in absolute levels of light pollution from day to day. In this case, the data may lend themselves more to a visualization based on geospatial data.
If we cannot, another logical visualization to explore astronomical conditions would be to look specifically at time of day data - showing the average number of sightings based on the time of day using some sort of a line plot, perhaps differentiated by location or season. We could also incorporate the data on duration (contained in duration_seconds
), potentially revealing information on whether sightings in broad daylight tend to last for longer periods of time (or any other such trends). In all of this, we intend to gleam information on how atmospheric conditions - namely visibility - affects reports of UFO sightings.
Another idea we are playing with is finding a data set on atmospheric events (like meteor showers, lunar phases, and planetary alignments) and doing some form of time series visualization to see if sightings exhibit some correlation with these events. However, we are unlikely to explore this avenue unless we find nothing of note in the other avenues we intend to explore. We do acknowledge our peer reviewers’ concern with finding these external datasets - we have included information throughout on how we will incorporate these (mostly based on geographic data), but finding them (beyond the aforementioned Census data) still remains a concern for us as well. We will continue to search the internet in the hopes of encountering the data we are looking for.