Squirrlytics: Central Park Squirrels

Proposal

library(tidyverse)
library(ggplot2)
library(knitr)

squirrels <- read_csv("data/squirrels.csv")

Dataset

This data set, named the Central Park Squirrel Census, describes squirrels from a census in 2018 in New York City’s Central Park. The data was originally collected in this Census and then used for TidyTuesday. In this census, squirrels were counted and analyzed based on a variety of factors, including location, color, age, and activities that the squirrels were engaging in. More information about this Census is listed here.

This dataset has 3023 observations (each about a squirrel sighting) and 31 variables. The variables are both numerical, such as longitude and latitude and date of sighting, and categorical. Some of the categorical variables include fur colors, age category, above/below ground location, activities (running, chasing, climbing, etc.), and various interactions between squirrels and with humans. Many of the variables are logical types with TRUE/FALSE values, but with data wrangling, we will make them categorical variables.

We chose this dataset because we love squirrels! Looking through the TidyTuesday data sets, this data set immediately caught our eyes because of the fun nature of its content. We explored the Squirrel Census website and were intrigued by the complexity of the topic. At Duke, we have a lot of crazy campus squirrels, so we are also curious to see if Central Park squirrels have similar habits as Duke squirrels. We also thought that we could create cool visualizations relating to geospatial data since the data includes longitude and latitude information within Central Park.

Below, we have displayed the 31 variables in the dataset.

names(squirrels)
 [1] "X"                                         
 [2] "Y"                                         
 [3] "Unique Squirrel ID"                        
 [4] "Hectare"                                   
 [5] "Shift"                                     
 [6] "Date"                                      
 [7] "Hectare Squirrel Number"                   
 [8] "Age"                                       
 [9] "Primary Fur Color"                         
[10] "Highlight Fur Color"                       
[11] "Combination of Primary and Highlight Color"
[12] "Color notes"                               
[13] "Location"                                  
[14] "Above Ground Sighter Measurement"          
[15] "Specific Location"                         
[16] "Running"                                   
[17] "Chasing"                                   
[18] "Climbing"                                  
[19] "Eating"                                    
[20] "Foraging"                                  
[21] "Other Activities"                          
[22] "Kuks"                                      
[23] "Quaas"                                     
[24] "Moans"                                     
[25] "Tail flags"                                
[26] "Tail twitches"                             
[27] "Approaches"                                
[28] "Indifferent"                               
[29] "Runs from"                                 
[30] "Other Interactions"                        
[31] "Lat/Long"                                  

Questions

Our two questions we are seeking to answer are:

  1. Where do different colors and ages of squirrels reside in Central Park?

  2. What activities do squirrels partake in based on time of day, and how does this vary with human interactions?

Analysis Plan

In order to answer the first question, we plan to use five variables:

  • X (longitude) (numerical)
  • Y (latitude) (numerical)
  • Age (categorical)
  • Primary Fur Color (categorical)
  • Location (categorical)

To create the latitude and longitude variables, we will mutate the lat/long variable to become two different characters. We plan to create a scatterplot in the shape of Central Park. We will plot latitude and longitude on the x and y axis respectively and plot where the squirrels are located in Central Park. We will create two different visuals, with one plotting the age of squirrels and one plotting the fur color. We could also facet or use color or shape aesthetic mappings for whether the squirrel was located above ground or below ground. If it is feasible, we also plan to overlay an image or map of Central Park over the plot to more accurately portray the squirrel’s locations within the park.

In order to answer the second question, we plan to use 14 variables:

  • Shift (categorical)
  • Running (categorical)
  • Chasing (categorical)
  • Climbing (categorical)
  • Eating (categorical)
  • Foraging (categorical)
  • Kuks (categorical)
  • Quaas (categorical)
  • Moans (categorical)
  • Tail Flags (categorical)
  • Tail Twitches (categorical)
  • Approaches (categorical)
  • Indifference (categorical)
  • Runs From (categorical)

To answer the second question, we plan to use shift and then the categorical variables we will create after transforming them from various logical variables. We want to explore the activities that squirrels partake in and we would group the related variables into the following activity categories: activities (running, chasing, climbing, eating, foraging), noises (kuks, quaas, moans), tail movements (tail flags, tail twitches), and interactions with humans (approaches, indifferent, runs from).

First, we could use a bar chart, stacked bar chart, or pie chart (by changing the coordinate system) to display the frequency of different squirrel activities based on the time of day, AM or PM. For this, we would plot the various activities on the x-axis and would fill the bars with the time of day. Another plot option could be to visualize the various levels within each of the 4 categories, whether that means picking one or two categories to focus on or displaying them all with faceting. For this option, we could fill the bar charts to display the frequencies of the different options within each category (activities, noises, tail movements, and interactions with humans). Another option would be to just analyze interactions with humans, as we are also interested in this variable on its own. We could consider using a plot from lecture, a lollipop chart, as an alternative visualization method. Another idea is to use a heatmap (geom_tile()) to compare the frequency of the crossover of a single’s squirrel activities. This would require data wrangling to compute a numeric variable to be used for the colors, but would be a unique visualization.

As of now, we do not plan on merging any external data, but that could change as we start with data wrangling and making our visualizations.