DATA 73000 Introduction to Data Visualization

The Question

I was interested in finding out from the NYC OpenData 311 Service Requests from 2010 to Present dataset if the frequency or type of complaints made by residents of New York City changed significantly when the city shut down due to the COVID-19 pandemic and all “non-essential” workers went into quarantine. My hypothesis was that the stress of both quarantine and of the pandemic in general would have had a significant impact on the number and type of complaints that people would take the time to call in to 311, and that people would in general tend to complain more due to the stress. Even though much has already been done to understand the impact that the pandemic had on mental health, I believed that this analysis might offer another window into the way that people coped with the pressures, challenges, and psychological stress of quarantine and the pandemic that might be useful to city government officials and mental health professionals for future planning and mental health support.

The Data

I used 311 Service Requests from 2010 to Present data from the NYC OpenData website and downloaded all complaints for the two year period prior to the closing of NYC public schools on March 16, 2020 until March 15, 2022 two years later in order to have a two year pre-pandemic baseline and an equally sized two year post-pandemic period for comparison. I used March 16, 2020 as the center date since that is when NYC public schools were shut down and when many offices like my own sent people home to work remotely, so it marks the beginning of the quarantine period. I filtered for only records where the “Borough” was entered as “Staten Island” (since that is where I live) due to the size of the dataset and time limitations of the project, but I may expand my analysis to the other boroughs in the future because I also think it may be a very interesting question to see if there were differences between the boroughs.

I used R to clean the data first by doing the following:

selecting only the columns I thought I would need,
reformatting date columns to date format,
converted most character columns to title case to remove inconsistencies where some data was entered in all uppercase,
recoded duplicate “Complaint Type” categories to be consistent,
created a binary variable for “Pandemic” to separate my data into before and after categories.

I then filtered the data for only complaint types that had a total count of at least 200 over the 4 year period to exclude types with a very small total number of complaints since there were an unwieldy 223 unique complaint types in the data. That reduced the number of complaint types to 109. I then further filtered the data by limiting to complaint types where there was at least a ten percent absolute value difference in the total number of complaints in the two year period before the start of quarantine vs the two year period after, and excluded any complaint type where there were no complaints in either period at all to exclude categories that were specifically related to the pandemic (like “Vaccine Mandate Non-Compliance”) since there would be no baseline. That reduced the total number of complaint types to 81. I then realized that there may be other types of new complaints that aren’t specifically about the pandemic that might have resulted from the stress, so I filtered a second data set specifically for complaints that were new after the beginning of quarantine and there were 9 new categories including:

Covid-19 Non-Essential Construction – 271 complaints
Dead Animal – 512 complaints
Dumpster Complaint – 214 complaints
Illegal Dumping – 612 complaints
Noncompliance With Phased Reopening – 3716 complaints
Obstruction – 381 complaints
Residential Disposal Complaint – 368
Storm – 476 complaints
Vaccine Mandate Non-Compliance – 751 complaints

The first, fifth, and tenth items on the list are new because they are pandemic related but the others are not directly related to the pandemic. I would possibly like to visualize these in a future iteration of the project.

Finally I visualized the remaining categories as time series line charts to view them each individually and chose the ones with the most visually and categorically interesting results to use in my final project.

The Results

I wasn’t sure if I would find anything at all, but after analysis it was clear that there were some differences that I guess I should have expected in hindsight, but also some surprising and unexpected differences in the frequency (both increasing and decreasing) in specific request types. Some of these fell into broader categories like three different types of noise complaints, two types of complaints about homelessness, etc. These were visualized together in groups where it seemed appropriate. Three particularly interesting results were visualized individually, Indoor Air Quality, Request for Large Bulky Item Collection, and Drug Activity. I chose to visualize all of them as time series area charts rather than line charts since the solid color areas more clearly showed the difference in call volume than line charts did. I also included reference lines vertically at the midpoint to highlight the start of quarantine and horizontally at the before and after quarantine averages to highlight the change in call volume from one period to the next. I chose a relatively soft and muted color palette but did not use color in any way to indicate categories since it wasn’t always clear which complaint types should go together of if they didn’t fit into a category at all.

I purposely chose not to include any explanatory text on the visualizations themselves about what you should conclude from them because while it is easy to make assumptions about why we see some of these differences there may be other reasons that we did not consider. However, I will explain some of my thoughts about what they might mean here.

There were massive spikes in Indoor Air Quality complaints starting almost a month after quarantine began and even larger spikes from mid-December through mid-April of 2021. One possible explanation is that they are coming from “essential” workers who were required to show up to their work locations and were concerned with the health risks due to inadequate ventilation. It’s also possible that the increase hit it’s peak in winter since doors and windows are more likely to be closed to keep heat inside. A more detailed analysis of the data, especially the Descriptor and Location Type variables, may provide more insight into the reasons behind these peaks and the specific complaints being made.

One finding that was surprising to me but probably shouldn’t have been was the massive increase in requests for sanitation to pick up large bulky items. There are many reasons that this could have happened. It could be due to the large number of people who began home improvement or decluttering projects since they were suddenly forced to stay home for weeks on end. Or people being forcibly displaced from their homes, or voluntarily relocating out of the city, or a combination of all of these and other reasons. The reasons for this might be hard to parse out of this dataset however since the Descriptor and Location Type variables in each case are “Request Large Bulky Item Collection” and “Sidewalk” respectively.

Complaints about homeless encampments were almost non-existant for about a year and then jumped up to more than twice the pre-pandemic level. At the same time calls about homeless people needing assistance increased dramatically over the entire period. Although the total number of complaints are relatively small, considering that Staten Island only comprises less than 6% of the city’s population, that’s not surprising. Staten Island’s homeless population often gets overlooked because of our relatively small population, but as this 2017 article from the NYT and this 2018 article from the Urban Institute make clear, Staten Island’s homeless problem is not much better or worse than it is in most other parts of the city.

Finally, complaints about noise (residential, street and vehicular) were all way up and have remained higher than pre-pandemic rates since the beginning of quarantine with the same seasonal fluctuations but at higher levels. Again there could be many reasons for this and a closer look at the Descriptor and Location Type variables may provide some insight.

The Future

In addition to visualizing the categories that only showed up in the data post-quarantine and expanding the data set to look at the other four boroughs of New York City, I would like to do a more thorough statistical analysis to see if there are statistically significant differences between the pre and post quarantine means that may not show up so obviously in a visual representation as well as to confirm that what seem like obvious differences visually are in fact statistically significant.

The Code

library(tidyverse)
library(psych)

df_raw <- read_csv("../Data/SI_Complaints_2yr_Pre-pandemic_through_today.csv",
               col_types = cols(`Unique Key` = col_integer(),
                                `Created Date` = col_character(),
                                `Closed Date` = col_character(),
                                `Incident Zip` = col_character(),
                                `Vehicle Type` = col_character()))

df %
  select(-c(`Park Facility Name`:`Bridge Highway Segment`),
         -c(`Street Name`:`Address Type`),
         -c(Location,`X Coordinate (State Plane)`,`Y Coordinate (State Plane)`,
            Landmark, BBL)) %>%
  mutate(across(contains("Date"), ~ as.Date(.x, format = "%m/%d/%Y %I:%M:%S %p")),
         across(where(is.character) & !Agency & !`Facility Type` & !contains("Date"),
                str_to_title),
         City = Borough,
         `Complaint Type` = str_replace(`Complaint Type`, "Homeless Encampment", "Encampment"),
         `Complaint Type` = str_replace(`Complaint Type`, "Animal-Abuse", "Animal Abuse"),
         `Complaint Type` = str_replace(`Complaint Type`, "Dirty Conditionss|Dirty Condition$", "Dirty Conditions"),
         `Complaint Type` = str_replace(`Complaint Type`, "Derelict Vehicles|Derelict Vehicle|Abandoned Vehicless|Abandoned Vehicles", "Abandoned Vehicle"),
         `Complaint Type` = str_replace(`Complaint Type`, "Electronics Waste Appointment", "Electronics Waste"),
         `Complaint Type` = str_replace(`Complaint Type`, "For Hire Vehicle Complaint","For Hire Vehicle Report"),
         `Complaint Type` = str_replace(`Complaint Type`, "Litter Basket / Request", "Litter Basket Request"),
         `Complaint Type` = str_replace(`Complaint Type`, "Missed Collection \\(All Materials\\)", "Missed Collection"),
         `Complaint Type` = str_replace(`Complaint Type`, "Snow Or Ice|Snow Removal", "Snow"),
         `Pandemic` = factor(case_when(`Created Date` %
  filter(`Created Date` < as.Date("03-16-2022", format = "%m-%d-%Y"))

# describe(df)
# length(unique(df$`Complaint Type`))

counts %
  group_by(`Complaint Type`) %>%
  summarise(count = n())

filter %
  filter(count >= 200)

df_reduced %
  filter(`Complaint Type` %in% filter$`Complaint Type`)

filter2 %
  group_by(`Complaint Type`, Pandemic) %>%
  summarise(count = n()) %>%
  pivot_wider(., values_from = count, names_from = Pandemic) %>%
  mutate(percent_diff = (after-before)/before*100) %>%
  filter(!is.na(percent_diff) & abs(percent_diff) > 10)

df_cleaned %
  filter(`Complaint Type` %in% filter2$`Complaint Type`)

write_delim(df_cleaned, "cleaned_311_data.txt", delim = "|", na = "")
saveRDS(df_cleaned, "cleaned_311_data.RDS")

filter_post_pandemic %
  group_by(`Complaint Type`, Pandemic) %>%
  summarise(count = n()) %>%
  pivot_wider(., values_from = count, names_from = Pandemic) %>%
  mutate(percent_diff = (after-before)/before*100) %>%
  filter(is.na(before))

df_post_pandemic %
  filter(`Complaint Type` %in% filter_post_pandemic$`Complaint Type`)

write_delim(df_post_pandemic, "cleaned_post_pandemic_311_data.txt", delim = "|", na = "")
saveRDS(df_post_pandemic, "cleaned_post_pandemic_311_data.RDS")

Acknowledgements

I owe a big debt of gratitude to Steve Wood for his help in figuring out how to create the horizontal reference line for the averages before and after the start of quarantine.

Project 1 – Evidence of the COVID-19 Pandemic in NYC 311 Complaint Data