Our Data

Datasets of Interest

We obtained most of our data from “Neighborhood Data for Social Change” (NDSC), a public resource from USC’s Price Center for Social Innovation that strives to uplift and empower neighborhoods in Los Angeles and Coachella Valley. The Price Center is supported by Price Philanthropies, a private foundation that funds youth programs and initiatives in mental health, real estate, and social justice. In the past, this center has also partnered with community organizations for compiling certain datasets, such as the Los Angeles Housing Department and United Neighbors in Defense Against Displacement.

Although NDSC hosts data about a variety of topics, including housing, education, and transportation, we focused on the Water Quality and Air Quality in Los Angeles datasets, with our variables of interest being the “Drinking Water Contamination Index” (in percentiles) and “PM2.5 Concentrations” (in micrograms/cubic meter), respectively.

The water quality data consists of 2494 instances and 5 columns: two of the columns represent GEOIDs to identify the measured census tract, and the other columns are the year of compilation, the total population in the tract, and the drinking water index. The year of compilation for all instances was 2017; however, the data was collected between 2005 and 2013, with the year of measurement not listed on the dataset. The air quality data consists of 4987 instances and the same 5 columns as water quality, with the exception of PM2.5 concentration instead of the drinking water index. The compilation years for air quality are 2014 or 2017, with the former representing data collected between 2009 and 2011, and the latter 2012 to 2014. A supplemental, non-environmental dataset from NDSC was “Race and Ethnicity”, which tabulated racial and ethnic identities from the U.S. Census Bureau’s American Community Survey from 2010 to 2019. Potential racial identifications included American Indian, Asian, Black, Hispanic/Latino, Native Hawaiian, other, one or more races, and white.

In addition to air and water quality data from NDSC, we also obtained environmental data from “Urban Tree Canopy and Environmental Justice”. This dataset was published by Jarlath O'Neil-Dunne, the director of the University of Vermont's Spatial Analysis Lab, and included landcover data from Baltimore, Los Angeles, New York City, Philadelphia, Raleigh, and Washington D.C. However, for the purposes of this project, we only analyzed the Los Angeles data. The dataset contained four variables related to land cover, including the percentage of tree canopy from high-resolution land cover and the percentage of tree canopy from the USGS's National Land Cover Database (2001). In addition to the landcover data were demographic variables from the United States Census, including population density, racial make-up, and a community's median income. Data was collected on a census-tract level for a total of 840 measurements.

The NDSC compiled air and water quality data from the CalEPA's CalEnviroScreen 2.0 and 3.0 campaigns.

The NDSC datasets were aggregated for the CalEnviroscreen 2.0/3.0 surveys, an environmental justice initiative by CalEPA’s Office of Environmental Health Hazard Assessment that aims to empower communities that experience severe pollution events. Part of their analysis includes the PM2.5 concentration dataset, which was collected by the CalEPA’s California Air Resources Board (CARB)’s air monitoring network of air sampling instruments and satellite data.

The Drinking Water Contamination Index was calculated from California Department of Public Health’s Water Quality Monitoring database on a yearly time grid, with the concentrations of the 13 most prevalent contaminants (ranging from arsenic to nitrate) included in the calculation. For each contaminant, the total contaminant level was compared to other census tracts, and a percentile was assigned. The percentiles of each of the 13 contaminants was summed to yield the contamination index, allowing for comparison between tracts.

How data was aggregated and what data was collected provide some limitations to our analysis.

The CalEnviroScreen 3.0 survey collected data at a census tract level, with census tracts being defined as small, geographic subdivisions of a county with approximately 4,000 inhabitants. According to the United States Census Bureau, the main factor involved in creating census tracts is balancing each tract’s population. As a result, census tract boundaries can be arbitrary and may not parallel actual neighborhoods, which can obscure possible patterns associated with specific communities. Furthermore, the air and water quality data sets only include environmental measurements, with little contextual background for these disparities. Without information about demographics, socioeconomics, or history, we cannot construct a thorough narrative that accounts for how power dynamics affect environmental justice and lived experiences. To address this limitation, we supplemented environmental data with socioeconomic data, such as the NDSC’s “Race and Ethnicity” dataset.

For air quality, the dataset only measured particulate matter with diameters of less than 2.5 micrometers (PM2.5). This measurement is not as comprehensive as the EPA’s Air Quality Index (AQI), which includes ozone (photochemical smog), larger particulate matter of up to 10 micrometers in diameter (PM10), and other pollutants. For instance, the AQI accounts for CO, which is a toxic gas emitted from wood burning; SO2, a pollutant originating from coal-burning power plants that contributes to acid rain; and NO2, a main contributor to smog formation. As a result, we cannot comprehensively summarize an area’s overall air quality based on PM2.5 alone. Nevertheless, although this dataset may be limited, PM2.5 is still a major factor in the AQI calculation and is especially dangerous, as it can enter the respiratory tract more easily than PM10. Therefore, our dataset still provides information about potential air toxicity that can be corroborated with other atmospheric datasets in future analyses.

For water quality, it is important to note that census tracts “sometimes contain multiple drinking water systems and groundwater resources.” Therefore, where the tract boundaries are drawn may introduce discrepancies in our analysis: for example, data may be aggregated in such a way that it averages out disparities. However, our dataset still portrays a generally accurate picture of the average water quality conditions within that tract — we just need to be cognizant that certain correlations may emerge or disappear depending on each tract’s geographical boundaries.

We analyze PM2.5 levels and drinking water contaminants to better understand environmental inequities.

PM2.5 concentrations reveal how air quality differs within regions in Los Angeles, allowing for spatial analysis into how socioeconomic makeup or other factors affect air quality. Underprivileged communities are typically more at risk for pollution, and the dataset can indicate what characteristics — a neighborhood’s poverty level, average education, and amount of stable housing; the percentage of members belonging to marginalized groups; and any other social or economic factors — most correlate to air pollution levels. Furthermore, given that particulate matter is emitted in large quantities through wildfires, the dataset can also show what areas within Los Angeles are most impacted by fires, an issue that is becoming more relevant in the near future due to the effect of climate change on California’s fire season. As particulate matter is toxic to human health, the dataset may illuminate which neighbourhoods are most severely affected and whether underprivileged communities are more likely to face lasting economic damage and health consequences.

Similarly, the dataset regarding water quality in Los Angeles can reveal information about which neighborhoods are affected by drinking water contaminants and how water pollution affects the residents’ lived experiences. 95% of California residents receive their drinking water from public sources; however, this water can become contaminated with chemicals and bacteria. At times, this contamination can be due to natural causes, such as chemicals originating from rocks and soil; other times, however, the sources can be anthropogenic. These include human waste from factories, sewage treatment plants, and agricultural runoff. Since the dataset contains information about 13 different pollutants, we can determine which pollutants are most prevalent in which areas and explore correlations between water pollution levels, potential sources (e.g. factories), and socioeconomic factors. Ultimately, we hope to use these correlations to investigate how social factors affect pollution levels to illustrate the humanistic side of environmental issues.

Made with Mobirise website template