Miguel Arceo: Controlling the Noise within the Numbers

As I reflect on the second month of my project, there is one word that continues to come up in my mind: noise. I have plenty of literature that is guiding my work, a great working relationship with my mentor, but the noise continues to sound loudly through the numbers. I am not speaking about noise that you hear with your ears, but rather that random variation that I find within the numbers that I cannot yet explain. Even though more is generally better in survey research, it often leads to increases in variation as the number of extraneous variables also increase.

Miguel Arceo, Political Science major, David B. Ford Undergraduate Research Award

For context, I am working with the data from Hispanic voters from every edition of the Cumulative Election Study since 2012. I created a single dataset with these respondents and their answers, as well as other variables from outside sources. With just over 21,000 observations, the noise starts to get louder. The dataset itself has a few demographic variables I can use as controls, but it was not enough to silence the noise.

As a result, my last few days in the data have revolved around refining my models and reducing the variation within it. My mentor has worked with me and provided me plenty of guidance on where to look. Together, we concluded that the year in which respondents took the CES survey and the local political context could be increasing the variation within my models.  The year variable was easier to include in my model since it was already in the dataset, but the accounting for the local political context was a monstrous task.

Thankfully, another fellow at the LeRoy Collins Institute had worked on a similar project and had a dataset that accounted for the county political climate. He generously shared the dataset with me that included the list of every county in the United States with their FIPS code, the Democratic presidential candidate vote total, and the Republican candidate vote total. While there is still some refining that needs to be done, he saved me at least two days in data collection.

My R Studio workspace

I really see the model refinement as the last major hurdle to overcome in my research. My models have shown great promise and, if I can just reduce the amount of variance in them, I can publish some great findings. Even without the models, other statistical tests have hinted at the fact that my hypothesis is supported by the data. I ran an ANOVA test on the average 7-point party identification scale by generation and found statistical significance in the differences. Just a simple correlation coefficient test shows that Hispanic voters become more conservative as the generation increases.

At this point, it sounds strange to write it out, but I do genuinely feel enlightened. I find myself excited to open R Studio and load in my data, then type away code for a few hours. This niche idea within election science can explain many things about the results of the last three presidential election. So long as I maintain this track, I can provide critical information at the President’s Showcase in October.

Nevertheless, I continue my journey.

Leave a comment