Position: Ph.D. Candidate

Current Institution: Princeton University

Sifting Through Massive Text Corpora to Detect and Characterize Historical Events

Significant events are characterized by interactions between entities (e.g., countries, organizations, individuals) that deviate from typical interaction patterns. Investigators, such as historians, commonly read large quantities of text to construct an accurate picture of who, what, when, and where an event happened. For example, the US National Archive has collected about two million diplomatic messages sent between 1973 and 1978; historians are interested in exploring the content of these messages. Unfortunately, this corpus is cluttered with diplomatic “business as usual” communications such as arrangements for visiting officials, recovery of lost or stolen passports, and obtaining lists of attendees for international meetings and conferences. But hidden in the corpus are indications of important diplomatic events, such as the fall of Saigon. These events, and the documents that portray them, are of primary interest to historians. My goal is to develop and apply a scalable method to help historians and political scientists sift through such document collections to find potentially important events and the primary sources describing them.

The principal goal of my research is to develop probabilistic models for understanding influences on human behavior; in this work, I develop a model for detecting and characterizing influential events in large collections of communication. Specifically, I have developed a structured topic model to distinguish between topics that describe “business-as-usual” and time-localized topics that deviate from these patterns. This approach successfully captures critical events and identifies documents of interest when applied to the US State Department diplomatic messages from the 1970s. The model identifies important time intervals and relevant documents for real-world events such as the Indonesian Invasion of East Timor at the end of 1975, the evacuation of Saigon and South Vietnam prior to the end of the Vietnam war, the Sinai Interim Agreement, the Apollo 17 lunar gifts to all nations, Operation Entebbe, and the death of Mao Tse-tung, among others. I have released source code for this method, which includes both an implementation of the machine learning algorithm to infer model parameters, along with tools to visualize and explore the model results. My longer term research goals include developing machine learning approaches to study how human behavior and decisions are influenced by events and interactions in many domains.

Allison Chaney is a PhD candidate in the Computer Science department at Princeton University and is advised by Professor David Blei. Her primary research interest is to develop statistical machine learning methods for real-world human-centered applications; specifically, she develops Bayesian latent variable models to estimate human behavior and identify external factors that influence it. In addition to deriving and implementing scalable inference algorithms for these models, she builds visualization tools to assist domain experts in interpreting and exploring the model results.

Allison received a BA in Computer Science and a BS in Engineering from Swarthmore College in 2008, and has worked for Pixar Animation Studios and the Yorba Foundation for open-source software. She has also completed research internships with eBay/Hunch and Microsoft Research. This fall, she will begin postdoctoral research with Professors Barbara Engelhardt and Brandon Stewart to study how machine learning algorithms influence human behavior in the context of recommendation systems and how to account for these biases when training future algorithms. In 2014, Allison served as the Program Chair for the Women in Machine Learning (WiML) Workshop, and is now a member of the WiML board; she also engages in various academic mentoring efforts.