Position: Ph.D. Candidate

Current Institution: University of California, Berkeley

Resource-efficient and high-performance big data systems via a coding-theoretic approach

The data revolution is strongly influencing almost all walks of human endeavor today. This paradigm shift is enabled by the so-called big data systems which make it possible to store and analyze massive amounts of data. The foundation of any big data system is a large-scale, distributed, data storage system that typically comprises thousands of interconnected servers. At such massive scales of operation, failures and operational glitches are the norm rather than the exception, making it imperative to store data redundantly. Most systems today use a strategy termed replication, storing multiple copies of the data on different servers. However, the amount of data to be stored is growing at an unprecedented rate, far surpassing Moore’s law. As a result, replication is quickly becoming economically infeasible. Coding theory offers a compelling alternative, that is storage space optimal, using erasure codes. For this reason, many storage systems are starting to deploy coding instead of replication. While traditional codes are optimal with respect to storage space utilization, the initial adopters of this technology have shown that these codes severely burden other system resources such as I/O, network bandwidth and CPU. Furthermore, coding is being introduced into big data systems as an afterthought leading to a fundamental disconnect in the design of these systems and the capabilities that coding offers. My research aims at addressing these challenges both by constructing new codes as well as by designing systems that can leverage the full potential of codes.

My approach towards this goal is two fold, spanning the areas of both theory and systems. In the first part of my talk, I will present new classes of codes that we have constructed for providing fault tolerance while addressing specific challenges that arise in big data systems. These codes significantly improve on traditional codes in terms of various critical system resources such as I/O, network bandwidth and CPU, while also being storage space optimal. An implementation of one of our new codes on the Facebook’s data warehouse cluster in production has shown significant benefits over the state-of-the-art. This new code is also being incorporated into the Apache Hadoop Distributed File System which is the most widely used file system in big data systems today. In the second part of my talk, I will present our work on identifying new ways in which coding can be employed to drive performance improvements in big data systems. Currently, big data systems employ coding as a resource-efficient alternative to replication for achieving fault tolerance. However, coding offers a broad range of properties and trade-offs that are fundamentally different from replication. These properties can be exploited to achieve performance improvements, beyond fault tolerance, in ways that are not possible under replication. As a step in this direction, we have recently designed and built a caching system for data-intensive clusters that exploits properties of coding to provide significant improvements in load balancing and read/write latencies. Overall, I envision coding-theoretic principles to enrich big data systems in a multitude of ways, and I plan to enable this through both fundamental theoretical as well as systems research.

Rashmi K. Vinayak is a PhD candidate in the Electrical Engineering and Computer Science department at UC Berkeley. Her research interests lie in the theoretical and system challenges that arise in storage and analysis of big data. Rashmi’s dissertation research focuses on achieving significantly better performance and resource efficiency in big data systems using coding-theoretic principles, specifically involving designing new codes for distributed storage systems as well as building systems that employ these new coding techniques in novel ways. Rashmi’s research has been recognized by the IEEE Data Storage Best Paper and Best Student Paper Awards for 2011 and 2012, and the Eli Jury Award 2016 which is the best dissertation award from the UC Berkeley EECS department presented for outstanding achievement in the area of Systems, Communications, Control, or Signal Processing. Rashmi is also a recipient of the Facebook Fellowship 2012-13, Microsoft Research PhD Fellowship 2013-15, and the Google Anita Borg Memorial Scholarship 2015-16.