Position: Ph.D. Candidate

Current Institution: University of Illinois at Urbana-Champaign

Information theory and machine learning techniques for emerging genomic data

The completion of the Human Genome Project in 2003 opened a new era for scientists. Through advanced high throughput sequencing technologies, we now have access to a large amount of genomic data and we can use it to answer key biological questions, such as the factors contributing to the development of cancer. Large data sets and rapidly advancing sequencing technology pose challenges for processing and storing large volumes of genomic data. Moreover, the analysis of datasets may be both computationally and theoretically challenging because statistical methods have not been developed for new emerging data. In this work, I address some of these problems using tools from information theory and machine learning.

First, I focus on the data processing and storage aspect of metagenomics, the study of microbial communities in environmental samples and human organs. In particular, I introduce MetaCRAM, the first software suite specialized for metegenomic sequencing data processing and compression and demonstrate that MetaCRAM compresses data to 2-13 percent of the original file size.

Second, I analyze a biological dataset assaying the propensity of DNA sequence to form a four-stranded structure called “G-quadruplex” (GQ). GQ structures have been proposed to regulate diverse key biological processes including transcription, replication, and translation. I present main factors that lead to GQ formation, and propose highly accurate linear regression and Gaussian process regression models to predict the likelihood of a DNA sequence to fold into GQ.

Minji Kim is a Ph.D. Candidate in the Electrical and Computer Engineering department at the University of Illinois at Urbana-Champaign, advised by Professor Olgica Milenkovic and Professor Jun Song. She received her BS in Electrical Engineering and Mathematics (Honors with Distinction) from the University of California, San Diego. Her research interests are in bioinformatics and computational biology, specifically in processing and analyzing genomic data using tools from information theory and machine learning. She is a recipient of the NSF Graduate Research Fellowship and Gordon Scholarship, a finalist of the Qualcomm Innovation Fellowship, and a member of Tau Beta Pi.