Thursday, April 12, 2007
4:15 pm
B17 Upson Hall

Computer Science
Colloquium
Spring 2007

Alice Zheng
Carnegie Mellon University
 

Statistical Failure Diagnosis in Software and Systems


As software and systems become increasingly complex, the task of debugging also becomes increasingly difficult.  Manual diagnosis can require sifting through millions of lines of code and output logs.  In addition, large systems contain many components, each complex on its own, and often interacting in unexpected ways.

I present a case study illustrating how statistical machine learning algorithms, along with appropriate system instrumentation, can aid in failure diagnosis.  I propose a statistical software debugging framework that collects information from past successes and failures via fine-grained instrumentation of the program and then analyzes this information to locate suspicious program predicates.  I discuss the algorithmic challenges of the approach, and demonstrate a bi-clustering algorithm that is effective at simultaneously clustering failed runs and selecting useful predicates.  Using this approach, it took a programmer 20 minutes to find a long-standing bug in a real-world software program which he had never seen before.

This work is done in collaboration with Ben Liblit (U. Wisconsin, Madison), Michael Jordan (U.C. Berkeley), Alex Aiken and Mayur Naik (Stanford).

Bio: Alice Zheng received her Ph.D. from UC Berkeley in 2005 and is currently a postdoctoral fellow at Carnegie Mellon University.  Her interests lie in applied machine learning, in particular to computer systems, software, and networks.  Current projects include statistical software debugging, performance diagnosis of distributed file systems, efficient internet traffic measurements, and modeling social networks.