As software and systems become
increasingly complex, the task of debugging also becomes increasingly
difficult. Manual diagnosis can require sifting through millions of
lines of code and output logs. In addition, large systems contain many
components, each complex on its own, and often interacting in unexpected
ways.
I
present a case study illustrating how statistical machine learning
algorithms, along with appropriate system instrumentation, can aid in
failure diagnosis. I propose a statistical software debugging framework
that collects information from past successes and failures via
fine-grained instrumentation of the program and then analyzes this
information to locate suspicious program predicates. I discuss the
algorithmic challenges of the approach, and demonstrate a bi-clustering
algorithm that is effective at simultaneously clustering failed runs and
selecting useful predicates. Using this approach, it took a programmer
20 minutes to find a long-standing bug in a real-world software program
which he had never seen before.
This
work is done in collaboration with Ben Liblit (U. Wisconsin, Madison),
Michael Jordan (U.C. Berkeley), Alex Aiken and Mayur Naik (Stanford).
Bio:
Alice Zheng received her Ph.D. from UC Berkeley in 2005 and is currently
a postdoctoral fellow at Carnegie Mellon University. Her interests lie
in applied machine learning, in particular to computer systems,
software, and networks. Current projects include statistical software
debugging, performance diagnosis of distributed file systems, efficient
internet traffic measurements, and modeling social networks.