Art Munson @ Cornell University

May 2010: I just finished my Ph.D. in Cornell's computer science department. Please contact me if you would like a copy of my dissertation for research purposes.

My research focus is on applied machine learning and data mining, with a wide interest in areas of applications. In no particular order, I am interested in natural language processing problems, dimensionality reduction techniques, anomaly detection (esp. in the domain of security), information-theoretic approaches to learning, and the intersection of machine learning and systems research (e.g. can we learn how to monitor or even tune a complex system so that it dynamically adapts to changes in running conditions?). I am also intrigued by the idea of programs "talking" to each other and the opportunities and requirements for such a computing environment.

Since my arrival in 2003 I have worked on a variety of projects including:

detecting novel attacks in network traffic using cluster ensembles
optimizing unwieldy performance metrics in natural language processing classification problems (e.g. noun phrase coreference resolution) using ensemble selection
finding the fine-grained topic segments of opinion letters using lexical similarity measures and frame-of-reference transitions
limiting overfitting in ensemble selection through model library pruning and cross-validated models
[Ongoing work] data mining bird observational data to find interesting trends in bird abundance (more specifics below)

My advisor is Rich Caruana.

Contact information (email and phone are best choices):

Art Munson
Department of Computer Science
Cornell University
Ithaca, NY 14853-7501
607-255-5521
607-255-4428 (FAX)
mmunson @ cs.cornell.edu

Publications

Professional Activities

Current Work

NEW! I'm conducting a survey on the difficulty and importance of various modeling steps.

Currently most of my time is spent in a collaboration with Cornell's Lab of Ornithology. The Lab of O, as we call them, is building a large data warehouse of bird observation data as part of their Avian Knowledge Network project. This data is collected from across North America and spans a number of years. There are several interesting aspects of this data, not the least of which is that many observations are collected by volunteers---people who just plain like birds. The project challenges include: many missing values, noise, and the requirement to ultimately build understandable models. And of course, more data is collected yearly, so the solutions we find need to scale.

We have successfully a) built bagged tree models for the winter feeding habits of almost 100 bird species across the contintental United States, b) analyzed the models to determine which features (a.k.a. predictor variables if you speak statistician) are most important to a model's predictions, and c) isolated and plotted the effects of the most important features on the probability of seeing particular birds. You can find a paper from KDD 2006 that describes this work in my publications list. The analysis results are publically available through the Avian Knowledge Network (Warning: the Lab of O people are constantly tweaking, revising, and updating this website, so the link might not work. You can probably find it from the AKN home page under Exporatory Analysis.)

My current focus is finding a way to reduce the number of features while maintaining the performance level achievable using the full set (currently at 500 features and counting). The motivation is that there are too many closely correlated (or loosely correlated but related) features. For example, we have more than 30 features that describe human population, taken from the 2000 US census. What we would really like is to find one (or a few) constructed feature(s) that captures all the information in those 30 human population features that is needed to make predictions about bird abundance. That would make studying the effects of important features much easier. In some sense, we are searching for the latent factors that really matter for making the predictions. The interesting wrinkle is the natural grouping of features into related clusters. One of our goals is to preserve this grouping (i.e. discovered factors correspond to a single group) to improve the understandability of the factors.

Things I Wish I Found Sooner

[2007.02.19] Google Scholar can give you bibliography information (enable it in Scholar Preferences)
[2007.02.09] CiteULike: social networking meets academic paper reading (notes, papers to read, citation databases)
[2007.02.05] Tips for Technical Writing

Resources for Technical Paper Reviewing

[2007.11.07] Twelve Tips for Reviewers
[2007.02.19] How to Review a Technical Paper

Links about Publishing Research and Access to Published Work

[2007.09.10] Clarity and Academic Prestige (see also The Dr. Fox Lecture)
[2007.03.12] Are acceptance rates too low?
[2007.03.01] Open Access Petition story from the BBC
[2007.03.01] Commentary on current research access and book review

Computer Science Education: Things to Consider

[2009.04.09] Are microlectures insane or brilliant?
[2008.04.14] The Future of Computing: Logic or Biology (Leslie Lamport)

Fun Artificial Intelligence

[2008.04.14] 20 Questions

Fun Computer Science

[2008.10.16] The Planarity Game (warning: very addictive)

Web Standards: HTML vs XHTML

[2008.06.16] Official best practices for XHTML 1.0 media type
[2008.06.16] Details about how to set the XHTML media type to work with current browsers.
[2008.06.16] Two posts arguing that the vast majority of the time you should prefer HTML 4.01 over XHTML 1.0.

(Hmmm, I guess this page does not follow these best practices...)

Miscellaneous

[2008.09.09] Structured procrastination: make procrastination work for you!
[2007.11.07] 5 Ways of Breaking the Procrastination Habit
[2007.09.28] Majority-Judgement: wouldn't you rather grade political candidates than vote for them?

Words to live by

I am not very diligent about updating this page; please check the last modified date below to gauge how reliable the information is.