Project Title

Scalable Decision Tree Construction

NSF Award IIS-0121175

Principal Investigator

Johannes E. Gehrke

Department of Computer Science

Cornell University

4105B Upson Hall

Ithaca, NY 14850

Phone: 607-255-1045

Fax: 607-255-4428

johannes@cs.cornell.edu

http://www.cs.cornell.edu/johannes

Keywords

Data mining, classification, decision tree, classification tree, regression tree, data mining with constraints, privacy-preserving data mining.

Project Summary

This project spans two areas: Decision trees and privacy-preserving data mining.

Decision trees are one of the most widely used data mining models. Our research addresses three topics: (1) bias in split selection, (2) scalable regression tree construction, and (3) novel types of decision trees. We have developed the first provable method for eliminating bias in decision tree construction and the first truly scalable regression tree construction algorithm. Source code of our algorithms can be found at http://himalaya-tools.sourceforge.net.

In addition, we have recently started work on privacy-preserving data mining. One approach to managing large data collections is to fully centralize and integrate the data and host it on huge server farms; this is the prevalent approach today. This massive centralization raises concerns about the privacy of personal information. Privacy issues are further exacerbated now that the Internet makes it easy for the new data to be automatically collected and added to databases. The concerns over massive collection of data are naturally extending to analytic tools applied to data. Data mining, with its promise to efficiently discover valuable, non-obvious information from large databases, is particularly vulnerable to misuse. In this project, we ask the question: Can we develop accurate data mining models while preserving the privacy of individual records (or individual databases)? Recent work includes methods for privacy-preserving mining of association rules, novel definitions of privacy, and methods for eliminating privacy breaches in privacy-preserving data mining.

Publications and Products

Alin Dobra and J. E. Gehrke. Bias Correction in Classification Tree Construction. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2001), Williams College, Massachusetts, June 2001.
Alin Dobra and Johannes Gehrke. SECRET: A Scalable Linear Regression Tree Algorithm. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002). Edmonton, Alberta, Canada, July 2002.
Cristian Bucila, J. E. Gehrke, Daniel Kifer, and Walker White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002). Edmonton, Alberta, Canada, July 2002.
Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, and J. E. Gehrke. Privacy Preserving Mining of Association Rules. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002). Edmonton, Alberta, Canada, July 2002.
Shai Ben-David, J. E. Gehrke, and Reba Schuller. A Theoretical Framework for Learning from a Pool of Disparate Data Sources. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002). Edmonton, Alberta, Canada, July 2002.
Daniel Kifer, J. E. Gehrke, Cristian Bucila, and Walker White. How to Quickly Find a Witness. In Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003). San Diego, CA, June 2003.
Alexandre Evfimievski, J. E. Gehrke, and Ramakrishnan Srikant. Limiting Privacy Breaches in Privacy Preserving Data Mining. In Proceedings of the 22nd ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 2003). San Diego, CA, June 2003.

Project Impact

We have developed the first scalable algorithm for regression tree construction, and we have developed the first method for provably removing bias from split selection in decision trees. We are collaborating with scientists from the Cornell lab of Ornithology and the Stanford Linear Accelerator Center on data mining projects where we are using the tools developed with this grant. Our tools are publicly available at http://himalaya-tools.sourceforge.net. The site averages about 50 page views per days, and our code is downloaded about three times per day.

Our work on privacy-preserving data mining has potentially huge impact on the way government agencies collaborate. Effective intelligence gathering and associated intelligence data processing by the U.S. Government is of utmost importance for national security. A recent effort called the Knowledge Discovery and Dissemination Working Group has begun research on an information infrastructure that will meet the needs of U.S. intelligence and law enforcement agencies for the next century. From the resulting discussions and projects some basic architectural requirements are emerging. In particular, operational and legal requirements for intelligence agencies lead to two separate architectural areas, which we call information spheres. A local information sphere exists within each government agency. Its goal is to perform online analysis and data mining of multiple high-speed data streams, with essentially unrestricted local access to all data managed by the system. The global information sphere spans all participating government agencies, and mediates inter-agency collaboration. It addresses the sometimes conflicting requirements of allowing analysts from different government agencies to efficiently share information, hypotheses and evidence without revealing confidential data or violating applicable laws regarding privacy and civil rights. Techniques for privacy-preserving data mining can have a huge impact on the feasibility and inner workings of the global information sphere.

Goals, Objectives and Targeted Activities

We are working on novel algorithms for decision tree construction, novel types of decision trees, pushing constraints into data mining model construction and on privacy-preserving data mining. Our activities currently encompass the following research goals:

Privacy-preserving data mining. How can we construct data mining models without access to the data over which the data mining models should be constructed? What are suitable models of privacy? How can we avoid privacy-breaches in privacy-preserving data mining?
Data mining with constraints. How can we push constraints into data mining model construction? What types of constraints are amenable to being push deep into model construction algorithms, and what is the complexity of the resulting algorithms?
Novel probabilistic decision trees. We are working on a generalization of decision trees: instead of using deterministic split predicates, we will use probabilistic splits with the goal to generate smooth models that have better generalization error than traditional decision trees while still allowing for interpretation of the resulting (single) tree.

Area Background

There have been over three decades of work on decision tree construction, and decision trees are one of the most widely used data mining models available. The area references give a cross-section of the recent work in this area. There has also been work on mining with constraints, again the area references cite some of the early work in this area. Privacy-preserving data mining has recently gained importance, and the references cite some of the early work in this area.

Area References

R. Agrawal, A. Evfimievski and R. Srikant, "Information Sharing across Private Databases". In Proceedings of the ACM SIGMOD Conference on Management of Data, San Diego, California, June 2003.
R. Agrawal and R. Srikant, "Privacy-Preserving Data Mining". In Proceedings of the ACM SIGMOD Conference on Management of Data, Dallas, May 2000.
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classication and Regression Trees. Wadsworth, Belmont, 1984.
C. Clifton and D. Marks, “Security and Privacy Implications of Data Mining”, Proceedings of the ACM SIGMOD Conference Workshop on Research Issues in Data Mining and Knowledge Discovery, Montreal, June 1996.
D.J. Hand. Construction and Assessment of Classification Rules. John Wiley & Sons, Chichester, England, 1997.
Venkatesh Ganti, J. E. Gehrke, and Raghu Ramakrishnan. Mining very large databases. IEEE Computer, Vol. 32, No. 9, August 1999,
J. E. Gehrke, Venkatesh Ganti, Raghu Ramakrishnan, and Wei-Yin Loh. BOAT -- Optimistic Decision Tree Construction. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD 1999), Philadelphia, Pennsylvania, June 1999.
J. E. Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. RAINFOREST - A Framework for Fast Decision Tree Construction of Large Datasets. In Data Mining and Knowledge Discovery, Volume 4, Issue 2/3, July 2000,
L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, '' Optimization of Constrained Frequent Set Queries with 2-Variable Constraints ''. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD 1999), Philadelphia, PA, June 1999, pp. 157-168.
Y. Lindell and B. Pinkas. Privacy preserving data mining. Journal of Cryptology, 15(3):177–206, 2002.
R. Ng, L. V. S. Lakshmanan, J. Han and A. Pang, '' Exploratory Mining and Pruning Optimizations of Constrained Associations Rules''. In Proceedings of the ACM SIGMOD Conference on Management of Data (SIGMOD 1998). Seattle, Washington, June 1998.
B. Thuraisingham. Data Warehousing, Data Mining and Security. IFIP Database Security Conference, Como, Italy, July 1996.

Project Websites

http://www.cs.cornell.edu/database/himalaya

This website is the main website of the Cornell Himalaya Data Mining Project with links to all ongoing data mining activities.

Online Software

http://himalaya-tools.sourceforge.net

This website contains the open source data mining tools created by the Cornell Data Mining Group funded partly under this grant.