Scalable Decision Tree Construction
NSF Award IIS-0121175
Principal Investigator
Johannes E. Gehrke
Department of Computer Science
Cornell University
4105B Upson Hall
Ithaca, NY 14850
Phone: 607-255-1045
Fax: 607-255-4428
johannes@cs.cornell.edu
http://www.cs.cornell.edu/johannes
Keywords
Data mining, classification,
decision tree, classification tree, regression tree, data mining with
constraints, privacy-preserving data mining.
Project Summary
This project spans two areas: Decision trees and
privacy-preserving data mining.
Decision trees are one of the most widely used data
mining models. Our research addresses three topics: (1) bias in split
selection, (2) scalable regression tree construction, and (3) novel types of
decision trees. We have developed the first provable method for eliminating
bias in decision tree construction and the first truly scalable regression tree
construction algorithm. Source code of our algorithms can be found at http://himalaya-tools.sourceforge.net.
In addition, we have recently started work on
privacy-preserving data mining. One
approach to managing large data collections is to fully centralize and
integrate the data and host it on huge server farms; this is the prevalent
approach today. This massive centralization raises concerns about the privacy
of personal information. Privacy issues are further exacerbated now that the
Internet makes it easy for the new data to be automatically collected and added
to databases. The concerns over massive collection of data are naturally
extending to analytic tools applied to data. Data mining, with its promise to
efficiently discover valuable, non-obvious information
from large databases, is particularly vulnerable to misuse. In this project, we
ask the question: Can we develop accurate data mining models while preserving
the privacy of individual records (or individual databases)? Recent work
includes methods for privacy-preserving mining of association rules, novel
definitions of privacy, and methods for eliminating privacy breaches in
privacy-preserving data mining.
Publications and Products
- Alin Dobra and J. E. Gehrke. Bias Correction in Classification
Tree Construction. In Proceedings of the Seventeenth International
Conference on Machine Learning (ICML 2001), Williams College, Massachusetts, June 2001.
- Alin Dobra and Johannes Gehrke. SECRET: A Scalable Linear
Regression Tree Algorithm. In Proceedings of the Eighth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (SIGKDD
2002). Edmonton, Alberta, Canada, July 2002.
- Cristian Bucila, J. E.
Gehrke, Daniel Kifer, and Walker White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. In Proceedings of the
Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (SIGKDD 2002). Edmonton, Alberta, Canada, July 2002.
- Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh
Agrawal, and J. E. Gehrke. Privacy Preserving
Mining of Association Rules. In Proceedings of the Eighth ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (SIGKDD
2002). Edmonton, Alberta, Canada, July 2002.
- Shai Ben-David, J. E. Gehrke, and Reba Schuller.
A Theoretical Framework for Learning from a Pool of Disparate Data
Sources. In Proceedings of the Eighth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining (SIGKDD 2002). Edmonton, Alberta, Canada, July 2002.
- Daniel Kifer, J. E. Gehrke, Cristian Bucila, and Walker
White. How to Quickly Find a Witness. In Proceedings of the 22nd ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS
2003). San Diego, CA, June 2003.
- Alexandre Evfimievski,
J. E. Gehrke, and Ramakrishnan Srikant. Limiting Privacy Breaches in Privacy
Preserving Data Mining. In Proceedings of the 22nd ACM
SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS
2003). San Diego, CA, June 2003.
Project Impact
We have developed the first scalable algorithm for
regression tree construction, and we have developed the first method for
provably removing bias from split selection in decision trees. We are
collaborating with scientists from the Cornell lab of Ornithology and the
Stanford Linear Accelerator Center on data mining projects where we are using
the tools developed with this grant. Our tools are publicly available at http://himalaya-tools.sourceforge.net.
The site averages about 50 page views per days, and our code is downloaded
about three times per day.
Our work on privacy-preserving data mining has
potentially huge impact on the way government agencies collaborate. Effective
intelligence gathering and associated intelligence data processing by the U.S.
Government is of utmost importance for national security. A recent effort
called the Knowledge Discovery and Dissemination Working Group has begun
research on an information infrastructure that will meet the needs of U.S. intelligence and law
enforcement agencies for the next century. From the resulting discussions and
projects some basic architectural requirements are emerging. In particular,
operational and legal requirements for intelligence agencies lead to two separate
architectural areas, which we call information spheres. A local information
sphere exists within each government agency. Its goal is to perform online
analysis and data mining of multiple high-speed data streams, with essentially
unrestricted local access to all data managed by the system. The global
information sphere spans all participating government agencies, and mediates
inter-agency collaboration. It addresses the sometimes conflicting requirements
of allowing analysts from different government agencies to efficiently share
information, hypotheses and evidence without revealing confidential data or
violating applicable laws regarding privacy and civil rights. Techniques for
privacy-preserving data mining can have a huge impact on the feasibility and
inner workings of the global information sphere.
Goals, Objectives and Targeted
Activities
We are working on novel algorithms for decision tree
construction, novel types of decision trees, pushing constraints into data
mining model construction and on privacy-preserving data mining. Our activities
currently encompass the following research goals:
- Privacy-preserving data mining. How can we construct data mining
models without access to the data over which the data mining models should
be constructed? What are suitable models of privacy? How can we avoid
privacy-breaches in privacy-preserving data mining?
- Data mining with constraints. How can we push constraints into
data mining model construction? What types of constraints are amenable to
being push deep into model construction algorithms, and what is the
complexity of the resulting algorithms?
- Novel probabilistic decision trees. We are working on a
generalization of decision trees: instead of using deterministic split
predicates, we will use probabilistic splits with the goal to generate
smooth models that have better generalization error than traditional
decision trees while still allowing for interpretation of the resulting
(single) tree.
Area Background
There have been over three decades of work on
decision tree construction, and decision trees are one of the most widely used
data mining models available. The area references give a cross-section of the
recent work in this area. There has also been work on mining with constraints, again the area references cite some of the
early work in this area. Privacy-preserving data mining has recently gained
importance, and the references cite some of the early work in this area.
Area References
- R. Agrawal, A. Evfimievski
and R. Srikant, "Information Sharing across
Private Databases". In Proceedings of the ACM SIGMOD Conference on
Management of Data, San Diego, California, June 2003.
- R. Agrawal and R. Srikant,
"Privacy-Preserving Data Mining". In Proceedings of the ACM
SIGMOD Conference on Management of Data, Dallas, May 2000.
- L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classication
and Regression Trees. Wadsworth, Belmont, 1984.
- C. Clifton and D. Marks,
“Security and Privacy Implications of Data Mining”, Proceedings of the ACM
SIGMOD Conference Workshop on Research Issues in Data Mining and Knowledge
Discovery, Montreal, June 1996.
- D.J. Hand. Construction and
Assessment of Classification Rules. John Wiley & Sons, Chichester, England, 1997.
- Venkatesh Ganti, J. E.
Gehrke, and Raghu Ramakrishnan. Mining very
large databases. IEEE Computer, Vol. 32, No. 9, August 1999,
- J. E. Gehrke, Venkatesh Ganti, Raghu Ramakrishnan,
and Wei-Yin Loh. BOAT
-- Optimistic Decision Tree Construction. In Proceedings of the ACM
SIGMOD Conference on Management of Data (SIGMOD 1999), Philadelphia,
Pennsylvania, June 1999.
- J. E. Gehrke, Raghu Ramakrishnan, and Venkatesh Ganti. RAINFOREST
- A Framework for Fast Decision Tree Construction of Large Datasets. In Data
Mining and Knowledge Discovery, Volume 4, Issue 2/3, July 2000,
- L. V. S. Lakshmanan, R. Ng, J. Han and
A. Pang, '' Optimization of Constrained Frequent Set Queries with
2-Variable Constraints ''. In Proceedings of the ACM SIGMOD Conference
on Management of Data (SIGMOD 1999), Philadelphia, PA, June 1999, pp.
157-168.
- Y. Lindell and B. Pinkas.
Privacy preserving data mining. Journal of Cryptology,
15(3):177–206, 2002.
- R. Ng, L. V. S. Lakshmanan, J. Han and
A. Pang, '' Exploratory Mining and Pruning Optimizations of Constrained
Associations Rules''. In Proceedings of the ACM SIGMOD Conference on
Management of Data (SIGMOD 1998). Seattle, Washington, June 1998.
- B. Thuraisingham. Data Warehousing, Data
Mining and Security. IFIP Database Security Conference, Como, Italy, July
1996.
Project Websites
http://www.cs.cornell.edu/database/himalaya
This website is the main website of the Cornell
Himalaya Data Mining Project with links to all ongoing data mining activities.
Online Software
http://himalaya-tools.sourceforge.net
This website contains the open source data mining
tools created by the Cornell Data Mining Group funded partly under this grant.