Systems for Large Data
CS 6322
Department of Computer Science
Cornell University
Fall 2008
Instructor:
Johannes Gehrke
Time: Tuesdays and Thursdays, 2:45-4:00pm.
Place: Thurston 202
Course Management System
News
Course Overview
The last decade has been a turning point in database research. The number of
research communities working on BIG data has grown significantly, and it now
not only includes the traditional database vendors but also industries such as
digital entertainment, social network analysis, e-science, advertising, and
search. At the same time as application scenarios expanded, the way that data
has traditionally been managed has changed significantly. The rise of cloud
computing requires fundamental changes in the architecture of data-driven
systems; Moore's law is now based on scaling the number of processor cores
instead of clock speed; systems with huge main memory sizes and large middle
tiers of solid-state disks are emerging; power consumption has become a major
concern for large systems.
This course covers recent research on the design and implementation of
scalable data-centric systems. Topics include infrastructure for cloud
computing, novel database architectures such as column stores and main-memory
data management, the convergence of search over unstructured data and querying
of structured data, power-aware data management, and data management for
computer games and virtual worlds.
The course prerequisites include basic undergraduate knowledge of database
systems as covered in the cow book.
Course Work
- 10 paper summaries (the papers are marked with a *).
Submission is required through the Course
Management System.
- One presentation in the course
Course Outline
Note that the course schedule is still under construction.
Data Services in
the Cloud
Thursday, September
3, 2008 (Presenter: Johannes)
- Brian Hayes. Cloud Computing.
Communications of the ACM, Volume 51, Issue 7 (July 2008). Pages 9-11.
- Jeffrey Dean and Sanjay Ghemawat.
MapReduce:
Simplified Data Processing on Large Clusters. In Proceedings of
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
- David Dewitt. MapReduce: A major step backwards. The Database
Column, January 17, 2008.
- David Dewitt, Michael Stonebraker.
MapReduce II. The Database Column, January 25,
2008.
- Hung-Chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and
D. Stott Parker. Map-Reduce-Merge:
Simplified Relational Data Processing on Large Clusters. Proc. of ACM
SIGMOD, pp. 1029--1040, 2007.
Tuesday, September 9, 2008 (Presenter: Johannes)
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew
Birrell, and Dennis Fetterly.
Dryad:
distributed data-parallel programs from sequential building blocks,
ACM SIGOPS Operating Systems Review, v.41 n.3, June 2007.
- http://en.wikipedia.org/wiki/Dryad
- http://en.wikipedia.org/wiki/Hadoop,
http://hadoop.apache.org/, and http://research.yahoo.com/node/90.
- C. Olston, B. Reed, U. Srivastava, R. Kumar and A. Tomkins. Pig Latin:
A Not-So-Foreign Language for Data Processing. ACM SIGMOD 2008
International Conference on Management of Data (Industrial Track),
Vancouver, Canada, June 2008.
- Read also http://en.wikipedia.org/wiki/Pig_Latin.
Thursday, September 11, 2008 (Presenter: Johannes)
- Rob Pike,
Sean Dorward, Robert Griesemer,
Sean Quinlan. Interpreting
the Data: Parallel Analysis with Sawzall.
Scientific Programming Journal. Special Issue on Grids and
Worldwide Computing Programming Models and Infrastructure 13:4,
pp. 227-298.
- (*) Fay Chang, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar
Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable:
A Distributed Storage System for Structured Data. OSDI'06:
Seventh Symposium on Operating System Design and Implementation, Seattle,
WA, November, 2006.
- Note that this paper is
marked with a star and thus you need to write a brief summary of the
paper.
- Please also answer the
following question: What differences do you see between a standard DBMS
and BigTable? Make a list of differences and
explain each of them in at most a couple of sentences.
- The paper summary and
answer to the question are due in the Course Management System on
Thursday morning at 10:30am.
Tuesday, September 16, 2008 (Presenter: Johannes)
- Sanjay Ghemawat, Howard Gobioff,
and Shun-Tak Leung. The Google File System.
19th ACM Symposium on Operating Systems Principles,
Lake George, NY, October, 2003.
- The Hadoop Distributed File System: Architecture and
Design.
- Luiz Barroso,
Jeffrey Dean, and Urs Hoelzle. Web Search
for a Planet: The Google Cluster Architecture. IEEE Micro, vol. 23,
pp. 22-28, 2003.
- Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski,
Christos Kozyrakis. Evaluating
MapReduce for Multi-core and Multiprocessor
Systems. Proceedings of the 13th Intl. Symposium on
High-Performance Computer Architecture (HPCA),
Phoenix, AZ, February 2007.
Thursday, September 18, 2008 (Presenter: Ymir Vigfusson)
- Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon,
Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni. PNUTS: Yahoo!'s
Hosted Data Serving Platform. VLDB Conference (industry track),
Auckland, New Zealand, 2008.
- Adam Silberstein, Brian F. Cooper, Utkarsh
Srivastava, Erik Vee, Raghu Ramakrishnan and Ramana Yerneni. Efficient Bulk
Insertion into a Distributed Ordered Table. ACM SIGMOD Conference,
Vancouver, BC, Canada, 2008.
Background reading:
Parallel Database
Systems
Tuesday, September 23, 2008 (Presenter: Johannes)
Background reading:
Class cancelled on Thursday, September 25.
Concurrency Control
and Recovery
Tuesday, September 30, 2008 (Presenter: Johannes)
- Background material on concurrency control and serializability theory.
Background reading:
- Marianne Winslett. Interview
with Phil Bernstein. SIGMOD Record, Volume 33, Number 3, September
2004.
- H. T. Kung and John T. Robinson. On optimistic methods for concurrency control. ACM Transactions on Database Systems (TODS), Volume 6 , Issue 2
(June 1981). Pages 213- 226, 1981.
Thursday, October 2, 2008 (Presenter: Alan Demers)
- Hal
Berenson, Philip
A. Bernstein, Jim Gray, Jim
Melton, Elizabeth
J. O'Neil, Patrick
E. O'Neil: A Critique of ANSI SQL Isolation Levels. SIGMOD
Conference 1995: 1-10
- Atul Adya, Barbara
Liskov, Patrick E.
O'Neil: Generalized
Isolation Level Definitions. ICDE
2000: 67-78.
- D. B. Terry, M. M. Theimer,
K. Petersen, A. J. Demers, M. J. Spreitzer, and
C. Hauser. Managing
Update Conflicts in Bayou, a Weakly Connected Replicated Storage System.
Proceedings 15th Symposium on Operating Systems Principles (SOSP-15) , Cooper Mountain, Colorado, December 1995, pages
172-183.
Tuesday, October 7, 2008 (Presenter: Christoph Koch)
- C. Mohan, et al., "ARIES: A Transaction Recovery
Method Supporting Fine-Granularity Locking and Partial Rollbacks Using
Write-Ahead Logging", TODS 17(1), 1992.
- (*) Edmond
Lau, Samuel Madden: An Integrated Approach to Recovery and High Availability
in an Updatable, Distributed Data Warehouse. VLDB
2006: 703-714.
- Note that this paper is
marked with a star and thus you need to write a brief summary of the
paper.
Background reading:
No class on Thursday, October 9 due to Yom Kippur.
No class on Tuesday, October 14 due to Fall Break.
Thursday, October 16, 2008 (Class meets at 10:10am in Upson 111 together
with CS 6410)
Tuesday, October 21, 2008 (Presenter: Robbert van Renesse)
- Robbert van Renesse, Danny Dolev, Fred B. Schneider. Stepwise refinement of
consensus. Unpublished draft.
- Tushar Chandra, Robert Griesemer, and Joshua Redstone. Paxos Made Live – An Engineering Perspective. PODC
'07: 26th ACM Symposium on Principles of Distributed Computing,
2007.
Background Reading:
- Leslie Lamport. Paxos Made Simple. ACM SIGACT News (Distributed
Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.
Thursday, October 23, 2008
Tuesday, October 28, 2008 (Presenter: Christoph Koch)
- Marcos K. Aguilera, Arif
Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis,
Sinfonia: a new paradigm for building scalable
distributed systems, ACM SIGOPS Operating Systems Review, v.41 n.6,
December 2007
- (*) Mike Burrows. The Chubby Lock Service
for Loosely-Coupled Distributed Systems. OSDI'06: Seventh
Symposium on Operating System Design and Implementation, Seattle, WA,
November, 2006.
- Note that this paper is
marked with a star and thus you need to write a brief summary of the
paper.
Thursday, October 30, 2008 (Presenter: Hussam Abu-Libdeh)
- Welsh, M., Culler, D., and Brewer, E. 2001. SEDA: An architecture for well-conditioned, scalable internet
services. In Proceedings of the Eighteenth ACM Symposium on
Operating Systems Principles (Banff, Alberta, Canada, October 21 - 24,
2001). SOSP '01. ACM Press, New York, NY, 230-243.
- Werner Vogels. Amazon's
Dynamo Technology.
- (*) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall
and Werner Vogels. Dynamo:
Amazon’s Highly Available Key-value Store. SOSP’07, October 14–17,
2007, Stevenson, Washington, USA.
- Note that this paper is
marked with a star and thus you need to write a brief summary of the
paper.
Background reading:
- Stoica, I., Morris, R., Karger, D., Kaashoek, M. F.,
and Balakrishnan, H. 2001. Chord:
A scalable peer-to-peer lookup service for internet applications. In Proceedings
of the 2001 Conference on Applications, Technologies, Architectures, and
Protocols For Computer Communications (San
Diego, California, United States). SIGCOMM '01. ACM Press, New York, NY,
149-160.
- Chord
Project at MIT.
Column Stores
Tuesday, November 4, 2008 (Presenter: Johannes)
- (*) Mike Stonebraker, Daniel
Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, Pat O'Neil,
Alex Rasin, Nga Tran
and Stan Zdonik. C-Store: A Column
Oriented DBMS. VLDB, pages 553-564, 2005.
- Integrating
Compression and Execution in Column-Oriented Database Systems. Daniel
J. Abadi, Samuel R. Madden, and Miguel C. Ferreira. Proceedings of SIGMOD,
June, 2006, Chicago, USA.
- Stavros Harizopoulos, Velen Liang, Daniel Abadi, and Samuel Madden. Performance
Tradeoffs in Read-Optimized Databases. Proceedings of VLDB, September,
2006, Seoul, Korea.
Thursday, November 6, 2008 (Presenter: Lyublena Antova)
- Daniel J. Abadi, Daniel S. Myers, David J. DeWitt, and
Samuel R. Madden. Materialization
Strategies in a Column-Oriented DBMS. Proceedings of ICDE, April,
2007, Istanbul, Turkey.
- I.
Ivanova, M.
L. Kersten, N.
Nes. Self-organizing strategies for a
column-store database. In Proceedings of the International Conference
on Extending Database Technology (EDBT), pp 157-168, Nantes, France,
March 2008.
BiBTeX Abstract
PDF
Tuesday, November 11, 2008 (Presenter: Michaela Goetz)
- S.
Idreos, M.
L. Kersten, S.
Manegold. Database Cracking. In Proceedings
of the Biennial Conference on Innovative Data Systems Research (CIDR),
Asilomar, CA, USA, January 2007.
BiBTeX Abstract
PDF
- S.
Idreos, M.
L. Kersten, S.
Manegold. Updating a Cracked Database. In Proceedings
of the ACM SIGMOD International Conference on Management of Data,
Beijing, China, June 2007.
BiBTeX Abstract
PDF
Thursday, November 13, 2008
- (*) Daniel J. Abadi, Adam Marcus, Samuel R. Madden, and
Kate Hollenbach. Scalable
Semantic Web Data Management Using Vertical Partitioning. Proceedings
of VLDB, September, 2007, Vienna, Austria.
- L.
Sidirourgos, R.
Goncalves, M.
L. Kersten, N.
Nes, S.
Manegold. Column-Store Support for RDF Data
Management: not all swans are white. In Proceedings of the
International Conference on Very Large Data Bases (VLDB), Auckland,
New Zealand, September 2008. BiBTeX Abstract
PDF
Tuesday, November 18, 2008 (Presenter: Guozhang Wang)
- (*) Daniel J. Abadi, Samuel R. Madden, Nabil Hachem. Column-Stores
vs. Row-Stores: How Different Are They Really? In Proceedings of
SIGMOD, 2008, Vancouver, Canada.
- Michael Stonebraker, Chuck
Bear, Ugur Cetintemel,
Mitch Cherniack, Tingjian
Ge, Nabil Hachem, Stavros Harizopoulos,
John Lifter, Jennie Rogers, and Stan Zdonik.
"One Size Fits All? - Part 2: Benchmarking Results." In Proceedings
of the Third International Conference on Innovative Data Systems Research
(CIDR), Asilomar, CA, January 2007. PDF
Thursday, November 20, 2008 (Presenter: Ben Sowell)
- Allison Holloway, David DeWitt. Read-Optimized
Databases, In Depth. VLDB 2008.
- Russell Sears, Mark Callaghan, and Eric Brewer. Rose:
Compressed, Log-Structured Replication. VLDB 2008.
Tuesday, November 25, 2008 (Presenter: Haoyuan Li)
- (*) Michael Stonebraker,
Samuel Madden, Daniel Abadi, Stavros Harizopoulos,
Nabil Hachem, and Pat Helland. "The End of an Architectural Era (It's
Time for a Complete Rewrite)." In Proceedings of the 33rd
International Conference on Very Large Data Bases (VLDB), Vienna,
Austria, September 2007. PDF
- Stavros Harizopoulos, Daniel
Abadi, Samuel Madden, and Michael Stonebraker.
"OLTP Through the Looking Glass, and What We Found There." In Proceedings
of the ACM SIGMOD International Conference on Management of Data,
Vancouver, BC, Canada, June 2008. PDF
No class on Thursday, November 27 due to Thanksgiving
break.
Potpourri
Tuesday, December 2, 2008 (Presenter: Raluca Tanase)
- Matthias
Brantner, Daniela
Florescu, David
A. Graf, Donald Kossmann, Tim Kraska: Building a
database on S3. SIGMOD
Conference 2008: 251-264
- David DeWitt, Eric Robinson, Srinath
Shankar, Erik Paulson, Jeffrey Naughton, Andrew Krioukov, Joshua Royalty. Clustera:
An Integrated Computation and Data Management System.
Thursday, December 4, 2008 (Presenter: Johannes)
- Marcos Aguilera, Wojciech Golab, and Mehul Shah. A
Practical Scalable Distributed B-Tree. VLDB 2008.
- Ioannis Koltsidas,
Stratis Viglas.
Flashing Up The Storage Layer. VLDB 2008.