CS5412 Spring 2016 Lecture Schedule and Links to Slides
Notes:
We post these slide sets for other instructors to use, or for people exploring the class remotely: permission is granted without limitations. The PDF files are intended to be helpful when taking notes: attendence in class is still expected.
We often revise the slides the night before a given class. The initial links are to the spring 2015 slides, but we'll update many of the lectures and then the links will become 2016 versions.
VideoNotes:
In 2012 we attempted to record an old version of this class (the 2012 slides were NOT identical) using Cornell's videonotes service. The quality wasn't good and we decided not to use the videos. But sometimes you have no choice and need to miss a class, and the slide sets are similar enough so those old videos can be helpful despite the poor lighting and the so-so sound. Information on accessing them is here.
Date | Topic | Slides (pptx) |
Slides (pdf) |
Comments | Readings | |
Th, Jan 28 |
Introduction |
(1) Overview of the entire course, topics we’ll cover, the way grading will work, etc. Plus we'll discuss all those XaaS acronyms. Many people who think they know all about cloud computing basically only know that the X in XaaS can be {I,P,S} and perhaps {BP}! | Book: Preface & Intro | |||
Tue, Feb 2 |
The big picture |
(2) This is an overview lecture: a shallow
big-picture review touching on every part of the story, but in a
light way. Later we will rewind to the start and do deep
dives, but the idea is to get the big picture first. So the
topic is: How cloud client systems work. We’ll look at some typical cloud
applications (music or video players, Shazzam) and ask how they do it.
We'll also start recitations this week and we want people to start on the projects this week. |
Book: Chapter 2, 3 | |||
Th, Feb 4 |
Network routing | (3) BGP, high availability routers, future secure networks. Routing has a key role to play in the cloud. But how does the Internet do routing? What options exist for controlling routing from within a cloud application? We’ll look mostly at BGP but will also discuss the resilient overlay network idea (RON), from MIT. Then we’ll ask how one might make the Internet robust in other dimensions such as security, and how it could support cloud-based routing control. |
Routers for the Cloud. Can the Internet Achieve 5-Nines Availability? Agapi et. al, IEEE Internet Computing. Volume 15. Issue 5. pp.72 - 77. September, October 2011. The League of SuperNets, Ken Birman, IEEE Internet Computing, vol. 7, no. 5, 2003, pp. 92-96. |
|||
Tue, Feb 9 | P2P file sharing and search | (4) Peer-to-peer protocols have transformed the Internet by making
it easy for people who don’t know one another to still cooperate to
share music or other kinds of files. Underlying this are two aspects:
the Internet file transfer protocol (which is trivial) and the ability
to search for things in a network where you only know about a few of the
members. This latter functionality is often called a “key,value” store
and is frequently implemented using a “distributed hash table”. We’ll
study CAN and Chord, two widely known examples of DHTs. The key take-aways are: first, that a "key-value" storage system can scale extremely well, but on the other hand that in the Internet, they also have high costs due to indirect routing and churn. |
Book, Chapter 4 | |||
Th Feb 11 | First and second tier cloud services |
pptx
|
(5) We use the term “tier 2” to talk about services that are nearly
as elastic and scalable as the ones in tier 1, but have roles focused on
caching. Best-known examples include Memcached and Dynamo. We’ll also
look briefly at GFS, BigTable, Chubby, Zookeeper, Tao, MapReduce
and other services and will see where each lives. We’ll only get
detailed for a few of these (it would take too long to cover all of
them), but we will try and identify common themes. Interestingly, although these systems run inside the cloud (not in the network, like Chord), many use the same key-value structure as Chord and in fact many evolved from ideas that originated in the P2P networking community! On the other hand, Chord itself is viewed as too slow for use in the tier 2 cloud (for a small DHT with 1000 members, Chord might need 8 or 9 routing hops to find an item, which is very costly). Another issue is that access distributions might be skewed: there could be hot spots. The key take-aways are that we can map a surprising number of functionalities onto a key-value model: things that look very different from a key-value store often run on top of a key-value model (for example, Memcached is just an API, and Dynamo is Amazon's version of Chord, but BigTable turns out to map rows and columns onto a key-value model. Plus, inside a data center the routing is far easier because we just keep a list of the machines running the key-value system. Hint: for prelim1, you should definitely study these take-away points! |
Sec 5.7 | ||
February break runs from February 13-16, instruction resumes Feb 17 | ||||||
Th Feb 18 | Adaptive Overlays for the tier 2 cloud | (6) Motivated by the desire to use a DHT in tier2 to support technologies like the ones mentioned in lecture 5, we'll revisit the pure DHT concept and explore ways of using dynamic adaptation to compensate for skewed access distributions, skewed node-to-node latencies, etc. We’ll look at Beehive (a modified version of Chord that achieves O(c) lookup times for constant c), Pastry (a Plaxton-tree DHT structure that creates flexibility as to which nodes a given node should peer with) and Kelips (a gossip-based DHT with O(1) lookup times). |
Book, Chapter 4 |
|||
Tue, Feb 23 | Torrents and Tit-for-Tat Incentives. BAR Gossip | (7) BitTorrent is widely used both in the P2P network world and inside cloud computing data centers. In this lecture, we’ll look closely at BitTorrent, to see exactly how its “tit for tat” concept of fairness works. This will lead us to look at an idea from UT Austin for a scheme they call “BAR” gossip |
Book: Chapter 5 |
|||
Th Feb 25 |
Anatomy of a cloud. CAP theorem |
(8) At every level of the cloud we struggle with deep tradeoffs: rapid response and scalability tend to fight against techniques that provide strong guarantees. Berkeley’s Eric Brewer captured one of these in his CAP theorem, which focuses on tier1 and tier2 and says you can have just two of Consistency, Availability, Fault or Partition Tolerance. We’ll see how CAP applies and how it was proved by Gilbert and Lynch. Although the proof applies only in WAN settings, we’ll also find that CAP has implications in the first tier of a typical cloud computing data center. | Book, Sec. 5.3 | |||
Tue Mar 1 |
BASE methodology versus ACID model |
(9) Many cloud systems are rejecting the database ACID guarantees in favor of what has come to be known as the BASE methodology, which weakens consistency to enhance responsiveness, but can make mistakes because of the weaker properties that must later (“eventually”) be repaired. Amazon's Dynamo and Dynamo-DB are the most widely cited examples of BASE, but in fact there are other systems that use this same approach. We'll see that while the acronym is cool (BASE is the opposite of ACID, or at least that's the idea), the details are kind of hazy. Basically, BASE seems to mean "self stabilizing" but is not really a guarantee of "consistency" in the normal sense of the term. |
Book (BASE is discussed in several chapters, but this lecture draws
mostly on the material in the Preface, Introduction and Chapter 5)
BASE: An Acid Alternative - ACM Queue. Dan Pritchett, eBar. ACM
Queue. Overcoming the “D” in CAP: Using Vsync To Build Locally Responsive Cloud Services. K. P. Birman, Q. Huang, D. Freedman. IEEE Internet Computing. Volume 12. pp. 50-58. February 2012. |
|||
Th Mar 3 | Consistency and time. Logical and Vector clocks. Consistent cuts. | pptx | (10) To reason about stronger forms of consistency, we’ll need some abstractions. One tool that turns be especially useful was created by Leslie Lamport and offers a way to model time in terms of causal relationships. We’ll see how this leads to notions of logical clocks that sometimes make more sense that real clocks. | Book, Sec 10.1 and 10.2 | ||
Tue Mar 8 | 2 and 3-phase commit | pptx | (11) We’ll look at the famous two and three phase commit protocols, which play a wide range of roles in many kinds of systems. These are used as building blocks, but are also for their own sake. Oddly, it turns out that neither protocol is able to guarantee that progress will occur, although 3PC does better than 2PC. This motivates looking more closely at just when fault-tolerance can be achieved. | Section 10.3 | ||
Wed Mar 9 7:30-9:00pm |
Prelim 1 will be held on Wednesday evening in our standard recitation slot, Gates G01 We are scheduling the exam from 7:30-9:00 but we think it should be less than 1 hour for most people. Our reason for deciding to schedule what may be a double-length time slot is to avoid anyone feeling time pressure. The exam will be open book, open notes, including our web pages and slides, but read only. No emails to anyone, no postings on help sites! |
|||||
Th Mar 10 | Consensus and the FLP theorem | pptx | (12) We’ll look at a theoretical result most people refer to as the “FLP theorem”. FLP asserts that fault-tolerant consensus is “impossible” in asynchronous systems. What do these terms mean? How is FLP proved? What does FLP mean in practice? | Section 11.3 | ||
Tue Mar 15 | Paxos | pptx | (13) This lecture will present the “basic” Paxos protocol. Paxos is a powerful protocol for managing replicated data and widely used in cloud computing systems. The Chubby lock service mentioned in lecture 7 runs on top of Paxos. | Chapter 11 | ||
Th Mar 17 | Atomic multicast with Virtual Synchrony guarantees (Vsync) | pptx | (14) We’ll review the virtual synchrony model, a less stringent way to implement replication that brings complexity not seen in Paxos, but also gains performance advantages in doing so. Vsync employs the virtual synchrony model, but also supports Paxos (called SafeSend within the system). | Chapters 11-14 | ||
Tue Mar 22 | Atomic multicast tradeoffs | pptx | (15) We’ll look at the question of weak versus strong durability as it arises when applying Vsync to replicate data in tier-1 services. Here, we’ll look at the question of how durable a multicast needs to be if the corresponding data is managed in-memory and there isn’t any way to preserve state across node failures and restarts (i.e. there is only soft-state). This will help us see that the deep tradeoff between what Paxos is doing and what Vsync is doing centers on the "cost of durability". Vsync also offers a few ordering options, which can be useful, but the cost of ordering is low compared to the cost of durability. | Book,
Section 14.3. Optional: Chapter 15. You may also want to remind yourself of this paper, which we discussed on February 21: Overcoming the “D” in CAP: Using Isis2 To Build Locally Responsive Cloud Services. K. P. Birman, Q. Huang, D. Freedman. IEEE Internet Computing. Volume 12. pp. 50-58. February 2012 |
||
Th Mar 24 | Transactional model, ACID | pptx | (16) We’ll look more closely at the ACID model mentioned in Lecture 10 and how it works. ACID is the standard way to think about database consistency. | Chapter 20 | ||
Spring Break: March 26-April 3. Classes resume April 4
|
||||||
Tue Apr 5 | Implementing transactional services | pptx | (17) How can ACID-based systems be implemented? We’ll look at some basic mechanisms and will also touch upon snapshot isolation, an approach popular in cloud settings. | |||
Th, Apr 7 |
The value of cloud logging services. |
pptx | (18) This starts a series of “single topic”
lectures. The focus of this lecture will be on
the Corfu system. The basic idea here is to treat logging as a
distinct service. Once you do that, you can use multicast as
an event notification solution without worrying so much about
durability, since the log itself is durable and ordered. As an example, we'll look at the way modern event notification systems and modern logging or database systems are integrated in large-scale, demanding settings with strong assurance needs, like in air traffic control, managing the electric power grid, etc. We haven't talk so much about transactions, so although we'll touch on Corfu DB, the relevance of that work will be more clear in upcoming lecturesThursday and then on the Tuesday after spring break. |
Mahesh Balakrishnan, Dahlia Malkhi, John Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber, CORFU: A Distributed Shared Log, in ACM Transactions on Computer Systems, ACM, 2013 Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Micheal Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck, Tango: Distributed Data Structures over a Shared Log, in SOSP, November 2013 |
||
Tue Apr 12 |
Bimodal Multicast. |
pptx | (19) Although they don't publish a lot of their system design decisions, Amazon is known to make heavy use of gossip-based mechanisms in EC2 and S3. Since we can't read directly about those commercial versions, instead we'll look at similar systems to understand the value these have: Bimodal Multicast and Astrolabe. | Chapter 21 | ||
Th Apr 14 |
Building complex overlay networks using Gossip | pptx | (20) We'll look at the work on the T-Man system at the University of Bologna. This work illustrates the self-healing and self-organizing properties of gossip-based solutions, which are leveraged by systems like the ones actually used in Amazon, and have also made them popular in very remote settings. For example, swarm robot systems are using these protocols. | T-Man slide set. (and as a pdf) | ||
Tue Apr 19 | The RealTime Cloud | pptx | (21) In this lecture we'll talk about cloud computing systems that have real-time goals and the notion of a real-time QoS guarantee for a data distribution service that has associated state. Our focus will be on work IBM did in 1995 for a new air traffic control system that was being created in the US at that time (in fact the project failed, but the reasons are very instructive) | Chapter 23 | ||
Th Apr 21 | Information Retrieval in the Facebook cloud | pptx |
(22) This lecture will focus on technologies used within the Facebook cloud. We'll look at the ways that Facebook uses caching to accelerate worldwide photo retrieval, and then we'll dive deeper and look at the TAO subsystem it uses to track relationships and the Haystack storage backend system. |
There are some nice published papers on this material. The first part of the lecture is drawn from a paper at SOSP 2013, and then the second part on a paper from OSDI 2012 and from NDSI 2013. | ||
Tue Apr 26 | Security concerns for consolidated systems. Synthetic diversity. | pptx | (23) Cloud security worries are a big barrier to adoption. The paper we’ll discuss was written by Professors Schneider and Birman as a report on a study looking at these tradeoffs. | Dangers of monocultures paper. | ||
Tue Apr 28 | Future cloud I: Computing on encrypted data |
(24) One very new idea, still very much considered research, imagines that we might upload data to the cloud in an encrypted form and then compute on the data without actually decrypting it. We'll look at a cutting-edge idea that tackles this question, with surprisingly successful results (but also some limitations). |
Chapter 19. The core idea we'll discuss in this lecture is from a set of papers that you can read on the website maintained by the main author, Raluca AdaPopa | |||
Th May 3 |
|
pptx | (25) We all know that virtualization plays a huge role in modern cloud settings but that virtual machine migration isn't really exploited very much. We'll talk about work by Hakim Weatherspoon and a team at IBM in which cloud virtualization is combined with migration to to let an application access sensitive files that you don’t trust the cloud to host. Hakim's approach also allows you to create a virtual image that includes resources like devices that actually live on remote machines. | XenBlanket paper | ||
Th May 5 |
The cloud value proposition. |
pptx | (26) The cloud "Value Proposition". We'll discuss some of the major business models for cloud enterprises to try and understand what determines value in the cloud. This turns out to be a much trickier question than you might have expected. | |||
Tue May 10 | A Cloud for the Internet of Things | pptx | Optional lecture. Ken will be doing a dry run of his keynote talk for MesosCon, a "containerization" conference (one of the big new trends in cloud computing) that runs in June in Denver. The main focus will be on how to create a cloud that can support the Internet of Things, like the smart grid. These turn out to need new kinds of support from the cloud, and the talk will focus on tools Ken's group at Cornell is building. This material will definitely NOT appear on the final exam. | |||
Project Demos: Wednesday and Thursday May 9 and 10. Please sign up ahead of time using the sign-up web page Theo prepared. Final Exam: Thu, May 19, 7:00-9:00 PM (Hollister B14). Open book, open notes, ok to access course web pages. |