CS6465: Emerging Cloud Technologies and Systems Challenges
CS6465 is a PhD-level class that will look closely at how technology shifts are creating new challenges and research opportunities in the cloud computing and distributed systems arena, and especially at how this is bringing the machine learning community together with the systems community. The intent is to to undertake an informed, fairly deep exploration of how hardware is reshaping the research prospects for applications that are dependent on hardware characteristics. Although the focus of the class will be on systems, the implications of these shifts reach into databases, Internet of Things, machine learning, NLP, robotics, and many other domains. The course is unusual in that we will have extensive class discussions that will look at emerging cloud application areas (such as "smart highways") and think about the technology shifts in terms of how such applications might use the new hardware and software models. Thus unlike a class that mostly reads papers, this class will mostly discuss papers from a fairly "applied" perspective, with these new use cases as the source of the applications.
Hardware we will consider includes new storage technologies, RDMA networking, FPGA computing, and GPUs. Also of interest are programmable network switches and routers, and network interface cards with customization capabilities. For completeness we will touch on Intel's SGX architecture.
Schedule and Readings/Slides
Date | Topic | Readings, other comments on the topic |
Tue Aug 22 | 1. Overview of our topic: bringing machine learning to the edge. | Our first meeting will center on breaking into groups and brainstorming about how one might build a smart highway system that captures data and uses machine learning to understand what the video cameras are seeing. U.C. Berkeley has a new center focused on this topic. Start by thinking about small numbers of cameras, still images, simple use case. |
Thu Aug 24 | 2. Challenges of "online data filtering" for machine learning systems that capture large amounts of data. | Brief lecture on the methods used for video analysis, followed by more discussion. We will expand on the still-image scenario from lecture 1. Now we have far more cameras, they capture video, not just still photos, and we are trying to do far more with them. What pressures does this put on the system infrastructure? |
Tue Aug 29 | 3. Discussion of asynchronous machine learning techniques. Goal is just to understand the computational models favored in such systems. What patterns of communication do these induce? | Although we will not discuss them, Chris De Sa and Joey Gonzalez pointed to these and a few other papers that capture the general flavor of the algorithms they are working with: Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright. 2011. Distributed Parallel Inference on Large Factor Graphs. Joseph E. Gonzalez Yucheng Low, Carlos Guestrin, David O'Hallaron. 2009. |
Thu Aug 31 | 4. Distributed hash tables for scalability. Building DHT-hosted data pipelines that have strong consistency guarantees yet minimize pausing (locking, or other kinds of protocol-introduced wait states) | Chord: A scalable peer-to-peer lookup service for internet applications. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, and Hari Balakrishnan. 2001. In Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications (SIGCOMM '01). ACM, New York, NY, USA, 149-160. DOI=http://dx.doi.org/10.1145/383059.383071 In-class discussion on mapping our smart highway with video feeds into a sharded DHT representation and building a solution that is smart about its use of memory and communication. Could we use "sharding" to spread the work broadly in a smart memory? |
Tue Sept 5 | 5. The Derecho System. Smart Memory concept over RDMA. Derecho overview, RDMC layer. Use of sharding / DHT model in Derecho. | Building Smart Memories and Scalable Cloud Services with Derecho. Sagar Jha, Jonathan Behrens, Theo Gkountouvas, Matthew Milano, Weija Song, Edward Tremel, Sydney Zink, Kenneth P. Birman, Robbert van Renesse. Submitted for publication, September 2017. |
Thu Sept 7 | 6. Core of Derecho performance: Asynchronous protocols on the Derecho's SST layer. Agreement on ordering, terminating a disrupted epoch and switching to a new epoch. | Main focus here is on understanding the idea of monotonicity as used in the Derecho SST predicates. In-class discussion topic: can a similar notion of asynchronous decision-making be created for our smart highway application? How does temporal urgency for decision-making play against the idea of receiver-oriented batching? |
Tue Sept 12 | 7. State machine replication is one of several important models for consistency. The other really big one is the transactional model from databases. We will review the idea, with a focus on how transactions can be used in distributed machine learning systems, notably to maintain and update Bayesian graphs. | Most people in CS6465 will have learned about transactions in prior classes, but this lecture is just to get everyone onto the same page. To keep the discussion focused, we will look primarily at transactions on graphs, notably the forms of graphs seen in social-networking settings and in Bayesian machine learning systems. In class discussion: can transactions of these kinds of graphs be implemented as non-blocking asynchronous flows? Or must they use blocking? |
Thu Sept 14 | 8. The FaRM DHT (Microsoft) | FaRM: Fast Remote Memory. Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. . In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation (NSDI'14). USENIX Association, Berkeley, CA, USA, 401-414. 2014 |
Tue Sept 19 | 9. CMU's Herd DHT | Using RDMA Efficiently for Key-Value Services. Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2014. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 295-306. DOI: https://doi.org/10.1145/2619239.2626299 Discussion: In Herd, the use of RDMA primitives differs, with a focus on in-lining and tradeoffs between one-sided and two-sided operations. But is this comparison fair, or have one of the two papers (FaRM versus Herd) "stacked the deck" by focusing on a case with non-standard structure? |
Thu Sept 21 | 10. FaSST: A case for one-sided unreliable datagrams on RDMA | FaSST: Fast, Scalable and Simple Distributed Transactions with Two-Sided (RDMA) Datagram RPCs Anuj Kalia, Michael Kaminsky, and David G. Andersen. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 185-201. Building on our discussion of Herd, this lecture will focus on the value of unreliable datagrams in RDMA settings. Again, the core question centers on the fairness of the problem setup and evaluation. Would this same set of claims be relevant to Derecho? Should Derecho be changed to run SST protocols and SSTMC over two-sided UD RDMA? |
Tue Sept 26 | 11. Microsoft's architecture for leveraging FPGA in Azure | A cloud-scale acceleration architecture Adrian M. Caulfield; Eric S. Chung; Andrew Putnam; Hari Angepat; Jeremy Fowers; Michael Haselman; Stephen Heil; Matt Humphrey; Puneet Kaur; Joo-Young Kim; Daniel Lo; Todd Massengill; Kalin Ovtcharov; Michael Papamichael; Lisa Woods; Sitaram Lanka; Derek Chiou; Doug Burger 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) Software/hardware support for FPGA programming in Azure. This is probably out of the best possible order and should move downward in the class topics in future offerings of cs6465. |
Thu Sept 28 | 12. How Facebook represents social network graphs | TAO: Facebook's Distributed Data Store for the Social Graph. Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov Dmitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani. 2013 USENIX Annual Technical Conference (USENIX ATC '13). Here we are back to looking at a social networking graph and its representation. Thinking back to the models used in FaRM and Herd, to what extent do their transactional evaluations actually match the intended use case, which would probably correspond to TAO? Would a smart highway generate the same style of graphs and transactions? |
Tue Oct 3 | 13. Facebooks caching architecture, and its backend Haystack image storage layer. | An Analysis of Facebook Photo Caching. Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, and Harry C. Li. 2013. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 167-181. DOI: https://doi.org/10.1145/2517349.2522722 Finding a needle in Haystack: Facebook's photo storage. Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. In Proceedings of the 9th USENIX conference on Operating systems design and implementation (OSDI'10). USENIX Association, Berkeley, CA, USA, 47-60. Continues on the theme from Thursday Sept 28: How does the Facebook infrastructure teach us things that would transfer to settings such as smart highway systems? Where are there notable differences (hint: think about real-time response requirements, data update patterns, the associated machine-learning model updates and where they need to be computed). |
Thu Oct 5 | 14. Optimizing cache implementations to match properties of flash (SSD) storage hardware | RIPQ: Advanced Photo Caching on Flash for Facebook. Linpeng Tang, Qi Huang, Wyatt Lloyd, Sanjeev Kumar, and Kai Li. 2015. In Proceedings of the 13th USENIX Conference on File and Storage Technologies (FAST'15). USENIX Association, Berkeley, CA, USA, 373-386. Here we look at how the advanced caching concepts simulated in the prior Facebook papers turned out to stumble when actually implemented on modern SSD, because of performance issues relating to the way that SSDs implement small random writes. Then the authors identify the best possible SSD write patterns and show how the same algorithms can be approximated in a hardware-friendly way. |
Tue Oct 10 | Fall break, no class | |
Thu Oct 12 | 15. Leveraging network routers to obtain fast ordering for conflicting operations. | Just say NO to Paxos Overhead: Replacing Consensus with Network Ordering. Jialin Li, Ellis Michael, Naveen Kr. Sharma, Adriana Szekeres, and Dan R. K. Ports. 2016. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI'16). USENIX Association, Berkeley, CA, USA, 467-483. This paper argues that totally ordered updates can be accelerated by taking advantage of features associated with advanced data center networks and routers. Question to discuss: if we think back to our discussion of consistency for transactions on social networking graphs or Bayesian knowledge graphs, that property centers on coordination and ordering. But does the pattern of updates we would see match the requirements of this NO Paxos concept, or did the NO Paxos designers actually assume something that would be poorly matched to that case? Would NO Paxos be likely to offer a useful speedup for transactions in FaRM or Herd / FaSST? |
Tue Oct 17 | 16. Getting RDMA technology to run over Ethernet (RoCE) | Congestion Control for Large-Scale RDMA Deployments. Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and Ming Zhang. 2015. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). ACM, New York, NY, USA, 523-536. DOI: https://doi.org/10.1145/2785956.2787484 TIMELY: RTT-based Congestion Control for the Datacenter. Radhika Mittal, Vinh The Lam, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, and David Zats. 2015. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM '15). ACM, New York, NY, USA, 537-550. DOI: https://doi.org/10.1145/2785956.2787510 These two papers are the most famous ones for showing that RDMA can potentially run just as well on Ethernet (RoCE) as on Infiniband (which is like Ethernet, but has a credit-based permission to send mechanism that ensures that no message is ever transmitted unless the receiver has a buffer waiting to receive it into). The intriguing pattern here, shared with the RIPQ paper, centers on how the operating system software has to do part of the job: we have advanced hardware, but in order to "use it effectively" we need a special kind of partnership from the software side. Is this the main thing to expect in the future? |
Thu Oct 19 | 17. Microsoft's experience with a large-scale RDMA on Ethernet (RoCE) deployment that uses DCQCN. | RDMA over Commodity Ethernet at Scale. Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn. 2016. RDMA over Commodity Ethernet at Scale. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM '16). ACM, New York, NY, USA, 202-215. DOI: https://doi.org/10.1145/2934872.2934908 This continues on the topic from Tuesday by looking at actual experience with DCQCN at large scale. The solution really works, but there were a lot of surprises that the DCQCN paper didn't anticipate. I am not aware of any large-scale experience with TIMELY yet, but it would be interesting to know whether that method also leaves some puzzles that will only show up when people really bet on it! |
Tue Oct 24 | 18. Characteristics of NUMA computing systems. SOSP paper on how Linux kernel locking had to evolve in order to operate efficiently on NUMA platforms. | An Analysis of Linux Scalability to Many Cores. S. Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI10, 2010, Oct. 4-6 2010, Vancouver, BC, Canada. The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors. Austin Clements, Frans Kaashoek, Nickolai Zeldovich and Robert Morris. ACM TOCS 32:4 (Jan 2015). There has been a lot of recent work on NUMA machines and the issues that arise when multicore (or even just multithreaded) applications use the locking hardware to manage concurrency. The first (short) paper has a nice review of the work and focuses on a number of Linux-hosted applications to ask how well they co-exist with NUMA locking. The key finding is that a fairly small number of changes are needed, but that "NUMA-blind" locking would indeed be quite slow. The second paper is longer and focuses on a policy they call "scalable commutativity" for multicore program design. |
Thu Oct 26 | 19. ASPLOS paper on efficiency of Corfu over NUMA architectures. | Black-box concurrent data structures for NUMA architectures. This paper asks how to create a NUMA-friendly shared log (not exactly Corfu, but similar), and shows that there are some basic building blocks that make the task much easier. Not only does this yield a high performance logging tool, the primitives themselves are candidates to view as "operating systems help" for exploiting NUMA architectures. A question that arises in our "smart highways" use case is this: with very large data objects like videos, NUMA architectures, and RDMA, do additional complications enter the picture? Or would these kinds of building blocks be just what we want? |
Tue Oct 31 | 20. [Matt] Software Transactional Memory, Herlihy's hardware accelerator concept | Transactional memory: architectural support for lock-free data structures. Maurice Herlihy and J. Eliot B. Moss. 1993. In Proceedings of the 20th annual international symposium on computer architecture (ISCA '93). ACM, New York, NY, USA, 289-300. DOI=http://dx.doi.org/10.1145/165123.165164 A classic paper for our overall topic. Matt will present it and lead a discussion on the connections to modern machine learning, but the core issue is this: could we somehow get correct behavior from programs that run with concurrency and yet ignore locking? The basic idea centers on having the compiler automate the locking needed to provide atomicity but with help from a special kind of hardware accelerator. |
Thu Nov 2 | 21. [Matt] Software Transactional Memory, using the L2 cache as a simple way to sense transactional conflicts | ProteusTM: Abstraction Meets Performance in Transactional Memory. Diego Didona, Nuno Diegues, Anne-Marie Kermarrec, Rachid Guerraoui, Ricardo Neves, and Paolo Romano. 2016. SIGARCH Comput. Archit. News 44, 2 (March 2016), 757-771. DOI: https://doi.org/10.1145/2980024.2872385 |
Tue Nov 7 | 22. Graphics processing units (GPU). Challenge of sharing GPUs in cloud computing data centers | Dandelion: a compiler and runtime for heterogeneous systems. Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 49-68. DOI: https://doi.org/10.1145/2517349.2522715 This paper proposes an operating system to help with GPU program and memory management, and measures many of the key performance-limiting paths. |
Thu Nov 9 | 23. Difficulty of getting RDMA to work with GPU memory |
SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs Shai Bergman, Tanya Brokhman, Tzachi Cohen, Mark Silberstein. USENIX ATC, 2013. GPUnet: Networking Abstractions for GPU Programs. So here we have three more papers looking at issues that arise when treating a GPU system as a first-class participant in a data center. The first explores challenges of creating a file system for GPU. The two others focus on networking GPUs to other kinds of datacenter resources. What interests me is that it turns out to be so challenging: from the Dandelion paper you might have thought these issues were all solved. But no, not at all... In fact GPU is a "poor fit" into a classic computing environment and isn't easy to integrate with Linux-based host applications. |
Tue Nov 14 | 24. [Drew] Machine learning for tasks like speech understanding using hardware accelerators. | Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers. SIGARCH Comput. Archit. News 43, 1 (March 2015), 223-238. DOI: https://doi.org/10.1145/2786763.2694347 Journal version of the same paper: here. This has a little more detail and includes some additional experiments, but the shorter conference version is probably fine unless you find something puzzling or incomplete and want to read a little more. Drew will lead the discussion for this class. We'll be looking at how clusters of FPGA or GPU devices could yield breakthroughs for spoken language understanding. |
Thu Nov 16 | 25. Spark/DataBricks | Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. Improving MapReduce Performance in Heterogeneous Environments, M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz and I. Stoica, OSDI 2008, December 2008. Spark has been an incredible success for the machine-learning community. There are many Spark papers, but this one was sort of the classic. It looks at how Spark gains big speedups for Hadoop, which is the main "back end" system in many machine learning systems, by a mixture of smart use of caching plus smart scheduling. But I should probably emphasize "back end" because a big issue that arises concerns the relevance of Spark to our kinds of front-end real-time machine learning applications. Could a Smart Highway somehow leverage Spark as a real-time tool? This lecture will be led by Xinwen Wang. Questions to think about when reading this paper:
|
Tue Nov 21 | 26. Tensorflow. | TensorFlow: A System for Large-Scale Machine Learning The big topic these days is TensorFlow: a Python dialect for building applications that process images and other kinds of machine-learned content. We'll limit ourselves to a big-picture overview of the language itself. There are many distributed computing and fault-tolerance challenges for the TF community, but the topic gets really obscure if you go too far down that path. |
Thu Nov 23 | Thanksgiving break, no class | |
Tue Nov 28 | 27.Hardware security technologies: Intel SGX. How Microsoft modified MapReduce to create an SGX-secured version. | Oblivious Multi-Party Machine Learning on Trusted Processors Olya Ohrimenko, Felix Schuster, Cédric Fournet, Aastha Mehta, Sebastian Nowozin, Kapil Vaswani, Manuel Costa, in The 25th USENIX Security Symposium 2016, USENIX, July 13, 2016, I want to cover one paper on SGX, but also wanted it to tie into modern machine learning systems. Those, of course, often run Hadoop. So I settled on this nice result by Microsoft's Cambridge research group. They adapt Hadoop to run in SGX-secured enclaves. |
Thu Nov 30 | 28. Disagregated Data Centers: What are the challenges? |
Open discussion in the class:
For our last class, I would like you to spend the next
two weeks thinking about a specific question, and come to class prepared to
discuss your ideas. First, think about machine-learning technologies we
have heard about this semester, even though we didn't dive deep on any of
them. They include Bayesian Network graphical models (the TAO social
network graph is an example, but some machine learning systems build more
elaborate functionality into the graph), Neural Networks, other forms of
optimization where you learn the parameters of a model from training data
and later use the model to understand real inputs, etc.
For each type of machine learning model, read about it
on Wikipedia if needed, and try to understand (1) what data is used to hold
such a model; (2) what kinds of inputs it can learn; (3) how the model
is actually used at runtime. Example: Suppose that our model was very simple: a
neural network intended to distinguish cats from fish. The model
itself would consist of a long (potentially VERY long) vector of its
numerical parameter settings. Perhaps many gigabytes of floating-point
numbers. These are actually the weights associated with edges and
nodes internal to the model. The training set would have vast numbers
of photos of cats, and vast numbers of photos of fish, and a label: "cat".
"cat/kitten." "fish/guppy."
Then later at runtime, we can take a photo of
something, and feed it to the trained model. It will output two labels
and its weight for each. For example, given a photo of a lovely
Persian cat, it should have high likelihood assigned to cat as the label,
and low to fish. With a photo of a catfish or a sturgeon, which both
have very visible whiskers, it might be mostly fish but a little cat,
especially if it never trained on examples of fish with whiskers. With my
golden retriever, Biscuit (who passed away a few years ago, sadly), it would
probably would guess cat because she was very furry.
A database of airplane flight plans is also a kind of
knowledge. We discussed how an air traffic control system works.
So this is another form of learning. It has less guesswork involved
but it still can be used to decide if a situation is safe or not, etc. As
we heard, for airplanes, these might be individual records containing about
1MB to 10MB of data. A few good-quality still photos of the vehicle
would probably push the record to the very large end of that range. On
the other hand, you don't need to include a photo every time you talk about
a plane, so perhaps the record has URLs pointing to other databases, like a
database of photos.
So this is the basic level of knowledge I want you to
have about the ML technologies you have heard about. Next, take that insight to our smart car scenario.
We have decided to move machine learning to the edge and our goals are to
solve basic questions:
You can make this list longer, and please feel free to
do that! Especially, add important time-critical functionality and
tasks.
Last, and this is what we will discuss in class, ask yourself what demands this application will need to make on the communication layer, the storage layer, specialized hardware, and the low-level operating system platform. For example, if you think that we will need to multicast video snippets, that would put heavy load on something like Derecho. If you think an FPGA array will be needed because we will want to have our system talk to drivers, plan that into your answer. If you think that we will use a GPU for video image analysis, include the issues associated with that. Bring your top five big challenges for research to class on the last day: what are the big issues someone should write papers about? |