using PyPlot
using LinearAlgebra
using Statistics
Scheduled for release at 5/18/2020 at 9:30 AM, and due 48 hours later.
So far, we've talked about machine learning running on two types of classical hardware: CPUs and GPUs.
But these are not the only options for training and inferring ML models.
An exciting new generation of computer processors is being developed to accelerate machine learning calculations.
These so-called machine learning accelerators (also called AI accelerators) have the potential to greatly increase the efficiency of ML tasks (usually deep neural network tasks), for both training and inference.
Beyond this, even the traditional-style CPU/GPU architectures are being modified to better support ML and AI applications.
Today, we'll talk about some of these trends.
As we saw in the first lecture, the ML pipeline has many different components.
Hardware can help improve performance pretty much everywhere in the pipeline, and there's interest from hardware vendors in designing better hardware for pretty much every aspect of the ML pipeline. Two main ways to do this:
Both of these methods have seen significant use in recent years.
What improvements can we hope to get from better hardware in the ML pipeline? How can hardware make our ML systems better?
Power efficiency β especially for edge deviced
Better throughput β handle larger datasets, train better at scale
Better latency
More (fast) memory β scale up models with more parameters without destroying performance
What does this mean for the statistical performance of our algorithms?
Sometimes, when we move to a different hardware device, we expect to get the same results: same learned model, same accuracy.
But this is not always the case. Why?
x = (0.1 + 0.2) + 0.3
y = 0.1 + (0.2 + 0.3)
x == y
abs(x - y)
Generally, we expect specilized ML hardware to produce learned models that are similar in quality, but not necessarily exactly the same, as baseline chips (CPU/GPU).
One major issue when developing new hardware for ML, and the question we should always be asking: how is the device programmed?
One important thing to realize is that the only real distinction between GPUs and ML accelerators is that GPUs weren't originally designed for AI/ML.
But there's nothing in the architecture itself that particularly separates GPUs from all more purpose-built ML accelerators.
In fact, as machine learning tasks capture more of the market for GPUs, GPU designers have been adjusting their architectures to fit ML applications.
For example, by supporting low-precision arithmetic.
For example, by creating specialized compute paths for tasks common to DNN architectures.
For example, by making it easier for multiple GPUs to communicate with each other so as to collaborate together on a single training task.
As GPU architectures become more specialized to AI tasks, it becomes more accurate to think of them as ML accelerators.
All computer processors are basically integrated circuit: electronic circuits etched on a single piece of silicon.
Usually this circuit is fixed when the chip is designed.
A field-programmable gate array or FPGA is a type of chip that allows the end-user to reconfigure the circuit it computes in the field (hence the name).
You can program it by specifying the circuit you want it to compute in terms of logic gates: basic AND, OR, NOT, etc.
Note that this doesn't actually involve physical changes to the circuit that's actually etched on the physical silicon of the FPGA: that's fixed. Rather, the FPGA constructs a logical circuit that is reconfigurable by connecting or disconnecting various parts of the circuit that is etched on its silicon.
FPGAs were used historically for circuit simulation.
FPGAs consist of an array of programmable circuits that can each individually do a small amount of computation, as well as a programmable interconnect that connects these circuits together. The large number of programmable gates in the FPGA makes it a naturally highly parallel device.
An important property of FPGAs that distinguishes them from CPUs/GPUs: you can choose to have data flow through the chip however you want!
FPGAs often use less power to accomplish the same work compared with other architectures.
When would we want to use an FPGA vs. building our own application-specific integrated circuit (ASIC) from scratch for our ML application?
Main one is Microsoft's Project Capatult/Project Brainwave (https://www.microsoft.com/en-us/research/project/project-catapult/). For example, their website milestone from 2017 says:
_MSR and Bing launched hardware microservices, enabling one web-scale service to leverage multiple FPGA-accelerated applications distributed across a datacenter. Bing deployed the first FPGA-accelerated Deep Neural Network (DNN). MSR demonstrated that FPGAs can enable real-time AI, beating GPUs in ultra-low latency, even without batching inference requests._
And from the paper "Serving DNNs in RealTime at Datacenter Scale with Project Brainwave":
Google's Tensor Processing Unit (TPU) made a splash in 2015 as one of the first specialized architectures for machine learning and AI applications.
The original version focused on fast inference via high-throughput 8-bit arithmetic.
This is now a real product you can buy and use yourself, both on the cloud (from https://cloud.google.com/tpu)
and in physical hardware
This was not the case for ML accelerator hardware even three years ago!
Pro for TPU: Google has some evidence that the TPU outperforms GPUs and other accelerators on benchmark tasks. From Google's blog: (https://cloud.google.com/blog/products/ai-machine-learning/mlperf-benchmark-establishes-that-google-cloud-offers-the-most-accessible-scale-for-machine-learning-training)
"For example, itβs possible to achieve a 19\% speed-up with a TPU v3 Pod on a chip-to-chip basis versus the current best-in-class on-premise system when tested on ResNet-50"
But other hardware manufacturers make claims that their hardware is better...so you'll need to do some research to determine what is likely to be the best for your task and for the price point you care about.
Pro for TPU: Seems to have better power and somewhat better scalability than other options. E.g. you can scale up to 256 v3 TPUs in a pod.
Con for TPU: It can tie you to Google's Cloud Platform.
Con for TPU: Still might be a bit harder to program than GPUs/CPUs.
Intel's Nirvana Neural Network Processor(NNP). (https://www.intel.ai/intel-nervana-neural-network-processors-nnp-redefine-ai-silicon/)
New class of hardware called "Vision Processing Units" (VPU) for computer vision on edge devices
Intel acquired Habana Labs recently and has been building new AI processors
Main take-away here: there's buy-in for the idea of ML accelerator hardware from the biggest players in the computer processor space.
Apple's Neural Engine within the A11 Bionic system-on-a-chip (and subsequent chips) for neural networks on iPhones.
Many start-ups in the ML hardware space developing innovative new architectures.
Questions?
Scaling machine learning methods is increasingly important.
In this course, we addressed the high-level question: What principles underlie the methods that allow us to scale machine learning?
To answer this question, we used techniques from three broad areas: statistics, optimization, and systems.
We articulated three broad principles, one in each area.
Statistics Principle: Make it easier to process a large dataset by processing a small random subsample instead.
Optimization Principle: Write your learning task as an optimization problem, and solve it via fast general algorithms that update the model iteratively.
Systems Principle: Use algorithms that fit your hardware, and use hardware that fits your algorithms.
Now, some open questions in scalable ML that relate to the principles.