Over the past three lectures, we've been talking about the architecture of the CPU and how it affects performance of machine learning models.
However, the CPU is not the only type of hardware that machine learning models are trained or run on.
In fact, most modern DNN training happens not on CPUs but on GPUs.
It's not just training a neural network.
The pipeline has many different components.
Everywhere!
There’s interest in using hardware everywhere in the pipeline
What improvements can we get?
Speed up the basic building blocks of machine learning computation
Add data/memory paths specialized to machine learning workloads
To answer this...we need to look at what GPU architectures look like.
GPUs were originally designed to support the 3D graphics pipeline, much of which was driven for demand for videogames with ever-increasing graphical fidelity.
Important properties of 3d graphics rendering:
The first era of GPUs ran a fixed-function graphics pipeline. They weren't programmed, but instead just configured to use a set of fixed functions designed for specific graphics tasks: mostly drawing and shading polygons in 3d space.
In the early 2000s, there was a shift towards programmable GPUs.
Programmable GPUs allowed for people to customize certain stages in the graphics pipeline by writing small programs called shaders which let developers process the vertices and pixels of the polygons they wanted to render in custom ways.
These shaders were capable of very high-throughput parallel processing
The GPU supported parallel programs that were more parallel than those of the CPU.
But unlike multithreaded CPUs, which supported computing different functions at the same time, the GPU focused on computing the same function simultaneously on multiple elements of data.
This illustrates a distinction between two types of parallelism:
How are the different types of parallelism we've discussed categorized according to this distinction?
Eventually, people started to use GPUs for tasks other than graphics rendering.
However, working within the structure of the graphics pipeline of the GPU placed limits on this.
To better support general-purpose GPU programming, in 2007 NVIDIA released CUDA, a parallel programming language/computing platform for general-purpose computation on the GPU.
Now, programmers no longer needed to use the 3d graphics API to access the parallel computing capabilities of the GPU.
This led to a revolution in GPU computing, with several major applications, including:
An illustration of this from the CUDA C programming guide:
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
...
}
This syntax launches $N$ threads, each of which performs a single addition.
Importantly, spinning up hundreds or thousands of threads like this could be reasonably fast on a GPU, while on a CPU this would be way too slow due to the overhead of creating new threads.
Additionally, GPUs support many many more parallel threads running than a CPU.
CPU is a general purpose processor
GPU is optimized for
Machine learning workloads are compute intensive and often involve streaming data flows.
Is it compute bound or memory bound?
Ideally: it's compute bound
But often it is memory/communication bound
So we have to consider both in ML systems!
Because of their large amount of data parallelism, GPUs provide an ideal substrate for large-scale numerical computation.
In particular, GPUs can perform matrix multiplies very fast.
Just like BLAS on the CPU, there's an optimized library from NVIDIA "cuBLAS" that does matrix multiples efficiently on their GPUs.
There's even a specialized library of primitives designed for deep learning: cuDNN.
Machine learning frameworks, such as TensorFlow, are designed to support computation on GPUs.
And training a deep net on a GPU can decrease training time by an order of magnitude.
How do machine learning frameworks support GPU computing?
A core feature of ML frameworks is GPU support.
High-level approach:
Represent the computational graph in terms of vectorized linear-algebra operations.
For each operation, call a hand-optimized kernel that computes that op on the GPU.
Stitch those ops together by calling them from Python
More compute-intensive CPUs
Low-power devices
Configurable hardware like FPGAs and CGRAs
Accelerators that speed up matrix-matrix multiply
Accelerators special-designed to accelerate ML computations beyond just matrix-matrix multiply