Frontmatter

If you are publishing this notebook on the web, you can set the parameters below to provide HTML metadata. This is useful for search engines and social media.

Playing with performance in Julia

In this notebook, we are going to look at the performance of a few different implementations of matrix-vector and matrix-matrix products in Julia. This serves three purposes:

  1. It will give you a taste of Julia programming.

  2. It will show how there are several different ways of thinking about even simple operations.

  3. It will illustrate the performance advantages of using matrix-matrix products (implemented by someone else!) as a building block for later linear algebra algorithms.

md"""
# Playing with performance in Julia

In this notebook, we are going to look at the performance of a few different implementations of matrix-vector and matrix-matrix products in Julia. This serves three purposes:

1. It will give you a taste of Julia programming.
2. It will show how there are several different ways of thinking about even simple operations.
3. It will illustrate the performance advantages of using matrix-matrix products (implemented by someone else!) as a building block for later linear algebra algorithms.
"""
1.4 ms
using Plots
2.8 s

Matrix-vector products

We'll start by multiplying matrices by vectors. First, though, it's useful to understand how to write matrix and vector literals in Julia! Array literals (matrices and vectors) are surrounded by square brackets, and their elements are connected by horizontal concatenation (separation by space) and vertical concatenation (separation by semicolons). For example, let's build a little matrix and a column vector and compute their product.

md"""
## Matrix-vector products

We'll start by multiplying matrices by vectors. First, though, it's useful to understand how to write matrix and vector literals in Julia! Array literals (matrices and vectors) are surrounded by square brackets, and their elements are connected by horizontal concatenation (separation by space) and vertical concatenation (separation by semicolons). For example, let's build a little matrix and a column vector and compute their product.
"""
192 μs
A
2×2 Matrix{Float64}:
 1.0  2.0
 3.0  4.0
A = [1.0 2.0; 3.0 4.0]
37.6 μs
x
x = [5.0; 6.0]
8.0 μs
700 μs

What happens under the covers of the expression A*x? There are actually a couple different ways of thinking about multiplying a matrix by a vector. One version is that each entry of the result of a matrix-vector product is a matrix row times a vector; this is the row-oriented or dot-product formulation. Another version is that the result is a linear combination of the columns of $A$, where the weights in the linear combination are the entries of the vector $x$. This gives us a column-oriented view. If we break everything down into low-level for loops, we end up with very similar code to implement these two approaches; the only thing that changes is the loop order.

md"""
What happens under the covers of the expression `A*x`? There are actually a couple different ways of thinking about multiplying a matrix by a vector. One version is that each entry of the result of a matrix-vector product is a matrix row times a vector; this is the row-oriented or dot-product formulation. Another version is that the result is a linear combination of the columns of $A$, where the weights in the linear combination are the entries of the vector $x$. This gives us a column-oriented view. If we break everything down into low-level `for` loops, we end up with very similar code to implement these two approaches; the only thing that changes is the loop order.
"""
176 μs
matvec_row (generic function with 1 method)
function matvec_row(A, x)
m, n = size(A)
y = zeros(m)
for i = 1:m
# y[i] = A[i,:]*x -- a dot product
for j = 1:n
y[i] += A[i,j]*x[j]
end
end
y
end
934 μs
12.4 μs
matvec_col (generic function with 1 method)
function matvec_col(A, x)
m, n = size(A)
y = zeros(m)
for j = 1:n
# y += A[:,j]*x[j] -- weighted linear combination of columns
for i = 1:m
y[i] += A[i,j]*x[j]
end
end
y
end
950 μs
13.9 μs

Measuring performance

Now we have three different ways of computing the same matrix-vector product: the functions matvec_row and matvec_col and the built-in * operation. What is the speed difference between the three?

It doesn't make sense to do this type of thing for 2-by-2 matrices – we really want something more serious. It turns out that interesting things happen to timings for matrix sizes that are near multiples of large powers of 2, for reasons that have to do with the cache hardware, so let's start with a list of test sizes.

md"""
## Measuring performance

Now we have three different ways of computing the same matrix-vector product: the functions `matvec_row` and `matvec_col` and the built-in `*` operation. What is the speed difference between the three?

It doesn't make sense to do this type of thing for 2-by-2 matrices -- we really want something more serious. It turns out that interesting things happen to timings for matrix sizes that are near multiples of large powers of 2, for reasons that have to do with the cache hardware, so let's start with a list of test sizes.
"""
214 μs
test_sizes
test_sizes = []
8.1 μs
for n = 128:128:1024
push!(test_sizes, n-1)
push!(test_sizes, n)
push!(test_sizes, n+1)
end
41.8 μs
7.8 μs

Now we want a timing harness. The BenchmarkTools package in Julia is useful for this, but for our purposes we'll stick with the built-in @elapsed function to see how long a task takes. We also use the fact that functions are first-class objects, so we can pass the specific multiplication routine to be timed as an argument to the timing harness function.

md"""
Now we want a timing harness. The `BenchmarkTools` package in Julia is useful for this, but for our purposes we'll stick with the built-in `@elapsed` function to see how long a task takes. We also use the fact that functions are first-class objects, so we can pass the specific multiplication routine to be timed as an argument to the timing harness function.
"""
130 μs
matvec_gflops (generic function with 1 method)
function matvec_gflops(matvec, n)
A = rand(n, n)
x = rand(n)
ntrials = ceil(1e7 / n^2) # Scale the number of trials to get some minimal time
time = @elapsed for j = 1:ntrials
y = matvec(A, x)
end
2e-9*n^2*ntrials/time
end
3.6 ms

Now let's actually do the timings to get a Gflop rate (billions of floating-point operations per second) for each of the different methods for each matrix size. We can capture all of the results in an array without an explicit loop using an array comprehension. Also notice that to time the built-in routine, we use the anonymous function syntax (aka "lambda function" syntax) in Julia.

md"""
Now let's actually do the timings to get a Gflop rate (billions of floating-point operations per second) for each of the different methods for each matrix size. We can capture all of the results in an array without an explicit loop using an *array comprehension*. Also notice that to time the built-in routine, we use the anonymous function syntax (aka "lambda function" syntax) in Julia.
"""
1.1 ms
gflops_row
gflops_row = [matvec_gflops(matvec_row, n) for n in test_sizes]
357 ms
gflops_col
gflops_col = [matvec_gflops(matvec_col, n) for n in test_sizes]
157 ms
gflops_builtin
gflops_builtin = [matvec_gflops((A, x) -> A*x, n) for n in test_sizes]
73.2 ms

There is a pretty serious difference in performance here! The row-oriented code runs at about half a gigaflop/s; column-oriented goes up to mostly around 3-4; and the built-in is up around 10! Of course, there is variation depending on the matrix sizes.

md"""
There is a pretty serious difference in performance here! The row-oriented code runs at about half a gigaflop/s; column-oriented goes up to mostly around 3-4; and the built-in is up around 10! Of course, there is variation depending on the matrix sizes.
"""
151 μs
begin
plot(test_sizes, gflops_row, labels="row")
plot!(test_sizes, gflops_col, labels="col")
plot!(test_sizes, gflops_builtin, labels="builtin")
end
2.7 s

Matrix-matrix multiply

Just as we can get different algorithms for matrix-vector products using different loopo orderings, we can also get different algorithms for matrix-matrix multiplication. Here, though, we have three indices, and so there are six possible orderings of the loops! Let's consider two, one of which is oriented toward inner products, the other oriented towards outer products.

md"""
## Matrix-matrix multiply

Just as we can get different algorithms for matrix-vector products using different loopo orderings, we can also get different algorithms for matrix-matrix multiplication. Here, though, we have three indices, and so there are six possible orderings of the loops! Let's consider two, one of which is oriented toward inner products, the other oriented towards outer products.
"""
156 μs
matmul_ijk (generic function with 1 method)
function matmul_ijk(A, B)
m, p = size(A)
p, n = size(B)
C = zeros(m, n)
for i = 1:m
for j = 1:n
# Could also write this loop as C[i,j] = A[i,:]*B[:,j] -- inner product
for k = 1:p
C[i,j] += A[i,k]*B[k,j]
end
end
end
C
end
1.3 ms
matmul_kji (generic function with 1 method)
function matmul_kji(A, B)
m, p = size(A)
p, n = size(B)
C = zeros(m, n)
for k = 1:p
# Could write these loops as C += A[:,k]*B[k,:] -- outer product
for j = 1:n
for i = 1:m
C[i,j] += A[i,k]*B[k,j]
end
end
end
C
end
1.4 ms

We again write a uniform timing harness for the three codes.

md"""
We again write a uniform timing harness for the three codes.
"""
121 μs
matmul_gflops (generic function with 1 method)
function matmul_gflops(matmul, n)
A = rand(n, n)
B = rand(n, n)
ntrials = ceil(1e7 / n^3) # Scale the number of trials to get some minimal time
time = @elapsed for j = 1:ntrials
C = matmul(A, B)
end
2e-9*n^3*ntrials/time
end
901 μs

Now the difference in timing between the two naively coded versions is still big, but the gap between our naive codes and the built-in is huge! Indeed, it takes a lot of detailed fiddling to write a matrix-matrix multiply code that runs this fast, and that work changes from machine to machine. But people do spend that time, as we can see here – and they do so because once we have fast matrix-matrix multiplication codes, we can use it to write other fast linear algebra codes by thinking about matrices in a "block" fashion. We will see this repeatedly over the next couple months.

md"""
Now the difference in timing between the two naively coded versions is still big, but the gap between our naive codes and the built-in is huge! Indeed, it takes a lot of detailed fiddling to write a matrix-matrix multiply code that runs this fast, and that work changes from machine to machine. But people do spend that time, as we can see here -- and they do so because once we have fast matrix-matrix multiplication codes, we can use it to write other fast linear algebra codes by thinking about matrices in a "block" fashion. We will see this repeatedly over the next couple months.
"""
140 μs
gflops_inner
gflops_inner = [matmul_gflops(matmul_ijk, n) for n in test_sizes]
29.4 s
gflops_outer
gflops_outer = [matmul_gflops(matmul_kji, n) for n in test_sizes]
5.8 s
gflops_dgemm
gflops_dgemm = [matmul_gflops((A, B) -> A*B, n) for n in test_sizes]
713 ms
begin
plot(test_sizes, gflops_inner, labels="inner prod")
plot!(test_sizes, gflops_outer, labels="outer prod")
plot!(test_sizes, gflops_dgemm, labels="builtin")
end
1.5 ms