Playing with performance in Julia
In this notebook, we are going to look at the performance of a few different implementations of matrix-vector and matrix-matrix products in Julia. This serves three purposes:
It will give you a taste of Julia programming.
It will show how there are several different ways of thinking about even simple operations.
It will illustrate the performance advantages of using matrix-matrix products (implemented by someone else!) as a building block for later linear algebra algorithms.
Matrix-vector products
We'll start by multiplying matrices by vectors. First, though, it's useful to understand how to write matrix and vector literals in Julia! Array literals (matrices and vectors) are surrounded by square brackets, and their elements are connected by horizontal concatenation (separation by space) and vertical concatenation (separation by semicolons). For example, let's build a little matrix and a column vector and compute their product.
2×2 Matrix{Float64}:
1.0 2.0
3.0 4.0
5.0
6.0
17.0
39.0
What happens under the covers of the expression A*x
? There are actually a couple different ways of thinking about multiplying a matrix by a vector. One version is that each entry of the result of a matrix-vector product is a matrix row times a vector; this is the row-oriented or dot-product formulation. Another version is that the result is a linear combination of the columns of $A$, where the weights in the linear combination are the entries of the vector $x$. This gives us a column-oriented view. If we break everything down into low-level for
loops, we end up with very similar code to implement these two approaches; the only thing that changes is the loop order.
matvec_row (generic function with 1 method)
17.0
39.0
matvec_col (generic function with 1 method)
17.0
39.0
Measuring performance
Now we have three different ways of computing the same matrix-vector product: the functions matvec_row
and matvec_col
and the built-in *
operation. What is the speed difference between the three?
It doesn't make sense to do this type of thing for 2-by-2 matrices – we really want something more serious. It turns out that interesting things happen to timings for matrix sizes that are near multiples of large powers of 2, for reasons that have to do with the cache hardware, so let's start with a list of test sizes.
127
128
129
255
256
257
383
384
385
511
512
513
639
640
641
767
768
769
895
896
897
1023
1024
1025
Now we want a timing harness. The BenchmarkTools
package in Julia is useful for this, but for our purposes we'll stick with the built-in @elapsed
function to see how long a task takes. We also use the fact that functions are first-class objects, so we can pass the specific multiplication routine to be timed as an argument to the timing harness function.
matvec_gflops (generic function with 1 method)
Now let's actually do the timings to get a Gflop rate (billions of floating-point operations per second) for each of the different methods for each matrix size. We can capture all of the results in an array without an explicit loop using an array comprehension. Also notice that to time the built-in routine, we use the anonymous function syntax (aka "lambda function" syntax) in Julia.
2.69253
2.69974
2.7363
2.33188
2.32023
2.28154
2.24791
2.18164
2.25071
2.18685
2.01554
2.16312
2.07435
2.17865
2.11238
2.08599
2.14502
2.02236
2.09427
2.08035
2.14941
2.08135
1.20196
1.99499
4.24314
4.25128
4.23793
4.67305
4.60972
4.72296
4.7429
4.8531
4.62862
4.81059
4.87066
4.71287
4.80298
4.85792
4.68712
4.80784
4.88398
4.79263
4.7742
4.8313
4.85632
4.83404
4.74925
4.63107
11.9548
12.004
11.5722
11.0632
11.3406
11.0982
10.4462
10.7119
11.171
11.1419
11.2911
11.0954
11.2843
11.2135
11.0072
11.1501
11.4961
11.3448
11.2271
10.5431
11.155
11.2213
10.7975
10.8114
There is a pretty serious difference in performance here! The row-oriented code runs at about half a gigaflop/s; column-oriented goes up to mostly around 3-4; and the built-in is up around 10! Of course, there is variation depending on the matrix sizes.
Matrix-matrix multiply
Just as we can get different algorithms for matrix-vector products using different loopo orderings, we can also get different algorithms for matrix-matrix multiplication. Here, though, we have three indices, and so there are six possible orderings of the loops! Let's consider two, one of which is oriented toward inner products, the other oriented towards outer products.
matmul_ijk (generic function with 1 method)
matmul_kji (generic function with 1 method)
We again write a uniform timing harness for the three codes.
matmul_gflops (generic function with 1 method)
Now the difference in timing between the two naively coded versions is still big, but the gap between our naive codes and the built-in is huge! Indeed, it takes a lot of detailed fiddling to write a matrix-matrix multiply code that runs this fast, and that work changes from machine to machine. But people do spend that time, as we can see here – and they do so because once we have fast matrix-matrix multiplication codes, we can use it to write other fast linear algebra codes by thinking about matrices in a "block" fashion. We will see this repeatedly over the next couple months.
0.571575
0.571643
0.571884
0.577761
0.571945
0.566034
0.573173
0.569271
0.568995
0.566458
0.549464
0.565076
0.559691
0.553152
0.558069
0.56526
0.558081
0.567807
0.570227
0.564013
0.57084
0.562948
0.515153
0.56703
2.18057
2.18783
2.69892
3.18282
3.03315
3.10851
3.13942
3.18277
3.23578
3.02289
3.03604
3.03555
2.996
2.85529
2.8984
2.79002
2.94721
2.96753
2.66007
2.94652
2.93084
2.95061
2.76951
2.9084
10.6451
44.2204
35.9128
36.8919
41.6114
40.7085
41.4314
45.2864
43.0746
44.8529
46.4599
45.7491
46.2691
46.2951
45.9096
45.4504
46.4833
46.6703
45.9154
43.6015
46.4856
47.3433
47.5999
47.0725