In this walkthrough, we are going to set up a virtual development machine on GCP, install some profiling support, and see what we can see about the performance of a few toy codes.

Before we get started, you should set up a Linux VM on GCP. I recommend an e2-micro instance with the stock Debian 10 (Buster) OS and a 10 GB drive. This doesn’t include a lot of amenities, so you will probably want to grab a few packages before we begin: compilers, profilers, and tools. Here is how I set things up

  sudo apt-get install build-essential
  sudo apt-get install llvm
  sudo apt-get install clang
  sudo apt-get install git
  sudo apt-get install google-perftools
  sudo apt-get install valgrind

Performance counters and virtual despair

If you have access to a physical machine with no virtualization, and you are able to get root access, you can use performance counters to tell things like how many cache misses you suffered, how many misalignment penalties, and so forth. There are a few libraries that make use of these performance counters, including the Linux perf system and likwid (Like I Know What I’m Doing). There are also some very nice commercial packages like the Intel Parallel Studio (big chunks are available gratis in educational settings).

Unfortunately, the virtualization on the Google Compute Engine is set up not to share the hardware performance counters. I believe this is because it is considered a security risk; nonetheless, it’s annoying for fine-grain profiling work.

You should feel free to install these tools on your own systems, but we are going to assume they are off-limits for the rest of this walkthrough.

Google PerfTools

The Google PerfTools package (previously known as gperftools) is a simple sampling profiler. It works pretty well, but you have to know how to use it. To run the profiler, you have to run the code with the profiling library. This generates a profiling output file that you can look at with the viewer tool.

Using our centroid demo as an example, we the following line to collect the profiling information

  CPUPROFILE=/tmp/profile \
    LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libprofiler.so.0 \
    ./centroid.x

Yes, it’s a bit of a mouthful. The LD_PRELOAD variable tells the linker that we want to include the profiling library, and CPUPROFILE is the name of the file that we want to write data into. We don’t need to do anything at compile time to make this work; it is all at run time.

Once we have run the profile, we have to take a look at the output. We will do this with the profile viewer

  google-pprof ./centroid.x /tmp/profile

Details of viewer subcommands are listed). You will typically want to compile with debugging symbols (-g) in order to get the most insight possible out of this exercise.

An old favorite: gprof

An alternate way to collect profiling information is by compiling your code with the -pg flag in GCC. Once this is done, any run will produce a data file called gmon.out, which you can then view with the gprof profiler.

  gprof ./centroid.x

This shows a summary of the time spent in various functions. Note that if you use the default GCC version and stay with optimization level O2, some of the functions (particularly centroid1 and centroid3) are likely to be inlined on your behalf. This is a fine optimization, but it’s awful for understanding a profile.

It’s good to know about gprof, but we will find that we are somewhat limited in our ability to use it this semester, as it is not thread-safe.

Cachegrind

The valgrind program is really a suite of different tools that dynamically instrument a code and run it in partial simulation in order to figure out things about it. Maybe the most well-known of the valgrind tools is the memory checker, but you can also try using cachegrind to reason about potential cache issues in your code. However, cachegrind works by emulating a CPU and associated caches, and this is excruciatingly slow. It’s fine for timing short things that will be run a number of times, but it would kill us on a long run.

LLVM Machine Code Analyzer

If we want to dig a bit deeper, the LLVM Machine Code Analyzer (llvm-mca) can help us understand where the code that we’ve written actually is able to make good use of machine resources. The llvm-mca program runs a static analysis on the assembly code coming out from our favorite compiler (GCC or CLang) and tells us where we are going to have bad latencies and low throughputs.

Unfortunatey, this is pretty indecipherable stuff, at least at first! If you decide to play around with this at all, you will probably want to look at only a small segment of your code.

Above and beyond

The tools described above all tell us some variant of how long we are spending in one part of the code or another, potentially along with some auxiliary information about cache misses or empty pipeline cycles. This is notably not the same as knowing what changes will actually help us achieve good performance! One of the cooler pieces of profiling work that I’ve seen in the past few years addresses exactly that. The coz profiler out of UMass Amherst runs a “what if” experiment for each line of code of interest, asking “what would happen if this line took some percentage less time?” This helps us figure out which part of our code is most likely to really matter to the program performance.

Because coz is a Debian package, we only need one line to install it and start playing around:

  sudo apt-get install coz-profiler

The paper is worth reading, by the way. Or, if you don’t want to read the paper, the talk video is also pretty good.