Computing a Distance Matrix: Measuring Cache Utilization


In this exercise, we will measure the cache utilization of our distance matrix programs.

Computing a Distance Matrix: Measuring Cache Utilization

We will use the perf tool, which uses several performance counters to profile entire processes. For instance, to profile ls, enter the command: perf stat ls. You will see several performance counter stats.

Note that in the description below, perf profiles the entire program. Since our program spends the vast majority of the time computing the distance matrix, using perf in this manner is suitable for this scenario. However, in programs where we would like to measure the performance of a section of code, the method below will not be suitable.

The tiled distance matrix solution should have improved performance over the row-wise solution. In this exercise, we will use perf to measure the fraction of cache misses for the row-wise and tiled distance matrix solutions. This will allow us to confirm (or refute) that the tiled solution has better cache reuse. We will use the cache-references and cache-misses options in perf.

Test these commands by profiling ls as follows: /usr/bin/perf stat -B -e cache-references,cache-misses ls.

Running Perf and Collecting Performance Measurements

You will not program anything new. Rather, in this exercise, you will collect information regarding performance.

Create a new job script that measures performance by inserting /usr/bin/perf stat -B -e cache-references,cache-misses before the binary file (shown below). In the line below, you may not be using -n depending on how you have implemented your job scripts.

srun -n1 /usr/bin/perf stat -B -e cache-references,cache-misses /path/to/bin/ ...

In your solution, you must perform the following tasks:

  • Using perf, profile the scalability of the row-wise and tiled distance matrix solutions.
    • Regarding the tiled solution, use the best value of $b$ as before. You can copy and paste or append your job script(s) from the previous executions and insert the perf command as shown above.
    • Note that perf will profile each process. Therefore, when you run $p=20$ ranks, you will get 20 outputs from perf. You will report the average of the perf output in the table below.
    • perf outputs to stderr. Make sure that you check your stderr file (defined by #SBATCH --error=), and not your stdout file.
  • Complete the template table below, which includes the fraction of cache misses as output by perf.
# of Ranks ($p$) % Cache Misses (Row-wise Distance Matrix) % Cache Misses (Tiled Distance Matrix) Job Script Name (*.sh)

Answer the following questions:

  • Q9: Examining the measured percentage of cache misses in the table, does the tiled solution improve cache reuse?