# Exercises 4 Deadline 2026-01-06 23:59 All filenames refer to the g0.complang.tuwien.ac.at. You should perform your measurements on this machine. ## Submission Following the Setup below will give you a directory 4 in a working directory of your git project, containing `Exercises4.txt` and other files. Answer the questions by editing `Exercises4.txt` with a text editor, inserting parts of the transcript of your shell session(s) where needed. Don't forget to commit and push the directory and the relevant files before the deadline. Please do not just answer the questions, but document how you arrived at the answers (this is also useful for preparing yourself for the interview part of the exam later in the semester). A good way to do that for some questions is to provide transcripts of your shell sessions. You can either cut-and-paste the text from the terminal session, or use `script -f ` to write the transcript to the named file (and then copy parts of this transcript into your `Exercises4.txt`). ## Setup `cd` into a working directory of your git project for Efficient Programs on the course machine. Perform ```` unzip /nfs/a5/anton/pub/anton/lvas/efficient/25/4.zip git add 4 cd 4 make ```` The directory `4` contains `Exercises4.txt` (fill in the answers there) and the files mentioned below and a `Makefile`. Run `make` after changing a C file and before measuring the binary built from the C file. The goal of this exercise is that you get hands-on experience on the potential performance benefits of SIMD, and the limits of auto-vectorization and manual vectorization. There are source files `*X*.c`, `*X*m.c` that `make` compiles to binaries `*X*-*V*-*C*`, where `*V*` indicates the compilation options for `*X*.c` ``` *V* flags noauto -Wall -O3 -march=native -mtune=znver4 -ffast-math -fno-tree-vectorize auto -Wall -O3 -march=native -mtune=znver4 -ffast-math manual -Wall -O3 -march=native -mtune=znver4 -ffast-math ```` The meaning of the flags is: ``` -Wall warn about potential bugs. -O3 turn on lots of optimizations, including auto-vectorization. -march=native allow all instructions for the current machine. on the course machine that includes AVX-512. -mtune=znver4 Tune for a Zen4 (actually use AVX-512 instructions). For some reason, gcc avoids AVX-512 with -mtune=native. -ffast-math This allows some transformations that may change the results. In particular, it allows to reassociate + and *, even though FP + and * are not associative. -fno-tree-vectorize Disable auto-vectorization. ``` The `manual` files actually also compile and link a source file `*X*-manual.c` instead of `*X*.c`. This source file uses the GNU C [Vector extensions](https://gcc.gnu.org/onlinedocs/gcc-14.3.0/gcc/Vector-Extensions.html) (which are also supported by clang) to manually vectorize all but <16 iterations of the code. `*C*` indicates the compiler used: `gcc` or `clang`. These compilers produce different code with these flags, and sometimes one produces faster code, sometimes the other. To present measurement results from `perf stat`, I recommend that you perform ```` export LC_NUMERIC=prog ```` before performing the measurements (or prepend `LC_NUMERIC=prog ` to your `perf stat` commands) , for more readable numbers. ## Q1 Speedup from vectorizing a data-parallel loop `dp1-*V*-*C*` performs ~2M data-parallel computations on vectors of n=1..999 4-byte elements, with 2*n FP ops, for a total of ~2G FP ops. The SIMD width of AVX-512 is 64 bytes (16 elements). What is the speedup (measured in cycles) of the best `dp1-auto-*C*` variant over the best `dp1-noauto-*C*` variant? What is the speedup of the best `dp1-manual-*C*` variant over the the best `dp1-noauto-*C*` variant? What is the number of FP ops per second (also known as FLOPS) of the programs you used for the answers above (If you assume that the number of FP ops is 2G, the answer is close enough)? ## Q2 What obstructs vectorization? `dp2.c` and `dp3.c` are copies of `dp1.c` (and likewise for `dp[123]m.c`). Change `dp2.c` such that auto-vectorization does not work with gcc. Change `dp3.c` such that auto-vectorization does not work with clang. Ideally, these changes should be minimal; they are allowed to produce a different result from `dp1.c`. To prevent too-trivial solutions, the results should consume at least 150M cycles. How do you determine whether auto-vectorization works? You could look at the generated machine code, but that would require you to analyze this code and you can be mislead by partial auto-vectorization (probably not for this example, but in general). Here we use the criterion that for compiler `*C*`, `dp[23]-auto-*C*` is not more than a factor 1.5 faster than `dp[23]-noauto-*C*`. Present the changes (shown with `diff -u dp1.c dp2.c` and likewise for `dp3.c`) and the cycle numbers. What change prevents auto-vectorization by gcc? What change prevents auto-vectorization by clang? ## Q3 When does gcc auto-vectorization fail? `dp4.c`, `dp4m.c`, and `dp4-manual.c` are copies of the corresponding `dp1` files. Change `dp4.c` in a way that you can still manually vectorize (i.e., where you can change `dp4-manual.c` in a way that produces the same results), but where auto-vectorization of gcc does not work (using the speedup criterion from Q2 for determining the success and failure of auto-vectorization and manual vectorization). You can introduce additional parameters and change `dp4m.c` correspondingly (including checking the outputs of the function in `dp4.c`); the results of the `dp4` programs are allowed to be different from the `dp1` programs. I found it helpful to first change `dp4.c` and examine the compiler output with ``` make dp4-auto-gcc.o && objdump -d dp4-auto-gcc.o ``` before making the corresponding changes to `dp4-manual.c` and `dp4m.c`. If you like a challenge, declare all the pointer/array parameters that the function in `dp4.c` writes to as `restrict` (like the original function does for `y`). Please show a vectorizable loop (extracted from `dp4.c`) that is vectorizable (equivalent manually vectorized code in `dp4-manual.c`), but that gcc does not auto-vectorize. Include the measurements that indicate that the non-vectorization of `dp4-auto-gcc` and the vectorization of `dp4-manual-gcc`. Note: The corresponding task for clang is not included, because for simple data-parallel loops clang auto-vectorizes them even with a large number of non-`restrict`ed written-to arrays (checked by looking at the machine code, not by measuring the performance). However, I have seen clang's auto-vectorizer fail at other vectorizable code. Unfortunately, that code cannot be expressed manually with the GNU C vector extensions. ## Q4 Speedup for a reduction `red1-*V*-*C*` performs ~1M reductions (input: vectors, output: scalars) on vectors of n=1..1000 4-byte elements, with 2*n FP ops, for a total of ~1G FP ops. What is the speedup (measured in cycles) of the best `red1-auto-*C*` variant over the best `red1-noauto-*C*` variant? What is the number of FP ops per second (also known as FLOPS) of the programs you used for the answers above (If you assume that the number of FP ops is 1G, the answer is close enough)? Bonus question (food for thought, not as deliverable): Why are the FP ops per cycle for the noauto variants worse than in dp1.c? Why do you not see the same effect for the auto variants? ## Q5 Speedup for a generation `gen1-*V*-*C*` performs ~1M generations (input: scalars, output: vectors) of vectors of n=1..1000 4-byte elements. What is the speedup (measured in cycles) of the best `gen1-auto-*C*` variant over the best `gen1-noauto-*C*` variant? ## Q6 Speedup for data-parallel loop with conditional execution `cond1-*V*-*C*` performs ~2M loops, each of which conditionally performs 2 FP ops per iteration on 1..1000 4-byte elements of the associated arrays. What is the speedup (measured in cycles) of the best `cond1-auto-*C*` variant over the best `cond1-noauto-*C*` variant? Bonus question (food for thought, not as deliverable): Why are these programs slower than the the `dp1` programs, even though they perform fewer useful FP ops? In particular, why are the `noauto` programs so much slower? ## Q7 Vectorization overhead `dp5-*V*-*C*` takes an argument *n* and performs 2M loops of a data-parallel computation with vectors with m=1..2*n-1 elements (average n) (4m FP ops, 3m loads and 3m stores). Which is the first *n* for which `dp5-auto-*C*` is a factor >=1.05 faster than the fastest `dp5-noauto-*C*` (speed measured in cycles)? Only show the measurements for this *n* in `Exercises4.txt`, and the others in a transcript that you also include. Which is the last *n* for which `dp5-auto-*C*` is a factor <1.05 faster than the fastest `dp5-noauto-*C*` (speed measured in cycles)? Only show the measurements for this *n* in `Exercises4.txt`, and the others in a transcript that you also include. Is there a *n* for which `dp5-noauto-*C*` is a factor >=1.05 faster than the fastest `dp5-auto-*C*` (speed measured in cycles)? If so, present the measurements for that *n*. ## Q8 Effect of `restrict` `dp6-*V*-*C*` performs the same computation as `dp5-*V*-*C*`, but the written-to pointers in `axpy1()` are not defined as `restrict`. For *n*=3, what is the speedup (measured in cycles) of the best `dp5-noauto-*C*` over the best `dp6-noauto-*C*`? For *n*=3, what is the speedup of the best `dp5-auto-*C*` over the best `dp6-auto-*C*`? Which is the first *n* for which `dp6-auto-*C*` is a factor >=1.05 faster than the fastest `dp6-noauto-*C*` (speed measured in cycles)? Which is the last *n* for which `dp6-auto-*C*` is a factor <1.05 faster than the fastest `dp6-noauto-*C*` (speed measured in cycles)? Which is the last *n* for which `dp6-noauto-*C*` is a factor >=1.05 faster than the fastest `dp6-auto-*C*` (speed measured in cycles)? Note: If you define a pointer as `restrict` (i.e., that all memory accessed through this pointer is only accessed through this pointer), your program actually must not access the same memory in another way. The compiler may reorder the memory accesses relative to other memory accesses based on this promise, and if the other memory accesses access the same memory, this reordering may lead to incorrect results.