# Exercises 4

Deadline 2026-01-06 23:59

All filenames refer to the g0.complang.tuwien.ac.at.  You should
perform your measurements on this machine.

## Submission

Following the Setup below will give you a directory 4 in a working
directory of your git project, containing `Exercises4.txt` and other
files.

Answer the questions by editing `Exercises4.txt` with a text editor,
inserting parts of the transcript of your shell session(s) where
needed.  Don't forget to commit and push the directory and the
relevant files before the deadline.

Please do not just answer the questions, but document how you arrived
at the answers (this is also useful for preparing yourself for the
interview part of the exam later in the semester).  A good way to do
that for some questions is to provide transcripts of your shell
sessions.  You can either cut-and-paste the text from the terminal
session, or use `script -f <filename>` to write the transcript to the
named file (and then copy parts of this transcript into your
`Exercises4.txt`).

## Setup

`cd` into a working directory of your git project for Efficient
Programs on the course machine.  Perform

````
unzip /nfs/a5/anton/pub/anton/lvas/efficient/25/4.zip
git add 4
cd 4
make
````

The directory `4` contains `Exercises4.txt` (fill in the answers there)
and the files mentioned below and a `Makefile`.  Run `make` after
changing a C file and before measuring the binary built from the C
file.

The goal of this exercise is that you get hands-on experience on the
potential performance benefits of SIMD, and the limits of
auto-vectorization and manual vectorization.

There are source files `*X*.c`, `*X*m.c` that `make` compiles to binaries
`*X*-*V*-*C*`, where `*V*` indicates the compilation options for `*X*.c`

```
*V*       flags
noauto -Wall -O3 -march=native -mtune=znver4 -ffast-math -fno-tree-vectorize
auto   -Wall -O3 -march=native -mtune=znver4 -ffast-math
manual -Wall -O3 -march=native -mtune=znver4 -ffast-math
````

The meaning of the flags is:

```
-Wall  warn about potential bugs.
-O3    turn on lots of optimizations, including auto-vectorization.
-march=native allow all instructions for the current machine.
              on the course machine that includes AVX-512.
-mtune=znver4 Tune for a Zen4 (actually use AVX-512 instructions).
              For some reason, gcc avoids AVX-512 with -mtune=native.
-ffast-math   This allows some transformations that may change the results.
              In particular, it allows to reassociate + and *,
              even though FP + and * are not associative.
-fno-tree-vectorize  Disable auto-vectorization.
```

The `manual` files actually also compile and link a source file
`*X*-manual.c` instead of `*X*.c`.  This source file uses the GNU C
[Vector
extensions](https://gcc.gnu.org/onlinedocs/gcc-14.3.0/gcc/Vector-Extensions.html)
(which are also supported by clang) to manually vectorize all but <16
iterations of the code.

`*C*` indicates the compiler used: `gcc` or `clang`.  These compilers
produce different code with these flags, and sometimes one produces
faster code, sometimes the other.

To present measurement results from `perf stat`, I recommend that you
perform

````
export LC_NUMERIC=prog
````

before performing the measurements (or prepend `LC_NUMERIC=prog ` to
your `perf stat` commands) , for more readable numbers.

## Q1 Speedup from vectorizing a data-parallel loop

`dp1-*V*-*C*` performs ~2M data-parallel computations on vectors of
n=1..999 4-byte elements, with 2*n FP ops, for a total of ~2G FP ops.
The SIMD width of AVX-512 is 64 bytes (16 elements).

What is the speedup (measured in cycles) of the best `dp1-auto-*C*`
variant over the best `dp1-noauto-*C*` variant?  What is the speedup
of the best `dp1-manual-*C*` variant over the the best
`dp1-noauto-*C*` variant?  What is the number of FP ops per second
(also known as FLOPS) of the programs you used for the answers above
(If you assume that the number of FP ops is 2G, the answer is close
enough)?

## Q2 What obstructs vectorization?

`dp2.c` and `dp3.c` are copies of `dp1.c` (and likewise for
`dp[123]m.c`).  Change `dp2.c` such that auto-vectorization does not
work with gcc.  Change `dp3.c` such that auto-vectorization does not
work with clang.  Ideally, these changes should be minimal; they are allowed to produce a different result from `dp1.c`.  To
prevent too-trivial solutions, the results should consume at least
150M cycles.  How do you determine whether auto-vectorization works?
You could look at the generated machine code, but that would require
you to analyze this code and you can be mislead by partial
auto-vectorization (probably not for this example, but in general).
Here we use the criterion that for compiler `*C*`, `dp[23]-auto-*C*`
is not more than a factor 1.5 faster than `dp[23]-noauto-*C*`.
Present the changes (shown with `diff -u dp1.c dp2.c` and likewise for
`dp3.c`) and the cycle numbers.

What change prevents auto-vectorization by gcc?
What change prevents auto-vectorization by clang?

## Q3 When does gcc auto-vectorization fail?

`dp4.c`, `dp4m.c`, and `dp4-manual.c` are copies of the corresponding
`dp1` files.  Change `dp4.c` in a way that you can still manually
vectorize (i.e., where you can change `dp4-manual.c` in a way that
produces the same results), but where auto-vectorization of gcc does
not work (using the speedup criterion from Q2 for determining the
success and failure of auto-vectorization and manual vectorization).
You can introduce additional parameters and change `dp4m.c`
correspondingly (including checking the outputs of the function in
`dp4.c`); the results of the `dp4` programs are allowed to be different
from the `dp1` programs.  I found it helpful to first change `dp4.c` and
examine the compiler output with

```
make dp4-auto-gcc.o && objdump -d dp4-auto-gcc.o
```

before making the corresponding changes to `dp4-manual.c` and
`dp4m.c`.

If you like a challenge, declare all the pointer/array parameters that
the function in `dp4.c` writes to as `restrict` (like the original
function does for `y`).

Please show a vectorizable loop (extracted from `dp4.c`) that is
vectorizable (equivalent manually vectorized code in `dp4-manual.c`),
but that gcc does not auto-vectorize.  Include the measurements that
indicate that the non-vectorization of `dp4-auto-gcc` and the
vectorization of `dp4-manual-gcc`.

Note: The corresponding task for clang is not included, because for
simple data-parallel loops clang auto-vectorizes them even with a
large number of non-`restrict`ed written-to arrays (checked by looking
at the machine code, not by measuring the performance).  However, I
have seen clang's auto-vectorizer fail at other vectorizable code.
Unfortunately, that code cannot be expressed manually with the GNU C
vector extensions.

<!---
## Q4 When does clang auto-vectorization fail?

`dp5.c`, `dp5m.c`, and `dp5-manual.c` are copies of the corresponding
`dp1` files.  Use them for determining the limit of auto-vectorization
for clang: Change `dp5.c` in a way that you can still manually
vectorize (i.e., where you can change `dp5-manual.c` in a way that
produces the same results), but where auto-vectorization of gcc does
not work (using the speedup criterion from Q2 for determining the
success and failure of auto-vectorization and manual vectorization).
You can introduce additional parameters and change `dp5m.c`
correspondingly (including checking the outputs of the function in
`dp5.c`).  Note that, in my experience, gcc and clang
auto-vectorization fails for very different programs.  I found it
helpful to first change `dp5.c` and examine the compiler output with

```
make dp5-auto-clang.o && objdump -d dp5-auto-clang.o
```

before making the corresponding changes to `dp5-manual.c` and
`dp5m.c`.

If you like a challenge, declare all the pointer/array parameters that
the function in `dp5.c` writes to as `restrict` (like the original
function does for `y`).

Show a program that clang does not auto-vectorize and that you can
vectorize manually.  Show the speedups of auto-vectorization and
manual vectorization.
--->

## Q4 Speedup for a reduction

`red1-*V*-*C*` performs ~1M reductions (input: vectors, output:
scalars) on vectors of n=1..1000 4-byte elements, with 2*n FP ops, for
a total of ~1G FP ops.

What is the speedup (measured in cycles) of the best `red1-auto-*C*`
variant over the best `red1-noauto-*C*` variant?  What is the number
of FP ops per second (also known as FLOPS) of the programs you used
for the answers above (If you assume that the number of FP ops is 1G,
the answer is close enough)?

Bonus question (food for thought, not as deliverable): Why are the
FP ops per cycle for the noauto variants worse than in dp1.c?  Why do
you not see the same effect for the auto variants?

## Q5 Speedup for a generation

`gen1-*V*-*C*` performs ~1M generations (input: scalars, output:
vectors) of vectors of n=1..1000 4-byte elements.

What is the speedup (measured in cycles) of the best `gen1-auto-*C*`
variant over the best `gen1-noauto-*C*` variant?

## Q6 Speedup for data-parallel loop with conditional execution

`cond1-*V*-*C*` performs ~2M loops, each of which conditionally
performs 2 FP ops per iteration on 1..1000 4-byte elements of the
associated arrays.

What is the speedup (measured in cycles) of the best `cond1-auto-*C*`
variant over the best `cond1-noauto-*C*` variant?

Bonus question (food for thought, not as deliverable): Why are these
programs slower than the the `dp1` programs, even though they perform
fewer useful FP ops?  In particular, why are the `noauto` programs so
much slower?

## Q7 Vectorization overhead

`dp5-*V*-*C*` takes an argument *n* and performs 2M loops of a
data-parallel computation with vectors with m=1..2*n-1 elements
(average n) (4m FP ops, 3m loads and 3m stores).

Which is the first *n* for which `dp5-auto-*C*` is a factor >=1.05
faster than the fastest `dp5-noauto-*C*` (speed measured in cycles)?
Only show the measurements for this *n* in `Exercises4.txt`, and the
others in a transcript that you also include.

Which is the last *n* for which `dp5-auto-*C*` is a factor <1.05
faster than the fastest `dp5-noauto-*C*` (speed measured in cycles)?
Only show the measurements for this *n* in `Exercises4.txt`, and the
others in a transcript that you also include.

Is there a *n* for which `dp5-noauto-*C*` is a factor >=1.05 faster
than the fastest `dp5-auto-*C*` (speed measured in cycles)? If so,
present the measurements for that *n*.

## Q8 Effect of `restrict`

`dp6-*V*-*C*` performs the same computation as `dp5-*V*-*C*`, but the
written-to pointers in `axpy1()` are not defined as `restrict`.

For *n*=3, what is the speedup (measured in cycles) of the best
`dp5-noauto-*C*` over the best `dp6-noauto-*C*`?  For *n*=3, what is
the speedup of the best `dp5-auto-*C*` over the best `dp6-auto-*C*`?

Which is the first *n* for which `dp6-auto-*C*` is a factor >=1.05
faster than the fastest `dp6-noauto-*C*` (speed measured in cycles)?
Which is the last *n* for which `dp6-auto-*C*` is a factor <1.05
faster than the fastest `dp6-noauto-*C*` (speed measured in cycles)?
Which is the last *n* for which `dp6-noauto-*C*` is a factor >=1.05
faster than the fastest `dp6-auto-*C*` (speed measured in cycles)?

Note: If you define a pointer as `restrict` (i.e., that all memory
accessed through this pointer is only accessed through this pointer),
your program actually must not access the same memory in another way.
The compiler may reorder the memory accesses relative to other memory
accesses based on this promise, and if the other memory accesses
access the same memory, this reordering may lead to incorrect results.

