# Exercises 3

Deadline 2025-12-09 23:59

All filenames refer to the g0.complang.tuwien.ac.at.  You should
perform your measurements on this machine.

## Submission

Following the Setup below will give you a directory 3 in a working
directory of your git project, containing `Exercises3.txt`.

Answer the questions by editing `Exercises3.txt` with a text editor,
inserting parts of the transcript where needed.  Don't forget to
commit and push the directory and the relevant files before the
deadline.

Please do not just answer the questions, but document how you arrived
at the answers (this is also useful for preparing yourself for the
interview part of the exam later in the semester).  A good way to do
that for some questions is to provide transcripts of your shell
sessions.  You can either cut-and-paste the text from the terminal
session, or use `script -f <filename>` to write the transcript to the
named file (and then copy parts of this transcript into your
`Exercises3.txt`).

## Setup

`cd` into a working directory of your git project for Efficient
Programs on the course machine.  Perform

````
unzip /nfs/a5/anton/pub/anton/lvas/efficient/25/3.zip
git add 3
````

The directory 3 contains `Exercises3.txt` (fill in the answers there).

## Topic and the program `memory1`

The goal of this exercise is that you get an idea and feeling for the
performance effects of the memory hierarchy.  Note that the course
machine has a Rocket Lake processor, so the values you will measure
are different from those shown in slide 21 (for a Skylake).

There is a program
`/nfs/a5/anton/pub/anton/lvas/efficient/25/3/memory1`, which performs
100M memory accesses, each dependent on the previous one, following
pointers through a cycle of references (i.e., a cyclic linked list
without payload).  Depending on how this list is set up, you will
measure very different numbers of cycles and other events.

You call this program with

```
/nfs/a5/anton/pub/anton/lvas/efficient/25/3/memory1 random|linear <elements> <stride>
```

`random`
: the elements of the linked list are shuffled such that they are not
visited in linear order.  Various hardware features that depend on
spatial locality will not work in most cases, and hardware prefetchers
will also not help.

`linear`
: the elements of the linked list are visited in increasing order,
with the next element being *stride* bytes away (except that the last
element cycles back to the first one).  Hardware prefetchers and
(depending on the stride) hardware features that benefit from spatial
locality will help.

\<elements\>
: the number of elements in the list (at least 1, at most 1000000 (or
a little more)).

\<stride\>
: the distance from the start of one element to the start of the next
one in memory (not in visiting order in case of `random`).

Use `LC_NUMERIC=prog perf stat -e cycles` to determine the number of cycles spent for
initializing the list and performing these 100M dependent memory
accesses.  The time for initializing the list (and the memory occupied
by it) is small for many cases, but can become significant for large
memory areas; the number of list elements and randomization also have
their costs, but are small enough that they do not drown the effects
of the memory subsystem.

Use additional performance events as appropriate; useful events can
be:

```
L1-dcache-load-misses
l2_rqsts.demand_data_rd_miss (misses from prefetching not counted)
LLC-loads (L3 cache accesses, i.e., L2 cache misses, including prefetches)
LLC-load-misses (L3 cache misses)
longest_lat_cache.miss (L3 cache misses)
dtlb_load_misses.stlb_hit (L1 TLB miss that hits L2)
dTLB-load-misses (L2 TLB miss)
dtlb_load_misses.walk_completed (miss all TLB levels)
```

I have seen a lot difference between `LLC-load-misses` and
`longest_lat_cache.miss`.  I can only guess at the reason: maybe the
former does not include prefetches, while the latter does.

Remember percentages in parenthesis indicate that the numbers are
produced by multiplexing the limited number of counters and may be
less accurate than without multiplexing.  You can avoid that by asking
for fewer events.  This program has such a uniform behaviour that the
multiplexing may cause less inaccuracy than for more usual programs,
however.

An example:

```
LC_NUMERIC=prog perf stat -e cycles -e L1-dcache-load-misses -e dtlb_load_misses.stlb_hit -e dTLB-load-misses \
/nfs/a5/anton/pub/anton/lvas/efficient/25/3/memory1 random 5000 8
```

produces as output:

```
size: 40000, offsets at 0x0 0x8 0x10 0x18 ..., randomized
[...]
501_793_612      cycles
     35_203      L1-dcache-load-misses
        449      dTLB-load-misses
```

So there are hardly L1 cache misses and hardly TLB misses, i.e., this
is the best case, and the 500M cycles for 100M memory dependent memory
accesses indicate that one memory access has a latency of 5 cycles in
that case.

Note that the mapping from virtual to physical memory (which you
cannot influence) can cause significant variations (towards higher
misses) in the results of some of the following experiments.  Perform
your measurements several times; use the lowest numbers of
misses/cycles that you measure for the answers to the following
questions.  That's a useful basis for understanding the performance
effects, while the effects of virtual memory are too arbitrary to make
much use of, except for the following guideline:

If you size your data structures for fitting into a specific cache
size, size it for one way less than the cache actually has (e.g., for
an 8-way L1 cache with 4KB per way, make your data structure fit into
28KB, such that only 7 ways of each set are used).

The following questions concentrate on L1 and L2, because the L3 cache
and main memory are shared between CPU cores, so you could see wildly
varying results when different groups perform such experiments at the
same time.  Feel free to experiment in that direction, though.

For the questions about main memory latency, it is expected that some
of you will report higher numbers than others; there is no need to
measure that several times with the hope that you will see a smaller
result the next time.

## Q1 L1 cache size

Increase the elements until the number of L1 cache misses rises
significantly.  At what size does this happen?  What is your estimate
for the L1 cache size?

## Q2 Cache line size

Switch to linear mode, and perform measurements with larger numbers of
elements.  You will see a plateau in the number of L1 cache misses.
At this plateau: Given 100M accesses with a stride of 8, you get an L1
cache miss every n Bytes.  What is n?

Increase the stride to n and adjust elements to 8*current-elements/n.
You should now see 100M L1 cache misses.  For the next questions, use
stride=n, unless otherwise noted.

## Q3 L2 cache size

Design an experiment for determining the L2 cache size and use it to
answer the following question: What is the L2 cache size?

## Q4 L2 cache latency

Design an experiment that has 100M L1 cache misses and ~0 TLB misses
and ~0 L2 cache misses ("~0" typically means <100_000 in this
exercise).  What is the latency of an L2 cache access with linear
accesses (i.e., with the prefetcher helping)?  What is the latency of
an L2 cache access in random mode (where the prefetcher cannot help)?

## Q5 Main memory latency

Use random mode, 1000000 elements and stride 64 to measure main memory
latency (plus TLB miss cost).  How many cycles per access do you
measure on your first run?  How many ns (user time) per access do you
measure on your first run?  How many times is a L1 cache hit faster
than a main memory access in your first run (in user time)?

[The processor cores usually clocks significantly lower in this
measurement than in some other szenarios, so the "cycles" results are
somewhat misleading; apparently the frequency-controlling part of the
CPU notices that the core is waiting for main memory most of the time
and reduces the clock rate to save energy; you typically see this
behaviour on server and mobile CPUs, and typically not on desktop
CPUs.

Some cache miss numbers you will see in this experiment are
significantly larger than 100M.  This is probably due to page table
accesses from L2 TLB misses. I have no good explanation for the large
`longest_lat_cache.miss` numbers.]

## Q6 Main memory accesses with prefetcher help

Switch to linear mode such that the prefetcher helps you and TLB
misses will be reduced. How many cycles per access do you measure on
your first run?  How many ns per access do you measure on your first
run?  What is the effect on the cache misses?  What is the bandwidth
(in GB/s user time) of the main memory accesses, if you count, for
each access, the whole 64 bytes of a cache line that is actually
transferred from main memory?

[The theoretical maximum transfer speed of the DRAM interface used on
this processor is 46.9GB/s, but obviously `memory1` uses only a part
of that.]

## Q7 Conflict misses

Use stride=8192.  Which is the last number of elements where the L1
cache misses is ~0?  Where do you see the difference for stride=4096?
Where for stride=2048?  What is the associativity (number of ways) in
the L1 cache?  How much capacity per way does the L1 cache have (L1
size/number of ways)?  What is the number of sets in the L1 cache
(capacity per way/cache line size)?

Extra question (not required to answer that as part of the exercise):
How would you size your data structures, given this knowledge?  What
benefits do you expect from these choices?

[There is no corresponding question for the L2 cache, because the L2
cache is physically addressed: Two accesses that are an L2 way size
apart in virtual memory usually are not a multiple of that size apart
in physical memory and therefore do not map to the same way in the L2
cache.  Therefore you cannot use measurements with `memory1` for this
purpose.]

## Q8 L1 TLB entries

Use stride=131136.  At which number of elements is the number of L1
TLB misses no longer ~0?  Repeat the same experiment with
stride=65600, stride=32832, stride=16448, stride=8256, stride=4160,
stride=2112.  How many entries does the L1 TLB have?  How much memory
does each entry cover (it's a power of 2, the strides above are chosen
to avoid conflict misses in the L1 cache)?

## Q9 L1 TLB miss penalty

Choose parameters such that the L1 TLB misses are close to 100M, but
the L1 cache misses and L2 TLB misses are ~0.  How much does
each access cost?

## Q10 L2 TLB entries

Use stride=65600.  Which is the last number of elements where the miss
rate is ~0?

[Thanks to virtual memory variations that you cannot control and low
associativity of the L2 TLB, this number is quite a bit lower than the
actual number of L2 TLB entries, but there is only so much that you
can find out easily with `memory1`.]

## Q11 L2 TLB miss penalty

Choose the parameters such that the number of L2 cache misses stays
<1M, but the number of L2 TLB misses rises to >90M.  What is the number
of cycles per memory access?

[The number of L1 cache misses can increase significantly because the
MMU accesses the page tables in the caches when an access misses all
TLB levels.  For additional insight (no need to put that in your
answer), compare your run for answering Q11 with a run with stride=64
and otherwise the same parameters.]