# Exercises 3 Deadline 2025-12-09 23:59 All filenames refer to the g0.complang.tuwien.ac.at. You should perform your measurements on this machine. ## Submission Following the Setup below will give you a directory 3 in a working directory of your git project, containing `Exercises3.txt`. Answer the questions by editing `Exercises3.txt` with a text editor, inserting parts of the transcript where needed. Don't forget to commit and push the directory and the relevant files before the deadline. Please do not just answer the questions, but document how you arrived at the answers (this is also useful for preparing yourself for the interview part of the exam later in the semester). A good way to do that for some questions is to provide transcripts of your shell sessions. You can either cut-and-paste the text from the terminal session, or use `script -f ` to write the transcript to the named file (and then copy parts of this transcript into your `Exercises3.txt`). ## Setup `cd` into a working directory of your git project for Efficient Programs on the course machine. Perform ```` unzip /nfs/a5/anton/pub/anton/lvas/efficient/25/3.zip git add 3 ```` The directory 3 contains `Exercises3.txt` (fill in the answers there). ## Topic and the program `memory1` The goal of this exercise is that you get an idea and feeling for the performance effects of the memory hierarchy. Note that the course machine has a Rocket Lake processor, so the values you will measure are different from those shown in slide 21 (for a Skylake). There is a program `/nfs/a5/anton/pub/anton/lvas/efficient/25/3/memory1`, which performs 100M memory accesses, each dependent on the previous one, following pointers through a cycle of references (i.e., a cyclic linked list without payload). Depending on how this list is set up, you will measure very different numbers of cycles and other events. You call this program with ``` /nfs/a5/anton/pub/anton/lvas/efficient/25/3/memory1 random|linear ``` `random` : the elements of the linked list are shuffled such that they are not visited in linear order. Various hardware features that depend on spatial locality will not work in most cases, and hardware prefetchers will also not help. `linear` : the elements of the linked list are visited in increasing order, with the next element being *stride* bytes away (except that the last element cycles back to the first one). Hardware prefetchers and (depending on the stride) hardware features that benefit from spatial locality will help. \ : the number of elements in the list (at least 1, at most 1000000 (or a little more)). \ : the distance from the start of one element to the start of the next one in memory (not in visiting order in case of `random`). Use `LC_NUMERIC=prog perf stat -e cycles` to determine the number of cycles spent for initializing the list and performing these 100M dependent memory accesses. The time for initializing the list (and the memory occupied by it) is small for many cases, but can become significant for large memory areas; the number of list elements and randomization also have their costs, but are small enough that they do not drown the effects of the memory subsystem. Use additional performance events as appropriate; useful events can be: ``` L1-dcache-load-misses l2_rqsts.demand_data_rd_miss (misses from prefetching not counted) LLC-loads (L3 cache accesses, i.e., L2 cache misses, including prefetches) LLC-load-misses (L3 cache misses) longest_lat_cache.miss (L3 cache misses) dtlb_load_misses.stlb_hit (L1 TLB miss that hits L2) dTLB-load-misses (L2 TLB miss) dtlb_load_misses.walk_completed (miss all TLB levels) ``` I have seen a lot difference between `LLC-load-misses` and `longest_lat_cache.miss`. I can only guess at the reason: maybe the former does not include prefetches, while the latter does. Remember percentages in parenthesis indicate that the numbers are produced by multiplexing the limited number of counters and may be less accurate than without multiplexing. You can avoid that by asking for fewer events. This program has such a uniform behaviour that the multiplexing may cause less inaccuracy than for more usual programs, however. An example: ``` LC_NUMERIC=prog perf stat -e cycles -e L1-dcache-load-misses -e dtlb_load_misses.stlb_hit -e dTLB-load-misses \ /nfs/a5/anton/pub/anton/lvas/efficient/25/3/memory1 random 5000 8 ``` produces as output: ``` size: 40000, offsets at 0x0 0x8 0x10 0x18 ..., randomized [...] 501_793_612 cycles 35_203 L1-dcache-load-misses 449 dTLB-load-misses ``` So there are hardly L1 cache misses and hardly TLB misses, i.e., this is the best case, and the 500M cycles for 100M memory dependent memory accesses indicate that one memory access has a latency of 5 cycles in that case. Note that the mapping from virtual to physical memory (which you cannot influence) can cause significant variations (towards higher misses) in the results of some of the following experiments. Perform your measurements several times; use the lowest numbers of misses/cycles that you measure for the answers to the following questions. That's a useful basis for understanding the performance effects, while the effects of virtual memory are too arbitrary to make much use of, except for the following guideline: If you size your data structures for fitting into a specific cache size, size it for one way less than the cache actually has (e.g., for an 8-way L1 cache with 4KB per way, make your data structure fit into 28KB, such that only 7 ways of each set are used). The following questions concentrate on L1 and L2, because the L3 cache and main memory are shared between CPU cores, so you could see wildly varying results when different groups perform such experiments at the same time. Feel free to experiment in that direction, though. For the questions about main memory latency, it is expected that some of you will report higher numbers than others; there is no need to measure that several times with the hope that you will see a smaller result the next time. ## Q1 L1 cache size Increase the elements until the number of L1 cache misses rises significantly. At what size does this happen? What is your estimate for the L1 cache size? ## Q2 Cache line size Switch to linear mode, and perform measurements with larger numbers of elements. You will see a plateau in the number of L1 cache misses. At this plateau: Given 100M accesses with a stride of 8, you get an L1 cache miss every n Bytes. What is n? Increase the stride to n and adjust elements to 8*current-elements/n. You should now see 100M L1 cache misses. For the next questions, use stride=n, unless otherwise noted. ## Q3 L2 cache size Design an experiment for determining the L2 cache size and use it to answer the following question: What is the L2 cache size? ## Q4 L2 cache latency Design an experiment that has 100M L1 cache misses and ~0 TLB misses and ~0 L2 cache misses ("~0" typically means <100_000 in this exercise). What is the latency of an L2 cache access with linear accesses (i.e., with the prefetcher helping)? What is the latency of an L2 cache access in random mode (where the prefetcher cannot help)? ## Q5 Main memory latency Use random mode, 1000000 elements and stride 64 to measure main memory latency (plus TLB miss cost). How many cycles per access do you measure on your first run? How many ns (user time) per access do you measure on your first run? How many times is a L1 cache hit faster than a main memory access in your first run (in user time)? [The processor cores usually clocks significantly lower in this measurement than in some other szenarios, so the "cycles" results are somewhat misleading; apparently the frequency-controlling part of the CPU notices that the core is waiting for main memory most of the time and reduces the clock rate to save energy; you typically see this behaviour on server and mobile CPUs, and typically not on desktop CPUs. Some cache miss numbers you will see in this experiment are significantly larger than 100M. This is probably due to page table accesses from L2 TLB misses. I have no good explanation for the large `longest_lat_cache.miss` numbers.] ## Q6 Main memory accesses with prefetcher help Switch to linear mode such that the prefetcher helps you and TLB misses will be reduced. How many cycles per access do you measure on your first run? How many ns per access do you measure on your first run? What is the effect on the cache misses? What is the bandwidth (in GB/s user time) of the main memory accesses, if you count, for each access, the whole 64 bytes of a cache line that is actually transferred from main memory? [The theoretical maximum transfer speed of the DRAM interface used on this processor is 46.9GB/s, but obviously `memory1` uses only a part of that.] ## Q7 Conflict misses Use stride=8192. Which is the last number of elements where the L1 cache misses is ~0? Where do you see the difference for stride=4096? Where for stride=2048? What is the associativity (number of ways) in the L1 cache? How much capacity per way does the L1 cache have (L1 size/number of ways)? What is the number of sets in the L1 cache (capacity per way/cache line size)? Extra question (not required to answer that as part of the exercise): How would you size your data structures, given this knowledge? What benefits do you expect from these choices? [There is no corresponding question for the L2 cache, because the L2 cache is physically addressed: Two accesses that are an L2 way size apart in virtual memory usually are not a multiple of that size apart in physical memory and therefore do not map to the same way in the L2 cache. Therefore you cannot use measurements with `memory1` for this purpose.] ## Q8 L1 TLB entries Use stride=131136. At which number of elements is the number of L1 TLB misses no longer ~0? Repeat the same experiment with stride=65600, stride=32832, stride=16448, stride=8256, stride=4160, stride=2112. How many entries does the L1 TLB have? How much memory does each entry cover (it's a power of 2, the strides above are chosen to avoid conflict misses in the L1 cache)? ## Q9 L1 TLB miss penalty Choose parameters such that the L1 TLB misses are close to 100M, but the L1 cache misses and L2 TLB misses are ~0. How much does each access cost? ## Q10 L2 TLB entries Use stride=65600. Which is the last number of elements where the miss rate is ~0? [Thanks to virtual memory variations that you cannot control and low associativity of the L2 TLB, this number is quite a bit lower than the actual number of L2 TLB entries, but there is only so much that you can find out easily with `memory1`.] ## Q11 L2 TLB miss penalty Choose the parameters such that the number of L2 cache misses stays <1M, but the number of L2 TLB misses rises to >90M. What is the number of cycles per memory access? [The number of L1 cache misses can increase significantly because the MMU accesses the page tables in the caches when an access misses all TLB levels. For additional insight (no need to put that in your answer), compare your run for answering Q11 with a run with stride=64 and otherwise the same parameters.]