Control-flow integrity microbenchmark

An LWN article about control-flow integrity described that (with the right options) clang replaces indirect calls with a variant that checks whether the call target is among the functions whose address is taken and that have the right signature. The point of this is that if there is a vulnerability that allows overwriting a function pointer, this vulnerability cannot be exploited to execute arbitrary code.

The article describes this as follows: There are tables of addresses of functions, and a function pointer is represented by the address of the table entry. A dispatch function (one for each signature) checks whether the "function pointer" is inside the table, and if so, calls the function pointed to by the entry.

Nowadays, in security-critical code (such as code where you would enable control-flow integrity) you use retpolines instead of indirect branches/calls, in order to work around Spectre; retpolines cost 20 cycles or more because they guarantee a branch misprediction.

So I wondered if instead of the scheme described above one would not be better off by using a hard-coded binary search among the possible function pointers and then use direct jumps/calls to the targets. Here a function pointer is represented by an integer 0..n-1, if there are n target functions.

Later I refined this into a more B-Tree like scheme: discern among as many targets (for a leaf) or subtrees (for an internal tree node) as fits in a cache line, even at the cost of having a more linear search within a cache line. This allows us to decide between 8 targets in a leaf node and 8 subtrees in an internal node, at the cost of up to 5 executed branches per node. In the special case of a two-level tree, making use of short branches allows 9 subtrees in the root node, and 9 targets in one of the leaf nodes, for a total of 73 targets.

Paf asked about the cost of this scheme, so I wrote this microbenchmark. There are a number of cost aspects:

Size and Cache cost

Is the binary search bigger and does it cause more cache misses? In this microbenchmark, the binary search dispatch function is 68 bytes long on AMD64, but only 62 bytes are actually used when one of the 7 functions is called (the other 6 bytes are for dealing with the out-of-range case). Similarly, in the B-tree-like variant for a leaf node a cache line decides among 8 targets (and the last node ends with a jump to abort() that reaches into the next cache line), and an internal node decides among 8 subtrees (one by falling through into the next cache line).

By contrast, the table-using code (with retpolines) costs 27 bytes itself, 21 bytes in a different section for the retpoline code (which may be shared among several users), and a 8 bytes per target, 8 targets per cache line (i.e., the table alone is as big as the hard-coded B-tree-like search).

In terms of cache misses, if we assume that the code and tables are no longer in the L1 caches (but the retpoline code is), the binary search for 7 functions will encounter 1 I-cache miss, while the table-using version encounders 1 I-cache and 1 D-cache miss.

If we have more than 7 functions, the binary search will use more cache lines. And if we care about that, we will prefer the B-tree-like variant over the pure binary search. This allows us to call one of 73 targets with two cache accesses (i.e., a comparable cost to the table-using variant). For more targets, we need more I-cache accesses. With 3 I-cache accesses, we can reach at least 512 targets, and with 4, at least 4096 targets.

If you know which functions are called how often, you can rebalance the binary search to minimize the number of average cache misses.

Of course, the slowdown from retpolines of the table-using code in hot code may outweigh the cache-miss cost of the binary search in cold code.

Branch mispredictions

If you always call the same target function (and don't have too much other stuff going on between calls), the branch predictor works nicely and will always predict correctly. If you call this function from one place, and that function from another place, the predictors tend to work well, too. OTOH, if you call different functions based on the result of a good random-number generator, an indirect branch will have a high misprediction rate, while the conditional branches in a binary search each will have about 50% misprediction rate (and I expect that in a B-tree-like search we will see a similar number (not proportion) of mispredictions as for the binary search; so the tree searches are worse in the random case as soon as they have to select among 4 targets or more.

However, the random case is not very realistic, and one will have to measure production code to see how well the predictors behave. For stuff like Linux VFS code, I expect the conditional branch predictor to perform very well if you use tree search for dispatch.

Basic performance

Each iteration of this microbenchmark measures a dispatch, a loop around it, and a little bit of work done by the called function (basically add a constant to the input value and return it). It measures the in-cache and predictable-branch preformance (except that the retpoline variant throws the predictability away). There are still some things that are not clear under these circumstances. Optimization guides tell me that, e.g., Skylake can process 2 branches per cycles, but only one taken branch per cycle; they also tell me that, if there are too many branches in a 16-byte fetch unit, this will overload the branch predictor. The binary search dispatch is pretty dense in branches, and the B-tree like search even more so, so I expected this to have an effect.

The dispatch is among 7 different target functions for the binary search (method 1) an among 73 for the B-tree-like search (method 3).

Anyway, here are the results in cycles per iteration (on Sandy Bridge (Xeon E3-1220), Skylake (Core i5-6600K), Zen (Ryzen 7 1800X), Zen2 (Ryzen 9 3900X), Zen3 (Ryzen 7 5800X)), explanations below:

Sandy Skylk  Zen   Zen2  Zen3
 6.01  5.01  5.04  5.01  5.01 dispatch 0 6
 9.01  7.01  8.02  7.03  7.01 dispatch 2 6
32.03 38.15 57.09 45.96 24.01 dispatch-retpol 0 6
36.03 43.11 60.08 49.33 26.01 dispatch-retpol 2 6
 9.01  7.59  6.02  6.01  6.01 dispatch-retpol 1 0
 9.01  7.62  6.02  6.01  6.01 dispatch-retpol 1 1
12.02  8.57  8.02  8.02  8.01 dispatch-retpol 1 2
12.02  8.58  8.02  8.02  8.01 dispatch-retpol 1 3
10.01  6.01  7.03  7.01  7.01 dispatch-retpol 1 4
10.01  6.01  7.02  7.01  7.01 dispatch-retpol 1 5
12.01  7.01  8.02  8.02  8.01 dispatch-retpol 1 6
10.01  8.01  7.01  7.01  7.02 dispatch-retpol 4 0
10.01  8.01  7.01  7.01  7.01 dispatch-retpol 4 1
10.01  8.02  7.02  7.01  7.02 dispatch-retpol 4 2
10.01  8.02  7.01  7.03  7.01 dispatch-retpol 4 3
 8.01  7.49  6.01  6.01  6.01 dispatch-retpol 4 4
 8.01  7.54  6.01  6.01  6.01 dispatch-retpol 4 5
 8.01  7.57  6.01  6.01  6.02 dispatch-retpol 4 6
 8.01  7.55  6.02  6.01  6.01 dispatch-retpol 4 7
11-15  7-9   8-10  8-10  8-10 dispatch-retpol 3 *
"dispatch" is the benchmark compiled with clang-11.0 -Os without retpolines, dispatch-retpol is compiled with clang -Os -mretpoline. The first parameter specifies which dispatcher to use:
a plain indirect function call; this actually has the target in a register in the whole loop, which is probably unrealistically good (it's going to come from memory in more realistic settings), but that way we know we did not give an unfair advantage to the binary search variant.
A table-using version with bounds checking, similar to what the article describes for control-flow integrity, but we use an index into the table of functions instead of the address of a function-table entry.
Binary search among 7 targets
B-tree-like search among 8 targets (1 level)
B-tree-like search among 73 targets (2 levels)
The second parameter is the target function to call; this makes a difference for the tree-search variants, so we measured this variant with all different target functions. However, for the 73 different target functions of the B-tree-like search, I don't show them all, but instead show the range of results.

The results show that, for 7 target functions, the binary search is competetive with the no-retpoline version of the control-flow-integrity-checked indirect call. The single-level B-tree-like search has similar, sometimes even better performance to the binary tree, so the increase in executed branches does not hurt. The two-level B-Tree-like variant is typically ~2-3 cycles slower than the single-level one, so another B-tree level is relatively cheap as long as there are no I-cache misses and branch mispredictions.

Once retpolines come into play, the tree-based searches are quite a lot faster. The picture may look less rosy if you have more targets and the cache misses play a role, or if branch mispredictions come into play, but that actually needs measurements (with performance monitoring counters) in actual production code.

The effect of having too many branches in a 16-byte block is not really noticable, neither in Skylake nor in the other CPUs.

Anton Ertl
[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[   ]Makefile23-May-2021 15:42 1.4K 
[   ]dis-btree.o23-May-2021 15:56 3.7K 
[   ]dis-btree.s23-May-2021 15:55 3.9K 
[   ]dispatch23-May-2021 15:56 17K 
[   ]dispatch-retpol23-May-2021 15:56 18K 
[   ]dispatch-retpol.o22-May-2021 19:02 2.6K 
[TXT]dispatch.c22-May-2021 19:01 966  
[   ]dispatch.o22-May-2021 19:02 2.4K 
[   ]dispatch.s22-May-2021 19:07 2.7K 
[   ]main-retpol.o23-May-2021 15:56 3.0K 
[TXT]main.c23-May-2021 15:56 1.4K 
[   ]main.o23-May-2021 15:56 3.1K 

Apache/2.2.22 (Debian) DAV/2 mod_fcgid/2.3.6 PHP/5.4.36-0+deb7u3 mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1e mod_perl/2.0.7 Perl/v5.14.2 Server at Port 80