An example where avoiding undefined behaviour hurts performance

Fans of undefined behaviour claim (usually without providing any empirical evidence) that it allows the compiler to produce better code, and that the disadvantages don't exist if you are a good programmer (and the rest of us should not be using C).

One of the disadvantages is that undefined behaviour vastly reduces the ways in which you can express your intent, and that includes ways that compilers tend to compile into faster code (unless they miscompile it because they make assumptions based on undefined behaviour).

As a simple example, consider a programming language implementation written in assembly language for Aarch64, or in C. (This example is actually derived from a real programming language implementation). In this implementation, you want to always report division by zero and division overflow as errors. However, on Aarch64 (64-bit ARM instruction set) the CPU does not trap on these conditions, but produces some result, so you have to perform the checking yourself (32-bit ARM and 32-bit and 64-bit PowerPC also work like this). E.g., you could write it as follows:

  sdiv    x2, x0, x1
  cbz     x1, 3c 
  cmp     x0, x4
  ccmn    x1, #0x1, #0x0, eq  // eq = none
  b.eq    3c   // b.none
   

Here you start the division, and then perform the checks while the divider works. If you want to express this in (more-defined) C, you can do it as follows

    q=n1/n2;
    if (n2==0)
      abort();
    if (n2==-1 && n1==LONG_MIN)
      abort();
  

If the compiler compiles it as intended (gcc-6.3.0 and 8.3.0 happen to do so), this is portable between CPUs; on CPUs where the division does not trap on these conditions, the ifs catch them. On CPUs where division traps on these conditions, already these traps produce the desired result.

However, this code has a problem: Division by zero and integer overflow are undefined behaviours, so, because the division comes first, a nasal demon compiler might assume that n1 and n2 are already such that the ifs can never trigger even on CPUs where the division does not trap on such inputs. Consequently, it could "optimize" the ifs away, and the programming language we want to implement would not catch these cases. To avoid this problem, we could rewrite the code as

    if (n2==0)
      abort();
    if (n2==-1 && n1==LONG_MIN)
      abort();
    q=n1/n2;
  

The disadvantage of this variant is that, if compiled in a straightforward way, the division is only started after the checks are performed, so there is no overlap between the checking and the long-latency division. And indeed, gcc-6.3.0 and gcc-8.3.0 compile this into code that performs the sdiv after the checking code:

  cbz     x1, 50 
  cmp     x2, x4
  ccmn    x1, #0x1, #0x0, eq  // eq = none
  b.eq    50   // b.none
  sdiv    x0, x2, x1
  

Microbenchmarks

This directory contains microbenchmarks that measure these variants. You can build the microbenchmarks by creating a subdirectory sub, and then typing
    cd sub
    make -f ../Makefile
  
You should take a look at the machine code (with objdump -d divloop*.o ooo*.o) to see if the compiler actually compiled the benchmarks as intended.

You can find the code I have used for measuring various CPUs in the subdirectories here.

Microbenchmark for in-order CPUs

The first set of microbenchmarks is only useful on in-order CPUs and consists of main.c, divloop.c. The binaries are md (division-before-checking) and ub (checking-before-division).

The division-before-checking variant takes 7 cycles per iteration on a Cortex-A53, while the checking-before-division variant takes 8 cycles (for divisions with quotient 100; 5 vs. 6 cycles for quotient 1, larger numbers for larger quotients, but (surprising to me) not bigger differences). So, if you want to avoid undefined behaviour, it slows down this program.

Microbenchmark for out-of-order CPUs

On CPUs with out-of-order (OoO) execution, the microbenchmark above has very predictable branches, and as a result the CPU's front end (instruction fetching and decoding) runs far ahead of the execution of the division instructions, which are in the critical data dependence path. The branches of the checks are predicted correctly, and the comparisons are performed in parallel with the computations on the critical path.

After a misprediction, the situation changes: In the check-before-division code, the CPU has to fetch and decode the checking instructions before it sees the division, and this can take several cycles. In the division-before-check code, the CPU sees the division instruction right away, and if the data is present (which it usually is after a misprediction), can start dividing several cycles earlier. Programming language interpreters have a relatively high branch misprediction rate on bytecode/virtual machine instruction dispatch, so this actually has practical relevance.

The microbenchmarks for OoO CPUs (which also works for in-order CPUs) consist of ooo.c, ooomain.c and ooo.h. The resulting binaries are ooomd (divide-before-check) and oooub (check-before-divide). They use a pseudo-random switch between 4 variants of the division code (so on CPUs that predict the switch, there is a 25% probability of correctly predicting the case and a 75% chance of mispredicting). Right at the start of the switch case, there is the division code. The quotient is used in the computation of the next switch value, to ensure that it plays a role when the division starts (in real code, long-latency operations like divisions are often on the critical data-flow path of some computation, so that's realistic).

Results (in cycles per iteration):

  ooomb   oooub
   55.6    55.8 Cortex-A9 (Pandaboard) arm gcc-4.6.3
   22.2    24.1 Cortex-A53 (Odroid C2, in-order) aarch64 gcc-6.3.0
   29.7    31.2 Cortex-A72 (Rockpro64) aarch64 gcc-6.3.0
   23.9    25.4 Cortex-A73 (Odroid N2) aarch64 gcc-6.3.0
   41.9    41.9 PPC 7447A (iBook G4) ppc (32-bit) gcc-4.3.2
  130.0   130.0 PPC 970 (PowerMac G5) ppc64  gcc-4.4.5
We see the predicted effect on the Cortex-A53, A72, and A73, but not on the other CPUs.

Discussion

Overall, the effect of this manual optimization is small, and probably vanishes in the noise in a real programming language interpreter. But at least here you get numbers, whereas the fans of "optimization" usually leave it to your imagination how big their speedups are; and that strategy is working: In discussions about this topic quite a few people show vastly exaggerated expectations about the effectiveness of "optimizations". In my experiments (page 14 of this paper), I found that "optimizations" provide no speed difference in 13 out of 16 cases, provided speedups by factors 1.02 and 1.04 in two cases, and a slowdown by a factor 1.05 in one case.
Anton Ertl
[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[   ]Makefile17-Jul-2019 11:04 682  
[DIR]aarch64/17-Jul-2019 15:57 -  
[DIR]arm/17-Jul-2019 11:40 -  
[TXT]divloop.c17-Jul-2019 12:08 880  
[TXT]main.c17-Jul-2019 12:09 253  
[TXT]ooo.c16-Jul-2019 18:57 2.5K 
[TXT]ooo.h17-Jul-2019 11:34 241  
[TXT]ooomain.c16-Jul-2019 19:02 347  
[DIR]ppc/17-Jul-2019 17:09 -  
[DIR]ppc64/17-Jul-2019 11:42 -  

Apache/2.2.22 (Debian) DAV/2 mod_fcgid/2.3.6 PHP/5.4.36-0+deb7u3 mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1e mod_perl/2.0.7 Perl/v5.14.2 Server at www.complang.tuwien.ac.at Port 80