Speed of various threading techniques on several processors V1

Speed of various threading techniques on several processors V1

This page is mainly interesting for the results on machines without BTBs and other indirect branch prediction mechanisms. You can find improved benchmarks and their results on V2 of this page.
						cycles per NEXT (i.e., runtime/(1e8*cycletime))
					gcc	sub-
Machine			Processor	version	routine	direct	indir.	switch	call	anti-btb
DECStation 5000/125	R3000 25MHz	2.2.2	 4.95	 4.425	 6.45	11.625
DecStation 5000/150	R4000 100MHz	2.4.5	 9.7	 7.5	10.5	16.8
SGI PowerChallenge XL	R10000 195MHz	2.7.2	16.4	 8.31	10.3	13.3
HP/Apollo 425		68040 25MHz	2.2.2	 9.575+	 5.575	 7.525	15.875
HP/Apollo 720		HP-PA 50MHz	2.3.2	 7.77	 5.495	 7.525	 9.87
Sun Ultra 1	      UltraSparc 143MHz	2.8.1	 9.86	11.43	13.29	17.29
UltraSPARC 30	   UltraSPARC II 248MHz	2.7.2.3	 8.08	11.33	13.49	17.51
AlphaPC 64	       21064A 300MHz egcs-1.0.3	12.27$	 9.12	12.57	18.15	19.74
Alpha 164LX	       21164A 600MHz egcs-1.0.3  7.8$	 7.8	10.32	10.86
Compaq XP1000	       21264 500MHz  egcs-1.0.3	 9.4$	17.25$	17.25$	 9.55
PowerMac		PPC604e 200MHz	2.7.2.1	 5.74	 7.24	 9.34	12.4
Powerbook G3	       PPC750 266MHz egcs-1.1.2	 4.24	 5.36	 7.36	11.73	11.68
486			486DX2 50Mhz	2.2.2d	10.15*	 7.2	 7.3	10.75
Pentium	PB Cache	Pentium 133MHz	2.6.3	 8.93*$	 3.73	 4.73	17.52$
IBM/Cyrix 6x86-P166+	IBM6x86 133MHz	2.7.2.1* 5.48	 5.71^	 7.29	 7.6
DEll XPS Pro200n      PentiumPro 200Mhz	2.7.2%	 6.66	 5.52	 6.54	15.56
Mendocino		Celeron 333MHz	2.7.2.3* 4.8	 5.8	 6.7	 7.3		11.2
AMD K6-2 Super 7	K6-2 300MHz	2.7.2.3	 4.23*	 7.32	10.08	48.57$
Thunderbird/VIA KT133	Athlon 800MHz	2.95.1*	21.44$	 5.36	 6.08	 7.2	10.48	12.8
Northwood/i845E        Pentium4 2.26GHz 2.95.3*  8.6    10.2    10.7    11.4    20.6    22.9

+manually unoptimized to become realistic
&gcc version cygnus-2.7-96q4
*with -fomit-frame-pointer
%gcc version cygnus-2.7.2-960712
^compiled with gcc-2.6.3 (bug in RedHat's 2.7.2.1)
$see text below

Note that a microbenchmark like this can uncover some problems, but not necessarily all of them, and cannot be used to predict the performance of real applications. This is particularly true for the more modern processors; read the notes below.

Thanks to Bernd Paysan for the values on the SPARCStation and the HP 700. Bernd does not know whether the SPARCStation is a 1 or 2; I guess from its slowness that it's a SPARCStation 1. Thanks to Franz Puntigam for the values on the 486. Thanks to Dominique de Waleffe for the PentiumPro numbers. Thanks to Thomas Gschwind for the 21164 and the IBM6x86 numbers. Thanks to Bernd Beuster for the results on the UltraSPARC 30.

All times are user time. The assembly code generated by the GNU C Compiler was inspected and found realistic, with one exception: I had to unoptimize the assembly code for subroutine threading on the 68040, since the compiler allocated the address of the function "next" to a register.

The benchmark consists of a loop that contains nine NEXTs and a looping instruction (a termination test and a jump back for subroutine threaded code), i.e. it primarily measures NEXT speed. This loop is executed 10,000,000 times (resulting in 100,000,000 NEXTs and a bit of overhead). It fits completely into the respective caches.

The older processors are relatively simple and perform quite predictably. For the newer ones, the performance on this benchmark is determined very much by microarchitectural properties like branch target buffers, return stacks, branch mispredict penalties, and cache consistency algorithms. Here are some notes on specific processors:

Alphas (all kinds)
There are two calling conventions for compiling subroutine.c: the general calling convention can call code within umptillion GBs and provides one data area per procedure. There is also another calling convention that restricts calls to within +/- 4MB, and has only one 64KB global data area. egcs-1.03 uses the general convention for the current subroutine.c, and the restricted convention if main() is put after the other routines. Using the restricted convention gives significant speedups: 5.97 cycles on the 21064a, 5.22 cycles on the 21164a, 4.6 cycles on the 21264. Which convention is more appropriate depends on the system (for RAFTS/Alpha we will probably use the restricted convention).
21064A
This processor has a return stack and no branch target buffer, resulting in good performance for subroutine threading (with the restricted convention) and slower performance for the other branches. Its high clock rate makes the mispredict penalty of 5 cycles bearable.

The Alpha architecture allows static branch prediction. Our direct and indirect threading benchmarks work with 0% prediction accuracy on the Alpha. 90% accuracy makes the results faster by about 4.5 cycles. In real Forth code, we can achieve 33% prediction accuracy for direct threading and 40% for indirect threading (see http://www.complang.tuwien.ac.at/forth/peep/).

21264
This CPU is extremely sensitive to code alignment (misalignment penalty ~8 cycles for direct.c) and insertion of independent instructions. By varying this I achieved results in the range 3.6-17.25 cycles for direct threading. If the NEXT is aligned to a 16-byte boundary, the result is 5.6 cycles. Putting in two more nops results in a speedup to 3.6 cycles; inserting independent instructions (as happens in a typical virtual machine) should have the same effect.
Pentium
The switch results show a problem with the cache management on the Pentium (apparently the Pentium does not implement the shared state of the MESI protocol, see http://www.complang.tuwien.ac.at/misc/pentium-switch/). This is responsible for the bad results on switch threading. This problem also plays a role for direct threading, but it does not show up in this benchmark (because there are no colon defs, constants, variables etc.). However, in a real Forth system like Gforth direct threading is significantly slower than indirect threading on the Pentium (whereas it is a little faster on the 486). Also, in an indirect threading system like Win32Forth, where a primitive is jumped to through a pointer preceding it, this results in catastrophic performance.

The Pentium subroutine threading results also have a peculiarity. They are something of a worst case for the branch target buffer (for predicting the target of the RET). A change that implements the best case for the branch prediction (every call calls a different next()), runs in 3.76 cycles (as fast as direct threading). In practice, the branch prediction accuracy will be somewhere in between (probably closer to the best case).

IBM/Cyrix 6x86
The return stack causes the good result for subroutine threading. But the CPU apparently lacks a branch target buffer; still at less than 6 cycles per NEXT the direct threading performance is respectable.
K6-2
This processor apparently suffers from cache consistency problems similar to the Pentium. After moving the jump table for the case statement away from the code, the time for case.c improved to 11.28 cycles. The return stack makes subroutine threading very fast on this processor (4 cycles/next). The branch target cache of this machine apparently does not help in the indirect jumps used in threaded code, resulting in 7 cycles/NEXT for direct and 10 cycles/NEXT for indirect threading.
P6 (Pentium Pro, Celeron, ...)
This microarchitecture has a BTB, a return stack and (for the Celeron) no cache consistency problems; it is quite fast (~4-5 cycles for direct threading) when it hits the BTB, but it has a high branch miss penalty (~12 cycles/NEXT for direct threading) when it misses; direct.c gives a 90% hit rate, but a slight variation, anti-btb.c gives only 10% hit rate and requires 11.2 cycles on the Celeron.
Athlon
This microarchitecture behaves similar to the P6 on this benchmark (and on Forth code). The high time on the subroutine threading benchmark seems to be specific to this microbenchmark; inserting just a nop at the start of the next() function reduces the time to 9.28 cycles.
Pentium 4
It seems that the long pipeline even hurts this machine quite a bit when the branch predictions should be mostly correct. I will have to take a look at performance counter measurements to learn more about what's going on.
R10000
This machine has neither a branch target buffer nor a return stack; and apparently it has a very high mispredict penalty. It therefore suffers on all kinds of threading, taking 16 cycles/NEXT on subroutine threading and 8 cycles/NEXT on direct threading.

Anton Ertl
[ICO]NameLast modifiedSizeDescription

[DIR]Parent Directory  -  
[TXT]subroutine.c10-Jul-1992 08:55 211  
[TXT]direct.c09-Jul-1992 14:37 259  
[TXT]call.c20-Apr-1999 19:51 309  
[TXT]indirect.c09-Jul-1992 17:07 320  
[TXT]anti-btb.c27-Nov-1998 16:00 436 direct.c modified to have only 10% BTB hit rate
[TXT]pro-btb.c15-Apr-2004 15:55 492  
[TXT]case.c03-Feb-1993 15:23 579 switch threading

Apache/2.2.22 (Debian) DAV/2 mod_fcgid/2.3.6 PHP/5.4.36-0+deb7u3 mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1e mod_perl/2.0.7 Perl/v5.14.2 Server at www.complang.tuwien.ac.at Port 80