Performance of Forth systems

This page is derived from Section 4.5 of [Ert96].

The benchmarks we used were the ubiquitous Sieve (counting the primes <16384 a thousand times); bubble-sorting (6000 integers) and matrix multiplication (200*200 matrices) come from the Stanford integer benchmarks (originally in Pascal, but available in C (The C version and Martin Fraeman's original translations to Forth can be found at ftp.complang.tuwien.ac.at/pub/forth/stanford-benchmarks.tar.gz)) and have been translated into Forth by Martin Fraeman and included in the TILE Forth package. These three benchmarks share one disadvantage: They have an unusually low amount of calls. To benchmark calling performance, we computed the 34th Fibonacci number using a recursive algorithm with exponential run-time complexity. You can find the benchmarks in Forth, C and forth2c-generated C at http://www.complang.tuwien.ac.at/forth/bench.tar.gz.

The measured systems fall into several classes:

C-based native code
f2c opt
the output of our prototype translator compiled with gcc-2.6.3 -O3 -fomit-frame-pointer.
Timbre
the output of the Forth-to-C translator included in the Timbre V.4 distribution, also compiled with gcc-2.6.3 -O3 -fomit-frame-pointer.
f2c noopt
the output of our prototype translator compiled with gcc-2.6.3 without optimization.
C
hand-coded C programs (the original C versions in case of the Stanford benchmarks) compiled with gcc-2.6.3 -O3 -fomit-frame-pointer.

Native code
compilers, typically employing the traditional technique of combining subroutine threading with inlining and peephole optimization.
bigForth
bigForth 386 (v1.20beta) by Bernd Paysan [Pay91].
iForth
iForth 1.06 by Marcel Hendrix.
mxForth
mxForth by Marcel Hendrix.
NT NCC
LMI's NT Forth NCC (written by Tom Almy).
FLK
FLK 1.2 by Lars Krueger.

Interpreters
written in assembly language
Win32F
Win32Forth 1.2093 by Andrew McKewan and Tom Zimmer.
NT Forth
LMI's NT Forth (beta, May 1994).
eforth
Marcel Hendrix' port of eforth to Linux.
eforth opt
Marcel Hendrix added peephole optimization of intermediate code [Bad95, Sch92] to his port of eforth.

Interpreters
written in portable languages
Gforth
Gforth 0.4.9, written in GNU C [Ert93] (compiled with gcc-2.95.1, default flags and -DFORCE_REG).
PFE
pfe-0.9.14 by Dirk Zoller, written in ANSI C (compiled with gcc-2.6.3 with the default configuration)
thisForth
ThisForth Beta written in C by Wil Baden (compiled with gcc-2.6.3 -O3 -fomit-frame-pointer); it employs peephole optimization of intermediate code.
TILE
TILE Release 2.1, written in C by Mikael Patel (compiled with gcc-2.6.3 -O3 -fomit-frame-pointer).

Win32Forth, NT Forth and its NCC were benchmarked under Windows NT, bigForth and iForth under DOS/GO32, all other systems under Linux. The results for Win32Forth, NT Forth and its NCC were provided by Kenneth O'Heskin, those for iForth and eForth (with and without peephole optimization of intermediate code) by Marcel Hendrix. All measurements were performed on PCs with an Intel 486DX2/66 CPU with 256K secondary cache with similar memory performance. The times given are the median of three measurements of the user time (the system time is negligible anyway).

relative   f2c      	f2c hand-  big-	       mx-  NT-Forth           Win32-    NT      eforth       This-      abs.\ time
    time  opt.Timbre no opt coded Forth	iForth Forth   NCC   FLK Gforth Forth Forth eforth +opt   PFE Forth  TILE  f2c opt.
sieve     1.00     -   7.03  0.86  1.87	  2.16  2.31  1.27  1.47   5.05  7.99  6.56  8.00  4.87  9.09 18.32 49.42     5.19s
bubble    1.00     -   8.28  0.87  2.34	  2.32  2.20  7.12  1.61   6.25  9.69 10.41 10.94  6.49 11.11     - 28.67     4.79s
matmul    1.00     -   9.35  1.10  3.02	  2.19  2.31  4.14  1.99   5.92  9.87  9.09  9.80  4.95 10.59     - 27.41     4.02s
fib       1.00  3.14   4.92  1.00  1.37	  1.32  0.95  1.47  0.83   3.79  6.62  5.81  5.31  3.76  7.56 12.99 18.68     7.96s

The table above (and a picture) shows the time that the systems need for the benchmarks, relative to the time of f2c opt (or, in other words, the speedup factor that our translator (and GCC) achieves over the other systems). Empty entries indicate that I did not succeed in running the benchmark on the system.

The result of Timbre's Forth-to-C translator is slow, as expected (since Timbre does not have DO..LOOP and friends, I could only measure fib, but I think this result is representative). Combining our translator with a non-optimizing GCC results in code that is even slower than the interpretive Gforth system. Hand-coded C is between 14% faster and 10% slower than the output of the Forth translator. I was a little surprised by the matrix multiplication result, where the code translated to Forth and back was faster than the original. Closer inspection showed that, in translating from C to Forth, an optimization had been performed, which the C compiler does not perform (it is an interprocedural optimization that requires interprocedural alias analysis) and which reduced the amount of memory accesses; this is probably responsible for the speedup, maybe combined with vagaries such as instruction cache alignment.

BigForth and iForth achieve a speedup of about 3 over the fastest interpreters, but there is still a lot of room for improvement (at least a factor of 1.3-3). Even on the fib Benchmark, which should be the strong point of Forth compilers, f2c opt was better. The results of NT Forth NCC are a bit worse on average and have a big variance (a speedup of 1-4 over Gforth, 1.2-7.2 times slower than f2c opt.). These results show that researching better native code generation techniques is not just a waste of time and that there is still a lot to be gained in this area. These results also show that the following statement has not become outdated yet: ``The resulting machine code is a factor of two or three slower than the equivalent code written in machine language, due mainly to the use of the stack rather than registers.'' [Ros86]

The interpretive systems written in assembly language (except eforth opt) are, surprisingly, slower than Gforth. One important reason for the disappointing performance of these systems is probably that they are not written optimally for the 486 (e.g., they use the lods instruction). eforth opt demonstrates that peephole optimization of intermediate code offers substantial gains. eforth opt is 3.5-6.5 times slower than f2c opt. We can expect even better results if the baseline interpreter is more efficient (e.g., Gforth).

Gforth is 3.5-6.5 times slower than f2c opt, PFE 7.5-11 times. The slowdown of PFE with respect to Gforth can be explained with the self-imposed restriction to standard C (although the measured configuration of PFE uses a GNU C extension: global register variables), which makes efficient threading impossible (PFE uses indirect call threading). ThisForth and TILE were obviously written with a certain negligence towards efficiency issues and the limited optimization abilities of state-of-the-art C compilers, resulting in a slowdown factor of more than 49 for TILE on the Sieve.

Executable code size

I not only measured run-time, but also code size and compile time. For threaded code (interp. size), I measured the space alloted in the Gforth system during compilation of the program and subtracted the alloted data; i.e., the interpreted code size includes the space needed for headers. Gforth uses one cell (32 bits in the measured system) per compiled word, two cells for the code field and it pads the header such that the body is maximally (i.e., 8-byte-) aligned. For the size of the machine code produced by the translator/compiler combination (.o size), I used the sum of the text and data sizes produced by the Unix size command, as applied to the object (.o) file. This does not include the size of the symbol table information included in the object file (which is easy to strip away after linking). The data size in the object file does not include the alloted space, as that is allocated later at run-time.

	   interp.	 .o	   size	compile source    C
	     size	size	  ratio	  time   lines lines
sieve         418        272       1.54   1.10s     25   482
bubble       1020        748       1.36   1.60s     72  1100
matmul        784        412       1.90   1.40s     55   793
fib           140        140       1.00   0.90s     10   169

The code size measurements dispell another popular myth, that of the inherent size advantage of stack architecture code and of the bloat produced by optimizing C compilers. While a comparison of a header-stripping 16-bit Forth with a RISC (about 50% bigger code than CISCs) would give a somewhat different result, the reported size differences of more than an order of magnitude need a different explanation: differences in the functionality of the software and different software engineering practices come to mind.

For the compile time measurements, only the user time needed by GCC to compile and link the program are displayed. The system time was constant at 0.6s. The compilation to Gforth's interpreted code needed a negligible amount of time; the translation to C also vanished in the measurement noise, although it was not written for speed and although the present implementation should be much slower than normal Forth compilation. The compile time data indicate that, after a startup time of about 1.4s (user+system), GCC compiles about 90 lines of Forth code (1500 lines of translator output) per second. Interestingly, less than one byte of machine code is generated per line of C code.

Acknowledgements

Martin Fraeman provided the C versions of the Stanford benchmarks. Kenneth O'Heskin kindly produced the benchmark results for Win32Forth, NT Forth, and NT Forth NCC. Marcel Hendrix provided the iForth and eForth results.

References

Bad95
Wil Baden. Pinhole optimization. Forth Dimensions, 17(2):29-35, 1995.

Ert93
M. Anton Ertl. A portable Forth engine. In EuroFORTH '93 conference proceedings, Mariánské Láznè (Marienbad), 1993.

Ert96
M. Anton Ertl. Implementation of Stack-Based Languages on Register Machines. PhD thesis, Technische Universität Wien, Austria, 1996.

Pay91
Bernd Paysan. Ein optimierender Forth-Compiler. Vierte Dimension, 7(3):22-25, September 1991.

Ros86
Anthony Rose. Design of a fast 68000-based subroutine-threaded Forth with inline code & an optimizer. Journal of Forth Application and Research, 4(2):285-288, 1986. 1986 Rochester Forth Conference.

Sch92
Udo Schütz. Optimierung von Fadencode. In FORTH-Tagung, Rostock, 1992. Forth Gesellschaft e.V.


Anton Ertl