Up
Previous Next
Mail the author

FLK introduction

History and Objective

FLK started as an experiment in my 1997 summer holidays. I tried to improve the performance of FPC by inlined assembler code. Within two evenings the basic concept that FLK uses was implemented.

Then the new semester began and there was little time for further work. Till February 1998 almost nothing happened except that I wrote the assembler. Within the next two month the system grew and became a fully ANS (draft) compatible FORTH.

I now want FLK to be a fast standard system. It is meant to be an experiment in both meta compilation and code generation, but it should be a fully functional standalone FORTH too.

Since most of my work turns around neural networks fast floating point support is nessesary. Together with vector and matrix operations and some visualiziation tools FLK could become a good system for experiments in that field.

A different field that I'm interested in is symbolic computation. Sooner or later FLK will contain a computer algebra tool. It therefore has to be fast in non-floating point calculations too.

Start-up and user interface

To start FLK execute the program flk. Any commandline options are interpreted as file names to be INCLUDED.

When FLK is up and running you can enter words to execute. To include a file use S" filename.ext" INCLUDED or INCLUDE. The latter lets you input a filename using a different history list and completer.

A history list is accessible from the word ACCEPT only. If you press the up- or down-arrow key you can cycle your previously made inputs. Once you are sure that this the text you want press the return key and your text is appended to the history list and the word returns.

A completer is a special word to save you typing. Two completer words are implemented: One for the normal FORTH command line and one for filenames. If you want to learn to implement one yourself look in the files flkinput.fs and flktools.fs for the existing completers.

To activate the completer when ACCEPT is running press the Tabulator or Set-Tabulator key. In the FORTH commandline the completer searches the beginning of the word the cursor is in or behind. All words in the current search order with this beginning are searched and the longest common string of their names is generated. This string replaces the beginning in the inputline. If there is no word with this beginning one alert is produced, if there is more than one word with the beginning, two alerts are produced.

The filename completer takes the whole inputline and performs similiar to tcsh's completer. It tries to expand the path level by level using the longest common string method similiar to the command line completer. Two alerts are produced if more than one file in the directory matches, one if none matches.

Upon startup all copies of flkkern (flk is one of them.) search for an system image to load in four places:

  1. The current executable in the current directory. You can produce standalone executables this way.
  2. The current executable in the installation directory. This is nessesary because argv[0] does not contain the path to the program.
  3. The file flk.flk in the current directory. This is a way to store e.g. project specific systems under the same name.
  4. The file default.flk in the installation directory as noted in the Makefile. This is meant to be a default system if no file or directory specific images could be found.

Compiler basics

FLK is an optimizing native code compiler. It acts similiar to the so-called nano-compilers where every word knows how to compile itself. In FLK only a few words, the so-called optimizer or primitives, can do this. Any word that isn't a primitive doesn't contain code to compile itself. Such words are compiled the usual way by COMPILE,.

With version 1.2 the so-called level 2 compiler words are introduced. They try to fold more than one word into fewer machine code than the separate compiling would produce. Since they have access to the last few literals (including CONSTANTs and CREATEd words) it is possible to include these literals into the code instead of loading a register and then working with that register.

The return stack is addressed by esp, the data stack by ebp. Since the indexed access using ebp requires an offset value this is the first opportunity to save time. Instead of increasing and decreasing ebp itself the offset is increased and decreased. At each access to the stack one add operation less is nessesary. Before calling another word or returning from this word the accumulated offset has to added to ebp.

Control-flow words like IF or DO have to save the offset to ebp and words like THEN or LOOP restore the value by adding the difference to ebp.

The next possible optimization is to keep the top few items of the data stack in the CPU registers to reduce fetch and store operations. Since every word has a different number of accepted and produced items a defined state has to be reached at the beginning of each word. In this state eax caches the top of stack item and no other registers (except ebp and esp) have a defined meaning.

Each primitive first resets the register allocator and then requests the stack items and free registers (in that order) it needs, performs its operation and eventually marks the requested register free or puts free registers onto the stack.

One important point to mention is that each saved image contains a relocation table. This table contains the addresses of cells whos contents have to be corrected relative to the memory address of the first byte of the image. The contents of these cells are absolute addresses. Words are provided for the handling of relocation issues.

Adding your own primitives

This section describes the creation of primitives by the example of the word COUNT. The only way to compile a primitive is to put it into the file flkprim.fs. If you want to write a compiling word without interpretation semantics it is better to program an immediate word an throw an exception if interpreting.

COUNT can be written as the colon definition:

 : COUNT ( caddr -- caddr+1 len ) DUP CHAR+ SWAP C@ ; 

As a primitive it is written as:

 p: COUNT ( caddr -- caddr+1 len )
  regalloc-reset
  req-any
  req-free
  free0 free0 xor,
  0 [tos0] free0l mov,
  tos0 inc,
  0 free>tos ; 

The line p: COUNT ( c-addr1 -- c-addr2 u ) defines the primitive and informs about the stack effect. Only one space before the name of the primitive is allowed. Tabs are allowed after the name only if a space immediately follows the name.

The first thing to do is to reset the register alloctor using regalloc-reset. Now we request one item from the stack and one free register by req-any req-free. Then the actual code generation starts.

  free0 free0 xor,
  0 [tos0] free0l mov,
  tos0 inc, 

The byte at caddr is fetched into the cleared free0 meta register. Which register is hidden behind free0 is not interesting. Neither the user nor the programmer need to know it.

The last line puts the free0 register on top of the stack.

Other control words for the register allocator can be found in flkprim.fs in the definitions of the other primitives.

Adding your own level 2 compilers

This section contains the desciption of a level 2 compiler (found in flkopt.fs).

Each level 2 optimizer consumes zero items and produces no items either. To declare an optimizer edit flkopt.fs for optimizers that work in host and target or flktopt.fs for those that only run in the target.

First thing to do is to declare the sequence to optimize away: opt( ''# '' + '' @ )opt: does this. This optimizer is declared for the sequence number additiion fetch. Whenever this sequence is found, the following code is executed instead of their individual optimizers.

The rest of the word is very similiar to a primitive declaration. There are three exceptions: You have to delete the optimized words at the end of the word and you have to get or set the actual value of the number parameter. How to do this is shown in the code snippets below.

opt( ''# '' + '' @ )opt: 
    ( Get the actual value and a flag telling if it is an address. )
    0 opt-getlit 			\ x rel?
    ( Normal code generation. )
    regalloc-reset
    req-any 				\ tos0=offs
    ?+relocate
    [tos0] tos0 mov, 
    ( All items used up. )
    0 3 opt-remove
    ;opt
    
opt( ''# ''# '' + )opt: 
   ( get left parameter to + )
   1 opt-getlit 			\ x1 rel1 
   ( get right parameter to + )
   0 opt-getlit 			\ x1 rel1 x0 rel0
   ( If one is an address, result is an address to. )
   ROT OR -ROT 				\ rel tos1 tos0 
   ( Perform the actual calculation. )
   + SWAP 				\ x rel
   ( Store it back into the cache. )
   0 opt-setlit
   ( Delete the words optimized away from the cache. )
   1 2 opt-remove 
;opt

Unexpected behaviour

This section is meant to be a warning. FLK is still a construction site. You could end up having your car buried under a pile of dirt if you don't drive carefully. :-)

But seriously, some of the mistakes made by users (and programmers) are not reported at the moment. Some of them never will.

Among these unreported errors are data stack over- and underflows, return stack over- and underflows and floating point stack overflows. Some of them can produce unexpected or wrong results, some of them cause segmentation faults.

For a more detailed list of ambiguous conditions see here.

Benchmarks

To summarize this section: 63 % of all statistics are faked. 17 % of all people know that.

Seriously: I used the benchmarks of Anton Ertl's Benchmark suite to compare the speed of FLK with that of gforth. You can indirectly compare several other systems with FLK at Anton Ertl's performance web page.

The following sections describe the benchmark programs, show a list of times of gforth and FLK and explain which optimiziers have been implemented to achieve the speed-up.

The used system was a 133MHz Pentium without MMX running Linux kernel 2.0.30 and KDE. All times can differ a bit due to limited timer resolution and cpu load. The cpu used was between 97 and 99 % in all tests.

The initial state had no optimizers except combining OVER or 2DUP, relational operators and IF and WHILE to allocate fewer registers and not to generate an intermediate flag on the stack. All other changes are incremental. These tests were performed with version 1.2 but aplly for later versions too.

Sieve

What is a benchmark without the sieve of Erastothenes. The implementation in sieve.fs is a straight-forward one: Two nested loops to check and clear the flags. To have a reasonable time the search of all primes between 1 and 8190 is repeated 1000 times. The most frequent used words are: I C@ C! DO +LOOP DUP

OptimizationTime of FLK in sec.Speed factor (gforth: 14.37 sec)
initial test3.943.6
DUP and +LOOP combined (1 register less used)3.93.6
LOOP (short jumps when possible instead always near jumps)3.73.9
I (esp access using SIB addressing instead exchanges and ebp access using MOD/RM addressing)2.46
As you can see in the second row saving a register when enough of them are available gains very little. Removing unnessary jump gains a bit more due to the saved space in the branch predictor of the pentium. The last change removes at least two AGIs (address generation interlock) per I in the innermost loop. That gains at least four cycles per loop.

Bubble sort

Another classical benchmark: sorting 6000 random numbers. Implementation: two nested loops. The most frequent used words are: I 2@ > SWAP 2!

OptimizationTime of FLK in sec.Speed factor (gforth: 14.54 sec)
initial test4.293.4
all opt. above2.725.4

Fibonacci

This little word has two recursive calls and measures mostly call/return performance.

OptimizationTime of FLK in sec.Speed factor (gforth: 17.13 sec)
initial test2.357.3
all opt. above2.27.8
+ changed to SWAP +2.167.9
The change of + to
SWAP +
produces code that looks better before an EXIT. The four hundreds of a second saved can be blamed on the timer tolerance.


Up
Previous

FLK

Next
Mail the author