Chapter 8: The Dictionary

A Definition's Memory Image

Just like in ANS Forth, the strongForth dictionary is a list of definitions. Each definition consists of several components, e. g., the definition's name and attributes, parameters and a pointer to the next definition. In strongForth, a definition additionally contains its stack diagram. Otherwise, the interpreter and the compiler wouldn't be able to perform type checking and to distinguish overloaded words.

As you might remember from chapter 4, strongForth provides several distinct memory spaces:

Now, the data that constitutes a definition is actually distributed over three memory spaces, using pointers to link the various components. The name space contains everything that is required to compile a definition. Virtual machine code is stored in the constant data space and real machine code is stored in the code space. While each definition has a unique set of data in the name space, several definitions may share the same virtual machine code, and different blocks of virtual machine code might share the same real machine code. For example, all colon definitions use the same machine code to interpret their virtual machine code.

Let's begin with the those components of a definition that are stored in the name space:


name field

link field
attribute field
token field

input parameter field


output parameter field

Unless a definition is not explicitly defined by :NONAME, it has a name and a pointer to the previous definition in the dictionary. The name is just a counted string, i. e., a byte indicating the length, followed by the characters of the string. The link field contains an address that points to the name field of the previous definition that has a name field. The link field of the first definition in strongForth's dictionary contains a null pointer.

An item of data type DEFINITION may be casted to an item of data type FAR-ADDRESS to obtain a pointer to the attribute field of the definition. It's not a pointer to the name field, because there are some definitions without a name field and a link field. The attribute field is always present. It has the size of a single cell and consists of various bit fields whose contents provides additional information about the definition:

Unused I S N Size

Unused:
I:
S:
N:
Size:
reserved for implementation defined attributes
immediate attribute
smudge attribute
noname attribute
size of the stack diagram

For normal words like DUP and 1+, the immediate attribute is 1. If it is 0, the word is immediately executed even in compilation mode. Typical words of this category are IF and [CHAR]. Note that the semantics of the immediate attribute is inverted with respect to the expectations, i. e., words which are immediately executed in compilation state have a 0 in this bit field. This should not present a common cause of error, because the immediate attribute is usually set and queried by special words with an unambiguous name.

The smudge attribute prevents that definitions are actually used before their compilation is complete. It is automatically set when a new definition is created and reset after the compilation of the definition is finished. As long as the smudge attribute is set, the definition will not be recognised by FIND and by other words that search the dictionary.

The noname attribute marks all words that have no name field and no link field. Those are the words defined by :NANAME.

Finally, the size attribute is the size of the definition's stack diagram. It contains the number of basic data types that constitute the stack diagram. For example, in

DROP ( SINGLE -- )

the size attribute is 1. The definition

! ( SINGLE DATA -> 1ST -- )

has a stack diagram consisting of 3 basic data types. It's size attribute is 3.

The next field in the name space is the token field. It contains a pointer of data type CONST to the definition's code field in the constant data space. This pointer is the execution token of the definition. Whenever a word is compiled as part of a new colon definition, it's token is added to the virtual machine code. Colon definitions are actually stored as lists of tokens.

After the token field, the stack diagram of the definition is stored in the name space. The stack diagram is just a list of item's of data type DATA-TYPE as described in chapter 7. It is divided into two sections, the input parameter field and the output parameter field. Since a definition might have no input parameters, no output parameters, or no parameters at all, one of these sections or the complete stack diagram might be empty. If a definition has no stack diagram, its size attribute is 0.

Those parts of the dictionary, which are stored in the name space, are only required by the interpreter and the compiler. During execution of a word, all these data are irrelevant. In an embedded system that just executes already compiled code and does not compile or interpret any strongForth source code, it's often possible to completely remove the name space. Usually, this will save a lot of memory.

As mentioned before, a definition's token field contains a pointer to a part of the definition that is located in the constant data space:

code field

data field

This part consists of the code field and the variable-length data field. The code field contains a pointer of data type CODE, which is the address of the definition's machine code. Whenever the definition is executed, this machine code is actually executed. Pure machine code words like SWAP or + do not need anything else to execute. They do not have a data field. On the other hand, colon definitions or definitions generated by other defining words like VARIABLE and PROCREATES need some more information, because the definitions from each of these categories share the same machine code. This additional information is stored in the data field, and is accessed by the machine code. For example, the machine code of colon definitions is an interpreter for the virtual machine code contained in the data field. The data field of a definition created by VARIABLE contains a pointer to the location of the variable's value within the data space. The variable's value is not located directly in the data field, because the data field is in the constant data space, which is assumed to be read-only during runtime.

Finally, the machine code of a definition is stored in code space:


machine code

As a short summary, each definition has an attribute and a token field in the name space. Optionally, the name space might contain a name and link field and, if the definition has a stack diagram, input and/or output parameter fields. The token field contains a pointer to the code and data fields, which are located in the constant data space. Different definitions may share the same code and data fields. The code field contains a pointer to the definition's machine code. The same machine code can be used by multiple definitions.

Creating a Definition

Not surprisingly, most components of a definition are created by CREATE:

: CREATE ( -- )
  ?EXECUTE
  CONST-HERE [ 7 BIT 6 BIT OR ] LITERAL (CREATE)
  CONST-CELL-ALLOT ;

CREATE in turn takes advantage of another word (CREATE) that does most of the job. It just has to provide (CREATE) with the token of the new definition, which is the address of the next available cell in the constant data space, and an initialization value for the attribute field. After (CREATE) has done it's work, CREATE allocates one cell for the code field. (CREATE) actually creates a definition for an already existing code field, while CREATE additionally allocates a new code field. Most defining words just use CREATE. However, in those cases where you want to reuse an existing code field, for example to define an alias of an already existing word, (CREATE) is quite useful.

: (CREATE) ( CONST LOGICAL -- )
  SPACE@ NAME-SPACE OVER [ 5 BIT ] LITERAL AND 0=
  IF HERE PARSE-WORD ", ALIGN LINK,
  THEN LATEST!
  SWAP , \ attribute field \
  SWAP , \ token field \
  SPACE! ;

(CREATE) expects an address of data type CONST as the first input parameter. This address becomes the token of the new definition. The value of the second input parameter of data type LOGICAL becomes the content of the attribute field.

Since (CREATE) compiles stuff into the name space, it has to save the current memory space on entry and restore it at the end. SPACE@ and SPACE! were already presented in chapter 4. The noname bit in the attribute field determines whether the new definition gets a name and a link field. If this bit is not set, (CREATE) remembers the location where the definition starts and then uses PARSE-WORD ", to parse the definition's name and compile it into the name field. ALIGN is required because ", leaves the memory space pointer unaligned. LINK, and LATEST! compile the link field and update the pointer to the latest definition, respectively. These two words will be described below in more detail.

CREATE provides the initialization value of the attribute field with immediate and smudge attributes. This means, that the new definition is a non-immediate word by default. The smudge attribute remains set until the definition is complete. Because the noname attribute is clear, CREATE always creates the name and the link field. But there are other defining words, like :NONAME, which do not create these two fields and which use (CREATE) with the noname attribute in the second input parameter. The size of the stack diagram cannot yet be determined at this point.

A definition's link field contains a pointer to the name field of the previous definition. Establishing a linked list of definitions allows words like FIND to search the dictionary for a specific definition.

: LINK, ( ADDRESS -- )
  LINK @ , LINK ! ;

LINK, fetches the link to the previous definition and compiles it into the link field of the current definition. Afterwards, it stores the input parameter of LINK, into the variable LINK to be available for compiling the link field of the next definition. Remember that the input parameter of LINK, is the name field address of the definition created by (CREATE).

LINK ( -- DATA -> ADDRESS )

Although LINK looks like an ordinary system variable that contains an item of data type ADDRESS, it has a slightly different semantics. In fact, this variable is an attribute of the name space and the local data space, because these two memory spaces are the ones where definitions may be created. LINK contains the address of the name field of the most recent definition within the current memory space. This address is undefined if the current memory space is not either the name space or the local data space. You'll learn more about the additional local dictionary in chapter 12.

StrongForth provides another variable that also contains a pointer to the recently created definition. LATEST is defined as a value containing the latest definition:

NULL DEFINITION VALUE LATEST

Having an easy access to the currently compiled definition is often very useful. As mentioned before, an item of data type DEFINITION contains the full address of a definition's attribute field. Now, after (CREATE) has compiled the name and link fields, the next available address of the name space is the attribute field address. Casting this address to an item of data type DEFINITION yields the definition itself. LATEST! does nothing else but using this address to update LATEST:

: LATEST! ( -- )
  FAR-HERE CAST DEFINITION TO LATEST ;

Other than in ANS Forth, the strongForth version of CREATE does not create a complete definition. It only compiles the name field, the link field and the token field, and reserves space for the attribute field and the code field. The input and output parameter fields are compiled when the definition's stack diagram is specified. Furthermore, the size of the stack diagram determines the content of the Size bit field within the attribute field. All these information is usually not available at the time when CREATE is executed, which is typically at the beginning of a defining word.

So let's have a look at some typical defining words. A very simple defining word is CODE:

: CODE ( -- CODE STACK-DIAGRAM )
  CREATE CODE-HERE NULL STACK-DIAGRAM ;

Note that CODE is overloaded with both a defining word and a data type identifier. As expected, CODE starts with CREATE. It then obtains a pointer to the first unused location of the code space, which will be the start address of the definition's machine code. Finally, it pushes a null stack diagram onto the stack. Now, what's still missing in order to make a complete definition? Well, it's the stack diagram, the machine code, the attribute field and the code field. Pure machine code words do not have a data field.

Chapter 7 already explained how a stack diagram for a machine code word is generated. Starting with the null stack diagram, the input and output parameters are collected and finally compiled into the name space. Remember that the last cell CREATE compiles into the name space is the token field, so the next available address in the name space actually is the address of the input parameter field. If no stack diagram is specified for the new definition, nothing more is compiled into the name space, and the null stack diagram remains unchanged. Here's a simple example of creating a machine code word:

CODE DROP .S
CODE STACK-DIAGRAM  OK

CODE and STACK-DIAGRAM are the two items CODE leaves on the stack. The code address is the next available address in the code space and the stack diagram is empty:

DUP CAST DOUBLE .
0  OK

After the stack diagram has been specified, the second one of these two items changes:

( SINGLE -- ) .S
CODE STACK-DIAGRAM  OK
DUP CAST DOUBLE .
33  OK

The item of data type STACK-DIAGRAM now contains the information that a stack diagram with one parameter has been compiled. 33 is the decimal value of a stack diagram with 1 in the offset attribute and an additional output parameter attribute.

Next step is the compilation of machine code words:

AX POP, NEXT,

AX POP, compiles machine code to pop one cell off the data stack. This is the sole semantics of DROP. NEXT, is a macro that compiles strongForth's inner interpreter. Almost all machine code words end with this macro, just like machine code subroutines must end with a return instruction.

That's all? No, not really. You've compiled the input parameter field and the machine code, but the smudge attribute is still set, the code field is empty and there are two items of data types CODE and STACK-DIAGRAM on the stack. To finalise the machine code word, you have to enter

END-CODE
 OK

END-CODE clears the smudge attribute, fills the code field, and disposes two items on the data stack. The output parameter field remains empty, because DROP does not have any output parameters. Here's the definition of END-CODE:

: END-CODE ( CODE STACK-DIAGRAM -- )
  LATEST CAST FAR-ADDRESS -> STACK-DIAGRAM
  DUP @ DT-INPUT INVERT AND SWAP ! DROP
  LATEST CAST FAR-ADDRESS -> CONST -> CODE 1+ @ ! ;

Remember that LATEST delivers the currently compiled definition and that an item of data type DEFINITION may be casted to the address of the definition's attribute field. The value to be stored in the attribute field is composed of the size of the stack diagram, plus the noname, smudge and immediate attributes. The smudge attribute, which is at bit 6 of the attribute field, needs to be cleared, because the definition is now going to be completed. Everything else in the attribute field remains unchanged.

Next, END-CODE calculates the address of the code field and stores a pointer to the machine code in it. The code field address is contained in the token field, which is located next to the attribute field of the definition.

The usage of two type casts makes it obvious that END-CODE hides implementation details of definitions. The same statement applies to the words described in the next section, which access the various fields of a definition as well. However, by using these words instead of directly accessing the various components of definitions, you can easily get around those details and consider a definition as an abstract data type.

The Abstract Definition

Given an item of data type DEFINITION, how can you access the various components, i. e. obtain the information that is stored in them? Let's start with the name field. StrongForth offers the word NAME to access the name of a definition:

: NAME ( DEFINITION -- CFAR-ADDRESS -> CHARACTER UNSIGNED )
  ?NONAME CAST CFAR-ADDRESS -> CHARACTER 4 -
  BEGIN DUP @ BL >
  WHILE 1-
  REPEAT DUP 1+ SWAP @ CAST UNSIGNED ;

NAME throws an exception if the definition does not have a name field. Since an item of data type DEFINITION actually is the address of the attribute field, NAME starts at this address and traverses back through the characters of the name string until it finds the length byte. The length byte is clearly distinguished from any character, because it's value is always less than 32, which is the ASCII code of a space character. The first address where the length byte might be is 4 bytes below the attribute field. The address of the length byte is the name field address. This address is then converted into the address of the first character and the string length, which constitute the output parameters of NAME.

The link field contains a pointer to the name field of the previous definition. StrongForth provides the word PREV to access the previous definition in the dictionary:

: PREV ( DEFINITION -- 1ST )
  ?NONAME CAST FAR-ADDRESS -> ADDRESS DUP 1- @ DUP
  IF SWAP SPLIT NIP MERGE
     CAST CFAR-ADDRESS -> UNSIGNED NAME>DEFINITION
  ELSE DROP DROP NULL DEFINITION
  THEN ;

PREV, just like NAME, throws an exception if the definition was created by :NONAME. Starting at the attribute field, PREV steps one cell back to fetch the contents of the definition's link field. Let's first assume that this is not the first definition in the dictionary, where the link field contains zero. Since the link is an address of data type ADDRESS, it has to be combined with the memory segment or bank of the name space to get the full address of the previous definition's name field. But this is still not the desired result. To get from there to the attribute field, PREV has to skip the name field and the link field. This final task is performed by the word NAME>DEFINITION:

: NAME>DEFINITION ( CFAR-ADDRESS -> UNSIGNED -- DEFINITION )
  DUP @ DUP 31 >
  IF DROP CAST DEFINITION
  ELSE + 1+ ALIGNED 1 CELLS + CAST DEFINITION
  THEN ;

NAME>DEFINITION expects the address of the name field on the stack. It adds the length of the string, plus one for the location that contains the length byte, and then gets the address aligned to cell size. The address now points to the link field. After skipping the link field, NAME>DEFINITION returns the definition, i. e., the address of the attribute field,

But NAME>DEFINITION also works if its input parameter is the address of a noname definition's attribute field instead of a name field. Since definitions with no name field have bit 5 set in the attribute field, the false length byte has a value of 32 or greater. In this case, NAME>DEFINITION just needs to cast its input parameter to a definition. What is this feature useful for? PREV does not take advantage of it, because noname definitions are not linked in the dictionary. The answer will be given at the end of this section, where PREV's counterpart NEXT is presented.

Now, what if PREV is applied to the first word in the dictionary? The link field of the first word is zero, and it makes no sense at all to continue. In this case, PREV simply returns a null definition to indicate that there is no previous word.

A perfect illustration of how NAME and PREV can be applied is the word WORDS, which belongs to the ANS Forth Tools word set. In strongForth, WORDS has an extended semantics. If supplied with the name of a word as a parameter, WORDS displays only those words in the dictionary with the given name. Otherwise, all words in the dictionary are displayed. As you've seen several times in the previous chapters, WORDS is quite useful for finding all overloaded versions of a word. Here's the definiton:

: WORDS ( -- )
  PARSE-WORD LOCALS| COUNT ADDR | LATEST
  BEGIN DUP 0= INVERT
  WHILE COUNT
     IF DUP ADDR COUNT ROT NAME COMPARE 0=
     ELSE TRUE
     THEN
     IF DUP . CR
     THEN PREV
  REPEAT DROP ;

WORDS stores the parsed name in two locals and then iterates through the dictionary, starting with the latest definition. Within the loop, the current definition is printed if it's name matches the parsed name, or if the parsed name was empty. The loop exits when a null definition is encountered, i. e., after the first definition in the dictionary has been processed.

The attribute field is the next field within the name space. It's contents can easily be queried by using the strongForth words IMMEDIATE?, SMUDGE?, NONAME? and #PARAMS. IMMEDIATE?, SMUDGE? and NONAME? query the noname, smudge and immediate attribute, respectively. NONAME? and SMUDGE? return FALSE if the corresponding bit is 0, and TRUE if the bit is 1. With IMMEDIATE?, it's exactly the other way around. Remember that the immediate attribute is inverted with respect to the immediate property of a definition. #PARAMS returns the content of the Size bit field, which is actually the length of the definition's stack diagram, or zero if the smudge attribute is set.

: IMMEDIATE? ( DEFINITION -- FLAG )
  CAST FAR-ADDRESS -> LOGICAL @ [ 7 BIT ] LITERAL AND 0= ;

: SMUDGE? ( DEFINITION -- FLAG )
  CAST FAR-ADDRESS -> LOGICAL @ [ 6 BIT ] LITERAL AND 0<> ;

: NONAME? ( DEFINITION -- FLAG )
  CAST FAR-ADDRESS -> LOGICAL @ [ 5 BIT ] LITERAL AND 0<> ;
  
: #PARAMS ( DEFINITION -- UNSIGNED )
  CAST FAR-ADDRESS -> DATA-TYPE @ DUP DT-INPUT ATTRIBUTE?
  IF DROP 0 ELSE OFFSET THEN ;

The definition of #PARAMS takes advantage of the fact that the format of the combination of attribute field and token fiels resembles the format of an item of data type DATA-TYPE. On items of data type DATA-TYPE, the words ATTRIBUTE? and OFFSET can be applied. The relation between the bit fields of items of data type DATA-TYPE and the attribute field of a definition are as follows:

DATA-TYPE attribute field
prefix attribute immediate attribute
input parameter attribute smudge attribute
output parameter attribute noname attribute
offset attribute size of the stack diagram

The same programming trick is used in the definition of INNEDIATE. IMMEDIATE is an ANS Forth word that makes the most recent definition an immediate word. In strongForth, it just toggles the immediate attribute.

: IMMEDIATE ( -- )
  LATEST CAST FAR-ADDRESS -> DATA-TYPE
  DUP @ DT-PREFIX INVERT AND SWAP ! ;

Another word changes the content of the size bit field by adding the offset attribute of an item of data type STACK-DIAGRAM to it. Because this operation usually has to be performed immediately after compiling the input and output parameter fields of a definition, this word is called END-DIAGRAM:

: END-DIAGRAM ( STACK-DIAGRAM -- )
  OFFSET LATEST CAST FAR-ADDRESS -> STACK-DIAGRAM DUP @
  ROT OFFSET+ SWAP ! ;

You've already seen an application of END-DIAGRAM in the definition of ) in chapter 7. Since END-DIAGRAM is frequently used together with END-CODE, END-DEFINITION combines these two words:

: END-DEFINITION ( STACK-DIAGRAM CODE -- )
  SWAP DUP END-DIAGRAM END-CODE ;

Typical applications of END-DEFINITION are defining words that implicitely compile stack diagrams, like CONSTANT and VARIABLE. This is in contrast to explicitly compiling a stack diagram with ( ... -- ... ).

StrongForth provides two additional words that throw exceptions if a definition has certain attributes:

: ?NONAME ( DEFINITION -- 1ST )
  DUP NONAME? IF -267 THROW THEN ;

: ?SMUDGE ( DEFINITION -- 1ST )
  DUP SMUDGE? IF -268 THROW THEN ;

These two words are quite useful if further processing relies on the absence of these attributes. You already know ?NONAME from the definitions of NAME and PREV. Any attempt to access the name field or the link field of a definition that does not have these fields can be prevented by including ?NONAME at the beginning the code.

After the attribute field comes the token field. In order to obtain a pointer to a definition's code field, >CODE simply fetches the execution token and casts it to an item of data type CONST -> CODE. Remember that the pointer to the code field is identical to the definition's token.

: >CODE ( DEFINITION -- CONST -> CODE )
  CAST FAR-ADDRESS -> CONST 1+ @ -> CODE ;

>BODY is very similar to >CODE. It returns a pointer to the definition's data field, which is next to its code field. Since the contents of the data field may vary between different words, >BODY just returns an unspecified pointer of data type CONST. Note that >BODY is an ANS Forth word, while >CODE is not.

: >BODY ( DEFINITION -- CONST )
  >CODE 1+ CAST CONST ;

Finally, PARAM@ grants you access to a definition's input and ourput parameter fields. Given a definition and a zero-based index into it's stack diagram, PARAM@ returns the selected basic data type as an item of data type DATA-TYPE. PARAM@ has already been presented in chapter 7. It calculates the address of the first parameter by skipping the attribute and token fields. For that purpose, these two fields are again interpreted as one item of data type DATA-TYPE.

: PARAM@ ( DEFINITION UNSIGNED -- DATA-TYPE )
  SWAP CAST FAR-ADDRESS -> DATA-TYPE 1+ SWAP + @ ;

As promised at the beginning of this section, here's the definition of NEXT, the counterpart of PREV:

: NEXT ( DEFINITION -- 1ST )
  DUP LATEST <>
  IF ?SMUDGE DUP CAST FAR-ADDRESS -> DATA-TYPE SWAP #PARAMS 1+ +
     CAST CFAR-ADDRESS -> UNSIGNED NAME>DEFINITION
  ELSE DROP NULL DEFINITION
  THEN ;

Starting at the attribute field of a definition, NEXT advances to the next definition by skipping the attribute field, the token field, and the input and output parameter fields, provided that the definition is not unfinished. The result is the address of the next definition's name field, so all that's missing is now for NAME>DEFINITION to skip the name field and the link field of the next definition. If a definition has no successor, which means it's the latest definition, NEXT just returns a null definition.

Now it also becomes clear why NAME>DEFINITION's capability to deal with noname definitions is useful. It returns the correct result even if the next definition has no name field.

Searching the Dictionary

ANS Forth provides the word FIND for searching the dictionary for the latest occurrence of a word with a given name:

FIND ( c-addr -- c-addr 0  |  xt 1  |  xt -1 )

The stack diagram of strongForth's version of FIND looks different:

FIND ( CDATA -> CHARACTER UNSIGNED SINGLE UNSIGNED -- DEFINITION SIGNED )

What are the actual differences? First of all, strongForth has abandoned counted strings, i. e. strings with a leading length byte in memory. Therefore, FIND expects the name of the word as a pointer to the first character and a separate parameter for the character count:

CDATA -> CHARACTER UNSIGNED

Secondly, strongForth requires that stack diagrams are unique at compilation time. Alternative stack diagrams are not allowed. In ANS Forth, the stack diagram of FIND depends on the search result, which is unknown at compile time. It was therefore necessary to modify the output parameters of FIND. StrongForth's version of FIND always returns an object of data type DEFINITION and a signed integer to indicate the search result. If FIND is successful, it returns the definition and +1 or -1, depending on whether the definition is an immediate word or not. If FIND is not successful, it returns a null definition and 0. Note that FIND returns the definition itself instead of the execution token. As explained in the previous section, the token can easily be derived from a definition, because the token of a definition is just the address of it's code field.

SIGNED Search result
0 definition not found
+1 definition found, immediate word
-1 definition found, no immediate word

Finally, what about the two additional parameters of data types SINGLE and UNSIGNED? Since strongForth supports operator overloading, finding words in the dictionary can be a pretty complicated task. The two parameters are used to distinguish between different variants of the dictionary search. Generally, UNSIGNED specifies the type of the dictionary search, while SINGLE may in certain cases contain an additional parameter. The matching criteria as specified by UNSIGNED and SINGLE are considered in addition to the name of the word.

If UNSIGNED is 0, no additional search criteria are applied, i. e., FIND finds the latest word in the dictionary with the given name. The value of SINGLE is ignored. This is exactly the semantics of FIND in ANS Forth. A simple application of FIND with no additional search criteria is the word ', which is defined as follows:

: ' ( -- DEFINITION )
  PARSE-WORD 0 0 FIND 0= IF -13 THROW THEN ;

' always returns the latest definition whose name is identical to the one parsed from the input source. Other than ANS Forth, strongForth's version of ' does not return the execution token of a definition. To obtain the token, you have to use 'TOKEN:

: 'TOKEN ( -- TOKEN )
  ' >CODE CAST TOKEN ;

A very similar word, 'CODE, returns the content of the code field of a definition with a given name:

: 'CODE ( -- CODE )
  ' >CODE @ ;

Note that both 'TOKEN and 'CODE will only work correctly if either the definition name is unique, i. e., if the definition is not overloaded, or you know for sure that you want the latest definition with the given name.

If UNSIGNED is 1, FIND considers the content of the code field as an additional matching criteria. It finds the latest word in the dictionary with the given name, whose code field contains the value specified by SINGLE. This is useful if only words with a specific code field need to be considered. For example, DT tries to find a word that has been defined by PROCREATES, and TO may only find words that have been defined by VALUE. The definition of DT is presented in chapter 9.

If UNSIGNED is 8, the stack diagram of the word to be found in the dictionary has to match a given sample. The sample is to be provided at the top of the local name space, just below the first unused address. SINGLE contains the length of the stack diagram as an unsigned number. Manually creating a stack diagram can be quite a tedious task, as shown in the following sample code

LOCAL-SPACE
 OK
DT LOGICAL DT-INPUT OR ,
 OK
DT UNSIGNED DT-INPUT OR ,
 OK
DT LOGICAL DT-OUTPUT OR 1 OFFSET+ ,
 OK
PARSE-WORD LSHIFT 3 8 FIND . .
-1 LSHIFT ( LOGICAL UNSIGNED -- 1ST )  OK

It's much more convenient to take advantage of strongForth's special tools for creating stack diagrams. In the following definition of the word )', <DIAGRAM and DIAGRAM> are used to process a stack diagram in the common format:

: )' ( MEMORY-SPACE FLAG STACK-DIAGRAM -- DEFINITION )
  <DIAGRAM DUP OFFSET PARSE-WORD ROT 8 FIND
  0= IF -13 THROW THEN ROT ROT DIAGRAM> ;

)' is similar to ', but it allows to additionally specify the stack diagram of the word:

( LOGICAL UNSIGNED -- 1ST )' LSHIFT .
LSHIFT ( LOGICAL UNSIGNED -- 1ST )  OK
( LOGICAL -- 1ST )' LSHIFT .
LSHIFT ( LOGICAL -- 1ST )  OK
( UNSIGNED -- 1ST )' LSHIFT .

( UNSIGNED -- 1ST )' LSHIFT ? undefined word
DEFINITION

However, an exact match of a given stack diagram still does not implement the search criteria of strongForth's interpreter and compiler. The interpreter's and compiler's search criteria are more complex, because they allow a data type to match any of it's subtypes, or a basic data type to match compound data types. Even references to other data types can be provided within stack diagrams and are considered by the interpreter and the compiler. This matching algorithm will be described in detail in the next section.

If UNSIGNED is 4, FIND performs the interpreter's and compiler's matching rules by comparing the stack diagram of each word with the contents of the interpreter or compiler data type heap. In interpretation state, the interpreter data type heap is used. In compilation state, the selection of the data type heap depends on whether the word is immediate or not. The stack diagrams of immediate words are matched against the interpreter data type heap, while those of non-immediate words are matched against the compiler data type heap.

These rules work nicely for the interpreter and the compiler, but not for words like [COMPILE] and POSTPONE. These two words can only be executed in compilation state, because they always compile a word. They require a special rule to always match a stack diagram against the compiler data type heap, no matter whether the word is immediate or not. This special rule is applied if UNSIGNED is 6.

In both of these two cases, UNSIGNED = 4 and UNSIGNED = 6, the chosen data type heap is updated to carry out the data type transformations as determined by the stack diagram of the found word. If, for example, the data type heap contains CODE FLAG UNSIGNED at it's top and FIND finds ROT, these three data types are replaced by FLAG UNSIGNED CODE. Note that the word is not yet executed or compiled by FIND. Only the data type heap is updated.

Are we now done with all those matching criteria? No, there are actually two more variants of UNSIGNED = 4 and UNSIGNED = 6, which are activated by adding 16 to the value of UNSIGNED, giving 20 or 22, respectively. If UNSIGNED is 20 or 22, the matching rules for the input paraemters are the same as for UNSIGNED = 4 and UNSIGNED = 6. However, an additional matching rule is applied to the output parameters. After carrying out the data type transformations for a word with matching input parameters, the output parameters are being compared with those of a sample word whose address is provided by the parameter SINGLE: Actually, SINGLE contains only the lower part of an item of data type DEFINITION that identifies the sample word. If the output parameters of the word as calculated by the data type transformations are not exactly the same as the output parameters of the sample word, the word is rejected and the dictionary search is continued. An application of this rather complex matching rule will be presented in chapter 15 in connection with EXECUTE.

Before starting with a detailed description of the interpreter's and compiler's matching rules, let's summarise the matching criteria of FIND as specified by the two additional parameters SINGLE and UNSIGNED:

SINGLE UNSIGNED Matching criteria
- 0 The latest word in the dictionary with the given name is found.
code field 1 Only words whose code field contains the value of SINGLE can be found.
- 4 In interpretation state, only words whose input parameters match the data types on the interpreter data type heap can be found. In compilation state, only non-immediate words whose input parameters match the data types on the compiler data type heap, and immediate words whose input parameters match the data types on the interpreter data type heap can be found.
- 6 Only words whose input parameters match the data types on the compiler data type heap can be found. An ambiguous condition exists if executed in interpretation state.
length 8 Only words whose stack diagram is exactly the same as the temporary stack diagram stored at the top of the local name space can be found. SINGLE is the length of the temporary stack diagram.
sample definition 20 The matching criteria for the input parameters are the same as for UNSIGNED = 4. Additionally, the output parameter list has to be identical to the output parameter list of the definition specified by SINGLE, after resolving all data type references to the input parameters. The value in single is the low word of a definition.
sample definition 22 The matching criteria for the input parameters are the same as for UNSIGNED = 6. Additionally, the output parameter list has to be identical to the output parameter list of the definition specified by SINGLE, after resolving all data type references to the input parameters. The value in single is the low word of a definition.

Finally, it's important to note that some of these matching criteria can be combined by simply adding the respective values of UNSIGNED. It is possible to combine 4 or 6 with 1 or 8, resulting in 4 additional matching criteria:

For example, UNSIGNED = 5 means that the matching criteria of the interpreter or compiler will be applied, but only words with a code field identical to SINGLE can be found. UNSIGNED = 14 performs an exact match against the stack diagram stored at the top of the local name space, and then updates the compiler data type heap according to this stack diagram.

Matching Rules

When trying to find a word in the dictionary, the interpreter and the compiler do not only consider the name of the word. Since the concept of operator overloading allows to distinguish equally named words by the data types they are applied to, the stack diagrams become part of the matching rules. Note that ANS Forth allows redefining words as well, but once a new version has been defined, the old versions can no longer be found in the dictionary. In strongForth, previously defined versions stay alive, as long as they can be distinguished by their stack diagrams from versions defined later.

Generally, interpreter and compiler try to find a word with the given name that can be applied to the items on the data stack. If, for example, the item on top of the stack has the data type UNSIGNED, a word whose stack diagram has an item of data type FLAG as it's last input parameter won't match. But what about a word whose stack diagram has only one input parameter of data type INTEGER? This word matches, because it can be applied to data type INTEGER and all its subtypes.

So what are the exact matching rules? First of all, FIND checks whether the data stack contains at least as many items as the word has input parameters. Note that each item on the data stack corresponds to either a basic or a compound data type on the data type heap and in the input parameter field. Thus, the length of the input parameter field is not identical to the number of input parameters. For example, the input parameter field of the word

! ( SINGLE CONST -> 1ST -- )

consists of only 2 parameters, although the length of the input parameter field is 3.

Next, FIND compares the data type of each parameter with the data type of the corresponding item on the data stack. Four different cases have to be considered.

Case Input parameter Data stack item
1 Basic data type Basic data type
2 Basic data type Compound data type
3 Compound data type Basic data type
4 Compound data type Compound data type

Case 1

If both the input parameter and the corresponding item on the data stack are basic data types, they match if their data types are identical or if the item on the data stack is a subtype of the input parameter. For example,

ALIGNED ( ADDRESS -- 1ST )

matches if the item on top of the data stack is of data type ADDRESS, or any of its direct or indirect subtypes, like DATA, CONST or CDATA.

Case 2

Now. let's assume that the item on the data stack is a compound data type, while the corresponding input parameter is still a basic data type. A compound data type is always more specific than a basic data type. The input parameter matches the item on the data stack, if the header of the compound data type, which is the leftmost of the basic data types that constitute the compound data type, is identical to or a direct or indirect subtype of the basic data type of the input parameter. In the previous example, ALIGNED matches even if the item on top of the data stack is one of these:

DATA -> UNSIGNED
CONST -> CODE -> SINGLE
CDATA -> CHARACTER

Case 3

On the other hand, if the input parameter has a more specific data type than the corresponding item the data stack, they don't match. A word that requires a compound data type as an input parameter can't be satisfied by a basic data type. For example,

@ ( CODE -> SINGLE -- 2ND )

won't match if the item on top of the data stack has the basic data type CODE.

Case 4

What if both data types are compound data types? The matching rule for this case can be described by a simple recursion. Let

head1 -> tail1

be the compound data type of the input parameter and

head2 -> tail2

be the compound data type of the corresponding item on the data stack. head1 and head2 are basic data types, while tail1 and tail2 may be either basic or compound data types. The input parameter matches the item on the data stack if both of the following two conditions are met:

In the above example, @ matches if the item on the data stack is one of these:

CODE -> SIGNED
CODE -> CONST -> LOGICAL

The same version of @ don't match if the data type of the item on top of the data stack is either of the following:

CODE -> UNSIGNED-DOUBLE
CCODE -> CHARACTER
DATA -> SINGLE

Data Type References

But still, these four cases do not cover everything. What happens if an input parameter contains a data type reference? A data type reference can be either in a basic data type or in the last basic data type of a compound data type. This means, a basic data type with a data type reference does never have the prefix attribute:

2ND               \ Okay
DATA -> 1ST       \ Okay
3RD -> UNSIGNED   \ Wrong!

If a reference is part of a compound data type, all basic data types up to the reference are processes according to the matching rules described in cases 1 to 4. The basic data type containing the reference is then substituted by the referenced basic or compound data type. Any one of the basic data types in the input parameter list, whose index is lower that the index of the basic data type containing the reference, can be referenced:

( UNSIGNED CHARACTER 1ST -- )	\ Okay
( CDATA -> CHARACTER 1ST -- )   \ Okay
( CDATA -> CHARACTER 2ND -- )   \ Okay
( CDATA -> CHARACTER 3RD -- )   \ Wrong!

Note that the referenced data type can be basic data type, the head of a compound data type or even the tail of a compound data type.

But although the referenced data type is part of the input parameter list, the actual match is performed against the data types of the items on the data stack. While matching the input parameters one by one with the data types of the items on the data stack, each basic data type in the input parameter list is assigned to a unique basic data type on the data type heap. In order for the match to succeed, the basic or compound data type on the data type heap, which is assigned to the data type reference in the input parameter list, has to be identical to the basic or compound data type on the data type heap, which is assigned to the referenced basic or compound data type in the input parameter list. Note that subtype or prefix relationships are not considered when matching references. The data types have to be identical.

The matching rules for data type references are best explained by a number of examples.

! ( SINGLE DATA -> 1ST -- )

matches

FLAG DATA -> FLAG

and

CONST -> UNSIGNED DATA -> CONST -> UNSIGNED

1ST is a reference to the first basic data type in the input parameter list, which is SINGLE. However, the data type is substituted by the corresponding data type of the item on the data stack, which is FLAG or CONST -> UNSIGNED. According to this rule, the following data types will not match:

INTEGER DATA -> SINGLE
CODE -> SINGLE DATA -> CODE
PORT DATA -> PORT -> CHARACTER

The rationale of using a reference in this special case is that an item of a specific data type can only be stored in a memory location, which is specified by an address of exactly the same data type.

- ( ADDRESS -> SINGLE 1ST -- INTEGER )

matches every pair of identical data types, provided the first parameter has passed the matching rule for ADDRESS -> SINGLE.

The last input parameter of

FILL ( DATA -> SINGLE UNSIGNED 2ND -- )

is a reference to the tail of the first parameter. It matches for

DATA -> TOKEN UNSIGNED TOKEN
DATA -> CONST -> SIGNED UNSIGNED CONST -> SIGNED

but not for

DATA -> LOGICAL UNSIGNED FLAG
DATA -> SINGLE UNSIGNED DATA -> SINGLE

To make things even worse, references are allowed to be recursive. Stack diagrams like in the following example are possible, and the matching rules for data type references are applied correctly:

( INTEGER CCONST -> 1ST DATA -> 2ND -- )

A word with this stack diagram can be found in the dictionary if the data type heap contains

INTEGER CCONST -> INTEGER DATA -> CCONST -> INTEGER
or something that is based on subtypes of data types INTEGER, CCONST and DATA, like

UNSIGNED CCONST -> UNSIGNED DATA -> CCONST -> UNSIGNED

Again: Searching the Dictionary

FIND searches the dictionary for a given name of a definition. But in some cases, instead of the name you just have the token of a definition. Although the token is a unique identifier, there's no link back from the token to the definition. You have to search the complete dictionary in order to find out which definition a specific token belongs to. This kind of dictionary search is performed by FIND-TOKEN:

FIND-TOKEN ( TOKEN -- DEFINITION SIGNED )

FIND-TOKEN has only one item of data type TOKEN as input parameter. The output parameters are identical to the output parameters of FIND, and they have the same semantics. An item of data type SIGNED indicates whether the token is valid and whether the associated definition is an immediate word or not. You'll see an application of FIND-TOKEN in the next chapter.

Alias Definitions

In strongForth, it is not unusual to define a new word that has exactly the same semantics as an already existing word. There are mainly two reasons for doing this:

  1. Giving an existing definition a second name. This makes sense if a word is used for two different purposes. A good example is the word CELLS, which has the same semantics as 2* on machines where the size of a single cell is two address units.
  2. Providing an overloaded version of a definition for input parameters of different data types. For example, if you have already defined a word that performs some calculation on items of data type INTEGER, and you need an overloaded version for items of data type ADDRESS, the semantics is usually exactly the same.

Of course, these two reasons can be combined. In some cases, you might want to define a word with the same semantics as an existing word, but with a different name and a different stack diagram. These so-called alias definitions can be defined by the strongForth word ALIAS:

: ALIAS ( DEFINITION -- CODE STACK-DIAGRAM )
  ?EXECUTE ?SMUDGE
  >CODE DUP [ 6 BIT 7 BIT OR ] LITERAL (CREATE)
  @ NULL STACK-DIAGRAM ;

ALIAS expects the existing definition on the data stack. From this definition, it extracts the contents of the code field and then creates a new definition with the same code field. Just like CODE, ALIAS returns the pointer to the machine code and an empty stack diagram. Thus, the typical usage of ALIAS is as follows:

' name1 ALIAS name2 ( ... -- ... ) END-CODE

name1 is the existing definition, name2 is the alias definition. Now, here are the examples that were already mentioned in the above use cases:

' NOOP ALIAS CHARS ( INTEGER -- 1ST ) END-CODE
 OK
' 2* PREV ALIAS CELLS ( INTEGER -- 1ST ) END-CODE
 OK
' THEN ALIAS ENDIF ( ORIGIN -- ) END-CODE IMMEDIATE
 OK
: 2+ ( INTEGER -- 1ST )
  2 + ;
 OK
LATEST ALIAS 2+ ( ADDRESS -- 1ST ) END-CODE
 OK
7 2+ .
9  OK
HERE . HERE 2+ .
6852 6854  OK

The word PREV after ' 2* is required, because the first version of 2* that ' finds is the one for items of data type INTEGER-DOUBLE. Its predecessor in the dictionary is 2* for items of data type INTEGER:

' 2* .
2* ( INTEGER-DOUBLE -- 1ST )  OK
' 2* PREV .
2* ( INTEGER -- 1ST )  OK

Dr. Stephan Becher - October 14th, 2005