Xchars and Unicode - Gforth Manual

5.19.9 Xchars and Unicode

ASCII is only appropriate for the English language. Most western languages however fit somewhat into the Forth frame, since a byte is sufficient to encode the few special characters in each (though not always the same encoding can be used; latin-1 is most widely used, though). For other languages, different char-sets have to be used, several of them variable-width. Most prominent representant is UTF-8. Let's call these extended characters xchars. The primitive fixed-size characters stored as bytes are called pchars in this section.

The xchar words add a few data types:

xc is an extended char (xchar) on the stack. It occupies one cell, and is a subset of unsigned cell. Note: UTF-8 can not store more that 31 bits; on 16 bit systems, only the UCS16 subset of the UTF-8 character set can be used.
xc-addr is the address of an xchar in memory. Alignment requirements are the same as c-addr. The memory representation of an xchar differs from the stack representation, and depends on the encoding used. An xchar may use a variable number of pchars in memory.
xc-addr u is a buffer of xchars in memory, starting at xc-addr, u pchars long.

xc-size       xc – u         xchar-ext       “xc-size”

Computes the memory size of the xchar xc in pchars.

x-size       xc-addr u1 – u2         xchar       “x-size”

Computes the memory size of the first xchar stored at xc-addr in pchars.

xc@+       xc-addr1 – xc-addr2 xc         xchar-ext       “xc-fetch-plus”

Fetchs the xchar xc at xc-addr1. xc-addr2 points to the first memory location after xc.

xc!+?       xc xc-addr1 u1 – xc-addr2 u2 f         xchar-ext       “xc-store-plus-query”

Stores the xchar xc into the buffer starting at address xc-addr1, u1 pchars large. xc-addr2 points to the first memory location after xc, u2 is the remaining size of the buffer. If the xchar xc did fit into the buffer, f is true, otherwise f is false, and xc-addr2 u2 equal xc-addr1 u1. XC!+? is safe for buffer overflows, and therefore preferred over XC!+.

xchar+       xc-addr1 – xc-addr2         xchar-ext       “xchar+”

Adds the size of the xchar stored at xc-addr1 to this address, giving xc-addr2.

xchar-       xc-addr1 – xc-addr2         xchar-ext       “xchar-”

Goes backward from xc_addr1 until it finds an xchar so that the size of this xchar added to xc_addr2 gives xc_addr1.

+x/string       xc-addr1 u1 – xc-addr2 u2         xchar       “plus-x-slash-string”

Step forward by one xchar in the buffer defined by address xc-addr1, size u1 pchars. xc-addr2 is the address and u2 the size in pchars of the remaining buffer after stepping over the first xchar in the buffer.

x\string-       xc-addr1 u1 – xc-addr1 u2         xchar       “x-back-string-minus”

Step backward by one xchar in the buffer defined by address xc-addr1 and size u1 in pchars, starting at the end of the buffer. xc-addr1 is the address and u2 the size in pchars of the remaining buffer after stepping backward over the last xchar in the buffer.

-trailing-garbage       xc-addr u1 – addr u2         xchar-ext       “-trailing-garbage”

Examine the last XCHAR in the buffer xc-addr u1—if the encoding is correct and it repesents a full pchar, u2 equals u1, otherwise, u2 represents the string without the last (garbled) xchar.

x-width       xc-addr u – n         xchar-ext       “x-width”

n is the number of monospace ASCII pchars that take the same space to display as the the xchar string starting at xc-addr, using u pchars; assuming a monospaced display font, i.e. pchar width is always an integer multiple of the width of an ASCII pchar.

xkey       – xc         xchar-ext       “xkey”

Reads an xchar from the terminal. This will discard all input events up to the completion of the xchar.

xemit       xc –         xchar-ext       “xemit”

Prints an xchar on the terminal.

There's a new environment query

xchar-encoding       – addr u         xchar-ext       “xchar-encoding”

Returns a printable ASCII string that reperesents the encoding, and use the preferred MIME name (if any) or the name in http://www.iana.org/assignments/character-sets like “ISO-LATIN-1” or “UTF-8”, with the exception of “ASCII”, where we prefer the alias “ASCII”.