ASCII is only appropriate for the English language. Most western languages however fit somewhat into the Forth frame, since a byte is sufficient to encode the few special characters in each (though not always the same encoding can be used; latin-1 is most widely used, though). For other languages, different char-sets have to be used, several of them variable-width. Most prominent representant is UTF-8. Let's call these extended characters xchars. The primitive fixed-size characters stored as bytes are called pchars in this section.
The xchar words add a few data types:
xc-size
xc – u xchar-ext “xc-size”
Computes the memory size of the xchar xc in pchars.
x-size
xc-addr u1 – u2 xchar “x-size”
Computes the memory size of the first xchar stored at xc-addr in pchars.
xc@+
xc-addr1 – xc-addr2 xc xchar-ext “xc-fetch-plus”
Fetchs the xchar xc at xc-addr1. xc-addr2 points to the first memory location after xc.
xc!+?
xc xc-addr1 u1 – xc-addr2 u2 f xchar-ext “xc-store-plus-query”
Stores the xchar xc into the buffer starting at address xc-addr1, u1 pchars large. xc-addr2 points to the first memory location after xc, u2 is the remaining size of the buffer. If the xchar xc did fit into the buffer, f is true, otherwise f is false, and xc-addr2 u2 equal xc-addr1 u1. XC!+? is safe for buffer overflows, and therefore preferred over XC!+.
xchar+
xc-addr1 – xc-addr2 xchar-ext “xchar+”
Adds the size of the xchar stored at xc-addr1 to this address, giving xc-addr2.
xchar-
xc-addr1 – xc-addr2 xchar-ext “xchar-”
Goes backward from xc_addr1 until it finds an xchar so that the size of this xchar added to xc_addr2 gives xc_addr1.
+x/string
xc-addr1 u1 – xc-addr2 u2 xchar “plus-x-slash-string”
Step forward by one xchar in the buffer defined by address xc-addr1, size u1 pchars. xc-addr2 is the address and u2 the size in pchars of the remaining buffer after stepping over the first xchar in the buffer.
x\string-
xc-addr1 u1 – xc-addr1 u2 xchar “x-back-string-minus”
Step backward by one xchar in the buffer defined by address xc-addr1 and size u1 in pchars, starting at the end of the buffer. xc-addr1 is the address and u2 the size in pchars of the remaining buffer after stepping backward over the last xchar in the buffer.
-trailing-garbage
xc-addr u1 – addr u2 xchar-ext “-trailing-garbage”
Examine the last XCHAR in the buffer xc-addr u1—if the encoding is correct and it repesents a full pchar, u2 equals u1, otherwise, u2 represents the string without the last (garbled) xchar.
x-width
xc-addr u – n xchar-ext “x-width”
n is the number of monospace ASCII pchars that take the same space to display as the the xchar string starting at xc-addr, using u pchars; assuming a monospaced display font, i.e. pchar width is always an integer multiple of the width of an ASCII pchar.
xkey
– xc xchar-ext “xkey”
Reads an xchar from the terminal. This will discard all input events up to the completion of the xchar.
xemit
xc – xchar-ext “xemit”
Prints an xchar on the terminal.
There's a new environment query
xchar-encoding
– addr u xchar-ext “xchar-encoding”
Returns a printable ASCII string that reperesents the encoding, and use the preferred MIME name (if any) or the name in http://www.iana.org/assignments/character-sets like “ISO-LATIN-1” or “UTF-8”, with the exception of “ASCII”, where we prefer the alias “ASCII”.