Wayback Machine
SEP May SEP
Previous capture 30 Next capture
2003 2004 2005
15 captures
12 Sep 03 - 26 May 10
sparklines
Close Help

An Elementary Prolog Library

DRAFT 4

Section 0. Introduction

The ISO Prolog standard does not adequately support the needs of Prolog users. This may seem unbelievable, and it is certainly inexcusable, but the standard does not even define append/3, length/2, or member/2 (although an example on page 33 shows clauses for append/3 and an example on page 85 shows clauses for member/2).

Many areas are crying out for standardisation, including

This is an attempt to deal with basic data structures. Perhaps more important than the specific data structures and operations dealt with is the stress placed on consistent naming and argument order, and the use of projection as a means of ensuring that argument order is consistent.

0.1. Notation.

I provide three kinds of declaration for the "base" predicates:

The sample code uses a convention taken from DEC-10 Prolog via C Prolog: the dollar sign is treated as a lower case letter, and predicates whose names contain a dollar sign are "implementation detail" not visible to the user. Such predicates are only for specification purposes, and need not conform to quality requirements for visible predicates such as steadfastness.

Perhaps the most important notational issue is that I give no indication in the running text of the module a type or predicate is defined in. It is clear that the four "basic" list processing predicates append/3, length/2, member/2, and memberchk/2 are so useful and so used that they had better be treated as part of the language. It would be quite appropriate if the others were all in a module named 'lists'. See plface.pl for one distribution.

I assume the following predefined types:

:- type list(T)   ---> [] | [T|list(T)].
:- type order     ---> (<) | (=) | (>).
:- type pair(K,T) ---> K-T.
:- type integer.
:- type float.
:- type number.
:- type atom.
:- type char = atom.       % such that atom_codes(It, [_])
:- type chars = list(char).
:- type code = integer.    % such that between(0, 16'10FFFF, It).
:- type codes = list(code).

Other types will be introduced where they are used.

1. Characters.

1.1. Face reality, adopt Unicode.

With Prolog being used in e-business and stuff like that, we have to deal with HTML 4 (defined in terms of Unicode) and XML (defined in terms of Unicode). If we can't portably deal with Unicode, we can't portably deal with the Web. We can no longer afford to put up with Prolog systems that don't handle Unicode cleanly. Quintus Prolog handled 16-bit character sets via JIS and EUC nearly two decades ago. It was obvious then that the byte-at-a-time way that QP did so was a makeshift with a use-by date in the near future (now the rather less near past).

The ISO Prolog standard was published in 1995. In 1983, when I started work on an informal Prolog standard, it was obvious that 8-bit character sets were dead and Prolog would have to support 16-bit character sets. By 1993, it was obvious that there was a single "wide" character set to use in future language standards: ISO 10646/Unicode. Unicode 1.0.0 came out in October 1991. The reconciliation between ISO 10646 and Unicode resulted in Unicode 1.0.1, which came out in June 1992. This left the ISO Prolog committee with 3 years to reflect on the fact that there was now one ISO character set with round-trip conversion between the main national character set standards and itself.

This was certainly obvious to others. For example, the X/Open technical study E401, "Universal Multiple-Octet Coded Character Set coexistence and Migration" had been published the previous year. Of even greater relevance: a great deal of work had gone into updating Ada for what became Ada 95. By 1993, I already had a pile of Ada 95 background documentation and proposals in my office about 50cm high, and Unicode was clearly going to be the character basis of Ada 95 then.

The main reason why certain people at Quintus were not as hostile to the introduction of the single-character-atom design botch as they should have been was concern for EBCDIC on IBM mainframes. However, I have studied the current z/Architecture Principles of Operation, and note that current IBM mainframes have hardware support for Unicode.

With this background, we can see that the ISO Prolog committee made some questionable decisions:

1.2. char_conversion must go!

"This is not negotiable", for the reason that no sane programmer who is acquainted with the facts would wish to retain char_conversion.

That's strong language. Let's summarise what char_conversion/2 and current_char_conversion/2 do.

  1. There is a function ConvC which maps single characters in the implementation-defined character set to single characters in the same character set. This function starts out as the identity function; it is not, in general, expected to be a bijection.
  2. char_conversion('X', 'Y') replaces ConvC with µ(ConvC,x,y), where x is the character code of X and y is the character code of Y.
  3. current_char_conversion(A, B) is true when A and B are single-character atoms and b=ConvC(a) where a and b are the character codes of the first characters of the names of A and B respectively, provided (and this is the most confusing bit) a and b are different.
  4. When a term is read, every unquoted character is effectively replaced by its image under ConvC.

Concerning the excessive strength of char_conversion/2, consider the goal char_conversion('.', ','). If you give that as a query to the interactive top level, you haven't shot yourself in the foot, but in the head. There is no way to recover from this.

In fact it's worse than that. Section 5.5.3 reads "A processor may support some other initial value of ConvC ...", which means that there is no portable Prolog syntax at all. This implementation permission should have required all Prolog processors to start with ConvC mapping the characters used in the Core Prolog syntax to themselves. Indeed, it might have been advisable to decree that char_conversion should not be allowed to change the identity mappings for the core syntax characters.

The :- char_conversion(From, To) directive is unusable in portable programs, even programs intended to be used within a single locale using the same coded character set. The reason is that it is explicitly not specified whether char_conversion directives in a source file are limited in effect (the way it was desired by many that op directives should be) or unlimited in effect (the way op directives were in DEC-10 Prolog and C Prolog). More precisely, a program using this feature can be portable only if every source file spells out the same set of (typically 190) mappings: the fact that it is not specified whether the effect is limited or not means that a program cannot simply consult a single mapping file into each source file.

Concerning the weakness of char_conversion/2, we have to start by looking at the rationale:

I take this to refer to the "full width" (zenkaku) and "half width" (hankaku) copies of ASCII present in some Japanese character sets (and therefore also present in Unicode).

The char_conversion/2 machinery does not satisfy the explicit requirement in the rationale that it should apply to Prolog data. For example, it does not apply to atom_chars/2, and there is no analogous built-in predicate which does character conversion. "Words" read from a file are only converted to a normal form by ConvC mapping if they are read in Prolog syntax using one of the read family. There should surely have been

converted_atom_chars(Atom, Chars) :-
    (   atom(Atom) ->
        atom_chars(Atom, Name),
        $convert_chars(Name, Chars)
    ;   var(Atom) ->
	$convert_chars(Chars, Name),
	atom_chars(Atom, Name)
    ;   $error
    ).

$convert_chars([], []).
$convert_chars([Char|Chars], [Conv|Convs]) :-
    (   current_char_conversion(Char, Conv) -> true	
    ;   Conv = Char
    ),
    $convert_chars(Chars, Convs).

An even worse way in which this misfeature fails to satisfy its explicit rationale is that character conversion does not apply to number_chars/2, and there is no analogous built-in predicate which does character conversion. Numbers read from a file are only converted to a normal form by ConvC mapping if they are read in Prolog syntax using one of the read family.

The failure to deal with run-time number conversion is a particular nuisance for people using Arabic or Indic scripts. As a preliminary sketch, it would be useful if something like the following predicates were adopted:

:- pred integer_codes(integer, codes, integer, code).

integer_codes(Integer, Codes, Base, Zero) :-
    Delta is Zero - 0'0,
    (   integer(Integer) ->
        integer_codes(Integer, ASCII, Base),
        $map(ASCII, Codes, Delta, 0'., 0'E)
    ;   var(Integer) ->
	$map(ASCII, Codes, Delta, 0'., 0'E),
        integer_codes(Integer, ASCII, Base)
    ;   $error
    ).

integer_codes(Integer, Codes, Base) :-
    between(Base, 2, 36),
    like number_codes/2 but only for integers
    and using the first Base digits from "0..9A..Z".

integer_codes(Integer, Codes) :-
    integer_codes(Integer, Codes, 10).

:- type float_format
   ---> e(integer) | e | f(integer) | f | g(integer) | g
      | k(integer) | k | p.
:- pred float_codes(float, codes, float_format, code, code, code).

float_codes(Float, Codes, Format, Zero, Decimal, Exponent) :-
    Delta is Zero - 0'0,
    (   float(Float) ->
	float_codes(Float, ASCII, Format),
	$map(ASCII, Codes, Delta, Decimal, Exponent)
    ;   var(Float) ->
	$map(ASCII, Codes, Delta, Decimal, Exponent),
	float_codes(Float, ASCII, Format)
    ;   $error

float_codes(Float, Codes, Format, Zero, Decimal) :-
    float_codes(Float, Codes, Format, Zero, Decimal, 0'e).

float_codes(Float, Codes, Format, Zero) :-
    float_codes(Float, Codes, Format, Zero, 0'., 0'e).

float_codes(Float, Codes, Format) :-
    like number_codes/2 but only for floats,
    (   Format = e(P), integer(P), P >= 0 ->
        generate or recognise %.Pe format
    ;   Format = e ->
        generate: use default P, recognise: accept any %e
    ;   Format = f(P), integer(P), P >= 0 ->
        generate or recognise %.Pf format
    ;   Format = f ->
        generate: use default P, recognise: accept any %f
    ;   Format = g(P), integer(P), P >= 0 ->
        generate or recognise %.Pg format
    ;   Format = g ->
        generate: use default P, recognise: accept any float
    ;   Format = k(P), integer(P, P >= 0 ->
        like e(P), but with 1..3 digits before the decimal
        point so that the exponent is a multiple of 3; a very
        useful format when you have it
    ;   Format = k ->
        generate: use default P, recognise: accept any P
        but still allow only results of k(_) formatting
    ;   Format = p -> % precise
        generate: same as e(P) with P large enough that full
        precision is generated, enough digits for round trip
        conversion to be one-to-one
    ;   $error
    ).

$map([0'+|Xs], [0'+|Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E).
$map([0'-|Xs], [0'-|Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E).
$map([0'.|Xs], [ D |Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E).
$map([0'e|Xs], [ E |Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E).
$map([0'E|Xs], [ E |Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E).
$map([ N |Xs], [ M |Ys], Z, D, E) :- !,
    plus(Z, N, M),
    between(0'0, 0'9, N),
    $map(Xs, Ys, Z, D, E).
With such predicates, we then define
:- pred number_codes(number, codes, code).

number_codes(Number, Codes, Zero) :-
    (   integer(Number) -> integer_codes(Number, Codes, Zero)
    ;   float(Number) -> float_codes(Number, Codes, p, Zero)
    ;   var(Number) >
	(   integer_codes(Number, Codes, Zero) -> true
	;   float_codes(Number, Codes, p, Zero)
	)
    ;   $error
    ).

number_codes(Number, Codes) :-
    number_codes(Number, Codes, 0'0).

At last, 10 years late, this would give people using Arabic or Indic scripts (amongst others) the ability to convert numbers using their own characters, and it would, 10 years late, give Europeans who like a rather ambiguous use of commas to convert floating-point numbers using their preferred decimal point character.

Is there any other way char_conversion/2 fails? Yes, there is. I shall take personal examples. The only languages that really matter to me personally (other than mattering because I think other people should be able to use their own language and script) are English, Latin, Greek, Hebrew, Serbo-Croatian, and Maori.

Writing in English, using ISO Latin 1, I would like to be able to use my father's name, Æneas. The letter ash is not one of the normal Prolog characters, so I would have to map it to two characters: A, e. But I can't do that. I cannot have the atom hæme mapped to haeme either, for the same reason. Since Latin1 includes the characters "¼", "½", and "¾", which are part of the number syntax of my native locale, I would like to use them in numbers, so that ¼=0.25, 2½=2.5, and 10¾=10.75. This could be done if I could map "¼" to ".25" and so on. But with char_conversion/2 so limited, I can't. If char_conversion/2 can't even handle ISO Latin 1, what good is it?

I suppose I should have said "Croatian" rather than "Serbo-Croatian", but there was a generation of New Zealanders who knew that their grandparents were from Yugoslavia who had to look the natal village up on a map to figure out which side they would have been on when Yugoslavia broke up. As it turned out, my relatives were all on the side labelled "refugee", and we have lost touch; the natal village is a heap of rubble. The NZ branch of the family figured out which to call themselves by looking at letters my grandfather's brother had written: they used the Latin script, not the Cyrillic script. Except that Croatian doesn't quite use the Latin script. There are four letters (DZ, DZ with caron, LJ, and NJ) which look like two English letters but are regarded as single letters in the script. It would probably be a good idea to map DZ to D,Z; Dz to D,z; and dz to d,z. But I can't do that with char_conversion/2.

New Zealand has two official languages, English and Maori. For historical reasons, there are three spelling conventions for Maori:

  1. Spell long and short vowels the same. This is the convention the missionaries used. It is the oldest convention, and is used in the word "Maori" itself.
  2. Spell long vowels by writing a long vowel as two copies of the corresponding short vowel. This is a Maaori invention.
  3. Spell long vowels by writing a macron over the corresponding short vowel. This is the convention adopted by the United Tribes, on the grounds that it's the least change to the oldest convention and is therefore most respectful to the ancestors.
  4. Spell long vowels by writing an umlaut over the corresponding short vowel. This is not an official convention for Mäori, but it's what you are most likely to see if you look at web pages containing Mäori text. The reason is that instead of adopting a character set containing macronised vowels, people said "let's use our Windows/MacRoman/Latin1 characters but with different fonts so the umlauts look like macrons."

Accordingly, in processing Maaori text in which vowel length is marked, the best scheme would be to map U+0100 (capital A with macron) and U+00C4 (capital A with daieresis) to A,A, and so on for the other 9 upper and lower case vowels. But I can't do that with char_conversion.

Do I need to mention the Unicode compatibility mappings which map one character to several? Maybe I had better. Take the character U+00A0 "DIAERESIS", for example. Its compatibility mapping is "<compat> 0020 0308". A quick tally of the number of replacement characters in the Unicode 3.1.1 database yielded
ReplacementNumber of
length occurrences
13005
21585
3360
462
515
62
81
181

This is of course a lower bound; the replacement rules in that file have to be applied recursively, so the average replacement length is somewhat longer.

In conclusion, the entire char_conversion/2 framework is too narrowly conceived, too little applied, and too dangerous in effect.

In order to deal effectively with Unicode (and it was the plain responsibility of the ISO Prolog committee to address this), some other means entirely will have to be found.

1.3 Normalisation

Unicode Standard Annex #15 defines four normalisation forms. Canonical Equivalence in Applications defines two more. We also need a "verbatim" normalisation form which acts as the identity function.
NameCodeSourceWhat it is
vHereNo change
NFDdTR15Canonical decomposition
NFCcTR15Canonical decomposition then canonical composition
FCDfcdTN5"Fast C or D" decomposition
FCCfcdTN5FCC form
NFKDkdTR15Compatibility decomposition
NFKCkcTR15Compatibility decomposition then canonical composition

Canonical decomposition and (re)composition apply, in principle, to any character set containing both precomposed accented letters and floating diacriticals. Unicode was by no means the first such character set.

FCC and FCD forms are intermediate between NFC and NFD; they make sense whenever NFC and NFD make sense.

Compatibility decomposition applies, in principle, to any character set containing multiple representations of the "same" character. In particular, the Japanese zenkaku and hankaku are just such characters. In fact, compatibility decomposition is pretty much exactly what char_conversion/2 was supposed to be all about, except that it is by no means true that all compatibility mappings in all such character sets have single-character replacements.

Basically, to really do the job that code_conversion/2 was supposed to do but doesn't, we need Normalisation Form KD or even better, Normalisation Form KC.

We need two modules, one for each representation of characters. (Unicode-hacking life would be so much easier if we only had to deal with integers in the range 0..16'10FFFF. For example, while '\xD800\xDC00' is, according to Unicode rules, the encoding of one character, it is not a 'char' according to ISO Prolog rules.)

The following type is shared by codes and chars:

:- type normalisation_form
   ---> v | c | d | fcc | fcd | kc | kd.

It is a matter for debate which normalisation forms should be supported by all Prolog systems. It is a practical fact that some normalisation forms should be supported by most Prolog systems. Whatever set of normalisation forms is eventually agreed on, implementors ought to be allowed to extend the set.

:- module(chars, [...]).
:- pred normalise(normalisation_form, chars, chars, chars).
:- normalise(NF, Xs, Ys0, Ys) when NF, Xs.
:- normalise(NF, Xs, Ys0, Ys) is semidet
   when ground(NF), ground(Xs).

normalise(NF, Xs, Ys0, Ys) :-
    Ys0\Ys is Xs in normalisation form NF.

normalise(NF, Xs, Ys) :-
    normalise(NF, Xs, Ys, []).

normalise(Xs, Ys) :-
    normalise(kd, Xs, Ys).

normalised_compare(NF, R, Xs, Ys) :-
    normalise(NF, Xs, Xn),
    normalise(NF, Ys, Yn),
    compare(R, Xn, Yn).

normalised_compare(R, Xs, Ys) :-
    normalised_compare(kd, R, Xs, Ys).

normalised(NF, Xs) :-
    normalise(NF, Xs, Xs).

normalised(Xs) :-
    normalise(Xs, Xs).
...

Please remember that these are specifications, not implementations. It is possible to implement normalised_compare/[3,4] so that they work in a single pass over their arguments without building any intermediate list. Similarly, it is possible to implement normalised/[1,2] so that they work in a single pass over their argument without building any intermediate list.

As specified above, normalised_compare/3 is defined only on character lists. It could be extended to all ground terms. I do not recommend this.

:- module(codes, [...]).
:- pred normalise(normalisation_form, codes, codes, codes).
:- normalise(NF, Xs, Ys0, Ys) when NF, Xs.
:- normalise(NF, Xs, Ys0, Ys) is semidet
   when ground(NF), ground(Xs).

normalise(NF, Xs, Ys0, Ys) :-
    Ys0\Ys is Xs in normalisation form NF.

normalise(NF, Xs, Ys) :-
    normalise(NF, Xs, Ys, []).

normalise(Xs, Ys) :-
    normalise(kd, Xs, Ys).

normalised_compare(NF, R, Xs, Ys) :-
    normalise(NF, Xs, Xn),
    normalise(NF, Ys, Yn),
    compare(R, Xn, Yn).

normalised_compare(R, Xs, Ys) :-
    normalised_compare(kd, R, Xs, Ys).

normalised(NF, Xs) :-
    normalise(NF, Xs, Xs).

normalised(Xs) :-
    normalise(Xs, Xs).
...

As specified above, normalised_compare/3 is defined only on character code lists. It could not be extended to all ground terms, because there is no general way of telling which lists of positive integers are supposed to be Unicode strings and which are not.

There needs to be a stream property normalisation_form(NF) which can be used to ensure that characters delivered to a Prolog application are already normalised, and that characters generated by a Prolog application are appropriately normalised before being delivered to their destination.

Several Prolog implementations (Xerox Quintus Prolog and SWI Prolog amongst them) have a "string" data type. Such implementations should have a string module analogous to codes and chars.

1.4. Character classification.

In the good old days, it was sufficient to have a library(ctypes) module. With ISO's thrice-accursed single-character-atom representation for characters, we need two modules, as noted in the previous section. So we might as well put the character classification predicates in 'chars' and 'codes'. The classification predicates in each module will have the same name and basically the same meaning, but different argument types.

The ISO Prolog standard uses the following character groups:

The term "graphic character" is very poorly chosen; one expects it either to have something to do with box drawing or to be strongly related to isgraph() in the C standards. Instead it refers to what was previously known as "symbol characters" or "operator characters".

That's a hint. We want to be able to do what C does, plus we want to be able to cover the categories used in Prolog itself.

Fortunately, there's a freely downloadable table classifying many tens of thousands of characters, actively maintained by someone else. Even more fortunately, the categories it uses can be mapped onto the categories we want quite easily. The table is part of the Unicode data base.

is_unicode/2 and is_digit/2

I propose a single base predicate, from which the others are easily derived.

:- type unicode_category
   ---> lu | ll | lt | lm | lo
      | mn | mc | me
      | nd | nl | no
      | pc | pd | ps | pe | pi | pf | po
      | sm | sc | sk | so
      | zs | zl | zp
      | cc | cf | cs | co | cn.

These codes are simply the Unicode category names converted to lower case for convenience in Prolog.

:- pred is_unicode(code, unicode_category).
:- is_unicode(C, K) when C ; K.
:- is_unicode(C, K) is semidet when ground(C).
:- is_unicode(C, K) is bounded when true.

is_unicode(C, K) :-
    between(0, 16'10FFFF, C),
    UnicodeData.txt says C has category K,
    K \== cn. % not assign

:- pred is_digit(code, integer).
:- is_digit(C, V) when C ; V.
:- is_digit(C, V) is semidet when ground(C).
:- is_digit(C, V) is bounded when true.

is_digit(C, K) :-
    is_unicode(C, nd),
    UnicodeData.txt gives V as the decimal value of C.

The Unicode standard and associated reports explain implementation techniques for this. Even with just one byte per character, a simple implementation would take 1.1 megabytes. Using multilevel tables can compress this dramatically, not least because only 3 of the 17 planes currently have any characters at all.

Derived predicates

is_unicode(C) :-
    is_unicode(C, _).

is_latin1(C, K) :-
    between(0, 16'FF, C),
    is_unicode(C, K).

is_latin1(C, K) :-
    is_latin1(C, _).

is_ascii(C, K) :-
    between(0, 16'7F, C),
    is_unicode(C, K).

is_ascii(C) :-
    is_ascii(C, _).

is_alnum(C) :-
    is_unicode(C, K),
    memberchk(K, [nd,lu,ll,lt,lm,lo]).

is_alpha(C) :-
    is_unicode(C, K),
    memberchk(K, [lu,ll,lt,lm,lo]).

is_blank(C) :- % known as is_white/1 in Quintus Prolog
    is_unicode(C, zs).

is_bracket(L, R) :-
    is_unicode(L, ps),
    is_unicode(R, pe),
    R is the closing bracket corresponding to L.

is_cntrl(C) :- % C0 and C1 controls + formats
    is_unicode(C, K),
    memberchk(C, [cc,cf]).

is_csym(C) :-
    is_unicode(C, K),
    memberchk(C, [pc,nd,lu,ll,lt,lm,lo]).

is_csymf(C) :-
    is_unicode(C, K),
    memberchk(C, [pc,lu,ll,lt,lm,lo]).

is_digit(C) :-
    is_unicode(C, nd).

is_endfile(-1).

is_endline(C) :-
    is_unicode(C, K),
    memberchk(K, [zl,zp]).

is_graph(C) :-
    is_unicode(C, K),
    nonmember(K, [zs,zl,zp,cc,cf]).

is_layout(C) :-

is_lower(C) :-
    is_unicode(C, ll).

is_print(C) :-
    is_unicode(C, K),
    nonmember(K, [zl,zp,cc,cf]).

is_punct(C) :-
    is_unicode(C, K),
    memberchk(K, [pc,pd,ps,pe,pi,pf,po,sm,sc,sk,so]).

is_title(C) :-
    is_unicode(C, lt).

is_quote(0'', 0'').
is_quote(0'", 0'").
is_quote(0'`, 0'`).
is_quote(L, R) :-
    is_unicode(L, pi),
    is_unicode(R, pf),
    R is the final quote corresponding to L.

is_space(C) :-
    is_unicode(C, K),
    memberchk(K, [zs,zl,zp]).

is_upper(C) :-
    is_unicode(C, lu).

is_xdigit(Base, C, V) :-
    between(2, 36, Base),
    Base1 is Base - 1,
    between(0, Base1, V),
    (   V >= 10 ->
        ( C is "A" - 10 + V ; C is "a" - 10 + V )
    ;   is_digit(C, V)
    ).

is_xdigit(C, V) :-
    is_xdigit(16, C, V).

is_xdigit(C) :-
    is_xdigit(16, C, _).

There are two debatable decisions here. In this draft I have decided that the "Other, Format" characters of Unicode should be grouped with the "Other, Control" characters for is_cntrl/1, on the grounds that these things fairly obviously are control characters and that if you just want the C0 and C1 controls it is easy enough to test for 'cc' yourself. Indeed, is_ascii(C, cc) will check for the ASCII (C0) controls. I have also decided that the "csymf" characters should include all the "Punctuation, Connector" characters, not just the LOW LINE "_".

i have read through rather more material about wide character classification in C than was enjoyable, and found stunningly little guidance about how non-ASCII characters should be classified, so I must emphasise that the definitions above are a draft.

The is_bracket/2 and is_quote/2 predicates do not correspond to anything in C, but I found them very useful in Quintus Prolog and equivalents very useful in a text editor. Unicode 3.1.1 adds only single quotes, double quotes, single guillemets, double guillemets, and single and double high reversed-9 quotes to the Pi category, so is_quote/2 isn't much of a burden. For the brackets we'd need a somewhat larger table, but there are plenty of nice looking brackets in Unicode it would be nice to recognise.

The "solo" characters in ISO Prolog are "!,;|%()[]{}". What is the appropriate generalisation for Unicode?

In ASCII, "()[]{}" is the full set of brackets. So Unicode solo characters should include all Unicode brackets. The other characters are all in the Po category, but there are many characters in the Po category which should be symbol characters, not solo characters. The simplest rule seems to be that a non-bracket character should count as a solo character if and only if it is in the Po category and is a script or compatibility variant of one of the existing non-bracket solo characters. In Unicode 3.1.1, that's just the following 24 characters:

It is possible to recognise this set of characters by pattern matching (carefully!) on the names, but since I do not propose access to the names, this has to be expressed by means of a table.

is_solo(C) :-
    is_unicode(C, K),
    memberchk(K, [po,ps,pe]),
    (   K = po -> 
        memberchk(C, [16'0021,16'0025,16'002C,16'003B,16'055C,16'055D,
                      16'060C,16'061B,16'066A,16'1363,16'1364,16'1802,
                      16'1808,16'3001,16'FE50,16'FE51,16'FE54,16'FE57,
		      16'FE6A,16'FF01,16'FF05,16'FF0C,16'FF1B,16'FF64])
    ;   true
    ).

The so-called "meta" characters in ISO Prolog are the ASCII umlaut, acute accent, grave accent, and reverse solidus. This is a rather puzzling grouping. Three of them are used for quoting string-like tokens, one of them isn't. Unlike Lisp, the reverse solidus has no "meta" semantics in Prolog except as part of string-like tokens; when it occurs unquoted it is just a plain symbol character.

Since we already have is_quote/2, which covers the characters that begin string-like tokens, I do not think that an *is_meta/1 character classification predicate is appropriate.

The confusingly named "graphic" characters of ISO Prolog are "#$&*+-./:<=>?@^~". This is especially confusing, because what we want to know is not "which characters stand for themselves in quoted tokens" (the answer is: anything except the corresponding closing quote, the reverse solidus, or a Zl or Zp character), but "which characters can be used in unquoted quoted literals that look vaguely like mathematical operators. That is, we need a character class that includes these characters and also includes the reverse solidus, for the sake of ISO Prolog atoms such as \=, \==, =\=, \+, /\, \/, and \ . This is in fact the set "graphic token char" of section 6.4.2.

If we look these characters up in the Unicode data base, we find that they fall into four groups:

These groupings actually fit our intuitions about which characters should be allowed in unquoted atoms very well; when extended to Latin 1 they yield exactly the set of characters my Prolog-in-C parser has accepted since Latin 1 came out. The only problem is that some Po characters are not symbol characters, but we already have a recogniser for those, so:

is_symbol(C) :-
    is_unicode(C, K),
    memberchk(K, [pd,po,sc,sm]),
    \+ is_solo(C).

Locales

If Prolog uses Unicode internally, then we don't have to worry much about locales for character classification, so I'll leave this section empty for a while.

1.5. Case conversion

In ASCII, every letter was either a lower case letter with a unique context-independent upper case letter or an upper case letter with a unique context-independent lower case letter, and this was the only variation between letter forms.

In the ISO 8859 family of 8-bit character sets, there could be letters which were neither upper nor lower case, and there could be letters which did belong to a case but did not have an opposite case equivalent (there are two of those in Latin-1), and there could be letters with positional variants (there are five of those in 8859-8). Worse still, there could be letters whose opposite case form was not the same length (ß).

Of course it's well known that Greek has two lower case forms for the letter "s" both corresponding to the same upper case form.

Interestingly enough, the Unicode 3.1.1 database defines case mappings, and its case mappings are one-to-one. the letter "ß" is handled by not mapping it to anything else at all, and the Greek letters are handled by mapping the capital "S" to the medial (not the final) letter "s", regardless of the context.

The Unicode standard points out that case mapping is locale-sensitive, and gives the example of Turkish.

Accordingly, we do need some kind of locale objects. This is a very preliminary draft here!

:- type character_locale.  % not specified

:- pred character_locale(atom, character_locale).
:- character_locale(Name, Locale) when Name.
:- character_locale(Name, Locale) is semidet when ground(Name).

character_locale(Name, Locale) :-
    Name is the name of a locale and
    Locale is a private representation of its character facet.

One special locale object is 'unicode', which represents the Unicode case conversion tables. We might also need 'posix' and/or 'C' locales for this. A locale name like 'en_NZ' might or might not be acceptable as a locale object at the implementor's choice.

We get two sets of case conversion predicates. None of them takes a single character as argument or delivers a single character as result. We have, in general, no right to expect that lc(x++y)=lc(x)++lc(y).

:- pred to_lower(character_locale, codes, codes, codes).
:- to_lower(L, Xs, Ys0, Ys) when L, Xs.
:- to_lower(L, Xs, Ys0, Ys) is semidet when ground(L), ground(Xs).

locale_to_lower(Locale, Xs, Ys0, Ys) :-
    Locale is a character locale and Xs is a string,
    Ys0\Ys is Xs converted to lower case according to Locale.

locale_to_lower(Locale, Xs, Ys) :-
    to_lower(Locale, Xs, Ys, []).

to_lower(Xs, Ys0, Ys) :-
    locale_to_lower(unicode, Xs, Ys0, Ys).

to_lower(Xs, Ys) :-
    to_lower(Xs, Ys, []).

:- pred to_title(character_locale, codes, codes, codes).
:- to_title(L, Xs, Ys0, Ys) when L, Xs.
:- to_title(L, Xs, Ys0, Ys) is semidet when ground(L), ground(Xs).

locale_to_title(Locale, Xs, Ys0, Ys) :-
    Locale is a character locale and Xs is a string,
    Ys0\Ys is Xs converted to title case according to Locale.

locale_to_title(Locale, Xs, Ys) :-
    locale_to_title(Locale, Xs, Ys, []).

to_title(Xs, Ys0, Ys) :-
    locale_to_title(unicode, Xs, Ys0, Ys).

to_title(Xs, Ys) :-
    to_title(Xs, Ys, []).

:- pred to_upper(character_locale, codes, codes, codes).
:- to_upper(L, Xs, Ys0, Ys) when L, Xs.
:- to_upper(L, Xs, Ys0, Ys) is semidet when ground(L), ground(Xs).

locale_to_upper(Locale, Xs, Ys0, Ys) :-
    Locale is a character locale and Xs is a string,
    Ys0\Ys is Xs converted to upper case according to Locale.

locale_to_upper(Locale, Xs, Ys) :-
    locale_to_upper(Locale, Xs, Ys, []).

to_upper(Xs, Ys0, Ys) :-
    locale_to_upper(unicode, Xs, Ys0, Ys).

to_upper(Xs, Ys) :-
    to_upper(Xs, Ys, []).

1.6. Text Element Boundaries

The title of this section refers to the corresponding section of the Unicode Standard, where rules are given for finding

In Latin1, there is a one to one correspondence between the thing you might call a `char' in a program and the thing a normal person looking at a printed text might call a `character'. The imaginary rock band ``Death-Töngue'' (from ``Bloom County'') clearly has 12 whatsits in its name on either construal. If we step through the Prolog list (of Latin1 codes) "Death-Töngue" one element at a time, we shall visit one written character at a time.

This neat equivalence did not hold in ASCII, although that is not widely known. In ASCII, it was legitimate to encode <ö> as <o,BS,"> or <",BS,O>. This accounts for some of the compromise character shapes: " has to serve for opening double quote, closing double quote, and diaresis, ' has to serve for apostrophe, right quote, and acute accent, , has to serve for comma and cedilla, and so on. The result is that except for US English, the relationship between a sequence of characters and a sequence of chars was neither one to one nor unique.

Unicode brings back the complexities of ASCII, doubled and redoubled. There is in principle no upper bound to the number of diacritical characters which may be attached to a base character, hence no upper bound on the number of Unicode code elements which may be required to represent a single scripteme.

I propose the following predicates for inclusion in both the 'chars' module and the 'codes' module. I illustrate only the 'codes' versions, but the 'chars' versions are obvious.

:- pred next_character(normalisation_form, codes, codes, codes).
:- next_character(NF, Xs, Ys0, Ys) when NF, Ys0.
:- next_character(NF, Xs, Ys0, Ys) is semidet
   when ground(NF), ground(Xs), ground(Ys0).
:- next_character(NF, Xs, Ys0, Ys) is semidet
   when NF==v, ground(Ys0).
:- next_character(NF, Xs, Ys0, Ys) is bounded
   when ground(Ys0).

next_character(NF, Xs, Ys0, Ys) :-
    append(Zs, Ys, Ys0),
    Zs \== [],
    normalised_compare(NF, =, Xs, Zs),
    Zs is the longest prefix of Ys0 not containing
      a character boundary.

next_character(Xs, Ys0, Ys) :-
    next_character(v, Xs, Ys0, Ys).

next_character(Xs, Ys0) :-
    next_character(Xs, Ys0, Ys).

:- pred next_word(normalisation_form, codes, codes, codes).
:- next_word(NF, Xs, Ys0, Ys) when NF, Ys0.
:- next_word(NF, Xs, Ys0, Ys) is semidet
   when ground(NF), ground(Xs), ground(Ys0).
:- next_word(NF, Xs, Ys0, Ys) is semidet
   when NF==v, ground(Ys0).
:- next_word(NF, Xs, Ys0, Ys) is bounded
   when ground(Ys0).

next_word(NF, Xs, Ys0, Ys) :-
    append(Zs, Ys, Ys0),
    Zs \== [],
    normalised_compare(NF, =, Xs, Zs),
    Zs is the longest prefix of Ys0 not containing
      a word boundary.

next_word(Xs, Ys0, Ys) :-
    next_word(v, Xs, Ys0, Ys).

next_word(Xs, Ys0) :-
    next_word(Xs, Ys0, Ys).

:- pred next_sentence(normalisation_form, codes, codes, codes).
:- next_sentence(NF, Xs, Ys0, Ys) when NF, Ys0.
:- next_sentence(NF, Xs, Ys0, Ys) is semidet
   when ground(NF), ground(Xs), ground(Ys0).
:- next_sentence(NF, Xs, Ys0, Ys) is semidet
   when NF==v, ground(Ys0).
:- next_sentence(NF, Xs, Ys0, Ys) is bounded
   when ground(Ys0).

next_sentence(NF, Xs, Ys0, Ys) :-
    append(Zs, Ys, Ys0),
    Zs \== [],
    normalised_compare(NF, =, Xs, Zs),
    Zs is the longest prefix of Ys0 not containing
      a sentence boundary.

next_sentence(Xs, Ys0, Ys) :-
    next_sentence(v, Xs, Ys0, Ys).

next_sentence(Xs, Ys0) :-
    next_sentence(Xs, Ys0, Ys).

:- pred next_line(normalisation_form, codes, codes, codes).
:- next_line(NF, Xs, Ys0, Ys) when NF, Ys0.
:- next_line(NF, Xs, Ys0, Ys) is semidet
   when ground(NF), ground(Xs), ground(Ys0).
:- next_line(NF, Xs, Ys0, Ys) is semidet
   when NF==v, ground(Ys0).
:- next_line(NF, Xs, Ys0, Ys) is bounded
   when ground(Ys0).

next_line(NF, Xs, Ys0, Ys) :-
    append(Zs, Ys, Ys0),
    Zs \== [],
    normalised_compare(NF, =, Xs, Zs),
    Zs is the longest prefix of Ys0 not containing
      a line boundary.

next_line(Xs, Ys0, Ys) :-
    next_line(v, Xs, Ys0, Ys).

next_line(Xs, Ys0) :-
    next_line(Xs, Ys0, Ys).

Section 2. Higher-order operations.

2.1. call/N

The ISO Prolog standard defines call/1 in section 7.8.3. Looking at the definition is like a slap in the face; the committee rejected my denotational semantics for Prolog on the grounds that it was written in Pascal, then when I pointed out that on the contrary, it was written in a pure functional language that wasn't anything like Pascal, it was rejected on the grounds that it was too "implementation-oriented". If that doesn't describe the definition of call/1, I don't know what does.

If call/1 was not in the language, we could define it thus:

    :- dynamic $call/1.

    call(Goal) :-
        asserta((
            $call(Goal) :-
                retract(($call(_) :- _)),
                !,
                Goal
	)),
	$call(Goal).

In the Prolog type checking paper that Alan Mycroft and I wrote, we proposed the call/N family, in order to permit type-checking Prolog higher-order code. Since then, some Prologs have implemented this feature in the library (notably DEC-10 Prolog, C Prolog, Quintus Prolog) with an upper bound on N, and some Prologs have built it into their language (SWI Prolog and, I believe, NU Prolog).

Roughly speaking, one expects

call(p(X1,X2,X3)) :- p(X1, X2, X3).

call(p(X1,X2), X3) :- p(X1, X2, X3).

call(p(X1), X2, X3) :- p(X1, X2, X3).

call(p, X1, X2, X3) :- p(X1, X2, X3).

Indeed, one way to implement the call/N family is to start with a skeleton

    call(P, ...) :- \+ callable(P), !, error(uncallable term).
    [special purpose code for control structures if N <= 2]
    [gap]
    call(P, ...) :- error(undefined predicate).

and then whenever a predicate p/N is defined, automatically add clauses to call/1, ..., call/N+1 in the appropriate gaps. In a WAM-based system, this can be done by exploiting switch-on-term without any choice-points being created, so call/N in the normal case is just a hash table look-up, a bit of argument shuffling, and a jump. The space requirement is O(P.A2), where P is the number of predicates and A is the arity. Since arities are usually fairly low, this is in effect linear in the size of the program. I have tried this, and found it to be a pleasantly efficient (as well as obvious) way to implement call/N.

There are other ways to implement call/N which are nearly as efficient, and have only O(P) space overhead, the hash table needed to map from a predicate's functor to its address.

This family of predicates has proven its usefulness. It's time they were in the community standard.

:- pred call(void(T1), T1).
:- pred call(void(T1,T2), T1, T2).
:- pred call(void(T1,T2,T3), T1, T2, T3).
:- pred call(void(T1,T2,T3,T4), T1, T2, T3, T4).
:- pred call(void(T1,T2,T3,T4,T5), T1, T2, T3, T4, T5).
:- pred call(void(T1,T2,T3,T4,T5,T6), T1, T2, T3, T4, T5, T6).

% Sample implementation.
% Beware: this is highly inefficient.  Don't do it this way.

call(P, Y1) :-
    P =.. L0,
    append(L0, [Y1], L1),
    Q =.. L1,
    call(Q).

call(P, Y1, Y2) :-
    P =.. L0,
    append(L0, [Y1,Y2], L2),
    Q =.. L2,
    call(Q).

call(P, Y1, Y2, Y3) :-
    P =.. L0,
    append(L0, [Y1,Y2,Y3], L3),
    Q =.. L3,
    call(Q).

call(P, Y1, Y2, Y3, Y4) :-
    P =.. L0,
    append(L0, [Y1,Y2,Y3,Y4], L4),
    Q =.. L4,
    call(Q).

call(P, Y1, Y2, Y3, Y4, Y5) :-
    P =.. L0,
    append(L0, [Y1,Y2,Y3,Y4,Y5], L5),
    Q =.. L5,
    call(Q).

call(P, Y1, Y2, Y3, Y4, Y5, Y6) :-
    P =.. L0,
    append(L0, [Y1,Y2,Y3,Y4,Y5,Y6], L6),
    Q =.. L6,
    call(Q).

2.2. meta_predicate

As soon as you have call/1, you have a problem. Cross-referencers don't work. You can fix this for the specific cases of call/1, findall/3, bagof/3, setof/3, and a few others, but as soon as a Prolog programmer writes

gcc(Goal) :- \+ \+ Goal. % garbage-collecting call

the cross-referencer stops working. The C call graph tool on my workstation is similarly limited; if you have a function which is passed to another function and not called directly, it thinks that function is never called. That's not what I call a usable call graph tool; it is worse than having no call graph at all.

Characteristically, the ISO committee missed the point. They thought that meta_predicate has something to do with modules. It does, but only accidentally. The real function of meta_predicate declarations is to provide just enough type information for a cross-referencer to work.

The idea of meta_predicate declarations was to start with existing mode declarations (so the arguments would be +, -, or ?) and change some of the arguments to something else to say "this argument is called".

Had call/1 been the only game in town, something like 'call' would have been adequate, so we'd have had

:- meta_predicate
    findall(?, call, -),    % not final version
    gcc(call).              % not final version

But the call/N family came into Prolog about the time the DEC-10 cross-referencer was written, and replaced prior code that did things like

all(P, [], []).
all(P, [X|Xs], [Y|Ys]) :-
    apply(P, [X,Y]),
    all(P, Xs, Ys).

Very often, when an argument is passed to a meta-predicate, it is a closure, not a goal; the callee will supply more arguments (at the right) before calling it. This is information that a cross-referencer needs.

So, quite independently of any module system, we need meta-predicate declarations of this form:

meta_predicate_directive -->
    [(:-),(meta_predicate)],
    meta_predicate_declarator,
    rest_meta_predicate_declarators.

rest_meta_predicate_declarators --> ['.'].
rest_meta_predicate_declarators --> [','],
    meta_predicate_declarator,
    rest_meta_predicate_declarators.
    
meta_predicate_declarator --> atom(_).
meta_predicate_declarator --> atom(_), ['('],
    meta_predicate_argument,
    rest_meta_predicate_arguments.

rest_meta_predicate_arguments --> [')'].
rest_meta_predicate_arguments --> [','],
    meta_predicate_argument,
    rest_meta_predicate_arguments.

meta_predicate_argument --> [+].
meta_predicate_argument --> [-].
meta_predicate_argument --> [?].
meta_predicate_argument --> [:].  % for modules.
meta_predicate_argument --> integer(N), {N >= 0}.

The meta_predicate arguments (+), (-), and (?) are ignored by the system, but may be processed by a mode inference program like Chris Mellish's, and indicate the intended use of the corresponding arguments.

The meta_predicate argument (:) is only useful when there is a module system, otherwise it is just like (+). It says that the corresponding argument will not be treated as a goal or closure, but should be subject to module name expansion anyway.

As a mode, the meta_predicate argument N has the same force as (+). But N tells the cross-referencer that the corresponding argument is to be a goal that is missing its rightmost N arguments.

With the aid of this declaration, we can write

:- meta_predicate
    findall(?, 0, ?),
    gcc(0),
    all(2, ?, ?).

all(_, [], []).
all(P, [X|Xs], [Y|Ys]) :-
    call(P, X, Y),
    all(P, Xs, Ys).

The ISO committee not only got hold of the wrong end of the stick, they broke it, in these ways:

The result is that the ISO metapredicate directives are useless for cross-referencers. They are best regarded as obsolete.

But what if you don't care about cross-referencing? I guess you've never used Masterscope for Lisp, or cscope(1) for C, or the Smalltalk IDE, or ... if you can ask such a question. Come to think of it, if you haven't used a reasonably reliable cross-referencer for Prolog, you don't realise what a useful debugging tool it can be. (I leave you to make your own inferences about the ISO Prolog committtee.) But let's suppose you don't care about cross-referencing, and you don't care about the added documentation value from stating the number of missing arguments. Is there any reason why you should care about meta_predicate?

Yes. From its inception (which I can speak with authority about, because I invented it), the intention was that the meta_predicate directive would licence a nonstandard translation of meta-calls. Take all/3 for example:

    all(append(Front), Xs, Ys)

can be translated as

    all(<{address of append/3}, Front>, Xs, Ys)

where <Address,Arg1,...> is a special kind of term that looks like a compound term, but has an address where the functor should be. This permits an implementation of call/3 which spreads the arguments and jumps straight to the given address. This eliminates the space overhead for meta-calls (there is no longer a need for a table of the addresses of all predicates; the compiler has arranged for the addresses of just the meta-called predicates to be available directly).

All right, all right; so that would also make cross-module meta-calls as efficient as any other meta-calls.

There is another thing that meta_predicate does, and it is related to the possibility of unusual implementations of meta-calls, but it has a rather more profound theoretical reason. If T1 and T2 are two terms representing goals, then T1 = T2 is, strictly speaking, a second-order unification (or worse). But it isn't implemented that way. In order to avoid the obvious soundness and completeness problems, we want to check at compile time that two meta-terms are unified only if at least one of them is an unbound variable. This means that a meta-argument of a predicate should only be a variable in the head, and that it should only be unified with another variable, and that it should only be passed to a meta-argument of some predicate. This is a type checking problem, and it is again a very useful debugging tool. But you can't even get started on it without meta_predicate; the ISO metapredicate abortion is quite useless for the job.

So even if we don't include modules in the community standard, and even if we don't include call/N, we still need either meta_predicate or a richer type declaration/checking system.

2.3. Lambda

We don't strictly speaking need lambda-expressions in Prolog; ordinary closures suffice quite often, and when they don't we can always write an extra predicate. But then, lambda-lifting shows that you don't strictly speaking need lambda-expressions in functional languages either; you can always move the thing out to top level, give it a name, and partially apply it at the original place of use.

Suppose we had a form something like

    all((\(X, Y) :- Body), Xs, Ys)

A compiler could implement this by finding the intersection of (free variables of Body \ free variables of \(X,Y)) with the free variables of the rest of the clause that contains it, and generating

p(FreeVars, X, Y) :- Body.
...
    all(p(FreeVars), Xs, Ys)

The interesting point is that it isn't possible to do this at run time. If we simply implemented

call((\(X1,X2) :- Body), X1, X2) :- call(Body).

then the first call to the lambda-expression would "use it up". We couldn't use this in a call to all/3, unless all of the elements of Xs were equal and so were all of the elements of Ys. If on the other hand we implemented

call((\(X1,X2):-Body), Y1, Y2) :-
    copy_term(foo(X1,X2,Body), foo(Y1,Y2,Goal)),
    call(Goal).

then

common_prefix(Front, Xs, Ys) :-
    all((\(X, Y) :- append(Front, Xs, Ys))).

wouldn't work, because Front would be copied, and it shouldn't be.

With compiler assistance (as in Mercury), lambda-expressions are fine. Without compiler assistance, lambda-expressions either don't work or require more, and more error-prone, annotation than I think the average programmer would be happy with. I do not want to recommend anything which requires compiler support or even that there be a compiler.

2.4. Recommendations

  1. At a minimum, meta_predicate directives with 0 as a meta-argument annotation distinct from ':'. This is of great practical use. A non-module compiler can simply ignore such directives. A module compiler that cannot exploit the extra information can simply treat them as alternative syntax for metapredicate.
  2. At least the first seven members of the call/N family; they are trivial to implement (I do not say trivial to implement efficiently, but even that is not particularly hard) and very useful.
  3. Do not have any kind of lambda-expression.

Section 3. Lists.

We can derive many of the operations in the Quintus Prolog library (which is a revision and extension of the DEC-10 Prolog library) by projection from a small number of predicates. These "base" predicates are specified in some detail, the specifications of the "derived" predicates are then obvious consequences. I do not claim that the "base" predicates are "basic" in any sense, only that they are "universal relations" from which a range of more useful "views" can be projected.

"Base" predicates have been chosen solely for convenience in specification. I do not say that they are "basic" in any way. They are generally harder to understand than some of their derived predicates. They are not, as a rule, a good implementation basis. (For example, it would be very silly to implement append/3 the way it is specified here.) What makes a predicate a "base" predicate is that it makes a good base for specification, so that details of types and so on have to be repeated as little as possible.

The notion of "projection" is closely related to the notion of "bootstrapping" (did Mrs Malaprop choose that word?) in the ISO Prolog standard. Many (but by no means all) of the "bootstrapped" predicates in the ISO Prolog standard are projections. Predicate P is a projection of predicate Q if and only if P is not only a special case of Q, but has its arguments in the same order as Q. The idea is that the strong semantic link between P and Q should make it easy for a programmer to memorise the argument order of Q and thereafter be able to correctly predict the argument order of P without any additional memory burden.

append_append_length/7

If any predicates have universal acceptance in Prolog, they must be append/3, member/2, and length/2. They are all projections of a single predicate.

:- pred append_append_length(
            list(X), list(X), list(X),
            list(Y), list(Y), list(Y),
            integer).

:- append_append_length(A, B, AB, C, D, CD, N)
   when A ; AB ; C ; CD ; N.

:- append_append_length(A, B, AB, C, D, CD, N) is semidet
   when list_skel(A) ; list_skel(C) ; nonvar(N).
:- append_append_length(A, B, AB, C, D, CD, N) is bounded
   when list_skel(AB) ; list_skel(CD).        

%  Sample implementation.

append_append_length(A, B, AB, C, D, CD, N) :-
    (   var(N) ->
        $append_append_length(A, B, AB, C, D, CD, 0, N)
    ;   integer(N) ->        
        N >= 0,
        $append_append_length(A, B, AB, C, D, CD, N)
    ;   $error
    ).

$append_append_length([], B, B, [], D, D, N, N).
$append_append_length([X|A], B, [X|AB], [Y|C], D, [Y|CD], N0, N) :-
    N1 is 1 + N0,
    $append_append_length(A, B, AB, C, D, CD, N1, N).

$append_append_length(A, B, AB, C, D, CD, N0) :-
    (   N0 =:= 0 ->
        A = [], B = AB, C = [], D = CD
    ;   N1 is N0 - 1,
        A = [X|A1], AB = [X|AB1],
        C = [Y|C1], CD = [Y|CD1],
        $append_append_length(A1, B, AB1, C1, D, CD1, N1)
    ).

Semantics: append_append_length(A, B, AB, C, D, CD, N) is true when length(A, N) and length(C, N) and append(A, B, AB) and append(C, D, CD). In Mercury, where the compiler reorders code according to data flow, we could write

append_append_length(A, B, AB, C, D, CD, N) :-
    length(A, N),
    length(C, N),
    append(A, B, AB),
    append(C, D, CD).

In ISO Prolog, however, this implementation is quite unsatisfactory.

?- append_append_length(A, "yz", "xyz", [X], R, "abc", N).

will find the (unique) solution, but if that is rejected, will then backtrack forever, trying ever longer bindings for A.

There is, in fact, no fixed ordering of these goals which is satisfactory for all queries, where "satisfactory" is defined as finding all solutions and terminating in a finite amount of time, for a query having a finite number of solutions.

Derived predicates

append_append(A, B, AB, C, D, CD) :-
    append_append_length(A, B, AB, C, D, CD, _).

append_length(A, B, AB, D, AD, N) :-
    append_append_length(A, B, AB, A, D, AD, N).

append(A, B, AB, D, AD) :-
    append_length(A, B, AB, D, AD, _).

append_length(A, B, AB, N) :-
    append_length(A, B, AB, A, B, AB, N).

append(A, B, AB) :-
    append_length(A, B, AB, _).

same_length(A, C, N) :-
    append(A, [], A, C, [], C, N).

same_length(A, C) :-
    same_length(A, C, _).

one_longer(A, C) :-
    same_length(A, [_|C]).

shorter_list(A, C) :-
    append([_|A], [], [_|A], _, _, C)  

cons(H, T, L) :-
    append([H], T, L).

The following predicates come from library(list_parts) in the Quintus library. However, in order that the argument order of these predicates should be consistent with the other list processing predicates, it is necessary to switch the argument order from Whole,Part to Part,Whole, because that's the way append/3 is.

head(H, L) :-
    cons(H, _, L).

tail(T, L) :-
    cons(_, T, L).

prefix(P, L) :-
    append(P, _, L).

proper_prefix(P, L) :-
    append(P, [_|_], L).

suffix(S, L) :-
    append(_, S, L).

proper_suffix(S, L) :-
    append([_|_], S, L).  

The next group of predicates are already fully consistent with append_append_length/7.

length(X, N) :-
    append_length(A, [], A, N).

member(X, L) :-
    append(_, [X|_], L).

memberchk(X, L) :-
    member(X, L),
    !.

:- nonmember(X, L)
   when ground(X), ground(L).

:- nonmember(X, L) is semidet
   when ground(X), ground(L).

nonmember(X, L) :-
    \+ member(X, L). 

select(X, L, R) :-
    append(_, [X|T], L, T, R).

selectchk(X, L, R) :-
    select(X, L, R),
    !.

select(X, Xs, Y, Ys) :-
    append(_, [X|Xs], L, [Y|Ys], R).

selectchk(X, L, Y, R) :-
    select(X, L, Y, R),
    !.

last(ButLast, X, L) :-
    append(ButLast, [X], L).

last(X, L) :-
    last(_, X, L).

nextto(X, Y, L) :-
    append(_, [X,Y|_], L).

This predicate corresponds to the old correspond/4 predicate, but has a consistent argument order:

member_member(X, Xs, Y, Ys) :-
    append(_, [X|_], Xs, _, [Y|_], Ys).

One operation which is commonly asked for is finding the Nth element of a list. Here, unfortunately, we have two clashing archetypes: arg/3 and the append_append_length/7 family. The Quintus library assimilated "Nth element" to arg/3, with predicates

  nth0(N, List, Member)
  nth0(N, List, Member, Residue)
  nth1(N, List, Member)
  nth1(N, List, Member, Residue)

The clash with member(Member, List) and select(Member, List, Residue) is unfortunate. The fact that 0 and 1 have other meanings in the Quintus library (and other meanings in connection with lists) is also unfortunate.

I am driven to the conclusion that the analogy with arg(N, Term, Arg) is bogus. In particular, arg/3 only works one way around: it does not solve for N (although at the price of notably worse generated code, it could) and it cannot solve for Term. This means that its first two arguments are, on any understanding of the term, "inputs", and by the inputs-before-outputs principle, should come first.

The "Nth element" predicates do not share arg/3's input/output restrictions. Indeed, they are more likely to be used to find N for a known Member than the other way around; it is tolerable for linear search to take linear time, but not for indexing.

I use the name component "_offset" to refer to a number of elements to skip. You can think of this as 0-origin indexing if you like, but it is actually more fruitful to think of it quite literally as the length of some list segment. Indeed, "length" could be used to be consistent with the other "_length" predicates, except that in this context it might be confusing.

I use the name component "_index" to refer to a position in a list. Like argument positions in a term, member positions in a list always start with 1.

The Nth element predicates are

member_member_offset(X, Xs, Y, Ys, Offset) :-
    append_length(_, [X|_], Xs, _, [Y|_], Offset).

member_member_index(X, Xs, Y, Ys, Index) :-
    member_member_index(X, [X|Xs], Y, [Y|Ys], Index),
    Index =\= 0.

member_offset(Member, List, Offset) :-
    append_length(_, [Member|_], List, Offset).

member_index(Member, List, Index) :-
    append_length(_, [Member|_], [Member|List], Index),
    Index =\= 0.

select_offset(X, Xs, Y, Ys, Offset) :-
    append_length(_, [X|T], Xs, [Y|T], Ys, Offset).

select_index(X, Xs, Y, Ys, Index) :-
    select_offset(X, [X|Xs], Y, [Y|Ys], Index),
    Index =\= 0.

select_offset(Member, List, Residue, Offset) :-
    append_length(_, [Member|Tail], List, Tail, Residue, Offset).

select_index(Member, List, Residue, Index) :-
    append_length(_, [Member|Tail], [Member|List], Tail, [Member|Residue], Index),
    Index =\= 0.

reverse_append_length/4

Prolog has traditionally had reverse/2. Common Lisp also provides (REVAPPEND - -), and such a predicate is commonly part of the implementation of reverse/2.

:- type reverse_append_length(list(T), list(T), list(T), integer).
:- reverse_append_length(L, B, RB, N)
   when L ; RB ; N.
:- reverse_append_length(L, B, RB, N) is semidet
   when list_skel(L) ; integer(N).
:- reverse_append_length(L, B, RB, N) is bounded
   when list_skel(RB).

%  Sample implementation.

reverse_append_length(L, B, RB, N) :-
    append_length(L, [], L, _, B, RB, N),
    $reverse(L, B, RB).
    
$reverse([], B, B).
$reverse([X|L], B, RB) :-
    $reverse(L, [X|B], RB).

Semantics: reverse_append(L, B, RB, N) is true when RB = reverse(L)++B and length(L) = N.

Notation: There is no new notation here, but I feel that I must repeat the warning that the "sample implementations" of "base" predicates are crafted for exposition and specification. This is not the way that I would implement this predicate for practical use. What implementor would use a two-pass algorithm when a one-pass algorithm could be used?

Except that L is to be reversed, the argument order of this predicate is identical to the argument order of append_length/4, as the name suggests.

Derived predicates

reverse_append(L, B, RB) :-
    reverse_append_length(L, B, RB, _).
    
reverse_length(L, R, N) :-
    reverse_append_length(L, [], R, N).

reverse(L, R) :-
    reverse_length(L, R, _).

permutation_length

:- pred permutation_length(list(T), list(T), integer).
:- permutation_length(A, B, N) is bounded
   when list_skel(A) ; list_skel(B) ; nonvar(N).

%  Sample implementation

permutation_length(A, B, N) :-
    same_length(A, B, N),
    $perm(A, B).

$perm([], []).
$perm([X|Xs], Ys1) :-
    $perm(Xs, Ys),
    append(_, R, Ys, [X|R], Ys1).

Derived predicates

permutation(A, B) :-
    permutation_length(A, B, _).

pairs_keys_values/3

Given the great practical utility of keysort/2, it is useful to be able to build and dismantle the lists it requires and produces.

I am not satisfied with the name that was used in the Quintus library (keys_and_values/3, because it does not make clear which argument is which. Nor is there any way to appeal to 'inputs before outputs' because the predicate is used both ways around. I am tentatively calling it pairs_keys_values/3. Whatever it's called, it is a surprisingly useful predicate.

:- pred pairs_keys_values(list(pair(K,V)), list(K), list(V)).
:- pairs_keys_values(Pairs, Keys, Values)
   when Pairs ; Keys ; Values.
:- pairs_keys_values(Pairs, Keys, Values) is semidet
   when list_skel(Pairs) ; list_skel(Keys) ; list_skel(Values).

%  Sample implementation

pairs_keys_values([], [], []).
pairs_keys_values([K-V|Pairs], [K|Keys], [K|Values]) :-
    pairs_keys_values(Pairs, Keys, Values).

Note that the sample implementation doesn't actually satisfy the requirements. In a traditional first-argument-indexing Prolog, if the first argument is a variable but one of the others is a list skeleton, the determinism of the query will not be detected.

Derived predicates

I'm not sure it's worth bothering with these.

pairs_keys(Pairs, Keys) :-
    pairs_keys_values(Pairs, Keys, _).

pairs_values(Pairs, Values) :-
    pairs_keys_values(Pairs, _, Values).

Deletion

The definition of the "delete" predicate is something I struggled with for years. Part of the problem is that if list R does not contain X, there are infinitely many lists L such that deleting X from L yields R, so the predicate really only works as a function. That tells us that L and X should precede R in the argument order. Should it be delete(Member, List, Residue) to be consistent with select/3? Or, because delete/3 is dangerously different from select/3, is it actually a good thing if they are different?

Prolog has three versions of equality: (=)/2, (==)/2, and (=:=)/2. Which of them should be used? Or should we have three predicates, the way Scheme used to have DELQ!, DELV!, and DELETE!?

Should the item to be deleted be a strict input, or should delete/3 be able to backtrack over the elements of the input list as in Cugini's library?

If X does not occur in L, is that a good input, or should it fail?

There is a definition in the Quintus library. It's basically this:

/* Historic background, NOT for inclusion */
:- pred delete(list(T), T, list(T)).
:- delete([], _, _) when true.
:- delete([H|T], X, R) when ground(H), ground(X).
:- delete(L, X, R) is semidet when ground(L), ground(X).
   
%  Sample implementation

delete([], _, []).
delete([H|T], X, R) :- H == X, !,
    delete(T, X, R).
delete([H|T], X, [H|R]) :-
    delete(T, X, R).

I don't want to take delete/3 away from the people who use it, if there are any, but it clearly doesn't belong in even an informal standard. The problem is that there are so many choices that none of them is going to be the right one for the majority of uses (if uses there be); none of them deserves the exclusive right to the name delete/3. The history of Scheme may be of interest here. There used to be three delete-from-list functions in Scheme, but there aren't any these days.

One of the questions about deleting from a list is how many elements to delete. I note that when I want to delete a known element from a list, I almost always want to delete just one copy (usually because I know there is only one copy). For that purpose, selectchk(X, L, R) works very well. The result is that despite writing delete/3, I cannot recall ever using it except to test it.

If delete/3 is not standard, what can programmers do instead? One easy answer is to use findall/3:

findall_member(Member, List, Test, Result) :-
    findall(Member, (member(Member, List), \+ \+call(Test)), Result).

... findall_member(Y, L, \+ Y = X, R)
... findall_member(Y, L, Y \== X,  R)
... findall_member(Y, L, Y =\= X,  R)

Using findall_member/4 instead of delete/3 has some important advantages:

The only bad thing about this predicate is efficiency. But that's why I introduced findall_member/4 in the first place, instead of calling findall/3 directly. It's comparatively simple for a compiler to recognise calls to findall_member/4 and generate a three-clause loop (essentially the structure of delete/3 shown above), using a cache to ensure that the same code is not generated repeatedly. Or for a compiler such as BinProlog, where findall/3 is fast, the compiler could just convert a findall_member/4 call to the corresponding findall/3 code.

The findall/4 predicate, which combines the effect of findall/3 with the effect of append/3, is unaccountably missing from ISO Prolog. When you know about it, you find nearly as many uses for findall/4 as for findall/3. This suggests to me that the right "base" predicate is not findall_member/4 but findall_member/5.

:- pred findall_member(T, list(T), void, list(T), list(T)).
:- findall_member(M, L, P, R) is semidet
   when list_skel(L), nonvar(P).

%  Sample implementation.

findall_member_append(_, [], _, R, R).
findall_member_append(M, [X|Xs], P, R0, R) :-
    \+ (X = M, call(P)),
    !,
    findall_member_append(M, Xs, P, R0, R).
findall_member_append(M, [X|Xs], P, [X|R0], R) :-
    findall_member_append(M, Xs, P, R0, R).

Derived predicates

findall_member(M, Xs, P, R) :-
    findall_member(M, Xs, P, R, []).

delete(L, X, R) :-
    findall_member(M, L, X \== M, R).

disjoint(S1, S2) :-
    findall_member(M, S1, memberchk(M, S2), []).

intersection(S1, S2, S) :-
    findall_member(M, S1, memberchk(M, S2), S).

difference(S1, S2, S) :-
    findall_member(M, S1, nonmember(M, S2), S).

union(S1, S2, S) :-
    findall_member(M, S1, nonmember(M, S2), S, S2).

symdiff(S1, S2, S) :-
    findall_member(M, S1, nonmember(M, S2), S, S3),
    difference(S1, S2, S3).

Recommendation

I am not actually recommending any of the predicates in this section. In particular, I am not recommending any of the traditional quadratic-time set-as-unordered-list predicates for inclusion in the "community standard library". What I do recommend is that either no such predicates (letting people call findall/3 themselves, or write and document their own deletion predicates), or else one predicate should be included in the library, and that predicate should be findall_member/5.

The Quintus library contains delete(L, X, N, R) which deletes the first N copies of X in L, or fewer if there aren't that many. To be honest, it is difficult to imagine any serious use of this. N=0 is just R=L. N=1 is just selectchk/3. N larger than the length of L is just delete/3. I never did find any other uses for it.

Sorting

sort/2 and keysort/2 may not be the most useful predicates in Prolog, but they are somewhere near the top. In a language without mutable data structures, but with a total order on the data structures it does have, sorting is often indispensable for getting code that is usefully fast.

The notion of a sorted list is clearly spelled out in the Prolog standard. Unbelievably, there are no means whatsoever of sorting a general list yourself; the only use of the "sorted list" notion is in setof/3. What's more, the Prolog standard doesn't even guarantee that the result of setof/3 will remain sorted, or even be sorted when it is first provided: the ordering on variables is only required to be consistent during the construction of a sorted list. Does this mean that the Prolog standard allows an implementation where

    length(L, N), N > 0,
    setof(X, member(X, L), S1),
    setof(X, member(X, L), S2),
    S1 == S2

may fail? Yes, it does.

In the same way that setof/3 needs sort/2 (yet the Prolog standard does not provide it), so bagof/3 needs keysort/2, and the Prolog standard does not provide it. While this makes it distressingly plain that the Prolog standardisers did not do a very good job, is it anything more than a nuisance? After all, it is quite easy to build a good sorting predicate on top of the compare/3 primitive.

Oddly enough, while ISO Prolog provides (@=<)/2, (@<)/2, (@>=)/2, (@>)/2, (==)/2, and (\==)/2, it does not provide compare/3. If one wants a minimal standard, this is back to front. Given compare/3, the other six comparisons can be implemented at low cost. (Although == and \== can be done even faster.) But given the other six comparisons, the best you can do for compare/3 is

compare(R, X, Y) :-
    (   X @> Y -> R = (>)
    ;   X == Y -> R = (=)
    ;   R = (<)
    ).

which would nearly double the cost of some sorting algorithms. Using the six comparison operators instead of compare/3 also forces you to use cuts or if-then-else in many cases where pure logic with no cuts of any kind could be used. This too is often an efficiency problem. There is an additional problem. In the presence of variables, the ISO Prolog term ordering is not required to be stable. It's required to be stable during a call to a built in predicate such as (@=<)/2 or setof/3, but it is not required to be consistent from call to call. This means that

  1. The following code fragments can all fail:
    X @< Y, X @< Y
    
    X @< Y, setof(Z, (Z=X;Z=Y), [X,Y])
    
    sort(X, p(X), S1), sort(X, p(X), S2), S1 == S2
    
  2. It is quite impossible for users to implement sort/2 or keysort/2 for themselves in ISO Prolog.

Both the efficiency and the instability problems suggest that some of the ISO Prolog committee didn't have a clue about what Prolog term ordering is for.

Essential recommendation

The built-in predicate compare/3 must be available with the same status as any other built-in predicate. For example, it should be in the same module as is/2 or clause/2.

The term ordering must be stable in the sense that only the disappearance of a variable (when it becomes bound to something else) can alter the order of two terms.

It is expected that compare(R,T1,T2) is linear in the size of its arguments (in fact, of the smaller of the two), and allocates no memory. It is possible to implement Prolog atoms with a symbol table so structured that atom comparisons take O(1) amortised time, but the requirement is actually "linear in the number of characters".

Derived comparison predicate

keycompare(R, K1-_, K2-_) :- compare(R, K1, K2).

This is useful for specifying keysort/2 and a number of other predicates.

Derived sorting predicates

:- pred sort(+void(?order,T,T), +list(T), ?list(T)).
:- sort(P, L, S) is semidet.
:- pred msort(+void(?order,T,T), +list(T), ?list(T)).
:- msort(P, L, S) is semidet.

The sort/3 predicate sorts its second argument, using the first argument to determine the order, and unifies its third argument with the result. If P(X1,X2) fails for any X1,X2 in L, sort(P, L, S) may fail. If P is not a partial order, the result is not defined. P may be called up to O(N.lgN) times, where length(L, N). In addition, only O(N.lgN) other work may be done. If two or more elements of L are equal according to P, only the first (leftmost) element will be retained.

The msort/3 predicate is like sort/3 but does not remove duplicates. It must be stable: if L = [...,X1,...,X2,...] and P(=,X1,X2), then S = [...,X1,...,X2,...].

Derived sorting predicates

sort(L, S) :- sort(compare, L, S).

msort(L, S) :- msort(compare, L, S).

keysort(L, S) :- msort(keycompare, L, S).

keysort(P, L, S) :- msort($keycompare(P), L, S).

    $keycompare(P, R, K1-_, K2-_) :- P(R, K1, K2).

The sort/2 and keysort/2 predicates are traditional. The need for msort/2 has often been felt; it is present in SWI Prolog, for example. The need for generalised versions has also been felt; LPA Prolog had a rather clumsy version which the higher order sorting predicates can simulate. With sort/3 we can do

    sort(codes:normalised_compare(kc), L, R)

without requiring a special-purpose predicate.

The derived comparison and sorting predicates should be in the same module as compare/3.

SWI Prolog includes merge/3, but not the obvious (and sometimes useful) keymerge/3; they are currently not included in this proposal. It also includes a merge_set/3 predicate, which appears to be an alias for ord_union/3.

Not only is it possible to have a stable sort with guaranteed worst case O(N.lgN), it is possible to have a stable sort which has linear cost when the input list is the concatenation of a small number of sorted lists (sorted by @=<, not just @<). For example,

merge(L1, L2, L) :-
    append(L1, L2, L12),
    msort(L12, L).

could be an O(|L|) implementation of merging. It is tempting to require that such an algorithm be used; this makes converting an already sorted list to ordered set form cheap, for example.

Locale-sensitive term ordering and sorting.

Term ordering in Prolog is based on the underlying coded character set. It is not intended to conform to the collating order of any natural language in any culture. Amongst other things, compare/3 must never report two terms as equal unless they are completely indistinguishable by Prolog. However, natural language collating orders may do things like ignoring alphabetic case so that 'bin' and 'BIN' sort together. Term ordering is also supposed to be fast; locale-sensitive ordering can be quite costly.

When producing output for people, it can be useful to compare and sort text in a way that people find reasonable. Just as ISO C has both strcmp() and strcoll(), so ISO Prolog needs both compare/3 and locale_compare/[3,4].

:- pred locale_compare(atom, order, T, T).
locale_compare(LocaleName, Order, T1, T2) is semidet
   when ground(LocaleName).
:- pred locale_compare(order, T, T).
locale_compare(Order, T1, T2) is semidet.

locale_compare(R, T1, T2) :-
    "L is the user's default locale",
    locale_compare(L, R, T1, T2).

These are defined in terms of locale names, not locale terms. Presumably they would be implemented by caching, so that sorting a list of a million terms would determine the locale's collation rules at most once.

Derived sorting predicates

locale_sort(R, S) :- sort(locale_compare, R, S).

locale_msort(R, S) :- msort(locale_compare, R, S).

locale_keysort(R, S) :- msort($locale_keycompare, R, S).

    $locale_keycompare(R, K1-_, K2-_) :-
        locale_compare(R, K1, K2).

locale_sort(L, R, S) :- sort(locale_compare(L), R, S).

locale_msort(L, R, S) :- msort(locale_compare(L), R, S).

locale_keysort(L, R, S) :- msort($locale_keycompare(L), R, S).

    $locale_keycompare(L, R, K1-_, K2-_) :-
        locale_compare(L, R, K1, K2).

Subsequences

People sometimes want a way to generate a subset of a given set. The Quintus Prolog predicates subseq/3, subseq0/2, and subseq1/2 address that.

subseq(AB, A, B) is true when AB is an interleaving of A and B. If AB represents a set, A will be a subset of AB, and B will be AB\A. It is regarded as a list operation, rather than a set operation, because it preserves the order of the elements.

subseq0(AB, A) :- subseq(AB, A, _).
subseq1(AB, A) :- subseq(AB, A, _), A \== AB.

The problem with this is that there are 2|AB| solutions. That makes these predicate useless except for rather small lists. In most potential applications for this predicate, there is some kind of

Recommendation

Do not include these predicates. In fact, think 50 times before including any predicates with exponential cost.

Undone

The following predicates from the DEC-10 and/or Quintus libraries have not been slotted into families yet.

Higher order list predicates

What self-respecting list-processing language would be caught dead without a basic kit of higher order list processing operations? Only Prolog, I guess.

The notation p/N+k is used for a family of predicates iterating over N lists, having k other arguments.

The following predicate families come from DEC-10 Prolog and Quintus Prolog:

maplist/N+1
The equivalent of 'map' and 'forall'. The functional programming community seem to have settled on 'map' as the name for this; if the name is changed at all, it should be to map/N+1.
maplist(P, [X11,...,X1n], ..., [Xm1,...,Xmn]) :-
    P(X11, ..., Xm1),
    ...
    P(Xm1, ..., Xmn).
scanlist/N+3
The name of this family was a mistake. I meant to copy the name from APL, which uses 'reduce' for this operation. OOPS. The functional community seem to have settled on the name 'foldl'. DEC-10 Prolog and Quintus Prolog had no equivalent of 'foldr'.
scanlist(P, [X11,...,X1n], ..., [Xm1,...,Xmn], V0, Vn) :-
    P(X11, ..., Xm1, V0, V1),
    ...
    P(Xm1, ..., Xmn, V', Vn).
cumlist/N+3
The APL name for this is 'scan' and the functional programming community have settled on 'scanl' for it.
cumlist(P, [X11,...,X1n], ..., [Xm1,...,Xmn], V0, [V0,V1,...,Vn]) :-
    P(X11, ..., Xmn, V0, V1),
    ...
    P(Xm1, ..., Xmn, V', Vn).
some/N+1
This serves the role of an existential quantifier. For lists, we don't actually need it.
some(P, [X11,...,X1n], ..., [Xm1,...,Xmn]) :-
  ( P(X11, ..., Xm1)
  ; ...
  ; P(Xm1, ..., Xmn)
  ).

Why don't we need it? Suppose we generalise member, to member/2N:

member(X1, [X1|_], ..., Xn, [Xn|_]).
member(X1, [_|Xs1], ..., Xn, [_|Xsn]) :-
    member(X1, Xs1, ..., Xn, Xsn).

Then instead of

    some(P, Xs1, ..., Xsn)

we can write

    member(X1, Xs1, ..., Xn, Xsn),
    P(X1, ..., Xn)

and we not only avoid all overheads of passing P and get to call it directly, if the query is try, we get to find out which values of X1...Xn make it true.

If an analogue of some/N+1 is implementable, so is an analogue of member/2N.

somechk/N+1
This is to some/N as memberchk/2 is to member/2. For lists, we don't actually need it.
somechk(P, [X11,...,X1n], ..., [Xm1,...,Xmn]) :-
  ( P(X11, ..., Xm1) -> true
  ; ...
  ; P(Xm1, ..., Xmn) -> true
  ).

Similarly, instead of

    somechk(P, Xs1, ..., Xsn)

we can write

    (member(X1, Xs1, ..., Xn, Xsn), P(X1, ..., Xn) -> true)

and get to find out why P is true as well that it is true.

include/N+2
The functional community calls this 'filter'.
include(P, [], ..., [], []).
include(P, [X1|Xs1], ..., [Xn|Xsn], Included) :-
    (   call(P, X1, ..., Xn) ->
        Included = [X1|Included1]
    ;   Included = Included1
    ),
    include(P, Xs1, ..., Xsn, Included1).

Note that this really requires P to be determinate, but it does allow variables in P to be bound. For a ground P, we could manage without this, writing

    findall(X1,
        ( member(X1, Xs1, .., Xn, Xsn), P(X1, ..., Xsn) ),
        Included)

would do the job. Some Prolog systems (BinProlog) may implement findall/3 so very efficiently to make this an attractive alternative.

exclude/N+2
This is include/N with the sense of the test reversed.
exclude(P, [], ..., [], []).
exclude(P, [X1|Xs1], ..., [Xn|Xsn], Included) :-
    (   call(P, X1, ..., Xn) ->
        Included = Included1
    ;   Included = [X1|Included1]
    ),
    exclude(P, Xs1, ..., Xsn, Included1).

There were and are others, but this will do to start with.

The obvious missing predicates are

foldr(P, [X11,...,X1n], ..., [Xm1,...,Xmn], V0, Vn) :-
    P(Xm1, ..., Xmn, V', Vn),
    ...
    P(X11, ..., Xm1, V0, V1).

scanr(P, [X11,...,X1n], ..., [Xm1,...,Xmn], Vn, [V0,V1,...,Vn]) :-
    P(Xm1, ..., Xmn, V', Vn),
    ...,
    P(X11, ..., Xm1, V0, V1).

Recommendations

This is very much a minimal set of higher-order list operations.

4. Sets

Many set representations can be used in Prolog: unordered lists, ordered lists, delta-coded ordered lists, bitstrings, binary search trees of various types, ternary search trees.

It is important that the commonly available implementation of sets should be efficient enough to use without hesitation. It is also important that the operation names should not conflict with widely used operation names for other set representations.

This means that either names such as union/3 must be avoided, or set predicates must not be used without a module prefix to identify the representation.

Without a change to ISO Prolog, we cannot enforce a requirement that module prefixes be used, so we have to avoid commonly used names.

The representation most widely used in Prolog textbooks is the unordered list representation. This is also the representation used in Common Lisp.

The ordered list representation has its flaws (adding, removing, or checking for the presence of a single element is O(|Set|) rather than O(lg|Set|)), but the unordered list representation has no virtues.

One of the flaws of any term comparison-based set representation in ISO Prolog is going to be that it allows variables to be set elements, so that two terms can be in a set because when they were entered they were not identical, but subsequent variable bindings can make them identical. The answer to this is to use a mode system which requires that set elements are sufficiently instantiated, perhaps via a "safe" version of compare/3.

I recommend the ordered set representation. It is space efficient, time efficient for bulk operations, and is already used in Prolog (the sort/2 and setof/3 predicates deliver results of this form, which we might as well exploit).

In order to preserve backwards compatibility and to avoid name clashes with the unordered list representation, all predicate names have the form ord_something.

The ordset representation cannot be implemented in Prolog without the use of Prolog's term ordering. As noted in the section on Sorting, compare/3 is not provided, even though term ordering is defined. That is really bad. The compare/3 predicate must be available to Prolog programmers.

ord_op/5

We can base all the ordered set operations on a single predicate. It uses a bit mask 2'byxBYX where X (x) says whether elements occurring only in Xs should be included in S1 (S2), Y (y) says whether elements occurring only in Ys should be included in S1 (S2), and B (b) says whether elements occurring in both sets should be included in S1 (S2).
:- type ordset(T) = list(T).

:- pred ord_op(integer, ordset(T), ordset(T), ordset(T), ordset(T)).
:- ord_op(M, Xs, Ys, S1, S2)
   when M, Xs, Ys.
:- ord_op(M, Xs, Ys, S1, S2) is semidet
   when ground(M), ground(Xs), ground(Ys).

ord_op(M, [], Ys, S1, S2) :-
    ( M /\ 2'000010 =:= 0 -> S1 = [] | S1 = Ys ),
    ( M /\ 2'010000 =:= 0 -> S2 = [] | S2 = Ys ).
ord_op(M, [X|Xs], Ys, S1, S2) :-
    $ord_op1(M, Xs, Ys, S1, S2, X).

$ord_op1(M, Xs, [], S1, S2, X) :-
    ( M /\ 2'000001 =:= 0 -> S1 = [] | S1 = [X|Xs] ),
    ( M /\ 2'001000 =:= 0 -> S2 = [] | S2 = [X|Xs] ).
$ord_op1(M, Xs, [Y|Ys], S1, S2, X) :-
    compare(R, X, Y),
    $ord_op12(R, Xs, Ys, S1, S2, X, Y, M).

$ord_op2(M, [], Ys, S1, S2, Y) :-
    ( M /\ 2'000001 =:= 0 -> S1 = [] | S1 = [Y|Ys] ),
    ( M /\ 2'001000 =:= 0 -> S2 = [] | S2 = [Y|Ys] ).
$ord_op2(M, [X|Xs], Ys, S1, S2, Y) :-
    compare(R, X, Y),
    $ord_op12(R, Xs, Ys, S1, S2, X, Y, M).

$ord_op12(<, Xs, Ys, S1, S2, X, Y, M) :-
    ( M /\ 2'000001 =:= 0 -> S1 = T1 | S1 = [X|T1] ),
    ( M /\ 2'001000 =:= 0 -> S2 = T2 | S2 = [X|T2] ),
    $ord_op2(M, Xs, Ys, T1, T2, Y).
$ord_op12(>, Xs, Ys, S1, S2, X, Y, M) :-
    ( M /\ 2'000010 =:= 0 -> S1 = T1 | S1 = [Y|T1] ),
    ( M /\ 2'010000 =:= 0 -> S2 = T2 | S2 = [Y|T2] ),
    $ord_op1(M, Xs, Ys, T1, T2, X).
$ord_op12(=, Xs, Ys, S1, S2, X, Y, M) :-
    ( M /\ 2'000100 =:= 0 -> S1 = T1 | S1 = [X|T1] ),
    ( M /\ 2'100000 =:= 0 -> S2 = T2 | S2 = [Y|T2] ),
    ord_op(M, Xs, Ys, T1, T2).

Looking at this predicate, we see that there is a clear notion of inputs (M, Xs, Ys) and outputs (S1, S2).

There are two ways to define is_ordset/1. One enforces the groundness of an ordered set. That is appropriate for systems which can delay calls to (some analogue of) compare/3. The other is useful for hacking, and allows set elements to be non-ground. At this stage I do not wish to make any recommendation, so I merely note the two definitions:

is_ordset(S) :-   % enforces groundness
    ground(S),
    sort(S, S).

is_ordset(S) :-   % allows sets of variables
    nonvar(S),
    $is_ordset(S).

$is_ordset([]).
$is_ordset([X|Xs]) :-
    nonvar(Xs),
    $is_ordset(Xs, X).

$is_ordset([], _).
$is_ordset([Y|Ys], X) :-
    nonvar(Ys),
    X @< Y,
    $is_ordset(Ys, Y).

Derived predicates

The list processing predicates, especially length/2, member/2, and select/3 can be used with ordered sets. The memberchk/2 and selectchk/3 predicates are rather more dubious, and need their own implementations.

list_to_ordset(List, Set) :-
    sort(List, Set).

ord_op(M, Xs, Ys, S1) :-
    ord_op(M, Xs, Ys, S1, []).

ord_compare(R, S1, S2) :-
    ord_op(2'010001, S1, S2, D12, D21),
    $ord_compare(R, D12, D21).

$ord_compare(<, [], [_|_]).
$ord_compare(>, [_|_], []).
$ord_compare(=, [], []).

ord_disjoint(S1, S2) :-
    ord_op(2'100, S1, S2, []).

ord_disjoint_union(S1, S2, S) :-
    ord_op(2'100011, S1, S2, S, []).

ord_intersect(S1, S2) :-
    ord_op(2'100, S1, S2, [_|_]).

ord_intersection(S1, S2, S) :-
    ord_op(2'100, S1, S2, S).

ord_intersection([], []).
ord_intersection([S1|Ss], S) :-
    ord_intersection(Ss, S2),
    ord_intersection(S1, S2, S).

ord_memberchk(X, S) :-
    ord_intersect([X], S).

ord_nonmember(X, S) :-
    ord_disjoint([X], S).

ord_proper_subset(S1, S2) :-
    ord_op(2'010001, S1, S2, [], [_|_]).

ord_proper_superset(S1, S2) :-
    ord_op(2'001010, S1, S2, [], [_|_]).

ord_selectchk(X, S1, S2) :-
    ord_op(2'010, [X], S1, S2).

ord_subset(S1, S2) :-
    ord_op(2'001, S1, S2, []).

ord_subtract(S1, S2, S) :-
    ord_op(2'001, S1, S2, S).

ord_superset(S1, S2) :-
    ord_op(2'010, S1, S2, []).

ord_symdiff(S1, S2, S) :-
    ord_op(2'011, S1, S2, S).

ord_union([], []).
ord_union([S1|Ss], S) :-
    ord_union(Ss, S2),
    ord_union(S1, S2, S).

ord_union(S1, S2, S) :-
    ord_op(2'111, S1, S2, S).

ord_union(S1, S2, U, D) :-
    ord_op(2'001111, S1, S2, U, D).

There are two names I am not happy with here. I would prefer it if ord_subtract/3 had a less "imperative" name, such as ord_difference/3. It would also be clearer if ord_union/4 were named ord_union_difference/4. However, that would break backwards compatibility, and that's bad manners

There are four predicates missing from the old library.

Of course, nothing would prevent an implementation providing those predicates for backwards compatibility.

Higher-order operations on sets

The higher order list operations can be applied to ordered sets. Results will not necessarily be ordered sets, but can be converted to ordered sets by sorting. (Except, of course, in the thrice-accursed ISO "Prolog", where there is no "standard" sort/2.) It is useful that [include,exclude]/N+2 do not alter the order of their 2nd argument and do not introduce duplicates, so filtering an ordered set yields an ordered set.

5. Queues

Should be based on q(Unary,L0,L) representation. Each basic operation is O(1).

A queue is a sequence, rather like a list. Adding items or lists of items at either end is rather like appending, and so is removing an item or list of items from the head end. (That is to say, what I'm calling a "queue" here is really an "output-restricted deque").

There are many ways to implement queues. A single list makes some operations cheap and other operations very expensive. A pair of back-to-back lists works nicely in a strict functional language, but since there is a slow rebalancing operation which may be executed repeatedly on backtracking, they are not efficient in Prolog. The representation I propose here was invented by someone at SRI. Fernando Pereira told me about it, and I am sorry to say that I have forgotten the inventor's name. This representation has the nice property that adding an element at either end or removing an element from the head end requires just a single unification. This makes it well suited to Prolog.

There is one flaw in this representation: it uses list differences so that appending an item at the tail end must be regarded as destroying the original queue. Queue uses should be single threaded. Mercury can enforce this, but does not support the partially instantiated terms the technique relies on. Prolog does support partially instantiated terms, but cannot enforce single threading.

We found a similar problem with sets; a representation which Prolog is good at has technical flaws which mean that it has to be used carefully, but the efficiency of the representation is such that we are unwilling to abandon it.

:- type queue(T)
   ---> q(unary, list(T), list(T)).
:- type unary
   ---> z
      ; s(unary).

is_queue(q(U,L0,L)) :-
    nonvar(U), nonvar(L0),
    $is_queue(U, L0, L).

$is_queue(z, L0, L) :-
    L0 == L.
$is_queue(s(U), [_|L1], L) :-
    nonvar(U), nonvar(L1),
    $is_queue(U, L1, L).

item_queue_append(X, q(U,L0,L), q(s(U),[X|L0],L)).

list_queue_append([], Q, Q).
list_queue_append([X|Xs], Q, Q0) :-
    item_queue_append(X, Q1, Q0),
    list_queue_append(Xs, Q, Q1).

queue_item_append(q(U,L0,[X|L]), X, q(s(U),L0,L)).

queue_list_append(Q, [], Q).
queue_list_append(Q0, [X|Xs], Q) :-
    queue_item_append(Q0, X, Q1),
    queue_list_append(Q1, Xs, Q).

queue_list(q(U,L0,L), Xs) :-
    $queue_list(U, L0, L, Xs).

$queue_list(z, L, L, []).
$queue_list(s(U), [X|L1], L, [X|Xs]) :-
    $queue_list(U, L1, L, Xs).

cons_queue(X, Q0, Q1) :-
    item_queue_append(X, Q0, Q1).

head_queue(X, Q1) :-
    item_queue_append(X, _, Q1).

tail_queue(Q0, Q1) :-
    item_queue_append(_, Q0, Q1).

member_queue_offset(X, q(s(U),L0,_), N) :-
    $member(X, U, L0, 0, N).

$member(X, _, [X|_], N, N).
$member(X, s(U), [_|L1], N0, N) :-
    N1 is N0 + 1,
    $member(X, U, L1, N1, N).

member_queue_index(X, q(s(U),L0,_), N) :-
    $member(X, U, L0, 1, N).

member_queue(X, Q) :-
    member_queue_index(X, Q, _).

memberchk_queue(X, Q) :-
    member_queue(X, Q),
    !.

queue_list_length(Q, Xs, N) :-
    Q and Xs have the same elements, N of them.

queue_empty(Q) :-
    queue_list(Q, []).

queue_length(q(U,L0,L), N) :-
    (   integer(N) ->
        N >= 0,
        $queue_length(U, L0, L, N)
    ;   var(N) ->
        $queue_length(U, L0, L, 0, N)
    ;   $error        
    ).

$queue_length(z, L, L, N, N).
$queue_length(s(U), [_|L1], L, N0, N) :-
    N1 is N0 + 1,
    $queue_length(U, L1, L, N1, N).

$queue_length(s(U), [_|L1], L, N0) :-
    N0 > 0,
    !,
    N1 is N0 - 1,
    $queue_length(U, L1, L, N1).
$queue_length(z, L, L, 0).

portray_queue(Q) :-
    (   write('[|'),
        member_queue_offset(X, Q, N),
        ( N =:= 0 -> true ; write(', ') ),
        write(X),
        fail
    ;   write('|]')
    ).

I have omitted the following predicates from the Quintus library:

Since the Quintus module is called 'newqueues', this should not be a problem.

Higher order operations on queues

I think I may already have too many operations providing access to the elements of a queue other than by popping them. It doesn't feel right to propose any higher-order operations on them. The idea of having 2N map*/N+1 predicates,

    map(P, L1, L2)
    map_list_queue(P, L1, Q2)
    map_queue_list(P, Q1, L2)
    map_queue_queue(P, Q1, Q2)

doesn't appeal either. Once the higher order list operations are there, anyone who wants higher order queue operations can easily program them him/herself.

6. Numbers.

6.1. Additional predicates.

There are three predicates which are so basic that the ISO Prolog standard itself uses them, while at the same time refusing to bless Prolog programmers with assured access to them.

succ/2 provides the successor function on natural numbers only. It is essential to the purposes for which succ/2 was invented that it should not accept or produce negative integers.

:- pred succ(integer, integer).
:- succ(X, Y) when X ; Y.
:- succ(X, Y) is semidet when X ; Y.

%  Sample implementation

succ(X, Y) :-
    (   integer(X) -> X >= 0, Y is X + 1
    ;   integer(Y) -> Y > 0, X is Y - 1
    ;   var(X), var(Y) -> instantiation fault
    ;   $error
    ).

plus/3 provides reversible arithmetic on exact numbers. It was intended to work with rational numbers as well as integers; since ISO Prolog omits rational numbers, it is here specified for integers only. It is essential to the purposes for which plus/3 was invented that it should be a proper logical predicate; the nature of floating point arithmetic means that it cannot behave as intended if it allows floating-point arguments.

:- pred plus(integer, integer, integer).
:- plus(X, Y, Z) when X, Y ; Y, Z ; Z, X.
:- plus(X, Y, S) is semidet when X, Y ; Y, Z ; Z, X.

%  Sample implementation

plus(X, Y, Z) :-
    (   integer(X), integer(Y) -> Z is X + Y
    ;   integer(Y), integer(Z) -> X is Z - Y
    ;   integer(Z), integer(X) -> Y is Z - X
    ;   $error
    ).

between/3 is astonishingly useful, not least because of its frequent use in bounding the search space of other predicates. It is difficult to take seriously any Prolog "standard" which omits it.

:- pred between(integer, integer, integer).
:- between(L, U, X) when L, U.
:- between(L, U, X) is semidet
   when ground(L), ground(U), ground(X).
:- between(L, U, X) is bounded
   when ground(L), ground(U).

%  Sample implementation.

between(L, U, X) :-
    (   integer(L), integer(U) ->
        (   integer(X) ->
            L =< X, X =< U
        ;   var(X) ->
            L =< U,
            $between(L, U, X)
	;   $error            
	)
    ;   integer(L), integer(X), L > X -> fail
    ;   integer(U), integer(X), X > U -> fail
    ;   $error
    ).

$between(L, L, L) :- !.
$between(L, _, L).
$between(L, U, X) :-
    M is L + 1,
    $between(M, U, X).

The order in which the sample implementation enumerates solutions for X is part of the specification.

There is of course a better alternative to between/3, and that is to extend the syntax of comparison.
Comparison formbetween/3 equivalent
E =< X =< FL is E,U is F,between(L, U, X)
E =< X < FL is E,U is F-1,between(L, U, X)
E < X < FL is E+1,U is F-1,between(L, U, X)
E < X =< FL is E+1,U is F,between(L, U, X)
E >= X >= FU is -F,L is -E,between(L, U, Z), X is -Z
E >= X > FU is 1-F,L is -E,between(L, U, Z), X is -Z
E > X > FU is 1-F,L is -1-E,between(L, U, Z), X is -Z
E > X >= FU is -F,L is -1-E,between(L, U, Z), X is -Z

This requires "distfix" operators, which are easy enough to add to Prolog syntax and have been so added in the past. These operators resemble the existing arithmetic comparison operators in evaluating their "outside" arguments; unlike the existing arithmetic comparison operators they do not evaluate their "inside" argument and only accept integers. This approach generalises neatly to multiple generated arguments, e.g., 1 =< X < Y =< N. However, between/3 is the simplest thing that could work and has been proven over many years, so let's stick with it.

6.2 Division Predicates.

The problem with integer division is that there are several different versions, and the one you get is seldom the one you need. We should consider seriously whether Common Lisp analogues such as

floor(N, D, Q, R)
ceiling(N, D, Q, R)
round(N, D, Q, R)
truncate(N, D, Q, R)

predicates with

floor(N, D), floor_remainder(N, D)
ceiling(N, D), ceiling_remainder(N, D)
round(N, D), round_remainder(N, D)
truncate(N, D), truncate_remainder(N, D)

as arithmetic functions, might not be a good idea. Obviously, one pair of these arithmetic functions would duplicate the effect of the existing // and mod, but can you instantly recall, without looking it up, which?

I brought this up in 1984 and with the BSI and with the ISO committee, to a deafening silence. Surely I can't be the only person who is never quite sure what // is going to do with negative numbers?

6.3 Additional Evaluable Functors.

Chapter 9 of the ISO Prolog standard lists the following evaluable functors:
FunctorNote
X + Ybut not +X
X - Y
- Y
X * Y
X / Y>td>It always delivers a floating-point result, even when X is divisble by Y. An implementation which has rational numbers is completely forbidden to return a rational answer to 1/2, can you believe it?
X // YYields an implementation-defined answer.
X rem Yyields the remainder corresponding to X // Y.
X mod Yyields X - floor(X/Y). Knuth's strong and carefully reasoned advice that X mod 0 should be allowed and should equal X has been ignored, sadly.
???There isn't any division corresponding to mod. Of course, this broke all the old Prolog code that relied on // and mod going together, but when did the ISO Prolog committee worry about that?
abs(X)
sign(X)Despite the overwhelming predominance of IEEE floating-point arithmetic on popular machines, there is no equivalent of IEEE's copysign(), nor is it clear that copysign() can be emulated.
float_integer_part(X)long-winded
float_fractional_part(X)
float(I)
floor(X)
truncate(X)
round(X)
ceiling(X)
X ** Y2**2 is 4.0, not 4.
sin(X)
cos(X)
atan(X)The one version of atan() no sane programmer uses!
exp(X)
log(X)A pity this wasn't spelled ln(X).
sqrt(X)
Bits << Shift
Bits >> ShiftUtterly useless definition.
X /\ YCrippled definition
X \/ YCrippled definition
\(X)Inconsistent definition

Let's start at the bottom and work up.

There are some obvious things missing. If you are using trig functions, sooner or later you need pi. We might as well have pi/0 and e/0 as evaluable functors.

Maximum and minimum are extremely useful. They've been in Fortran for years. They're in Lisp. They are easy to implement.

Recommendations

  1. Add the div infix operator.
  2. Ignore the standard sufficiently to allow the use of IEEE-754-conformant floating-point arithmetic.
  3. Add tan(X), atan2(X,Y), acos(X), and asin(X).
  4. Specify that >> acts as if it were a signed shift on twos-complement numbers.
  5. Specify that /\, \/, and \ act as if integers were twos-complement.
  6. Add max(X,Y) and min(X,Y); when the arguments are of different types use the usual floating-point contagion rule.
  7. Add pi and e constants.

7. Input/Output

7.1. Consulting Files

Recently my attention was drawn to the following facts:

  1. Prolog programs are in practice constructed using consult(File) or the alternative [File] syntax. All of the Prolog systems I have manuals for (Open Prolog, LPA Prolog, ALS Prolog, IBM PROLOG, Quintus Prolog, SICStus Prolog, Expert Systems Prolog/2, SWI Prolog, ECLiPSe, CIAO, NU Prolog, C Prolog, DEC-10 Prolog) support consult/1. Some of them support other operations as well.
  2. consult/1 is not in the standard. This means that while programs using consult/1 are de facto portable between Prolog implementations, they are not de jure standard; there may well be by now fully "ISO Prolog standard" compliant implementations which do not support consult/1.
  3. There were and are some relatively minor differences between Prolog systems about what exactly consult/1 does. It was pointed out in 1984 that these could be resolved fairly easily. For the sake of portability, any reasonable programmer would be willing to use the ISO Prolog source-file-loading operation instead of consult/1. Unfortunately, there isn't one. The ISO Prolog standard provides no operation at all for dynamically loading source files. The standard appears to assume that Prolog programs are built like C programs; there is no way to extend or revise an ISO Prolog program once built, and the means of building is completely unspecified.
  4. The ISO Prolog standard does provide two relevant forms.

    There are additional problems with those directives.

  5. This means that the fairly common practice of compiling some sort of problem-specific language (perhaps an expert system shell language) to Prolog, writing it to a file, and then loading that file, is simply not a possibility in ISO Prolog. We cannot even write a portable program that generates source files and a script to build a program from them. In fact, there isn't even any requirement that a Prolog implementation accept its source code from files at all ("Prolog text units" need not be files).
  6. We could live with this if the ISO Prolog standard provided all the necessary building blocks. Some public-spirited individual could provide an Open Source implementation of pconsult/1 and we could all use it. Unfortunately, the ISO Prolog standard does not provide all the necessary building blocks. There is, in particular, no way whatsoever to produce a static predicate, although a compile_predicate(Name/Arity) built-in predicate was proposed to the committee (by me) with a clear rationale. Here's a trivial implementation:
         compile_predicate(Name/Arity) :-
            scratch_file_name(Foobar),
            telling(Old), tell(Foobar),
            listing(Name/Arity),
            tell(Old), close(Foobar),
            consult(Foobar),  % 'compile' if you have it.
            delete_file(Foobar).
         

    That's not industrial-strength code, but it shows that all the Prolog systems listed above could easily support this operation.

  7. Worst of all, we can't even write our own loader that makes all predicates dynamic, using assert/1, because there is no standard way to open a file. More precisely, there is a standard predicate open/4 for opening a file, but there is no standard way to call that predicate. The format of the file name argument is implementation defined and there is no way for a program to discover what that format is. Contrary to the impression given by the examples in the standard, there is no requirement that any atoms ever be acceptable as file name arguments, nor, if accepted, that they be interepreted in accordance with the host file system.

Recommendations

  1. open/4 should be defined as accepting an atom, a non-empty list of atoms, or a non-empty list of character codes as a file name, plus additional implementation- defined ground terms. An atom provides a sequence of characters (its name). A non-empty list of atoms provides a sequence of characters determined by concatenating the names of its elements. A non-empty list of character codes provides a sequence of characters corresponding to the codes. In these three cases, the character sequence is taken verbatim and passed to the host file system for interpretation.
  2. :- include/1 should be defined as accepting an atom, a non-empty list of atoms, or a non-empty list of character codes as a text unit name, plus additional implementation-defined ground terms. A text unit name of these forms should always be interpreted as referring to the contents of a host file system file. If and only if there is no host file system file with exactly the name given, an implementation may try implementation-defined transformations of the file name. Typical transformations include adding a ".pl" or ".pro" suffix if there is no suffix, removing all "_" characters, changing " " to "-"...
  3. :- ensure_loaded/1 should be defined as accepting the same kinds of argument as :- include/1 and as interpreting them in the same way, except that it might try a different series of transformations, such as trying ".qof" before trying ".pl".
  4. ensure_loaded/1 should also be available as a built-in predicate.
  5. There should also be a consult/1 built-in predicate. This predicate should accept the same arguments as the :- include/1 directive and should use the same sequence of transformations, if any. From the host file system file name which finally works, a canonical form is computed, traditionally this has been the "absolute file name", with conversion to lower case for host file systems that ignore alphabetic case. Call this canonical form F, and the module that encloses the call M, unless the file is a module-file, in which case the module name in the :- module/2 directive is used for M. Each predicate property and clause loaded from the file is in effect tagged with the pair (M,F). This should work in three phases:
    1. Parse the file and any included files and generate an internal form, not changing the program.
    2. If the first phase completed without error, erase all existing predicate properties and clauses tagged (M,F).
    3. If the first phase completed without error, install the internal form, with (M,F) tags.

    The same (M,F) tagging should be done by :-include and :-ensure_loaded and by any other means of loading Prolog text units. For text units that are not files, F need not be a file name; it is simply a hidden canonical label.

7.2 Opening files by name

See the previous section. There are at least two dimensions of portability:

  1. Porting a program between different operating systems, using the same Prolog implementation (or implementations from a particular vendor that are claimed to be source compatible).
  2. Porting a program between different Prolog implementations on the same operating system.

In section 7.1 I complained that there is no way of specifying a file name to open/4 which is portable in the second sense. A Prolog implementor I shan't name has taken me to task over this, saying that it is quite reasonable for there to be no "portable" file name terms because only the first kind of portability matters, and the file names would not be portable anyway.

He is wrong. For example, I have Quintus Prolog, SICStus Prolog, XSB Prolog, SWI Prolog, Ciao, ECLiPsE, and a couple of others (including a version of Poplog I can't actually get at directly, as part of the SPARK/Ada toolkit), all running on the same machine. If you are trying to write code for publication, you want to test it under several Prolog systems. If you are trying to write library code for general use, you want to test it under several Prolog systems. Encouraging portability between different language implementations is one of the major purposes of a language standard. The really annoying thing here is that those implementations do share a common file name term (atom), so why can't the standard say so?

He is also wrong about the portability of file names. It is true that different operating systems have different file name syntax, but words matching the regular expression /[A-Z][A-Z0-9]{0,5}/ are portable between MVS, CMS, VMS, TOPS-10, TENEX, TOPS-20, RT-11, RSTS/E, all versions of UNIX from at least version 6 on, all versions of DOS, all versions of Windows, all versions of MacOS, the Burroughs (now Unisys) MCP, PR1MOS, ITS, and several other operating systems I haven't used. This is hardly surprising: that's the syntax for Fortran 66 identifiers, and Fortran programmers expected to be able to use Fortran identifiers as file names. The fact that some file names are not portable across a wide range of operating systems is no excuse for failing to provide standard access to file names that are portable across a wide range of operating systems.

In fact, there is a hidden assumption in the ISO Prolog standard which restricts ISO Prolog input/output to DOS, Windows, and POSIX-compatible systems such as Solaris and Macos-X. Since DOS and Windows accept forward slashes in file names as well as reverse slashes, then one can expect relative file names formed from the POSIX portable filename character set and using forward slash as the directory separator with names restricted to "8+3" form to be usable across all the file systems where ISO Prolog can be expected to work. What's more, while file names containing slashes are not legal OpenVMS (VMS, TOPS-10, TENEX, TOPS-20) file names, they can easily be mapped to file names which are legal. It is quite reasonable for open/4 to take as host file names those character sequences which are possible host file names and to use some kind of POSIX-to-host mapping for character sequences which are not.

7.3 Appending

Section 3.94 of the ISO standard lists the input/output modes read, write, and append. Section 7.10.1.1 repeats this. It is not clear whether an implementation may support additional input/output modes, such as rewrite.

The relevant wording in section 7.10.1.1 is
append -- Output. The source/sink is a sink. If the sink already exists then output shall start at the end of that sink, else an empty sink shall be created.

The problem is that "the end of the sink" is not a well defined term. Stream optionbs (section 7.10.2.11) include type(text) or type(binary). Now, in UNIX, Windows, and MacOS, the contents of a file are a sequence of bytes. (In many other operating systems, they are a sequence of records.) The end of a sequence of bytes is well defined. But a text file is a sequence of characters encoded as a sequence of bytes. For example, in TOPS-10 the last block of a file was typically padded with ASCII NUL characters. Where is the end? Is it at the end of the block, or is it at just after the last non-NUL character? Or consider DOS. It was based on CP/M, where a file was a sequence of 128-byte records. When you closed a character stream, if there was any space left in the last record, you'd get a Ctrl-Z followed by as many NULs as there were room for. Although Windows no longer uses this convention for output, it still obeys it for input. So if a 2 MB file contains a Ctrl-Z as its first character, a C program reading it as a text stream will think it is empty. Where is "the end of that sink"? Is it at the logical file end as reported by the file system, or is it at the first Ctrl-Z? This is actually serious problem, because if you start writing after the last byte in a file, the characters you write may be completely invisible when you read back:

    ?- open('HAIRY', write, Bin, [type(binary)]),
       put_byte(26),
       close(Bin),
       open('HAIRY', append, Txt, [type(text)]),
       put_code(0'*), nl,
       close(Txt),
       open('HAIRY', read, Src, [type(text)]),
       get_code(X),
       close(Src).

What is the value of X? Is it 0'*, or is it the end of file code?

To get text mode appending right, it is necessary to open the file for reading and writing, read and decode the current data, and then decide where the end of the text contents is found.

The matter of encoding is even more complicated than that. There is not a one-to-one correspondence between characters and bytes. There are over 98,000 characters in Unicode. To convert between a stream of characters and a stream of bytes it is necessary to use some kind of encoding. This encoding may be stateless (like UTF-8), or it may be stateful (like Unicode Technical Report 6), or for that matter like dynamic compression schemes such as dynamic Huffman encoding and Lempel-Ziv encoding. Stateless encodings are not a problem. Some stateful encodings can be restarted, by which I mean that you can put the encoder back into its initial state and keep or resume writing, and still recover the intended character stream without error or loss of information. Lempel-Ziv compression and Unicode Technical Report 6 are like that. If you create a file, write CS1, close it, open for append, write CS2, close it, and then read it back you will get CS1++CS2, even though you will not get the same byte stream that you would have got from writing CS1++CS2 in one go. But some stateful encodings cannot be restarted, because the original designer never thought of providing a code that says "forget what you know".

There is no way in ISO Prolog to get the effect of the O_APPEND flag in POSIX, which is something of a pity.

Recommendation

The behaviour of append mode for text files needs to be clarified.

When encodings are supported, the interaction between append and encodings needs to be clarified.

7.4 Positioning

An early draft of the ISO standard said that a stream position was an integer. In the days of ASCII/Latin-1 on UNIX and MacRoman on MacOS, the mapping between position in character sequence and position in byte sequence was the identity mapping. MS-DOS, with its CR-LF sequences, spoiled that, and meant that there was no general way to go from character position to byte position without reading all the previous bytes in the file. That was actually our good luck, because other encodings make it even harder.

However, the ISO standard makes stream positions harder to use than it has to. A normal use of stream positions is to remember them as you generate a file, and same them out in some kind of index structure. Then to look in the file, you load the index structure and use that to tell you where to go.

That usage is not allowed in ISO Prolog. The ISO Prolog standard says, in section 7.10.2.8, that stream position terms are stable while the stream is open. A stream position could, for example, include the stream identifier, making it unusable with any other stream for the same file.

Recommendation

It should be said that a stream-position term uniquely identifies a particular position in a source/sink that is connected to a persistent file as long as the contents of that file preceding the position are not modified.

Stream properties

encoding(EncName)

normalisation_form(NF)

7. Streams.

7.1. Encoding

7.2. User-defined stream properties

7.3. Binary input/output

7.4. UBF input/output

7.5. Encryption

7.6. Compression

7.7. Network streams

8. Files and Directories.

Drat, I ran out of time again.

9. Not really the end.

This is a draft. More operations will be added, and community discussion may result in adding, changing, or removing some of the operations listed here.

There is nothing here which was not obvious at least two years before the Prolog standard came out. In particular, at a conference in 1985, I warned explicitly that a "minimal" standard was not at all a good idea, and by 1986 most of the predicates listed above could have been designed, had a thorough overhaul of the library been desired instead of minor revisions.