Grammatical Notations

Copyright © 1993-2001 by the Xerox Corporation and Copyright © 2002-2005 by the Palo Alto Research Center. All rights reserved.

LFG vis-à-vis XLE notations

XLE uses only ascii for characters (so that up arrow is represented by ^ and the down arrow by !). Here is a table that gives LFG notations and their XLE equivalents:

LFG notation XLE-equivalent Description
^ f-structure metavariable
! f-structure metavariable
= = defining equality
= meta-category definition
=c =c or =C constraining equality
$ set membership
¬ ~ negation (complementation)
d d existential constraint (standard notation)
d d existential constraint (Sadler)
<- off-path constraint
-> off-path constraint
<< subsumption (subsumes)
>> subsumption (is subsumed by)
{ a | b | c | ... | z } { a | b | c | ... | z } disjunction
( a ) { a } optional f-structure constraint
symbol_ symbol_ instantiation
not used <h head precedence
not used >h head precedence
not used <s scope relation
not used >s scope relation
not used $<h<s surface adjunct scope
not used $<h>s surface adjunct scope

Section headers

The user directly edits the grammar files in XLE. This means that the user must be aware of the section headers that separate different grammatical resources in the grammar files. Each section begins with a two part name, a resource type, and a version identifier (always (1.0)). Here are example section headers for each resource type:

TEST ENGLISH CONFIG (1.0)

TEST ENGLISH RULES (1.0)

TEST ENGLISH TEMPLATES (1.0)

TEST ENGLISH FEATURES (1.0)

TEST ENGLISH LEXICON (1.0)

Each section ends with four dashes:

----

Regular predicates for c-structure rules

The daughter string of a c-structure node in LFG must belong to the regular language specified in the right-hand side of the relevant c-structure rule. Since every regular language can be specified by a regular expression applying only the operations of concatenation, union, and Kleene closure to primitive terms, a simple regular expression notation would be sufficient to characterize all and only the node expansions that are permitted by lexical-functional theory. But the rich mathematical properties of the regular languages offer many other possibilities for expressing linguistic generalizations without any increase in formal power or change to the set of acceptable trees. Many of these expressive possibilities are made available in XLE's c-structure rule notation. XLE allows c-structure expansions to be specified by means of Boolean combinations of `regular predicates' over primitive terms. Since the regular languages are closed under the operations of union, intersection, and complementation, Boolean combinations will be regular if each of the predicates itself only denotes a regular language. Despite their overall expressiveness, the predicates described below all have this characteristic.

An LFG c-structure rule in XLE is a specification of the form

M --> p.

where M is the mother category whose expansion is defined by this rule and p is a regular predicate that determines the (regular) set of possible daughter strings for M. In the rule S --> NP VP, for example, S is the mother category and NP VP is a predicate of concatenation. The set of regular predicates is constructed recursively from primitive terms as described below. A primitive term is a string-element predicate that may or may not be annotated with functional schemata.

string elements

category symbol

symbol[ arg ]

symbol[ arg1 , arg2 , ... , argn ]

e

?

A category symbol is a sequence of characters that matches nodes labeled with that sequence. Any alphanumeric character is allowed in the sequence, along with the special characters _ and -.  For example: NP Nbar N-bar N-bar N2. Other characters may also be included provided they are preceded by the back-quote character `. Back-quote thus serves as an escape character for the regular predicate notation, making it possible to mention ordinary punctuation marks in a grammar.

A complex category is a category symbol followed by square brackets that have one or more atomic values separated by commas. For example: AP[attr] AP[pred] VP[fin,perf]. To avoid confusion with other uses of square brackets in the notation, the '[' must immediately follow the last character of the base symbol. However, there may be whitespace between the brackets, the arguments and the commas, which will be normalized away by the grammar reader. For example: AP[ attr ] VP[ fin , perf ]. Complex categories are just a funny-looking notation for ordinary symbols, i.e. they are allowed in all places where a symbol can appear, such as categories, feature names or values, names for macros and templates etc. See the section on parameterized rules for how complex categories can be used.

The symbol e is the "epsilon" symbol for an LFG grammar: the predicate e matches against the empty string of categories. In combination with the union operator described below, this offers one way of expressing optionality. An example of the use of  e is shown in the next section.

The symbol ? denotes the disjunction of all categories that appear anywhere else in a rule. Thus, the following two (nonsense) rules are equivalent (disjunction is indicated by the {..|..} notation):

NP --> DET ? A N.

NP --> DET {DET | A | N} A N.

The ability to denote the complete set of categories may not seem particularly useful in isolation, but it is a powerful notational tool when intersected with other predicates. For example, the intersection

[?* NP ?*] & [?* VP ?*]

is one way of denoting the set of strings containing at least one NP and one VP in either order and separated by any number of categories drawn from NP, VP, and other ones mentioned elsewhere in the rule.

primitive terms

element

element : schemata ;

The unadorned string elements just introduced are primitive terms of the regular predicate notation. Every string element can also be annotated with a set of functional schemata. These are attached with a colon and terminated by a semi-colon (the semi-colon may be omitted when a punctuation mark that terminates some enclosing predicate is present). When a term with schemata is matched in a daughter string, the schemata are instantiated and added to the f-description.

Attaching schemata to an e that is disjunctive with some other predicate permits functional requirements to be imposed in trees that do not satisfy that other predicate. For example, the disjunctive predicate

{NP:(^ SUBJ)=!; | e: (^ SUBJ PRED)='PRO';}

defines the subject to be a dummy pronoun if an overt subject NP does not appear. Since there is no node beneath the epsilon and hence no lexical information, ! has no meaning in a schema attached to e and such a schema is therefore disallowed.

XLE supports the special abbreviatory convention that Kaplan and Bresnan proposed for non-epsilon terms that either have no schemata attached or whose schemata do not contain the daughter metavariable !. If such a term appears in a positive context, that is, not in the scope of any of the complementation predicates described below, then it is interpreted as if the schema ^=! were attached to it. The term:

VP:(^ TENSE)
    ^=!;

is thus equivalent to the abbreviated:

VP:(^ TENSE);

In the unmarked case mother and daughter f-structures are simply identified with each other. This permits the specification of head relations to be suppressed so that the assignment of particular grammatical functions can stand out.

The full set of regular predicates is constructed from this base of primitive terms. In the following recursive description the symbols p and pi denote arbitrary regular predicates.

concatenation

p1 p2 ... pn

The concatenation predicate is expressed by a sequence of predicates separated by white-space (spaces, tabs, carriage returns). This predicate is satisfied by an ordered sequence of substrings satisfying each of the predicates p1, p2, ... pn in turn. You may interchangeably use spaces or carriage returns to separate the concatenated predicates as you type them in. For example, the two following rules are equivalent:

     VP --> V NP.
     VP --> V
            NP.
     

iteration or repetition

p *

p +

p#n

p#n#m

p * denotes the Kleene closure of p : it is satisfied by zero or more substrings each of which satisfies p. p+ is satisfied by a sequence of one or more substrings satisfying p. This, for example, might be useful for stating that an NP can contain zero or more PPs:

NP --> N PP*.

Predicates with # involve a specified number of repetitions. p#n is satisfied by the concatenation of exactly n substrings satisfying p, where n is a non-negative integer (0, 1, 2, ...). The predicate NP#2, for example, requires exactly two adjacent NP's to appear. p#n#m is satisfied by an expression that has at least n but not more than m repetitions of p. m is either a non-negative integer greater than n or else it is the symbol *, indicating an unspecified upper bound. Thus, NP#3#5 requires 3, 4, or 5 NP's in a row, while NP#3#* requires 3 or more NP's. Note that NP* and NP#0#* are equivalent as are NP+ and NP#1#*. As a special convenience, repetition factors are allowed to appear between a categorial symbol and the colon that introduces its schemata. For example, PP*:(^ (! PCASE))=(! OBJ); is accepted as a variant form of PP:(^ (! PCASE))=(! OBJ);*.

grouping

[ p ]

Square brackets are used merely to explicitly mark that the components of p are to be treated as a unit with respect to other enclosing predicates. This is useful when the normal precedence conventions would associate these components in an unintended way or when you are unsure exactly what the normal precedence conventions would do. It is always safe to include brackets. For example, if you want to state that a VP can have zero or more sequences of PP followed by ADVP in it (e.g. V PP ADVP PP ADVP), the rule would be:

VP --> V [PP ADVP]*.
If instead you had:
VP --> V PP* ADVP*.
this would be interpreted as zero or more PPs followed by zero or more ADVPs.

disjunction or union

{ p1 | p2 | ... | pn }

This predicate is satisfied by any string meeting the conditions of at least one of the individual pi predicates. For example:

NP --> { DET | DEMON | POSSP } N.

optionality

( p )

This predicate indicates that a substring satisfying p may or may not be present. The predicate is equivalent to the disjunction {p|e} and the repetition p#0#1. For example, the VP rule might state that the verb is optionally followed by an NP:

VP --> V (NP).

conjunction or intersection

p1 & p2

This predicate is satisfied by any string of items meeting the conditions of both p1 and p2. An example of using intersection for ID/LP rules is seen in the section on regular expression macros.

negation or complementation

~p

This predicate is satisfied by any string that does not satisfy the predicate p. For example, the predicate ~[NP:(^ SUBJ)=!; VP] will match any string of nodes provided that it does not consist of a subject NP followed by a VP. The meaning of complementation is quite obvious when the predicate p contains no functional annotations: it denotes just the set of all strings not in the regular language specified by p. When schemata are attached, the predicate denotes the complement of the set of strings whose categorial requirements and associated schemata are both satisfied. This includes all the node-sequences that belong to the unannotated complement (such as VP NP, NP NP VP, VP NP NP) but also includes all strings that satisfy

NP:(^ SUBJ)~=!; VP

The f-structures defined by an annotated complementation predicate may have unexpected properties because of the nonconstructive meaning that LFG assigns to negative functional schemata. Also to avoid confusion, in the scope of a complementation predicate terms without !-containing schemata are not interpreted as containing ^=!.

relative complementation

p1 - p2

This predicate is satisfied by any string that satisfies p1 but does not satisfy p2. It is exactly equivalent to p1 & ~p2. As an equivalence going the other way, note that ~p denotes the same strings as ?*-p. Note that the schemata in p2 may be assigned negative, nonconstructive meanings and any intended ^=! interpretations must be stated explicitly.  For example, you might want a VP rule in which the verb has to be followed by some constituent; that is, it cannot contain just a V. This could be captured by:

VP --> [ V (NP) PP* ADVP* ] - V.

term complementation

\ p

The term complement is any single category-schemata combination that does not satisfy the requirements of p. This is the same as ?-p in contrast to the ?*-p of the usual complement. The term-complement of an unannotated category is simply the disjunction of all the other categories that appear in the same rule: \NP is the disjunction {VP|AP} if VP and AP are mentioned elsewhere in the rule. Given De Morgan's laws, the predicate \NP:(^ SUBJ)=!; is equivalent to the disjunction { NP:(^ SUBJ)~=!| VP | AP}. Note also that \p* is equivalent to ~[?* p ?*] in the special case where p is a primitive term. As for the other complementation predicates, any ^=! interpretations intended in p must be stated explicitly.

ignore or insert

p1 / p2

This predicate denotes strings that satisfy p1 when substrings satisfying p2 are freely removed or ignored. Alternatively, it denotes strings formed from the language of p1 by arbitrarily inserting strings from the language of p2. For example, the following rule expresses the generalization that adverbs can be freely inserted after the object of English VP's:

VP --> V (NP:(^ OBJ)=!)

[(NP:(^ OBJ2)=!) PP*:(^ XCOMP)=!] / ADV:! $ (^ ADJ).

Without this special notation, an ADV* with its schemata would have to be repeated several times and the generalization would be lost.

linear precedence

p1 < p2

p1 > p2

These predicates impose a linear ordering on strings. p1 < p2 is satisfied by any string in which there is no substring satisfying p1 in front of a substring satisfying p2. This is equivalent to ~[?* p2 ?* p1 ?*], which reduces to [\p2]*[\p1]* if p1 are p2 are primitive terms. The opposite ordering restriction is expressed by the p1 > p2 predicate. Thus the rule

S --> [NP:(^ SUBJ)=!, VP] & NP < VP.

is equivalent to the more familiar rule

S --> NP:(^ SUBJ)=!; VP.

but expresses a different set of generalizations. Since the linear precedence predicates are defined in terms of complementation, the general caveats about schemata in negative contexts apply.

shuffle

p1 , p2

This predicate is satisfied by any string that can be formed by `shuffling' together the elements of strings satisfying p1 and p2. That is, the elements that satisfy p1 must come in their p1-determined relative order and similarly for the elements satisfying p2, but elements satisfying those two predicates can be freely intermixed (provided they maintain their prescribed relative order). Thus, [A B],[X Y] is satisfied by any of the strings

A B X Y
A X Y B
X A Y B
A X B Y
X A B Y
X Y A B

As other examples, NP:(^ SUBJ)=!, VP is satisfied by strings containing one subject NP and one VP in either order, and NP#2#4,VP#3 would be satisfied by strings containing three VP's intermixed with between two and four NP's. Note, however, that the underlying finite-state machine can become quite large as the number of different elements being shuffled grows. This is because the shuffle operator is producing all possible orders, and there are a factorial number of different orders. This means that a rule that shuffles more than ten different items will make the grammar slow to load.

XLE's shuffle predicate provides a regular generalization of the immediate dominance (ID) specifications of other syntactic theories, such as GPSG (Gazdar et al., 1985).

comment

" any string of characters "

Character strings enclosed in double-quotes are treated as invisible comments and assigned no other interpretation. They may appear anywhere inside a regular predicate expression except between the characters of what you intend to be an atomic category symbol. This allows you to embed justifications, explanations, or other annotations at the most appropriate positions.

<COMMENT> any string of characters </COMMENT>

Character strings enclosed in <COMMENT> ... </COMMENT> are comments just like character strings enclosed in double-quotes except that they allow double-quotes and other <COMMENT> ... </COMMENT> strings to be embedded inside of them. This makes them useful for commenting out large sections of a grammar even if the sections have comments in them.

<comment> and </comment> are also acceptable.

Scope of Predicates

When these predicates are put together in a complicated formula, the meaning of the combined expression depends on the relative scope that is assigned to the different operators. If you are ever in doubt about the relative scope of operators, you can always include grouping brackets to make your intended meaning clear. Without explicit grouping brackets, the precedence of operators is determined by a simple set of conventions. With one exception, a potential scope ambiguity between most of the operators (* + # ~ - \ / , < >) is resolved so that the left-hand operator has wider scope than the right-hand one. Thus the expressions in the first column receive the interpretation notated more explicitly by the entries in the second:

A < B - C      A < [B - C]
A - B < C A - [B < C]
A / B* A / [B*]
~A, B ~[A, B]

The one exception to this general principle involves combinations of the term complement and iteration operators. Since the term complement of an iteration makes no sense, an expression such as \A* is more conveniently interpreted as [\A]* instead of \[A*]. The operators listed above all bind more tightly than the intersection and concatenation operators, so that A < B & C is interpreted as [A < B] & C and A / B C is interpreted as [A / B] C. Potential ambiguities between intersection and concatenation are resolved in favor of a tighter binding for the intersection operator. Thus, A & B C is interpreted as [A & B] C. The notations used for the remaining operators (optionality, grouping, and union) is such that their interpretation relative to each other and to other operators is always unambiguous.

Empty strings and empty nodes

Apart from their use in expressing various kinds of generalizations, the epsilon and optionality regular predicates in certain arrangements permit the right-side of a c-structure rule to be satisfied by an empty string of daughters. Such empty nodes or traces are used in many syntactic theories (particularly Government Binding theory) to code in phrase-structure trees the positions of null anaphors, the extraction sites of long-distance dependencies, and other kinds of coindexing conspiracies. The original LFG theory of long-distance dependencies (Kaplan and Bresnan, 1982) was also based on c-structure epsilons appearing in certain control domains. Kaplan and Zaenen (1989b) criticized the motivation for the c-structure approach to extraction phenomena. They proposed the formal device of functional uncertainty as the basis of a linguistically superior account that is much more compatible with LFG's general functional orientation. This approach deals directly with functional assignments and does not rely on empty nodes or traces as surrogate phrasal coding devices for implicit functional identities. The functional approach thus permits empty nodes to be banished from c-structure trees without losing any linguistic generalizations. In the absence of empty nodes the possible phrase-structure trees for a sentence are much more strongly constrained by the actual words of the string.

XLE implements the modern, functional uncertainty approach to long-distance dependencies (see below), and by default it also maintains a prohibition against empty c-structure nodes. Thus, the regular predicate notation allows you to use optionality or unannotated epsilons (e's) to factor phrasal generalizations in the right sides of rules, and the normal interpretation of the notation might therefore allow a rule to be satisfied by the empty string. XLE disallows that possibility by imposing a special interpretation: any rule of the form M-->p is interpreted by default as the rule M-->[p - e]; that is, any realization of p except for nothing. For example, suppose that a language permits NP's to be realized with or without noun heads as described by the rule NP -> (DET) A* (N) and that sentences are defined by the rule S --> NP VP. XLE would not assign a c-structure to the string Walks the dog, since it treats the rule as requiring that at least one of DET, A, or N be present in the subject NP.

The default interpretation is more subtle when the regular predicate contains an e annotated with functional schemata, as in the rule

S --> {NP:(^ SUBJ)=!| e: (^ SUBJ PRED)='PRO'}
      VP.

This installs a pronominal subject in the case where an NP is not present in the string. XLE achieves the intended effect of this specification without producing a c-structure containing an empty e node by shifting the schemata to a non-empty node either to the left or right of the e. In essence, XLE treats this rule as the equivalent but more redundant

S --> { NP:(^ SUBJ)=!; VP
       |VP: (^ SUBJ PRED)='PRO'}.

Given the convention of subtracting e's from every right-side predicate, there is always at least one non-empty node to the left or the right of an e that its schemata can be transferred to. Since schemata containing ! cannot be associated with e predicates, these schemata can be shifted to non-empty categories without changing their meaning.

XLE also makes it possible to explore alternative theories in which empty nodes play a more explicit role in c-structure representations, even though these are not in good style according to current LFG standards. The default behavior with respect to empty nodes is controlled by two parameters that can be set in the PARAMETERS section of an analysis configuration. Setting the parameter ALLOWEMPTYNODES to TRUE suppresses the default behavior of subtracting the empty string from the right hand side of every rule. If this is done, XLE will interpret the NP rule above as allowing empty NP nodes and the string Walks the dog would be assigned the following c-structure: (S NP (VP (V walks) (NP (DET the)(N dog)))).

The second parameter, PRESERVEEPSILONS, affects the treatment of e's that appear explicitly in rules. If this parameter is set to TRUE, then schemata will not be shifted from e's, e transitions will be preserved in the underlying finite-state automaton, and the resulting c-structures will be displayed with explicit e nodes. Given the S rule above, the string Throw the ball would be assigned the following c-structure: (S e (VP (V throw) (NP (DET the) (N ball)))).

The f-structure corresponding to this c-structure will have a pronominal subject.

Regular abbreviations

There may be several reasons for introducing a category and a c-structure rule that defines the realization of its daughters. Depending on your theoretical stance, the existence of a node in a c-structure tree may be motivated by  intonational or coordination evidence that you intend c-structure arrangements to account for, or by a universal theory of category/function correspondences as suggested by Bresnan (1982a). But it may also be the case that a category and c-structure rule is introduced for grammar-internal reasons, as a means of collecting in one place a specification of complex category sequences that would otherwise be replicated in several different rules. The purpose of such a category is to help in expressing generalizations that hold across several different grammatical constructions, but this may have the undesired side-effect of cluttering up the c-structure trees with nodes that have no extrinsic justification. XLE provides separate notational devices, meta-categories and c-structure macros, that permit the factoring of recurring grammatical patterns without forcing an unmotivated elaboration of linguistic representations.

Meta-categories

A meta-category permits certain kinds of cross-categorial generalizations to be expressed. In English, for example, the open-complement function XCOMP may be associated with any of the constituents AP, PP, NP, or VP', and we might write the VP rule as

VP --> V
(NP:(^ OBJ)=!)
(NP:(^ OBJ2)=!)
({ AP:(^ XCOMP)=!
|PP:(^ XCOMP)=!
|NP:(^ XCOMP)=!
|VP':(^ XCOMP)=!}).

and expect to see trees such as:

(VP (V consider) (NP John) (NP (Det a) (N leader))).

We could bring out the common functional association by introducing a new ordinary category XP:

VP --> V
(NP:(^ OBJ)=!)
(NP:(^ OBJ2)=!)
(XP: (^ XCOMP)=!).

XP --> {AP|PP|NP|VP'}.

but then the XP would show up as an extra, presumably unwanted and otherwise unmotivated, node in the tree:

(VP (V consider) (NP John) (XP (NP (Det a) (N leader)))).

If instead we specify XP as a meta-category, the abbreviated version of the VP rule would still produce the original tree. A meta-category is distinguished from an ordinary category by a very small change in the rule that defines its regular-predicate expansion: we separate the left and right sides by an equal-sign instead of a rewriting arrow:

XP = {AP|PP|NP|VP'}.

Rules that reference the category symbol XP, such as the VP rule, are not modified when, by virtue of this simple change to its expansion, XP is converted from an ordinary category to a meta-category.

As another meta-category example we might consider the VP category itself. The status of VP as a constituent in all languages has been the subject of much linguistic debate. The arguments for the existence of such a constituent, at least in languages like English, are mostly transformational in nature, having to do with movements, copies, and deletions of common constituent patterns under the assumption that transformational rules only operate on single nodes. The same facts can be accounted for in LFG by letting VP be an ordinary category that appears in the predicates that define several other rules, as in these S and VP' rules:

S --> NP:(^ SUBJ)=!; VP.

VP' --> (to) VP.

But the VP does not correspond to a distinct functional unit, and the generalizations about surface constituent patterns can equally well be expressed by defining VP as a meta-category:

VP = V
(NP:(^ OBJ)=!)
(NP:(^ OBJ2)=!)
(XP:(^ XCOMP)=!).

This will give rise to flatter c-structures than the ordinary VP rule would provide, but the surface strings and f-structures will be exactly the same.

In general, a symbol M is defined to be a meta-category by a statement of the form

M = p.

where p is an arbitrary regular predicate. Any other rule that mentions M is interpreted as requiring the appearance of substrings satisyfing p in all positions where a node labeled M would otherwise be required. Unless ALLOWEMPTYNODES is TRUE, the empty string is not included as a substitution for M, even if p provides for it.

A meta-category M does not correspond to a node in the c-structure. By default, a meta-category does not map to an f-structure.

In this case, any schemata attached to occurrences of M (such as (^ XCOMP)=! in the VP rule) are attached instead to every node required by p. Thus, converting a category from ordinary to meta (merely by switching --> to =) does not normally change the sentences recognized by the grammar nor the f-structures assigned to them. However, distributing the schemata across all the nodes may produce unexpected results if the schemata contain ! and p contains sequences instead of unit-length terms: the ! will denote a different f-structure at every node, and this may lead to unintended inconsistencies. For example, if VP is defined as a meta-category, the term VP:^=!; appearing in the S rule will have the presumably undesired effect of identifying the S f-structure with the OBJ, OBJ2, and XCOMP f-structures.

Alternatively, if the Metacategory-constraints config parameter is set to Relabeled, there will be an f-structure corresponding to M, just as though M were a rule with a c-structure node. In this case, there is no need to distribute schemata attached to invocations of M; they attach to M's f-structure.

Meta-categories are implemented by expanding the rules in which they appear, and this process would not terminate for meta-categories that are self-recursive either directly or indirectly. When XLE encounters such a recursive expansion, it prints a warning message and substitutes the empty regular language ~?* instead of p for the recursive appearance of M. This terminates the expansion.

An alternative to metacategories is to use phantom nodes.  Phantom nodes allow greater flexibility because they can function as regular nodes or be called in such a way that they do not appear in the c-structure.

Regular-expression macros

Meta-categories stand for what may be thought of as single units of c-structure; regular macros, on the other hand, expand to arbitrary regular predicates with no intuition that they correspond to singleton nodes. The cross-categorial treatment of constituent coordination offers a good example. As described by Kaplan and Maxwell (1988b), the f-structure element corresponding to a coordination construction is the set of f-structures corresponding to the coordinated nodes. This holds independent of the categories involved. For instance, standard V coordination can be handled by the following rule:

V --> V: ! $ ^;
      CONJ
      V: ! $ ^.

But adjectives, nouns, prepositions, and phrasal categories such as VP and AP can also be coordinated in just the same way. We can factor out the general pattern by defining a regular macro:

COORD(CAT) = CAT: ! $ ^;
             CONJ
             CAT: ! $ ^.

This defines COORD as the name of a regular macro with one formal parameter, CAT. We can now write

V --> @(COORD V).

as the formal equivalent of the rule above. The @ marks the invocation of the regular macro COORD and specifies that in this instance its parameter CAT is to be realized as V. The effect is first to substitute V for CAT wherever it appears in the regular expression that expands COORD, and then to substitute the result in place of @(COORD V) in the V rule. But we can now also invoke the COORD macro in other rules:

A --> @(COORD A).

P --> @(COORD P).

VP --> { V
(NP:(^ OBJ)=!)
(NP:(^ OBJ2)=!)
(XP: (^ XCOMP)=!)
|@(COORD VP)}.

etc.

The expansions will be different in each case, because of the differing realizations of the formal parameter. This example illustrates two of the features that distinguish regular macros from meta-categories: Their invocations are explicitly marked with the special character @, and they may have formal parameters whose realizations are specified in the invocation. In an invocation the name of the macro and the realizations of any formal parameters are grouped together in parentheses. In the definition of the macro, the names of the formal parameters are given in parentheses between the macro name and the equal-sign.

As another macro example, consider a strategy for creating an ID/LP-style c-structure grammar in XLE. You might start by factoring the S rule into an ID component and an LP component:

S --> [NP:(^ SUBJ)=!, VP] & NP<VP

But you might then want to view the linear-precedence predicate of this rule as an instance of a more general set of conditions it shares with a number of other rules. You could define this condition as a regular macro and then reference it by name in all the appropriate rules:

LP = NP<{VP|AP|PP|VP'} & {V|A|N|P}<{VP|AP|NP|PP|VP'}.

The first clause of the LP predicate asserts that NP's come before other phrasal categories, and the second indicates that the language is head-initial. This particular macro has no formal parameters, so there are no parentheses in its definition. Also, parentheses can therefore be omitted from its invocations, as in the following rules:

S --> [NP:(^ SUBJ)=!, VP] & @LP.

VP --> [V, NP:(^ OBJ)=!, NP:(^ OBJ2)=!, XP:(^ XCOMP)=!] & @LP.

etc.

The @ at each invocation is thus the only indication that LP is a regular macro and not a meta-category. The macro invocations enable the linear-precedence constraints named by LP to be shared by these and other rules. The constraints are imposed (by intersection) as further conditions on the immediate-dominance specifications. In this example, clearly, the macro expansion has no interpretation as a single node.

A regular macro M is defined by a statement of either of the forms

M = p.

M (param1 param2 ... paramn) = p.

where p is a regular predicate. The first form is exactly the same as for the definition of a meta-category; the interpretation as either a macro or meta-category is determined at each invocation by whether or not it is marked with an @. The second form provides for some number n of formal parameters, where each parami is a symbol that presumably appears somewhere in p. The ith realization at a particular invocation is systematically substituted for the symbol parami everywhere in p. It is an error if the number of arguments (realizations) differs from the number of parameters.

An invocation of the macro M appearing in another rule (or macro or meta-category definition) is a regular-predicate term of either of the forms

@M

@(M real1 real2 ... realn )

The first is appropriate for a macro with no realizations. The realization expressions reali in the second form are themselves arbitrary regular predicates, perhaps with attached schemata. Those expressions will be substituted in place of the corresponding parami symbols in the predicate defining M. As with meta-categories, a warning message will be printed and the empty regular language ~?* will be substituted if a cyclic expansion of M is detected. Furthermore, unless a parameter name in the macro definition is followed by a question-mark, it is an error for a rule to contain an invocation in which the number of arguments (realizations) differs from the number of parameters specified in the definition for M.

Because regular macros do not correspond to categories, their expansions are treated somewhat differently. First, the empty-string expansion of a regular macro is retained whether or not ALLOWEMPTYNODES is TRUE. Second, given the intuition that a regular macro expands to an arbitrary regular predicate and not a singleton node, it makes no sense to distribute schemata attached to the macro invocation to all the terms in the expansion. Indeed, it makes little sense to attach schemata to the macro invocation at all, but for convenience, we allow them to be attached but give them the following distinctive interpretation: Any macro invocation of the forms

@M: schemata;

@(M real1 real2 ... realn ): schemata;

are interpreted, respectively, as the sequences

@M e: schemata;

@(M real1 real2 ... realn ) e: schemata;

That is, the schemata are interpreted as if they were attached to an epsilon symbol that is concatenated with the macro-expansion predicate. On this interpretation, schemata containing ! have no meaning and are therefore disallowed. Since this e is implicit in the interpretation of the macro invocation and does not actually appear in the grammar, it is not preserved even when the parameter PRESERVEEPSILONS is TRUE.

Note that you cannot have off-path constraints in macros.

Phantom nodes

In some circumstances there may be motivation to treat a given pattern of c-structure nodes as the daughter sequence of a single mother node, while in other situations there may be no reason to isolate that pattern of nodes as a separate subtree. In the first case the pattern would appear as the right-side of an ordinary rewrite rule, while for the second case the pattern might appear as a meta-category definition. We used the English VP above to illustrate how a meta-category definition enables a pattern of nodes to occur in several c-structure positions without necessarily giving rise to a distinct constituent. The meta-category enables the c-structure generalizations to be factored out of other rules without introducing otherwise uninteresting nodes in the tree. Returning to that example, we also note that English VP's can be coordinated, and a separate node does seem necessary for the coordination construction so that the f-structure of each conjunct can be referred to as a separate unit. XLE allows a single VP specification to be used in these two different ways. The VP is defined as an ordinary rewrite rule, not as a meta-category, as in the following example:

VP --> { V
(NP:(^ OBJ)=!)
(NP:(^ OBJ2)=!)
(XP:(^ XCOMP)=!)
|@(COORD VP VP)}.

The VP references in the COORD macro invocation will result in the appearance of explicit VP constituents in this construction. In the situations where there is no need or desire to have a separate VP node and the VP would be better treated as a meta-category, that effect can be achieved by referring to the VP as if it were a macro without parameters. Thus the VP references in the following rules will not result in separate constituents in the c-structure:

S --> NP:(^ SUBJ)=!; @VP.

VP' --> (to) @VP.

In general, when a category defined by a rewriting rule is referenced as a macro, XLE treats that invocation as if it were a reference to a meta-category defined by the right-side of the rule. The conventions for interpreting schemata attached to meta-categories are applied to each such invocation. We say that such an invocation of a rewriting rule gives rise to a phantom node, to distinguish this kind of reference from its use as an ordinary phrasal expansion.

Note that if a METARULEMACRO is defined, it is NOT applied to the rule body before the rule is expanded as a macro. If you want the METARULEMACRO applied, you must apply it explicitly, as in the following:

S --> NP:(^ SUBJ)=!; @(METARULEMACRO VP VP @VP).

Parameterized rules (complex categories)

Complex categories are categories of the form symbol[arg1,...,argn]. There is only one place where complex categories get a special interpretation, namely if used as the left hand side of a grammar rule. Such a grammar rule will be interpreted as a rule schema (called parametrized rule), and will be instantiated to ordinary rules (still involving complex categories) for all relevant values of the parameters. For instance, the parameterized rule

NP[_NUM] -> N[_NUM]: (^ NUMBER)=_NUM.

represents a whole family of rules that differ in their value for NUM:

NP[SG] -> N[SG]: (^ NUMBER)=SG.
NP[PL] -> N[PL]: (^ NUMBER)=PL.

...

Which of these rules actually get added to the grammar is determined by the use of NP[] as a complex category elsewhere in the grammar.

The instantiation of the parameters will substitute all occurrences of the parameters in the rule schema by the actual arguments, no matter if the parameters appear in regular expressions, functional annotations, or even calls to macros, and no matter if the parameters appear as symbols themselves or if they are used as arguments of embedded complex categories.

The instantiation of parametrized rules takes place before macro substitution. This means that the value of the parameters can influence the choice of a macro which is called, but cannot affect symbols that appear in the body of a macro definition if they happen to be identical with parameters of the rule. For example, in the partial grammar

... ->  ... Cat[something] ... .
Cat[arg] ->  ... @macro ... OtherCat[arg]: (^ foo)=arg ... .
macro = ThirdCat: (^ bar)=arg.

calling Cat[something] will trigger an instantiation of the Cat[arg]-rule, resulting in

Cat[something] -> ... ThirdCat: (^ bar)=arg 
... OtherCat[something]: (^ foo)=something ...

where arg is replaced only in the functional annotations that appear directly in in the parametrized rule, but not in those that are introduced via the macro call.

A parameterized rule can include parameter declarations that specify the possible set of values that a parameter can take. The notation for this is:

Cat[arg1 $ {a1 a2}, arg2 $ {b1 b2 b3}] -> ... 

This notation says that arg1 can be either a1 or a2, and that arg2 can be either b1, b2, or b3.  This might be used for the English VP where one argument states the form of the verb heading the VP and one the form of its complement:

 VP[_aux $ {fut modal pass perf prog},_form $ {base fin fut modal pass perf prog}]

XLE will complain if the arguments of a complex call do not match the parameter declarations of the corresponding parameterized rule.

METARULEMACRO

If a macro named METARULEMACRO is defined, it is interpreted in a special way. The effective right-hand side of each rule in the grammar is taken to be the result of applying that macro to three parameters: the mother category of that rule, the base name of the mother category (for the complex categories of parameterized rules), and the specified right-hand side of the rule. The METARULEMACRO is useful for expressing generalizations that operate across all the rules of the grammar, such as coordination, brackets, parentheses, and linear precedence. Here is an example METARULEMACRO:

METARULEMACRO(_CAT _BASECAT _RHS) = {
_RHS & @LP "linear precedence rules"
|e: _BASECAT $ {NP N' N}; "treat NP coordination special"
@(NPCOORD _CAT _CAT)
|e: _BASECAT ~$ {NP N' N}; "regular coordination"
@(SCCOORD _CAT _CAT)
|LEFT_BRACKET
_CAT: { (* MOTHER LEFT_SISTER) "require containing rule to be branching"
|(* MOTHER RIGHT_SISTER) "to avoid a spurious ambiguity"
~(* MOTHER LEFT_SISTER)
|~(* MOTHER MOTHER)};
RIGHT_BRACKET}.

The first disjunct shows how you can add linear precedence constraints by intersecting them with the right-hand side of the rule. This makes the linear precedence constraints apply to all the rules in the grammar. The second disjunct shows how you can add coordination to a subset of the rules without having to enumerate all the possible parameter versions of these rules. Similarly for the third disjunct. The fourth disjunct shows how you can use c-structure constraints to avoid spurious ambiguities. The constraints involving LEFT_SISTER, RIGHT_SISTER, and MOTHER eliminate an analysis that uses a non-branching rule immediately above it. This means that the brackets will always attach to the top category in a non-branching rule chain.

The functional description language

The schemata that appear in c-structure rules and in lexical entries are written in XLE's functional description language. Except for the absence of bounded-domination metavariables, this is an extension of the notation presented by Kaplan and Bresnan (1982) and is used to express conditions that the f-structures and other projections corresponding to particular nodes must satisfy. Explicit projection symbols may be used in codescriptive statements that characterize structures representing different aspects of linguistic organization. The f-description language includes designators that denote symbols, semantic forms, f-structures, and other projection structures. Properties of those entities are asserted by propositions constructed from a small number of predicates and relations applied to those designators, and those propositions are combined together by the usual Boolean operators.

Designators

Designators are terms in the description language that stand for elements in the structures representing different projections. They are defined recursively in the following way:

attribute and value symbols

symbol

A symbol is a sequence of characters that are not separated by white-space (spaces, carriage returns, tabs) or by other delimiters in the f-description language (parentheses, brackets, predicate and relation symbols). Any character can be included in a symbol if it is prefixed with a backquote (`). The underscore and hyphen punctuation marks do not need to be prefixed by a backquote in order to be included in a symbol. For example: OBJ, OBJth, OBJ2, OBJ-TH, OBJ_TH.

A symbol character-sequence denotes the unique symbol with that name. The symbol may be an attribute in the f-structure (a feature or grammatical-function name) or in some other attribute-value structure, it may be the value of some attribute, or, in some cases, it may be both. The symbol OBL-GOAL, for example, may be both a grammatical function and the value of an attribute such as PCASE. LFG differs from many `unification-based' grammatical formalisms in that attributes and values are not necessarily disjoint. Symbols may also appear in special structures such as semantic forms, where they also retain their uniqueness property.

instantiated symbols

symbol_

Kaplan and Bresnan (1982) used the single-quote notation to denote semantic forms, as described below. These entities were used to formalize several different aspects of the interaction between the syntactic and semantic components, including a notion of individuation or unique instantiation. Further experience and understanding has suggested that unique instantiation is a formal device with motivation for purely syntactic features, and XLE therefore provides for instantiated feature values that carry no semantic entailments. An instantiated syntactic feature value is indicated by suffixing the underscore character _ to the name of an ordinary symbol. The notation is intended to suggest an indexing subscript.

Equalities of two ordinary symbols with the same spelling, such as NOM=NOM, are always true. In constrast, instantiated symbols, even those with identical spellings, are taken to denote distinct objects. Therefore, equalities of instantiated symbols are always false, and an equation such as OBL-ON_=OBL-ON_ will never be satisfied. For feature values that are ordinary symbols, the existential constraints in the f-description language can be used to discriminate between the absence of any defining equation and the presence of at least one but possibly several such equations. With instantiated symbols, the truth value of a Boolean formula can further discriminate between the appearance of exactly one defining equation as opposed to more than one.

As an illustration, suppose that the case-marking feature of a preposition is specified as an ordinary symbol, for example, by the schema (^ PCASE)=OBL-ON in the lexical entry for on. This will give rise to the proper functional configuration for a sentence like John relied on Bill. If the within-clause role of a fronted prepositional phrase is determined by functional uncertainty, the proper result will also emerge for the question On whom did John rely? The ungrammatical question On whom did John rely on Bill? will be rejected because the conflicting nominal predicates will both be assigned to the oblique-object grammatical function. But the equally ungrammatical question On whom did John rely on? will not be rejected because it involves doubling of only the semantically empty features of the preposition. Changing the PCASE value to an instantiated symbol via (^ PCASE)=OBL-ON_ will rule this out because of a clash between the two instantiations of OBL-ON_.

An instantiated symbol is interpreted as the corresponding ordinary symbol when it appears as an f-structure attribute, so that (^ ON-OBJ) and (^ ON-OBJ_) are exactly equivalent. The change of PCASE to an instantiated value thus still allows the proper assembly of an oblique object according to the schema (^ (! PCASE))=(! OBJ). An instantiated symbol is also treated as the corresponding ordinary symbol when it is used as a value in constraining as opposed to defining schemata. The existential constraint (! PCASE) will be satisfied if the PCASE value is ON-OBJ_, and so will the constraining equation (! PCASE)=c OBL-ON.

Note that an equality between two designators for the same instantiation of an instantiated symbol will be true, as in the conjunction of (^ PCASE)=ON-OBJ_ and (^ PCASE)=(^ PCASE).

It is possible to extract the base symbol from an instantiated symbol using the system-defined BASE attribute. For instance, if (^ PCASE)=ON_, then (^ PCASE BASE) will equal ON.

semantic forms

'function'

'function<a1..ak>'

'function<a1..ak>n1 ... nl'

A semantic form is an expression enclosed in single-quotes. It makes visible in the syntax only those aspects of semantic representation that interact in some way with syntactic properties. In particular, it defines the mapping between a predicate's governed grammatical relations and the particular argument positions of the semantic relation denoted by the function designator. Also, as discussed by Kaplan and Bresnan (1982), a semantic form acts like an instantiated symbol, carrying in this case a notion of semantic individuation: two semantic forms are taken to denote distinct semantic entities (and hence equality never holds between them) even if they are specified by identical expressions. The entities denoted by a semantic form may have other significant semantic properties, such as various kinds of scoping and entailment relations, but those have no syntactic consequences.

Included within the quotes of a semantic form is a designator for a semantic function, optional designators for one or more thematic argument functions a1... ak enclosed in angle brackets, and optional designators for one or more nonthematic grammatical functions n1 ... nl appearing after the brackets. The function is typically a symbol designating an atomic semantic relation (e.g. 'GIRL', 'WALK<(^ SUBJ)>'), but other types of designators are also allowed. For example, '(! PRED)<(^ SUBJ)>' might denote the semantic entity corresponding to a predicate nominal, constructed by applying a common-noun meaning (! PRED) to a single argument provided by a subject.

The thematic and nonthematic grammatical-function designators will typically be function-application expressions such as (^ SUBJ), (^ OBJ), etc. The attribute names in these expressions denote governable grammatical functions, and thus they should match one of the GOVERNABLERELATIONS patterns specified in the active configuration. The special symbol NULL may also appear and can be used to encode the fact that a particular semantic argument has no surface manifestation, such as in the passive which can be derived via a lexical rule. The symbol NULL thus serves as the null function /0 discussed by Bresnan (1982b).

XLE's implementation of LFG's completeness and coherence conditions depends on the thematic and nonthematic grammatical-functional designators. An f-structure is incomplete if it has a semantic-form PRED value with a ai or ni designating a grammatical function not locally present in the f-structure. An f-structure is incoherent if it contains a local semantic-form PRED value and a governable grammatical function (as specified by the GOVERNABLERELATIONS patterns of the current configuration) and that function is not sanctioned by one of the ai or ni in the PRED value. The distinction between the thematic and nonthematic designators (and the bracket-external notation for nonthematics) follows Bresnan (1982a): the thematic functions are mapped to arguments of the semantic relation and thus must satisfy its selectional restrictions. The nonthematic functions are purely syntactic and may be filled, for example, by semantically empty expletives, e.g. It rains 'rain<>(^ SUBJ)'. The equi predicate persuade thus differs from the raising predicate expect only in how grammatical functions are allocated to the thematic and nonthematic sets:

'PERSUADE<(^ SUBJ) (^ OBJ) (^ COMP)>'

'EXPECT<(^ SUBJ) (^ COMP)> (^ OBJ)'

XLE enforces this formal distinction by also marking f-structures as incomplete if a thematic function is present but does not have a semantic-form PRED value. No additional requirements are imposed on the values of nonthematic functions.

As described below, the individual components of a semantic form may be referenced in functional descriptions by use of the special attributes FN, ARG1, ARG2..., NOTARG1, NOTARG1, ...

f-structure metavariables

^

!

These designators denote the f-structures corresponding to nodes of the c-structure via the c-structure to f-structure correspondence phi (notated f:: in XLE). The phi correspondence, or projection, was introduced as a fundamental organizing mechanism by Kaplan and Bresnan (1982), but its role in interpreting the f-structure metavariables was not formally elaborated. ! was instantiated to the f-structure corresponding to the current node (the node matching the category-symbol containing the ! annotation) and ^ was instantiated as the f-structure corresponding to the mother of the current node. These notions were formalized more carefully in Kaplan (1987, 1989, 1995): If * denotes the node matching a given category symbol annotated with schemata containing !, then that ! is taken as a convenient abbreviation for f::*, the result of applying the projection phi to the matching node. If M denotes the function that takes a c-structure node into its mother, then ^ is interpreted as a convenient abbreviation for f::M*. For example, (^ SUBJ)=! can be written (f::M*)=f::*. Defining the f-structure metavariables explicitly in terms of phi makes it easier to formalize relations such as functional precedence, which are based on the inverse of the phi correspondence. As discussed at the end of this section, XLE also supports * and M* as designators of the matching node and its mother, and this allows for correspondences that map from c-structure nodes to information structures other than the f-structure.

function application and uncertainty

(d r)

A parenthetic expression stands for the result of applying the monadic function denoted by the designator d to an argument specified by r, a regular predicate over designators. In the common case, d is an f-structure metavariable and r is merely a symbol designator such as SUBJ, OBJ, CASE, or PERSON that specifes the name of a grammatical function or feature. The designator (^ PERSON) thus denotes the value of the attribute PERSON in the f-structure corresponding to the current node's mother. In the somewhat more general case, r is a sequence of symbol designators, as in (^ SUBJ PERSON). As defined by Kaplan and Bresnan (1982), this denotes the result of applying the function denoted by (^ SUBJ) to the argument PERSON. In the most general case, r is an arbitrary regular predicate describing an arbitrary regular language. Kaplan and Zaenen (1989b) assign an existential meaning in this situation:

(d r)=v iff either

d=v and the empty string belongs to the language denoted by r, or

there is a symbol s such that ((d s) Suffix(s, r))=v

where Suffix(s, r) is the regular language { y | sy is in r }

The indeterminacy about exactly which symbol is chosen at each point in the unfolding of this recursive definition is what gives rise to the intuition of "functional uncertainty". XLE solves propositions involving functional uncertainty according to the algorithm described by Kaplan and Maxwell (1988a).

The regular predicates that can be combined in r are drawn from the set of regular operators that make up the c-structure description language, described in above. For example, the designator

(^ COMP* GF - VCOMP)

denotes the infinite regular set whose strings begin with zero or more COMP's and end with any grammatical function other than VCOMP. Note, however, that in uncertainty predicates the relative complementation hyphen must be surrounded with white space; this permits the hyphen to be included as a character in the middle of names such as OBL-TO. The basic terms of these functional regular predicates are designators denoting grammatical function and feature names. XLE does not support the use of complex symbol designators such as (! PCASE) within a functional uncertainty.

In XLE, when a $ appears inside of a designator, then it represents the non-deterministic choice of a set element. For instance, (^ ADJUNCT $) will non-deterministically choose a set element from (^ ADJUNCT). This is useful for picking an element out of a set or testing for an element (e.g. (^ ADJUNCT $ PRED FN)=c never tests whether "never" is an adjunct modifier)).

If a functional uncertainty appears within the scope of a negation, then the functional uncertainty changes from an existential to a universal. For instance, (^ ADJUNCT $ PRED FN) ~= never means that none of the adjuncts can be "never". This is because this constraint is interpreted as ~[(^ ADJUNCT $ PRED FN) = never], and so the functional uncertainty is considered to be in the scope of a negation.

If a functional uncertainty appears in the scope of a sub-c constraint (including existentials), then the functional uncertainty will not introduce any disjunctions. Instead, it will test all of the different paths, and only fail if none of the tests are satisfied. This is done to avoid spurious ambiguities. If the functional uncertainty has an off-path constraint in it, then disjunctions will be introduced.

Functional predicates share the same epsilon symbol e as c-structure predicates. They may also make use of regular predicate abbreviations defined along with c-structure rules and abbreviations in the RULES section. Thus, you can include definitions such as

COMPFNS = {COMP | XCOMP}.

TERMFNS = {SUBJ | OBJ | OBJ2}.

and then make use of an uncertainty such as (^ COMPFNS* TERMFNS). There is no difference between metacategories and macros in functional regular predicates, since there are no schemata to be distributed differentially and there is no special treatment of epsilons to worry about.

Functional uncertainties are "nonconstructive". This means that all of the attributes that are within a regular operator other than concatenation only match against existing attributes in the f-structure, they never construct attributes where none existed before. For example, in

(^ A {B|C G} D E* F) =!

The attributes B, C, E, and G would be nonconstructive and the attributes A, D, and F would be constructive.

As originally presented by Kaplan and Bresnan (1982), it would be inconsistent for the designator d to denote anything other than an f-structure. XLE implements a modified form of the extension described by Kaplan and Maxwell (1989a) that gives meaning to the case when d designates a set. XLE's version is that if d designates a set, then (d r) gets distributed to each of the elements of the set by means of subsumption. This permits a natural account of how features and functions distribute across the conjuncts in a constituent coordination construction. XLE provides for this behavior only for attributes that are not specified in the configuration as "nondistributive". If r is a nondistributive attribute, then d is interpreted as if it is an ordinary f-structure containing a value for attribute r. In effect, d will have a mix of set and f-structure properties. This modification allows certain features, such as the identity of the conjunction, to be expressed directly in the f-structure, whereas the original Kaplan and Maxwell formulation forced the conjunction out of the f-structure and into another projection.

XLE implements another extension that permits reference to the components of a semantic form. If d denotes a semantic form, then the designator (d FN) denotes its function component, (d ARG1), (d ARG2), ... denote its a1, a2, ... argument components, and (d NOTARG1), (d NOTARG2), ... denote its n1, n2 nonthematic function components. Thus you can ensure that some noun phrase is realized as a pronoun by asserting (^ PRED FN)=PRO. Asserting (^ PRED)='PRO' would not have this effect, because of the instantiation of semantic forms ('PRO'='PRO' is always false).

You can also access the id of a semantic form using (d SFID). If two semantic forms have different ids, then they come from different instantiations of a semantic form. The semantic form ids are ordered based on the left vertex of the edge where the semantic forms were instantiated. This means that if one id is larger than another, then the left string position of where the larger id was instantiated cannot precede the left string position of where the smaller id was instantiated.

XLE does not allow functional uncertainties that end with PRED values (e.g. (^ r PRED)='PRO', where r is a functional uncertainty). If you want to achieve this effect, use a meta-variable (e.g. (^ r) = %X (%X PRED)='PRO'.

inside-out function application

(alpha d)

A parenthetic expression with a regular predicate alpha followed by a designator d is called an "inside-out" function application. Intuitively, an inside-out function application starts from the f-structure denoted by d, and moves to the left along the attributes specified in alpha. For instance, (XCOMP SUBJ !) starts with the f-structure denoted by !, and then moves to ^ if ^ has ! as its XCOMP SUBJ. The regular predicate alpha can denote not just a single string but a regular language of strings, thus providing the inside-out counterpart to the more familiar outside-in functional uncertainty. However, even when alpha is a string, an inside-out application can be non-deterministic. For instance, if (^ XCOMP SUBJ) = (^ SUBJ), and (^ SUBJ)=!, then (SUBJ !) can denote either ^ or (^ XCOMP).

The formal definition of inside-out function application is:

For an arbitrary predicate P and an inside-out application (alpha f), P((alpha f)) is true iff:

  1. epsilon is a member of alpha and P(f) is true;
  2. or there exists g and a such that g is an f-structure and a is an attribute and (g a) = f and P((Prefix(alpha, a) g)) is true.
  3. or there exists g and a such that g is a set and a is a nondistributive attribute and (g a) = f and P((Prefix(alpha, a) g)) is true;
  4. or Prefix(alpha, $) is not empty and there exists a g such that g is a set and f $ g and P((Prefix(alpha, $) g)) is true;
  5. or Prefix(alpha, $) is empty and there exists a g such that g is a set and f $ g and P(([alpha - epsilon] g)) is true;

We define Prefix(alpha, a) to be the set of all strings x such that xa is a member of alpha.

Most of this is obvious except for item 5. The basic idea is that by default inside-out functional uncertainties get to move from an element to the set containing it as long as there are more attributes to climb. The uncertainties do not do this if they are at the end of the inside-out path, however. The intuition for the latter is that inside-out uncertainties asserted within one set element shouldn't propagate to sister elements unless the grammar writer explicitly requests it.

Note that inside-out applications do not distribute across sets. If an inside-out application starts from within a set it may pass through the set, but there is no requirement that similar inside-out applications must originate from within the other set elements. Thus, when trying to decide whether to model a particular linguistic phenomena using regular uncertainty or inside-out uncertainty a crucial test is how the phenomena behaves when coordination is present.

Except for set distribution and the direction of application, inside-out applications are very similar to functional uncertainties. Like ordinary uncertainties, inside-out applications are nonconstructive. They can have off-path constraints. If they don't have off-path constraints, then non-defining constraints don't introduce any disjunctions. They are converted from implicit existentials into implicit universals in the scope of a negation. This is true even when the regular predicate is unambiguous. For instance, if (^ XCOMP SUBJ)=(^ SUBJ), (^ SUBJ)=!, and ((SUBJ !) FOO)~=+, then neither ^ nor (^ XCOMP) can have FOO equal to +.

XLE does not allow inside-out functional uncertainties that end with PRED values (e.g. ((alpha ^) PRED)='PRO', where alpha is an inside-out functional uncertainty). If you want to achieve this effect, use a meta-variable (e.g. (alpha ^) = %X (%X PRED)='PRO'.

off-path constraints

It is possible to add off-path constraints to the path language of a functional uncertainty or an inside-out application using the standard notation for annotating a category with constraints. Here are some examples:

(↑ XCOMP*: ~(-> FOCUS); GF)=↓

(^ XCOMP*: ~(-> FOCUS); GF)=!

(^ XCOMP* GF: (<- TENSED) = +;)=!

The only metavariables allowed within off-path constraints are "<-" and "->". The metavariable "<-" denotes the f-structure that contains the attribute that the off-path constraints are attached to. The metavariable "->" denotes the f-structure that is the value of the attribute that the off-path constraints are attached to. The arrows make more sense if you think about the constraints as being under the attribute that they match:

(^     XCOMP*           GF           )=!
   ~(-> FOCUS)

(^     XCOMP*           GF           )=!
                  (<- TENSED) = +

In the first example, the off-path constraint ~(-> FOCUS) on XCOMP says that each XCOMP that the uncertainty matches must have a value that does not have a FOCUS attribute. In the second example, the off-path constraint (<- TENSED) = + on GF says that the GF that matches the end of the uncertainty must belong to an f-structure that has TENSED = +.

Off-path constraints are useful in describing linguistic phenomena that are sensitive to the presence or absence of a functional uncertainty. They may also be useful for things like parasitic gaps. For instance, the off-path constraint in (^ XCOMP* GF:((<- ADJUNCT $ OBJ) = ->)) = ! can handle simple parasitic gaps by asserting that ! (which -> corresponds to) also fills the OBJ of an ADJUNCT. If an off-path constraint occurs in the scope of a negation, then the off-path constraints are negated, too.

Off-path constraints can be undecidable if there are no restrictions placed on them. For instance, (^ A*:(-> A B)=+)=! can generate an unbounded chain of A attributes in ^ once the uncertainty matches something. This is because the (-> A B)=+ constraint creates the next A attribute for the uncertainty to match. To avoid this problem, XLE makes attributes in the off-path constraints be nonconstructive if the off-path constraints are attached to a cycle and the attribute occurs in the cycle. For instance, the A attribute would be non-constructive in the example given above, and the constraint would be interpreted as (^ A*: (-> {A} B)=+)=!.

local names

%symbol

A local name can be used as a variable whose scope is limited to the schemata associated with a particular category or lexical item. This makes it convenient and possible to make repeated references to a single entity. For example, suppose that the PERSON, NUMBER , and GENDER features of a noun are grouped together in the value of an AGR attribute. A given lexical entry might assert that

(^ AGR PERSON)=3

(^ AGR NUMBER)=PL

(^ AGR GENDER)=FEM

but these feature equations can be shortened by means of a local name %A that captures the value of the AGR feature:

(^ AGR)=%A

(%A NUMBER)=PL

(%A PERSON)=3

(%A GENDER)=FEM

In this example the local name helps to abbreviate constraints whose effect could also be achieved in the usual more cumbersome way. When a functional uncertainty is involved, however, a local name can make it possible to express conditions that would otherwise be unstatable. Suppose that the three agreement requirements are to be asserted on the value at the end of a functional uncertainty, not on the value of ^. The schemata

(^ COMPFNS* TERMFNS AGR PERSON)=3

(^ COMPFNS* TERMFNS AGR NUMBER)=PL

(^ COMPFNS* TERMFNS AGR GENDER)=FEM

would not have the desired effect. Each of the uncertainties is treated separately and can be instantiated by different strings drawn from the regular sets. Thus, the first schema might assert the PERSON of a complement's object while the second asserts the NUMBER of a different structure, the second complement's subject. A local name can be used not only to shorten the specification but also to ensure that all three requirements are imposed on the same structure:

(^ COMPFNS* TERMFNS AGR)=%A

(%A NUMBER)=PL

(%A PERSON)=3

(%A GENDER)=FEM

The uncertainty may still be resolved by the choice of different strings, and this may result in a disjunction of different assignments for %A. However, as desired, all three requirements will be imposed simultaneously on each such assignment.

reference to stem

%stem

%stem is a local name with a special, built-in meaning. When it appears in a lexical entry or in a template invoked by a lexical entry, it is taken to denote a symbol whose name is the headword of the entry. Thus, %stem denotes the symbol walk when it appears in the definition of walk or in a template that walk invokes. If the template INTRANS is defined as

INTRANS(P) = (^ PRED)='P<(^ SUBJ)>'

then the particular predicate must be specified in each invocation, as in @(INTRANS walk). An alternative is to define INTRANS with no arguments, as in

INTRANS = (^ PRED)='%stem<(^ SUBJ)>'

With this definition, the simpler invocation @INTRANS in the entry for walk will produce (^ PRED)='walk<(^ SUBJ)>' but (^ PRED)='fall<(^ SUBJ)>' in the entry for fall. A %stem reference can thus be used to eliminate redundant specifications in many lexical entries. However, as this example illustrates, the spelling of the %stem reference will be exactly identical to the spelling of the headword, including the same pattern of upper and lower casing.

%stem can also be used as the argument to a complex category in a lexical entry. For instance:

-token !CAT[%stem] *; ONLY.

This is useful if you have brackets labeled with c-structure categories, and you don't want to add lexical entries for all of the categories. All that you need is the lexical entry above, plus a line in the METARULEMACRO like the following:

LeftSquareBracket CAT[_CAT] _CAT RightSquareBracket

projection-value designators

p d

The correspondence phi between c-structure and f-structure was the only projection discussed by Kaplan and Bresnan (1982). Subsequent work, especially Kaplan (1987, 1989), Halvorsen and Kaplan (1988), and Kaplan et al. (1989), presents a more general view of abstract structures representing different aspects of lingusitic organization and of piecewise correspondences relating the elements of those structures. These piecewise correspondences permit abstract structures to be described in terms of the elements and relations of more concrete ones. XLE provides a simple notation for specifying the projection functions that map between different kinds of structures. The notation allows projections to be composed with each other, thus enabling the statement of codescriptive constraints.

In XLE a projection designator p is simply an upper or lower case letter followed by two colons. Thus s::, S::, t::, etc. might all be used to denote different projection correspondences.

If d is a designator of an element in one structure, then p d designates the element corresponding to d through the p projection. For example, if s:: denotes the semantic projection, then s::^ is the semantic structure corresponding to the ^ f-structure, and s::(^ SUBJ) is the semantic structure corresponding to ^'s subject. If r:: denotes the correspondence between, say, units of semantic structure and a structure that represents their referents, then r::s::(^ SUBJ) stands for the referent of the semantic structure corresponding to the SUBJ of ^. Composition of projections is thus indicated by simple concatenation of projection designators, with a leftward projection applying to the element denoted by the designator to its right.

It is worth noting the difference between the conceptual principles embodied in LFG's projection architecture and the use of the distinctive attributes in other grammatical formalisms to represent different kinds of linguistic information in a single structure. Projections are aimed at separating and modularizing different aspects of linguistic organization, so that systems with relatively little interaction can be treated as different kinds of entities. Thus, apart from the explicit difference in notation, XLE displays different projections as separate structures in different windows. Also, for certain operations, projections have behavior that is formally different from the values of ordinary attributes. Projection (but not attribute) inverses are used to define functional precedence and other predicates (see Zaenen & Kaplan, 1995), and generalization takes place over set-valued attributes but not projections, as described by Kaplan and Maxwell (1988b) and implemented in XLE.

c-structure metavariables

*

M*

LS*

RS*

These designators can appear in schemata to denote, respectively, the node matching the associated category and the mother of that node. The f-structure metavariables ! and ^ are thus exactly equivalent to the specifications f::* and f::M*, where f:: is the f-structure-to-c-structure correspondence. These node designators make it possible to put information structures other than f-structures into correspondence with c-structure nodes. These alternative projections may be kept separate when corresponding f-structures are identified, and they may be equated when the nodes' f-structures are distinct. For example, defining a m:: projection to represent the morphosyntactic features of particular nodes can permit various morphological agreement requirements to be enforced without constraining the functional dependencies encoded in the f-structure (Butt, Nino, and Segond 1996). If English past-participles carry the schema (m::M* PARTICIPLE)=PAST, the perfect auxiliary have can ensure that its morphological complement is of the right type by the schema (m::M* MCOMP PARTICIPLE)=PAST. In m::-structure the features of the embedded verb would be grouped hierarchically under the MCOMP attribute even though the f-structure is arguably flat (by means of the schemata (m::M* MCOMP)=m::* and ^=! associated with the lower VP category).

The c-structure designators * and M* are typically used in the context of c-structure projections (e.g. x::* and y::M*). In addition, XLE allows a limited number of c-structure designators to test for the presence or absence of nearby c-structure nodes. The additional c-structure designators that XLE allows are: (* LEFT_SISTER), (* RIGHT_SISTER), and (* MOTHER) (equivalent to M*). You can also have one or more MOTHER relations between * and the tree relations LEFT_SISTER, RIGHT_SISTER, and MOTHER (e.g. (* MOTHER LEFT_SISTER), (* MOTHER MOTHER LEFT_SISTER), etc.). C-structure designators other than * and M* in the context of c-structure projections can only be used in positive or negative existential constraints (e.g. (* LEFT_SISTER) or ~(* LEFT_SISTER)). Since these constraints apply to c-structures they will not appear anywhere in the f-structure output. This means that if a solution is invalid because it violates one of these c-structure constraints, then the solution will be displayed with the message "BAD TREE CONSTRAINT" but the conflicting constraints will not appear in the display. LS* is a shorthand for (* LEFT_SISTER), and RS* is a shorthand for (* RIGHT_SISTER).

The c-structure designators (* WEIGHT) and (* MOTHER WEIGHT) can be used to designate the weight of a c-structure node. For instance, (* WEIGHT) = 5 indicates that the node designated by * must have a weight of 5. Similarly, (* MOTHER WEIGHT) < 3 indicates that the mother of the node designated by * must have a weight less than 3. These constraints can be used for heavy NP shift and other linguistic phenomena that are sensitive to the weight of a constituent. For instance, the following sample rule could be used for heavy NP shift:

VP --> V 
(NP: (^ OBJ)=!)
(NP: (^ OBJ2)=!)
PP*: ! $ (^ ADJUNCT);
( PP+: ! $ (^ ADJUNCT);
NP: (^ OBJ) = ! (* WEIGHT) > 10 ).
This rule says that an object NP can follow one or more PPs if its weight is more than 10. Please see the section on determining the weight of a edge for more information.

closed sets

{NOM}

{NOM ACC}

{NOM ACC (! CASE)}

A closed set is a list of designators surrounded by curly brackets. It can only appear on the right of a relation, for example (^ CASE) = {NOM ACC}, (^ CASE) $ {NOM ACC}, and (^ CASE) << {NOM ACC}. The constraint (^ CASE) = {NOM ACC} is similar to having NOM $ (^ CASE) and ACC $ (^ CASE), except that it also specifies that the set can have no other elements in it (hence the name "closed"). You can test whether an element is in a closed set using constraints like DAT $ (^ CASE). This will fail immediately if (^ CASE) is a closed set and DAT cannot be unified with any of its elements. If a set can be either open or closed, you can test for set membership using DAT $c (^ CASE). This will also fail immediately if (^ CASE) is a closed set and DAT cannot be unified with any of its elements. It will fail eventually from incompleteness if (^ CASE) is an open set and DAT is never added to it.

Closed sets have an implicit narrow-scope disjunction. For instance, if the verb tests the CLASS of the SUBJ using the following constraint:

(^ SUBJ CLASS) $ {CL1 CL2}
and CLASS is not a non-distributive attribute, then if the SUBJ is coordinated, the membership test gets distributed. In particular, the following constraints are consistent with the membership test:
%N1 $ (^ SUBJ)
%N2 $ (^ SUBJ)
(%N1 CLASS) = CL1
(%N2 CLASS) = CL2
This would not be true if the disjunction were explicit:
{(^ SUBJ CLASS) = CL1 | (^ SUBJ CLASS) = CL2}

restriction

d\a

d\a denotes a restricted version of the f-structure denoted by d (cf. Kaplan and Wedekind 1993). This restricted f-structure is identical to the original f-structure except that it doesn't have the attribute a. For instance, ^\CASE denotes an f-structure identical to ^ except that it cannot have the CASE attribute. ^\A\B denotes the same f-structure as ^\B\A.

If the attribute a is governable and a is in the list of governable attributes for d, then a will not be in the list of governable attributes for d\a. If a is governable and a is not in the list of governable attributes for d\a, then a will be in the list of governable attributes for d. This means that a constraint like !\SUBJ\OBJ = ^\SUBJ\OBL-AGT takes the governable attributes of !, subtracts SUBJ and OBJ from them to get the governable attributes for !\SUBJ\OBJ, and then adds SUBJ and OBL-AGT to them to get the governable attributes for ^. This can be useful as part of a syntactic rule for passivization, such as:

V --> V: !\SUBJ\OBJ = ^\SUBJ\OBL-AGT
         (! OBJ) = (^ SUBJ)
         (! SUBJ) = (^ OBL-AGT)
         {(! SUBJ) = NULL}
         (^ PASSIVE)=+
         (! PASSIVE)=-;
       PASSIVE_MARKER.
This is similar in effect to the lexical rewrite rules:
(^ OBJ) --> (^ SUBJ)
{ (^ SUBJ) --> (^ OBL-AGT)
|(^ SUBJ) --> NULL}
except that !\SUBJ\OBJ = ^\SUBJ\OBL-AGT doesn't require there to be an OBL-AGT in ^ (it just allows it). Note that there are very few ways that the passivation rule can be written using restriction. If you mimic the structure of the lexical rule such that the disjunction includes (! SUBJ) = (^ OBL-AGT) then you get unexpected results. Also, using {(! PRED ARG1) = NULL} fails. The only variation that works is using {(^ OBJ) = NULL}.

Restriction can also be used to add governable attributes to an f-structure, although it is not possible to add arguments to an existing PRED semantic value. For instance, if you wanted a mono-clausal f-structure for a light verb construction, you would still need a bi-clausal semantic structure. This could be done using the following grammar fragment:

   VP -> {V "regular verb construction"
(NP: (^ OBJ)=!)
(NP: (^ OBJ2)=!)
(PPobl: (^ OBL-BY)=! (! PCASE)=by)
| LV "light verb construction"
VP: "^ equals ! except for SUBJ and PRED"
!\SUBJ\PRED = ^\OBJ2\PRED
"the SUBJ becomes an OBJ2"
"the PRED becomes an argument of the light verb"
(! SUBJ) = (^ OBJ2)
(! PRED) = (^ PRED ARG2);
(NP: (^ OBJ2)=!)
}.

let LV * (^ PRED)='let<(^ SUBJ) %ARG2>'
"%ARG2 will be filled in with the lower predicate".

This grammar fragment will create a single f-structure where the SUBJ of the main verb becomes the OBJ2 of the light verb's f-structure. The PRED of the light verb has two arguments: the SUBJ and the PRED of the main verb.

In general, both variables involved in a restricted equality are included in the XLE output. However, if there is a restricted equality between ^ and !, then XLE will suppress the contents of !. Also, if a governable attribute is equal to NULL, then the governable attribute is not displayed.

Propositions

Propositions are elementary formulas in the functional description language that assert properties of f-structures and other projections. These propositions are true or false of a model consisting of a structure and an assignment of the variables in the proposition to the elements of that structure. The formulas involve a fixed set of predicates and relations as detailed below. These predicates and relations are asserted of arbitrary designators d1, d2, and d.

truth value

TRUE

FALSE

These are the truth-value constants that provide a grounding for the logical system. They do not commonly appear in grammatical specifications, but they may arise in the course of grammatical analysis and show up in some of XLE's output displays. They may also be useful in the invocation of functional templates as described below.

equality

d1 = d2

d1 =c d2

The equality predicates are true if d1 and d2 denote exactly the same entity. The difference between the defining equality and the constraining equality =c is that the latter is true only if the equality is implied by all the defining propositions in the functional description. Technically, this is a slightly stronger condition than the one given in Kaplan and Bresnan (1982) because it requires that the constraining equality be true of all extensions of the minimal model, whereas Kaplan and Bresnan only imposed the constraint on the minimal model alone. The letter c (which may be in upper or lower case) must appear immediately next to the =, with no intervening white-space. If d2 is a symbol designator, there must be white-space between it and the preceding c. Thus, (^ SPECIES)=COW is interpreted as a defining equation involving the symbol COW. The proposition must be written as (^ SPECIES)=C OW to obtain a constraining equality on the symbol OW.

set membership

d1 $ d2

The membership predicate is true if d2 denotes a set and d1 denotes one of its elements. This, for example, is used for assigning adjuncts to the adjunct set:  ! $ (^ ADJUNCT). It is also used in coordination to make each conjunct a member of a set: ! $ ^.

existential constraint

d

A designator standing by itself is true if the designator has a denotation in the minimal structure satisfying the defining propositions in the f-description. It makes no sense to assert an existential constraint of a symbol or semantic-form designator, since these always have a denotation. XLE therefore issues an error message when it encounters such a constraint. This is usually symptomatic either of a conceptual confusion or of a notational mistake elsewhere in the f-description. An example of an existential constrain would be to require all sentences to be tensed:

S --> NP: (^ SUBJ)=!;
      VP: (^ TENSE).

functional and abstract category constraints

The CAT predicate specifies conditions on units of f-structure or other projections that hold by virtue of the nodes that those units correspond to. The CAT predicate is defined in terms of the inverses of structural correspondences, and a CAT proposition is satisfied if and only if the entity denoted by the designator d is a projection of at least one node whose category is in 'categories'. If d denotes an f-structure, then

@(CAT d categories) iff there is some n in phi-1(d) such that category(n) $ categories.

As with abstract precedence, if d involves a composition of other correspondences and thus denotes an element in some other abstract structure, that combination of correspondences is similarly inverted to produce a set of nodes whose categories can be tested.

The CAT predicate permits constraints to be imposed on the categories of the nodes that map to an f-structure without actually copying a category feature into the f-structure (or doing it surreptitiously as Kaplan and Bresnan (1982) did when they used grammatical-function names like VCOMP, ACOMP, and PCOMP). Asserting contraints by means of this correspondence-based predicate maintains the essential modularity of c-structure and f-structure properties while still allowing for limited and controlled interactions to be expressed.

The d argument is an ordinary designator (e.g. (^ XCOMP) or sigma::^). The category's argument is a collection of atomic category labels enclosed in set-brackets. For example, the proposition

@(CAT (^ XCOMP) {NP' AP})

might be included in the English lexical entry for become to indicate that the XCOMP of become must correspond to some node labeled either NP' or AP. This would disallow strings like The girl became to go and The girl became in the park, even though the VP rule might allow VP', AP, NP', and PP all to be associated with the category-independent XCOMP function. If the XCOMP is assigned to the top of an ^=! head chain, then the constraint can also be stated in terms of just the lexical category labels:

@(CAT (^ XCOMP) {N A})

The CAT predicate is distributed to the elements of a set, so that this lexical specification for become would allow the mixed-category coordination John became a Republican and proud of it while the string John became a Republican and to go would be ruled out.

If only a single category is specified as the category's argument, the set-brackets may be omitted. Thus the proposition

@(CAT (^ XCOMP) V)

can appear in the lexical entry of verbs that take only verbal complements.

One advantage of a CAT specification over a VCOMP, ... NCOMP arrangement is that a predicate such as consider that allows any kind of complement requires no marking at all instead of being four ways ambiguous.

subsumption

d1 << d2

d1 >> d2

These propositions provide explicit grammatical access to the subsumption ordering of f-structures and other abstract structures. The subsumption ordering corresponds to a notion of information inclusion, in the sense that one structure (the subsumer) subsumes another if the second (the subsumee) contains at least the information found in the first. Thus, if d1 and d2 denote f-structures, d1 << d2 is true if and only if the structure denoted by the subsumee d2 contains all the attributes of the subsumer d1 structure, and on each of the common attributes, its value in d1 is either equal to or subsumes its value in d2. Subsumption provides for asymmetric propagation of information. If you assert that (^ OBJ NUM)=1 and (^ SUBJ)<<(^ OBJ), this does not establish a value for (^ SUBJ NUM): an existential constraint on the subject's number would not be satisfied. On the other hand, if you instead assert that (^ SUBJ NUM)=1 along with the subsumption, the number value will be propagated `upwards' to the OBJ and would be inconsistent with any incompatible value independently asserted of the object's number. For convenience, d1 >> d2 is equivalent to d2 << d1.

If d1 and d2 denote sets, then d1 << d2 is true if and only if every element of d1 subsumes at least one element of d2. When the elements of both sets are simple symbols, d1 << d2 is equivalent to d2 contains d1. It also follows that if d1 and d2 both subsume some third set d3, then d3 contains d1 unioned with d2.

The explicit subsumption introduced by f1 << f2 is a little different from the implicit subsumption that is introduced when properties are automatically distributed from a set to its elements. For the purpose of set distribution, "properties" include constraining equations such as (^ TENSE) or (^ NUM)~=SG. Thus, ~(^ TENSE) on a set is satisfied if and only if ~(f TENSE) is true for all f $ ^. Explicit subsumption doesn't work this way. If f1 << f2, then ~(f1 TENSE) can be satisfied even if f2 has a TENSE (e.g the constraint doesn't get distributed to f2).

When displaying or printing unpacked f-structures, XLE checks whether a subsumee has the same content as its subsumer. If the defining constraints of the two f-structures are the same, then XLE makes the subsumee equal to the subsumer. Thus, a sentence like John wanted to arrive Monday and to leave Tuesday would have one f-structure with John as a PRED instead of three. If the defining constraints are different, then XLE does not make the subsumee equal to the subsumer. This can happen in a language where verbs assign quirky case, for instance. Sometimes the defining constraints are the same but the negative or sub-c constraints are different. To handle this case, XLE makes the subsumee equal to the subsumer unless sub-c and negated constraints are being displayed, in which case it displays separate f-structures for the subsumee and the subsumer.

Subsumption, particularly of reentrant structures, should be used with a degree of caution. When other propositions assert certain internal equalities in the subsumer, the f-description as a whole may be undecidable and XLE's sentence analysis procedure may not terminate. There have been proposals for restricted versions of subsumption that avoid the mathematical pitfalls, but the linguistic significance of these restrictions is not yet understood. Experience with the current relation in actual grammars may help to sharpen the issues and lead to theoretical refinements as well as improvements in XLE's implementation.

head precedence

f1 <h f2

f2 >h f1

f1 <h f2 is true if and only if f1 and f2 have heads and the head of f1 precedes the head of f2 in the c-structure. For the purpose of this definition, the "head" of an f-structure is the constituent where the f-structure's PRED semantic form was instantiated if the constituent also maps via the phi projection to the same f-structure. The requirement that the head also map to the same f-structure means that f-structures whose PRED semantic forms are introduced by pro-drop constraints such as {(^ SUBJ PRED)='PRO'} will not have a head. If f1 or f2 is a set, then the "head" will be the heads of all of the elements of the set. f1 >h f2 means the same thing as f2 <h f1.

You can use head precedence as a impoverished version of f-precedence (f-precedence is currently not implemented). For instance, (^ SUBJ) <h (^ OBJ) will behave just like (^ SUBJ) <f (^ OBJ) whenever (^ SUBJ) and (^ OBJ) both have heads as defined above and neither are discontinuous.

scope relations

f1 <s f2

f2 >s f1

f1 <s f2 asserts that f1 is in the scope of f2. The scope relations are transitive; f1 <s f2 and f2 <s f3 implies that f1 <s f3. The scope relations are also anti-symmetric: it is not possible for f1 <s f2 and f1 >s f2 at the same time. f2 >s f1 means the same thing as f1 <s f2.

The scope relations are useful for specifying scope relations among arguments of a predicate. For instance, if you want the SUBJ to outscope the OBJ, you can assert (^ SUBJ) >s (^ OBJ). If you want the scoping relations to be determined by linear order, you can use the following constraints:

     {  (^ SUBJ) <h (^ OBJ)
(^ SUBJ) >s (^ OBJ)
| (^ SUBJ) ~<h (^ OBJ)
(^ OBJ) >s (^ SUBJ)}

This will have the effect of making the SUBJ outscope the OBJ when the SUBJ head-precedes the OBJ, and it will make the OBJ outscope the SUBJ otherwise.

surface adjunct scope

f1 $<h<s f2

f1 $<h>s f2

f1 $<h<s f2 has the definition:

	f1 $ f2 and
for all x in the set f2, if x $<h<s f2 and f1 <h x then f1 <s x.
f1 $<h>s f2 has the definition:

	f1 $ f2 and
for all x in the set f2, if x $<h>s f2 and f1 <h x then f1 >s x.

The adjunct scope relations are useful for specifying relative surface scope within a set of adjuncts. They implicitly divide a set into three subsets of elements: the elements that were added using $<h<s, the elements that were added using $<h>s, and the elements that were added using $. The scope relations only apply within their own subset. For instance, the following rule will give left scope to the premodifiers and right scope to the postmodifiers, but will leave the premodifiers unscoped relative to the postmodifiers:

NP --> AP*: ! $<h>s (^ MODS); 
N
AP*: ! $<h<s (^ MODS).

If the string a1 a2 a3 n a4 a5 a6 gets parsed with this rule, it will produce the following constraints:

	f1 >s f2
f1 >s f3
f2 >s f3
f4 <s f5
f4 <s f6
f5 <s f6

This is an underspecified version of the scoping relations that would be obtained if the grammar treated adjuncts as heads:

NP --> {  N
| AP
N: (^ PRED ARG1)=!
^ >s !;
| N: (^ PRED ARG1)=!
^ >s !;
AP}.

comment

" any string of characters "

Just as in the regular predicate language, character strings enclosed in double-quotes are treated as invisible comments and assigned no other interpretation in the functional description language. They permit you to add explanatory notes at any position in a formula where a proposition is allowed.

Boolean combinations

The elementary propositions just described can be put together in Boolean combinations to make up the schemata that are attached to c-structure categories and appear in lexical entries. In some cases there are special rules of interpretation for the Boolean operators having to do with the defining/constraining distinction in LFG theory. The possible schemata are specified recursively as follows:

elementary propositions

p

All of the elementary propositions described above can serve as schemata.

conjunction

s1 s2 ... sn

A sequence of schemata combined with no other operator (but separated by white-space as needed) is interpreted conjunctively. The sequence is true if and only if all of the individual schemata are true.

grouping

[ s ]

Square brackets are used merely to explicitly mark that the components of s are to be treated as a unit with respect to other enclosing operators. This is especially useful in defining the scope of negation and in demarcating the values supplied for template substitution.

disjunction

{ s1 | s2 | ... | sn }

A disjunction is satisfied if and only if at least one of the si is satisfied. By the usual rules of logic a disjunction would reduce to TRUE if one of its disjuncts is TRUE. This reduction does not go through in LFG, however, because of the defining/constraining distinction. Consider the formula {(^ TENSE)=PAST | TRUE} which might be used to indicate that a sentence is or is not marked as past tense. If this is reduced to TRUE and hence ignored, the minimal f-structure solution will have no TENSE feature at all (assuming that TENSE is not elsewhere defined). But then that structure will fail to satisfy an existential constraint (^ TENSE) occurring elsewhere in the f-description, which is not the intended result. Thus, XLE does not perform this reduction. Instead it operates on even TRUE-containing disjunctions, always attempting to find minimal solutions for each of the disjuncts taken separately.

optionality

{ s }

This is true whether or not the s is satisfied. It is a short-hand equivalent for the disjunction {s | TRUE}, and as just explained, XLE attempts to find one minimal solution that satisfies an f-description containing s instead of the optionality and also one not containing s.

negation

~s

A formula ~s is true of a model if s itself is false. Negation in LFG serves as a filter on the minimal structures satisfying the positive defining propositions in the f-description. XLE does not try to extend structures (for example, by adding conflicting features) so that a negative condition is satisfied. As a consequence, functional uncertainties in the scope of LFG's nonconstructive negation are decidable, whereas it is known that uncertainties in the scope of classical negation are not (Baader et al., 1991; Keller, 1991).

If s is an elementary proposition, the negation operator ~ may be placed as a prefix on the relation symbol: for example, ~(^ SUBJ)=! may be written instead as(^ SUBJ)~=!. Also, note that the negation operator binds more tightly than conjunction: ~(^ TENSE)(^ CASE) is interpreted as [~(^ TENSE)](^ CASE) instead of ~[(^ TENSE)(^ CASE)].

Functional templates and lexical rules

One of the hallmarks of Lexical Functional Grammar is its use of lexical redundancy rules to express generalizations that hold across large families of lexical entries. Bresnan (1982c) discusses the motivation for a passive redundancy rule, and Kaplan and Bresnan (1982) present a simple formalism for encoding such relation-changing lexical rules. Lexical rules map lexical entries to lexical entries by either adding new schemata or by systematically modifying the designators appearing in existing schemata. The result is an entry that stands as a disjunctive alternative to the original. The passive rule, for example, applies to transitive verb entries. It introduces a schema insuring that the verb is in its past-participle form, it modifies existing schemata by converting object designators to subject designators, and it also changes existing subject designators, either converting them to oblique agent designators or else deleting them. Using XLE's notation for disjunction and the NULL symbol to indicate the absence of a grammatical function, the Kaplan/Bresnan passive rule would be written as

(^ PARTICIPLE)=c PAST
(^ OBJ)-->(^ SUBJ)
{ (^ SUBJ)-->(^ OBL-AG) | (^ SUBJ)-->NULL }

The rewriting arrow --> is the special operator that indicates how designators are to be systematically modified. Suppose this rule is applied to the functional annotation (^ PRED)='kick<(^ SUBJ)(^ OBJ)>' for the simple transitive verb kick. The result would be the alternative passive entries

(^ PARTICIPLE)=C PAST (^ PRED)='kick<(^ OBL-AG)(^ SUBJ)>'

(^ PARTICIPLE)=C PAST (^ PRED)='kick< NULL (^ SUBJ)>'

Note the importance of rewriting the complex designator (^ SUBJ) and not just the grammatical-function name SUBJ as the simpler SUBJ-->OBL-AG would do. This would have the unintended effect of also changing the embedded SUBJ in a functional-control equation, so that the schema (^ XCOMP SUBJ)=(^ OBJ) for a raising verb would be transformed to (^ XCOMP OBL-AG)=(^ SUBJ).

The early LFG literature was quite precise in its discussion of the effects of lexical redundancy rules, but it was much less clear about the circumstances that would trigger the application of those rules. The presence or absence of particular grammatical functions is one of the conditioning factors, but it is not sufficient. The verbs cost and weigh are well-known examples in English of words that take subjects and objects but do not passivize, and donate is a commonly used example of a ditransitive verb that does not undergo dative-shift. The existence of lexical exceptions to seemingly general syntactic variations was taken to support the claim that these phenomena were lexical in nature. However, it also called for some way of either explicitly marking exactly when particular rules can apply or predicting their application from other lexical properties.

The early literature assumed, mostly tacitly, that lexical items would be annotated with some set of morphosyntactic features, and that these, together with some default marking conventions, would govern the application of rules. A complete and formally coherent notation for this purpose was never proposed, and theoretical work in this area has shifted away from the marking approach. More recent work in Lexical Mapping Theory (Bresnan & Kanerva, 1989) has attempted to develop general principles for predicting lexico-syntactic variations from the semantic properties of the lexical predicates and a skeletal assignment of grammatical functions to particular semantic arguments. This is a much more interesting and promising approach, but it has not yet reached the stage where precise formal mechanisms have been defined and justified. Thus, in the absence of clear theoretical guidance, XLE provides its own abbreviatory conventions for expressing common f-description patterns that are shared by many lexical items and for marking lexical entries to indicate how they participate in those generalizations.

Templates in lexical entries

Redundancies in the lexicon can be factored out by means of functional templates. These are like regular-predicate macros in that they permit the name assigned to a complex formula to be used in place of that formula in larger expressions. For instance, we know that all English count nouns follow the general pattern illustrated by the following entries for airport and girl:

airport N XLE (^ PRED)='airport'
              { (^ NUM)=SG (^ SPEC)
               |(^ NUM)=PL}.

girl    N XLE (^ PRED)='girl'
              { (^ NUM)=SG (^ SPEC)
               |(^ NUM)=PL}.

Each of the two entries provides the appropriate 0-place semantic relation and asserts that the noun must appear with a specifier when singular but not when plural. A parameterized functional template (defined in a TEMPLATE section) can factor out the count specification that occurs in both entries, as follows:

COUNT-NOUN(P) = (^ PRED)='P'
              { (^ NUM)=SG (^ SPEC)
               |(^ NUM)=PL}.

The lexical entries for airport, girl, or any other count noun can then be abbreviated to a simple invocation of the template COUNT-NOUN:

airport N XLE @(COUN-NOUN airport).

girl    N XLE @(COUNT-NOUN girl).

Just like a regular-predicate macro, a template invocation is marked by an @, followed by a parenthetic expression that gives the name of the template and a sequence of values to be substituted for the formal parameters in the template definition. In this case the single parameter is P, and it is replaced by airport and girl, respectively, to produce the formulas in the original lexical entries. If all count nouns are marked in this way, they would all be affected by any change to the template definition. For example, the extended definition

CN(P) = (^ PRED)='P'
        (s::^ REL)=P
        { (^ NUM)=SG (^ SPEC)
         |(^ NUM)=PL}.

with the added schema (s::^ REL)=P would make all count nouns also assert P as the value of a relation attribute in the s:: structure.

Template invocations provide the XLE notation for assigning a lexical item to a lexical class, and lexical classes are defined by their template definitions. For example, the following template definitions could be used to define the lexical classes of 'transitive' and 'intransitive' verb:

TRANS(P)   = (^ PRED)='P<(^ SUBJ) (^ OBJ)>'.
                                    
INTRANS(P) = (^ PRED)='P<(^ SUBJ)>'.

The verbs in a lexicon can then be assigned to a lexical class by invoking the appropriate template, as illustrated below for the transitive verb devour and the intransitive verb sleep:

devour V XLE @(TRANS devour ).

sleep  V XLE @(INTRANS sleep).

Some verbs can be either transitive or intransitive ('ambitransitive') and therefore would therefore be assigned to both of these lexical classes, as follows:

stop V XLE { @(TRANS stop) | @(INTRANS stop) }.

If there are many verbs that can be either transitive or intransitive, it may make sense to define a template for a class of ambitransitive verbs, as follows:

AMBITRANS(P) = { @(TRANS P) | @(INTRANS P) }.

The definition of the lexical entry for stop would thus look as follows:

stop V XLE @(AMBITRANS stop).

This illustrates the fact that template invocations may appear in the position of any proposition in the f-description language, so that the formulas they stand for can be combined by the usual Boolean operators. Note that the template calls can be made even more concise using %stem in lieu of a hard-codeD predicate names.

Lexical rules

Since the definition of a template can be any Boolean combination of schemata containing formal-parameter symbols in positions where schemata or symbols might otherwise appear, a template definition can also include other template invocations. Thus, the fact that most transitive and ditransitive verbs can also be passivized can be encoded in the definitions

TRANS (P) = @(PASS (^ PRED)='P<(^ SUBJ) (^ OBJ)>').

DITRANS (P) = @(PASS (^ PRED)='P<(^ SUBJ) (^ OBJ) (^ OBJ2)>').

The passive template can then be defined to specify the systematic rewriting of designators, in accordance with the original lexical redundancy rule described above. This might have the following form:

PASS(SCHEMATA) = { SCHEMATA 
|SCHEMATA
(^ PARTICIPLE)=c PAST
(^ OBJ)-->(^ SUBJ)
{ (^ SUBJ)-->(^ OBL-AG)
|(^ SUBJ)-->NULL}}.

The single parameter for this template can be a Boolean combination of schemata; in the case of TRANS and DITRANS, it is instantiated as a single PRED schema. The result after substituting this schema and expanding the template will be a disjunction with three branches. One disjunct will contain an isolated copy of the original schema, one will contain the agentive passive alternative, and the third will contain the agent-deletion passive. The passive alternations are created by rewriting designators according to the --> specifications.

A designator rewrite is an expression of the form

match --> result

where match and result are arbitrary designators. A designator rewrite can only appear in a position appropriate for a proposition and only in a template definition. A template definition containing such a specification is interpreted in the following way. First, the parameters of the template invocation are substituted for the formal parameters to give a parameter-free Boolean formula. The resulting formula is then converted to disjunctive normal form, a formula consisting of a top-level disjunction each of whose disjuncts is a conjunction of propositions and designator rewrites. Finally, the designator rewrites are removed from each disjunct and applied to the schemata remaining in the disjunct to systematically change any designators matching the match of a rewrite into the corresponding result. If the match for one rewrite is a subexpression of the match for another, the result corresponding to the larger match will be installed.

These three steps can be illustrated by the expansion of the template invocation @(TRANS kick). The symbol kick is first substituted for P in the TRANS definition to give the formula

@(PASS (^ PRED)='kick<(^ SUBJ) (^ OBJ)>')

Next, the PRED schema is substituted for SCHEMATA in the definition of PASS, producing

{ (^ PRED)='kick<(^ SUBJ) (^ OBJ)>'
|(^ PRED)='kick<(^ SUBJ) (^ OBJ)>'
(^ PARTICIPLE)=c PAST
(^ OBJ)-->(^ SUBJ)
{ (^ SUBJ)-->(^ OBL-AG)
|(^ SUBJ)-->NULL}}.

The inner disjunction is promoted to the top level to give the disjunctive normal form

{ (^ PRED)='kick<(^ SUBJ) (^ OBJ)>'
|(^ PRED)='kick<(^ SUBJ) (^ OBJ)>'
(^ PARTICIPLE)=c PAST
(^ OBJ)-->(^ SUBJ)
(^ SUBJ)-->(^ OBL-AG)
|(^ PRED)='kick<(^ SUBJ) (^ OBJ)>'
(^ PARTICIPLE)=c PAST
(^ OBJ)-->(^ SUBJ)
(^ SUBJ)-->NULL}.

The designator rewrites are then executed within their disjunct scopes to produce the final result:

{ (^ PRED)='kick<(^ SUBJ) (^ OBJ)>'
|(^ PRED)='kick<(^ OBL-AG) (^ SUBJ)>'
(^ PARTICIPLE)=c PAST
|(^ PRED)='kick<NULL (^ SUBJ)'
(^ PARTICIPLE)=c PAST}.

The same process would be carried out in the expansion of the DITRANS template, with the result containing the appropriate three-place PRED schemata. A variant of the DITRANS template could be defined that would first invoke a DATIVE template to disjunctively rewrite the OBJ and OBJ2 designators before offering them to the PASS template. The result would allow for all appropriate dative/passive combinations. Presumably verbs like give but not donate would invoke this dative variant.

Another aspect of template interpretation is illustrated by the specification of functional control verbs. The verb expect can take a simple object alone (I expected Mary), a that-complement alone (I expected that Mary would go), or an infinitival complement with or without a that-complement (I expected to go; I expected Mary to go). The complement complexities for verbs in this class can be isolated into a template RAISEOBJ, so that expect can be defined as

expect V XLE { @(TRANS expect) | @(RAISEOBJ expect) }.

The RAISEOBJ template provides a basic PRED schema for the that-complement configuration, substituting this for the formal parameter of an FCONTROL template:

RAISEOBJ(P) = @(FCONTROL (^ PRED)='P<(^ SUBJ) (^ COMP)>').

The interesting specifications appear in the definition of the FCONTROL template. This template will result in a set of three alternatives, one with just the original schema COMP and the others specify either OBJ or SUBJ control of a XCOMP:

FCONTROL(SCHEMATA) = SCHEMATA {(^ COMP)-->(^ XCOMP)
{ (^ XCOMP SUBJ)=(^ OBJ)
|(^ XCOMP SUBJ)=(^ SUBJ)}}.

In this definition, instead of putting SCHEMATA in two branches of a disjunction as we did in the PASS template above, we use the equivalent but more succinct form of conjoining the original SCHEMATA to an optional specificaton of the infinitival variations. The result of @(RAISEOBJ expect) after substitution, DNF expansion, and designator rewriting is the three-way disjunction

        { (^ PRED)='expect<(^ SUBJ) (^ COMP)>'
|(^ PRED)='expect<(^ SUBJ) (^ XCOMP)>'
(^ XCOMP SUBJ)=(^ OBJ)
|(^ PRED)='expect<(^ SUBJ) (^ XCOMP)>'
(^ XCOMP SUBJ)=(^ SUBJ)}.

The middle object-control disjunct is not acceptable in its present form, since it calls for the governable OBJ function while at the same time asserting that the local PRED semantic form does not permit an object. To avoid such unintended incoherences, XLE carries out one further step in the template expansion process: If a disjunct contains a PRED-defining schema and a schema introducing a governable function not allowed by that semantic form, the semantic form is modified to permit the new function as a nonthematic grammatical relation. Thus, the PRED schema in the middle disjunction is changed to the more appropriate

(^ PRED)='expect<(^ SUBJ) (^ XCOMP)>(^ OBJ)'

Of course, RAISEOBJ can be composed with PASS to provide for the passive/raising occurrences of verbs like expect.

If the only argument ever specified in invocations of a particular template is the name of the current lexical predicate, the invocations can all be simplified by making use of the %stem designator instead. Thus, the TRANS and RAISEOBJ templates can be equivalently defined as

TRANS = @(PASS (^ PRED)='%stem<(^ SUBJ) (^ OBJ)>').

RAISEOBJ = @(FCONTROL (^ PRED)='%stem<(^ SUBJ) (^ COMP)>').

and the invocations for expect can be simplified to {@TRANS | @RAISEOBJ}.

Templates in rules

These examples have focussed on the use of templates to eliminate redundancies across lexical entries. But templates can also be used in the annotations on syntactic rules, as one way of factoring out collections of schemata that are common to several different constructions. In some cases it may even be advantageous for a template to be used in both a lexical entry and a grammatical rule. For example, suppose that a template BE has been defined to include the complex pattern of constraints that make up the definition of the verb be. These constraints express, among other things, how the various forms of be interact with participles to form passives and progressives. It has often been noted that reduced relative clauses and other constructions seem to honor these same constraints, and this is what motivated early proposals for transformations such as the `WH-IS' deletion rule that removes a deep-structure be. The shared constraints that such a transformation provided can be obtained in LFG simply by placing the invocation @BE in the c-structure rule that describes the surface pattern of reduced relative clauses.

As another set of examples, the template notation can be used to augment the common lexical/syntactic f-description language with certain abbreviatory conveniences. The template defined as

DEFAULT(D V) = { D D~=V
                | D=V)}.

specifies that V is the default value for the designator D (the equation D~=V ensure that when D=V there is only one analysis; that is, it eliminates a spurious ambiguity). The invocation expands to

{ (^ GEN) (^ GEN)~=FEM
 | (^ GEN)=FEM) }

indicating that the gender feature takes on the value FEM unless another value is provided by a defining equation appearing somewhere else. The effect of the if-then and if-and-only-if logical operators can also be expressed as template definitions:

IF(P Q) = { ~P | Q}.

IFF(P Q) = @(IF P Q) @(IF Q P).

The invocation @(IF (^ NUM)=PL (^ CASE)=ACC) forces the case to be accusative whenever the number is plural.

System-defined templates

The COMPLETE template

The COMPLETE template has a built-in interpretation in XLE. It indicates that any sub-c constraints that apply to the designated f-structure should be replaced by FALSE if they are not satisfied within the local subtree (e.g. the daughter tree designated by * plus the local subtree constraints). Normally, sub-c constraints are replaced with FALSE if they are not satisfied within the entire input. Checking sub-c constraints locally changes the grammar. It shouldn't be done unless it is linguistically valid. However, when it is linguistically valid it can lead to substantial performance improvements, especially for complicated constructions such as long-distance dependencies that are triggered by the presence of a feature that rarely occurs. For instance:

  CPrel --> ... NP: (^ SUBJ)=!
(! ADJUNCT $)=%LOCAL
(%LOCAL OBJ PRON-TYPE)=c rel
@(COMPLETE (%LOCAL OBJ PRON-TYPE)) ...

In this case, if the NP doesn't have an ADJUNCT whose object is a relative pronoun, then the CPrel will be pruned immediately. If the COMPLETE template call wasn't there, then XLE would continue with the CPrel on the chance that the PRON-TYPE would get filled in later. Checking for PRON-TYPE =c rel locally is linguistically valid since relative pronouns cannot be extracted from a relative clause.

The COMPLETE template is not recursive: it only applies to the given designator, and not to any attributes of the designator. For instance, @(COMPLETE (^ SUBJ)) would only check sub-c constraints and existentials on (^ SUBJ), and would not check (^ SUBJ NUM) or (^ SUBJ ADJUNCT). The COMPLETE template is implemented this way to avoid copying up the more embedded structure since this might negate the performance advantage of completing the given f-structure early.

Using the COMPLETE template when it is not linguistically valid can lead to subtle bugs. For instance, it might seem lingustically valid to complete the SPEC on the object of a VP:

  VP --> ... NP: (^ OBJ)=!
@(COMPLETE (! SPEC)) ...

However, this will eliminate linguistically valid solutions if the SPEC gets filled in by another rule higher up:

  VPanaphsubj --> ... VP: (^ OBJ SPEC SPEC-TYPE)= null ...

Finally, if a grammar gives a template definition for COMPLETE, then XLE will use that definition instead of giving COMPLETE the built-in interpretation.

The RIGHTBRANCHING template

The RIGHTBRANCHING template has a built-in interpretation in XLE. If a daughter category in a rule has the RIGHTBRANCHING template at the top level (e.g. not within a disjunction), then XLE will compile the rule up to that daughter with pseudo-categories that will make the rule right branching. For instance, if the rule looked like:

  S --> NP: (^ OBJ)=!; 
NP: (^ OBJ2)=!;
V: @RIGHTBRANCHING;
NP: (^ SUBJ)=!.
Then XLE would compile the rule as if it had been written as:
  S --> S.1 NP: (^ SUBJ)=!.

S.1 --> NP: (^ OBJ)=! S.2.

S.2 --> NP: (^ OBJ2)=! S.3.

S.3 --> V.

Writing right-branching rules like this by hand can make XLE run faster when a rule is head-final. However, it is much more difficult to maintain such a grammar. Using the RIGHTBRANCHING template can produce a similar improvement in speed without sacrificing clarity.

Using RIGHTBRANCHING doesn't change how a grammar works unless the grammar has c-structure constraints that involve LEFT_SISTER, RIGHT_SISTER or MOTHER. If the grammar has constraints to eliminate spurious c-structure ambiguities, these may need to be rewritten to disprefer the spurious c-structure ambiguities instead.

The CONCAT template

The CONCAT template has a built-in interpretation in XLE: all arguments except the last argument are concatenated to produce the last argument. For instance, the constraint:

@(CONCAT look `- up %FN)

would set %FN to be "look-up".

CONCAT is defined as a relation instead of a function. This allows it to be used for generation as well as parsing. For instance, the constraint:

@(CONCAT %ROOT `- up look`-up)

would set %ROOT to look.

The arguments to CONCAT do not have to be bound to symbols within a lexical entry. However, all but one of the arguments must be eventually bound to a symbol in order for the CONCAT relation to be invoked and a variable to be set to a new value. If more than one argument is left unbound after all of the constraints in the chart have been processed, then the CONCAT template will be considered INCOMPLETE, and no solution will be produced. If the arguments are all bound but they do not form a valid concatentation relationship, then the constraint will be considered INCONSISTENT (e.g. @(CONCAT a b c abz)).

Here is an example of how to use the CONCAT template for a particle verb:

look V XLE @(CONCAT %stem `- %PART %FN)
(^ PRED)='%FN<(^ SUBJ)(^ OBJ)>'
(^ PART) << %PART.

During parsing, %FN will be set to look-up once the up particle is found and (^ PART) is set to up. During generation, %PART will be set to up if the input specifies that %FN is look-up. Note that (^ PART) subsumes %PART, instead of being equated to it using =c. This is so the value of (^ PART) flows to %PART during parsing, but the value of %PART doesn't flow to (^ PART) during generation. Also, if (^ PART) is instantiated, you should use (^ PART BASE) << %PART instead. The BASE feature extracts the base of the instantiated value.

Here is an example of how the CONCAT template might be used to remove the prefix from a lexical entry:

ein#scannen +V-S xle @(CONCAT %PART `# %FN %stem)
(^ PRED)='%FN<(^ SUBJ)>'
(^ PRT-FORM) << %PART.

This is useful if you want to make verbs with attached prefixes have the same f-structure as verbs where the prefix is separated (instead of the other way around). Note that you still have to use subsumption for (^ PRT-FORM) in this situation.

In order for the generator to operate efficiently, XLE must be able to determine the possible PRED FN values of a lexical entry from the information in the lexical entry. If XLE cannot determine the possible PRED FN values of a lexical entry, then it will add that lexical entry to the generation chart of every generation attempt, not just those that might actually use it. For this reason, XLE prints a warning whenever it encounters a lexical entry like this while indexing lexical entries for generation. You can help XLE determine the possible PRED FN values of a lexical entry by adding more information to the lexical entry. For instance, you can add a $c constraint that lists the possible values of a particle. The $c constraint should not adversely effect performance. Here is an example:

look V XLE @(CONCAT %stem `- %PART %FN)
(^ PRED)='%FN<(^ SUBJ)(^ OBJ)>'
(^ PART) << %PART
(^ PART) $c {up over}.

Finally, the CONCAT template should not be used in an unknown entry (e.g. -unknown or -token). For instance, the following is NOT recommended:

-unknown V XLE @(CONCAT %stem '- %PART %FN)
(^ PRED)='%FN<(^ SUBJ)(^ OBJ)>'
(^ PART) << %PART.

This will work while parsing, because the parser knows the value of %stem from the morphology. However, it will not work while generating, because the generator cannot determine the value of %stem during lexical lookup, and so it cannot generate with this lexical entry. This may be fixed at some future point.

Template notation

To summarize, a functional template T is defined in a Template section by statements of either of the forms

T = s.

T (param1 param2 ... paramn) = s.

Where s is a Boolean combination of schemata and designator rewrites. The second form provides for some number n of formal parameters, where each parami is a symbol that presumably appears in s in the position of either a proposition or a designator. The ith realization at a particular invocation is systematically substituted for the symbol parami everywhere in s.

An invocation of a template T appears in the position of a proposition in a lexical entry, in some other template definition, or in the annotation of a category in a grammatical rule. It is a term of either of the forms

@T

@(T real1 real2 ... realn )

The first form is appropriate for a template with no realizations. The realization expressions reali in the second form are themselves arbitrary schemata and designators. When an invocation is expanded, those expressions are first substituted in place of the corresponding parami symbols in the definition of T. Second, if there are designator rewrites in the result of performing any necessary substitutions, it is converted to disjunctive normal form and the rewrites are executed. Third, any nonthematic function adjustments are carried out. The formula that emerges from this expansion process takes the place of the template invocation. As templates mutually invoke each other in complex patterns of composition, the possibility arises, just as for metacategories and macros, that a template will lead back to itself in a self-recursive cycle. A warning message will be printed if this situation is detected, and the expansion will be terminated by substituting the constant FALSE for the self-recursive invocation.

XLE's functional templates provide a well-defined and carefully implemented mechanism for capturing generalizations that would otherwise be distributed throughout the lexicon and grammar. Mutual invocation, parameter substitution, and designator rewriting can be combined in rich and powerful ways to represent a broad range of common patterns. Direct marking and Boolean specifications are perhaps not as elegant as the principles that might emerge from further theoretical efforts in LFG. But they do allow, given the current state of the art, for the precise control of lexical and syntactic behavior that explicit computational tests of large-scale grammars require.

Configuration components

In XLE, the configuration section is selected by the initial create-parser and create-generator commands, which specify an "initial" file containing a CONFIG section (If there are more than one CONFIG sections in the file, the last one is used). A configuration identifies all the linguistic specifications that make up an active analysis environment and specifies other parameters that affect the behavior of the system when that environment is installed. A configuration has a version/language name just like any other linguistic specification. This name is used to identify the configuration for system operations such as create-parser and create-generator. The name appears at the top of the configuration section and is then followed by a sequence of component specifications each of which is a pair consisting of a component-name and a value terminated with a period. The component names are simple strings without punctuation or white space. Values can either be singletons or sequences, depending on the component.

For an example configuration, see the XLE walkthrough and the demo grammar.

The following list describes the components that make up a configuration and the way they affect system behavior when the configuration is installed.

FILES 

A list of files to be loaded into the grammar. These can include rules, lexicon, and template files, as well as feature declarations and morphconfig files. In the XLE configuration section, the scope of the configuration data base is given by a FILES clause:

      FILES filename_1  filename_2 .... filename_n.

defining the scope of the database for the configuration as the list of filenames plus the initial file (that specified by create-parser and create-generator). The specified filenames may be full path names, or simple names, or relative names such as ../xxxx, subdir/xxxx. In the latter two cases the names are assumed to be relative to the initial grammar file directory.

Lexicon Indexing and Lexicon Modification

Each file "filename" used in a FILES list is shadowed by an index file containing stored indexes for lexicon sections in the file, to avoid potentially costly reindexing of lexicons (for random access) at each XLE invocation. The index file is named (configfilename).fileindexdir/fileindex(N), where (configfilename) is the make of the file where the config is stored and N is the is the index in the FILES list (0 is used for the config file itself).

In general, users need not be aware of the indexing process, except in one respect. For convenience, lexicon sections may be edited during a session (i.e., intermixed with a series of parse or generate commands), and the modifications take effect immediately. These modifications cause XLE to reindex sections in the affected files, and print a "reindexing" message. So if a change to a LEXICON section is not followed by a "reindexing" message, it may be that the changes have not been saved/written (by the user), or some system problem exists.

Note: Changes may also be made to RULE and TEMPLATE sections, but these changes do not take effect within the session. Instead, you must restart XLE for these changes to take effect.

In more detail, for the curious, each lexicon section is potentially mirrored by two indexes, one for parsing, and a relatively larger one used in generation. The index files are created when the containing files are first referenced in a configuration, and contain parse or gen indexes for sections in that configuration. The index files are modified if

OTHERFILES

A list of non-grammar files to be included when running make-bug-grammar or make-release-grammar. Non-grammar files are included in the grammar using the following notation:

      OTHERFILES otherfile_1  otherfile_2 .... other_n.

The make-bug-grammar and make-release-grammar commands will copy these files along with the grammar files mentioned in the grammar config. This is useful for making sure that documentation files and/or source files get included whenever you make a copy of a grammar.

GRAMMARVERSION

Contains a user-specified version name for this grammar. If GRAMMARVERSION has a value, then the value is stored as a property of any Prolog files along with the grammar name and the grammar date. It can also be retrieved using the Tcl command "grammar_version $chart". If a grammar config has "GRAMMARVERSION." in it, then make-release-grammar will store the release directory in its place (for example, "GRAMMARVERSION release-2006-09-07.").

BASECONFIGFILE

Specifies the file name of a base config for the current config. The current config inherits any config fields (e.g. RULES, TEMPLATES, PERFORMANCEVARSFILE) in the base config that do not have values in the current config. Config fields in the current config overwrite fields in the base config unless the current config field has an entry with a + or a - in front of it (e.g. +my-english-lexicon.lfg, -(STANDARD ENGLISH)). Entries with + in front of them are added to the end of the list of the base config field. Entries with a - in front of them are removed from the base config field. Entries with a - in front of them must match an entry in the base config field exactly (e.g. same prefix and same capitalization). Only the following config fields can have entries with a + or - in front of them: FILES, OTHERFILES, ENCRYPTFILES, PERFORMANCEVARSFILE, RULES, LEXENTRIES, TEMPLATES, FEATURES, GOVERNABLERELATIONS, SEMANTICFUNCTIONS, NONDISTRIBUTIVES, and EXTERNALATTRIBUTES. OPTIMALITYORDER cannot have entries with a + or - in front of them. However, it is possible to make OT marks NOGOOD using a performance vars file.

Any config, including ones called by other configs, can have a BASECONFIGFILE entry, to arbitrary depth. make-bug-grammar and make-release-grammar will copy the file structure as given (e.g. they will not attempt to produce a single config file).

Here is a sample config that uses BASECONFIGFILE:

MY ENGLISH CONFIG (1.0)

BASECONFIGFILE english.lfg.
PERFORMANCEVARSFILE my-performance-vars.txt.
FILES -english-lex.lfg
      +my-english-lex.lfg.

----

This sample config removes the file english-lex.lfg and adds the file my-english-lex.lfg. Note that the PERFORMANCEVARSFILE overrides any PERFORMANCEVARSFILE in the BASECONFIGFILE english.lfg.

MORPHACCESSPATH

Sets the value of the ACCESS section of the morph config. This is useful for testing a new set of morphology transducers by redirecting the morph config to another directory.

ENCRYPTFILES

Specifies a list of files to be encrypted when make-release-grammar is run.

PERFORMANCEVARSFILE

A file containing settings for various performance variables.

Section Lists

RULES, LEXENTRIES, and TEMPLATES each list one or more sections of the type, in the form (VERSION LANGUAGE). For example:
      RULES (TOY ENGLISH) (SERIOUS ENGLISH).
The given sections must be found in the data base FILES. If a RULES, TEMPLATES, or LEXENTRIES section is not given, the grammar is assumed to not contain that type of section.

The section lists establish a precedence order for the sections, with the last section listed having highest precedence. For RULES and TEMPLATES, this means that if several sections contain an entry with the same name (e.g., a rule N-->...) the entry in the section with highest precedence (last listed) will be used.

For LEXICON sections, the same basic precedence relationship applies. However, lexicon entries in higher-precedence sections may be used to modify, rather than simply override, entries with matching headwords in lower- precedence sections. See the discussion of "LEXICON sections" below for a description of this interraction.

Note: Identically named entries in the same section are considered errors.

Duplicate Sections

For purposes of grammar development, a data base may contain more than one section of the same type and name, e.g. TOY ENGLISH RULES (1.0). In this case, all sections of that name will be used, and will be inserted in the precedence order as a group; within the group the occurrence in the highest precedence file will have highest precedence. For example, given:

   FILES  file1 file2 file3.
RULES (TOY ENGLISH) (SERIOUS ENGLISH).
and (TOY ENGLISH) sections in both file1 and file3, the effective section precedence list would be (lowest to highest)
   (TOY ENGLISH):file1  (TOY ENGLISH):file3  SERIOUS ENGLISH

Dynamic Sections

If all the sections of a particular type in the data base are to be used, the section specification clause may be abbreviated as

     section-type  (all all).
e.g.
LEXENTRIES (all all).
In this case, the precedence order of the sections will follow the precedence order of the containing FILES, and, within a FILE, later sections take precedence over earlier ones.

Section Headers

Individual sections are headed by a four-part header:
  version language type xle-version 
e.g.,
  TOY ENGLISH RULES (1.0) 
The "xle-version" refers to the XLE version used to process the grammar, and until further notice must be (1.0).

RULES

A list of section names that specifies the grammatical rules that are active in this configuration. For instance:

RULES (STANDARD ENGLISH)(SLANG ENGLISH).
The rules are installed in the order in which they are specified, so that the last name has priority over the names that precede it.

LEXENTRIES

A list of section names that specifies the lexical entries that are active in this configuration. Like rules, lexical entries are installed in the order that they are specified, so that the last name has priority over all the names before it.

TEMPLATES

An ordered list of section names that name the templates that will be active in this configuration.

FEATURES

An ordered list of section names that name the feature declarations that will be active in this configuration. See the section on feature declarations for more details.

ROOTCAT

The default root category for the grammar. Defaults to S (for Sentence). Unless otherwise indicated, when a string is typed in, XLE will attempt to find a tree rooted in this category.

REPARSECAT

The category to reparse with if XLE fails to find a valid analysis for the root category. This is useful for robust parsing to produce a collection of fragments for the input.

GOVERNABLERELATIONS

The attributes that are defined to be `governable' in the LFG sense: an f-structure will be marked as incoherent if it contains a value for a governable relation that is not subcategorized by the local PRED. These attributes are specified as a list of elements, such as SUBJ OBJ OBJ2 COMP. The list can also contain patterns in a limited regular predicate notation, with the interpretation that all attributes that match a pattern are also governable. Unlike c-structure rules and expressions of functional uncertainty, the terms of these regular predicates are individual characters, and adjacent characters in the pattern are interpreted as matching a corresponding sequence in an attribute name. The only operators that are allowed are * and +. Also, the question mark sign (?) stands for any single letter, and the hyphen (-) is interpreted literally, not as an indicator of relative complementation. Thus, ?COMP matches any attribute that has 5 letters ending in COMP, and OBL-?* matches any attribute that begins with OBL- and has zero or more characters following. If no value is specified for this component, then XLE will use the default value SUBJ OBJ OBJ2 OBL-?* POSS COMP ?COMP.

SEMANTICFUNCTIONS

A list of attributes that are not "governable" in the LFG sense yet are always associated with predicates. For instance, ADJ, RELMOD, TOPIC, and FOCUS might all be labeled semantic functions. If you declare an attribute to be `semantic', then XLE will mark an f-structure as incoherent if that attribute appears in an f-structure without a local PRED, and the f-structure value of such an attribute will be marked incomplete if it does not contain its own PRED. The default value for this component is ADJ XADJ. The semantic functions may also be specified in terms of the same regular-predicate patterns allowed in GOVERNABLERELATIONS.

NONDISTRIBUTIVES

A list of attributes which, when asserted of a set, are not distributed across the elements of a set but rather are taken as properties of the set itself (see Section 4). Thus by registering ADJ, XADJ, and CONJ as nondistributive, their values will not be distributed across the individual conjuncts of a coordination construction. Instead, they will show up as attribute values of the set itself, which will thus have a mix of f-structure and set properties. Regular-predicate patterns can also be used to specify the nondistributive function.

EXTERNALATTRIBUTES

A list of attributes or projections which are intended for use by external clients. Attributes and projections which are not external are intended to be grammar-internal, and subject to change by the grammar writers. If you include the token GOVERNABLERELATIONS in the list, then all of the attributes listed in GOVERNABLERELATIONS will also be considered external. If you include the token SEMANTICFUNCTIONS in the list, then all of the attributes listed in SEMANTICFUNCTIONS will also be considered external. The external attributes can be used to filter the display of f-structures by including @EXTERNALATTRIBUTES in the Tcl variable abbrevAttributes. They can also be used to filter the output by including "EXTERNAL" in the value for outputStructures. Finally, they can be used in the dontadd list of the generator by including EXTERNALATTRIBUTES, or in the addonly list by including INTERNALATTRIBUTES.

OPTIMALITYORDER

A list of Optimality Theory marks used for parsing in ranked order.

GENOPTIMALITYORDER

A list of Optimality Theory marks used for generation in ranked order.

EPSILON

Defines the name of the epsilon category.  Should be set to e.

CHARACTERENCODING

Specifies the character encoding for the grammar. See the character encoding section for more details.

PARAMETERS

This optional component provides a general escape-hatch for setting arbitrary environment variables that you might want to take on specific values only when a particular configuration is active. The value of this component is a list of variable-value pairs, each of which is enclosed in parentheses.

Here the parameters that affect the grammar's behavior:

ALLOWEMPTYNODES: As discussed above, the empty string will not be subtracted from c-structure regular predicates if this parameter is explicitly set to TRUE.

PRESERVEEPSILONS: Also as discussed above, explicit epsilon symbols in c-structure rules will be preserved and appear as distinct tree nodes if this parameter is set to TRUE.

Pred-prefix-check: PRED-setting constraints usually look something like "(some-path-expression PRED) = some-semantic-form" where the semantic form may contain arguments of the form "(path-expression grammatical-function)". In analysis mode, the path to the PRED typically must be a prefix of the path to each grammatical function in the arguments to the semantic form: that is, "some-path-expression" should be a prefix of "path-expression". This is not typically the case in generation mode. When Pred-prefix-check is set to TRUE, a violation of this condition is treated as an error; when it it not, violations are ignored.

Warn-of-argument-adjustment: Nonthematic argument adjustment is subtle and the grammar engineer may not realize it is happening. If the grammar engineer wants an explicit warning, she can set this to TRUE.

Do-not-warn-about-subc-precedence: If a precedence relation is explicitly expressed as a subc relation, a warning is emitted (since precedence is always subc, this is taken to be an indication that the grammar engineer may be confused). If this parameter is set to TRUE, the warning is suppressed.

Ignore-grammar-errors: When TRUE, causes errors encountered during grammar processing to be ignored, allowing sentences to be parsed against the grammar.

Treat-grammar-warnings-as-errors: When TRUE, causes warnings produced during grammar processing to be treated as errors, preventing sentences from being parsed against the grammar.

Metacategory-constraints: Allows the user to choose the desired interpretation of meta-categories (cf. meta-categories section) by specifying either Distributed or Relabeled as the value of the parameter. Distributed is the default.

PathsConnectedByDefault: Allows the user to specify that certain paths are connected by default. For instance if XXX is in this list, then XLE will add (^ XXX)=(! XXX) to the constraints associated with a daughter in a rule if the constraints do not mention XXX. This is similar to the convention of adding ^=! if the constraints associated with a daughter in a rule do not mention !. Currently the only notations for paths that are supported are (^ XXX), (! YYY), ZZZ (interpreted as (^ ZZZ)), and x::*. The x::* notation causes XLE to add x::M*=x::* if the constraints associated with a daughter in a rule do not mention x::. Connecting paths by default is useful for things like in-situ interrogatives which can appear anywhere within a clause. If interrogatives are marked by INT=+ and you make INT be connected by default, then the INT feature will be passed up automatically from the lexical item it occurs in to the top of the clause where it is needed. The only thing that you need to do is to mention the INT feature at clause boundaries to prevent it from propagating too far (for instance, by adding the constraint (^ INT)=(^ INT)).

Other parameters that may appear in the PARAMETERS section are ignored by XLE.

Feature Declarations

You can declare features in a special section that begins with NAME1 NAME2 FEATURES (1.0) and ends with ----. In between is a list of features with declarations after them. The feature declarations use "->" to designate a feature's value. This lets the grammar writer assert constraints on possible feature values or structures. (Eventually, the feature declarations may support using "<-" to designate the containing f-structure.) Here is an example:

TEST ENGLISH FEATURES (1.0)

GENDER: -> $ {MASC FEM}.
NUMBER: -> $ {1 2 3}.
SPEC: -> << [SPEC-TYPE SPEC-FORM].
SPEC-TYPE.

----

Constant-valued features can be declared using a disjunction of equalities, for instance XXX: {-> = a | -> = b}. This says that XXX can only have the value "a" or "b". However, this can be cumbersome when there are a lot of features, so you can also declare the possible values using the notation XXX: -> $ {a b c d e f ...}. This is interpreted to mean that the value of XXX is a member of the set given, and so XXX can have any of the values of the set. This was seen above for GENDER and NUMBER.

Features which consist of a set of constant values can be declared using set subsumption. For instance CASE: -> << {NOM ACC GEN DAT} is interpreted to mean that the value of CASE is a subset of the set given by {NOM ACC GEN DAT}.

You can declare complex features using the notation YYY: -> << [ATTR1 ATTR2]. This is interpreted to mean that the value of YYY is an f-structure consisting of at most the attributes ATTR1 and ATTR2. No other attributes are possible. This is seen above for SPEC. It is assumed that ATTR1 and ATTR2 are declared elsewhere. It is not possible to declare attributes contextually, e.g. to have ATTR1 take one set of values when under YYY and another set of values when under ZZZ.

It is also possible to mix declaration types. For instance, NTYPE: {-> << [PROPER TIME] | -> $ {common null} means that NTYPE is either a feature structure with the possible attributes PROPER and TIME, or it is one of the constants "common" or "null".

If you declare o:: (the optimality projection) to be @OPTIMALITYMARKS, then XLE will extract a declaration for o:: from OPTIMALITYORDER and GENOPTIMALITYORDER in the grammar config. This means that all optimality marks must be included in one of these lists. Optimality marks that are not in a list are normally interpreted as NEUTRAL, so you can declare missing optimality marks by grouping them with NEUTRAL at the end of the list (e.g. (NEUTRAL Mark1 Mark2)).

In rare cases, you may want to declare the type of a feature without declaring its values. You can declare that a feature only takes constant values by using a meta-variable: ZZZ: -> $ {%any}. You can declare that a feature must be an f-structure by using a meta-variable within square brackets: ZZZ: -> << [%any].

Feature Checking

XLE will warn you if a feature appears in the grammar that doesn't appear in the feature declaration and isn't governable or semantic. You can suppress this warning by entering the feature with no declaration, e.g. XXX. (note the period). This was seen above for SPEC-TYPE. If a feature is declared this way, then XLE will ignore the feature. If you want to find out what values a feature can take in the grammar, give it a dummy declaration such as -> $ {}. This will cause XLE to report all violations involving the feature.

XLE only checks the constraints of the parts of the grammar that it loads. If you want to check the whole grammar, then use create-generator instead of create-parser since the generator has to analyze all of the lexical entries. XLE aborts if a lexicon section has errors in it, so you may have to use create-generator repeatedly after updating the declarations in order to find all of the errors in the grammar. Don't stop until the generator loads sucessfully.

If you want to check for unused feature declarations, type print-unused-feature-declarations (). This will cause XLE to reload all of the lexicon indices for the grammar of the given chart and check for any feature declarations that are never used. If no chart is given, then XLE defaults to $defaultchart. XLE also checks for unused feature declarations whenever create-generator is used if all of the lexicon indices were reloaded.

XLE can also check feature values and some feature co-occurence constraints. These checks are only applied to simple constraints. For instance, XLE can check a constraint like (^ NUM)=dual, but it won't check constraints like (^ NUM) = %VALUE %VALUE=dual or (^ {NUM|PERS})=dual. These sorts of constraints can only be checked after f-structures have been constructed, which XLE doesn't currently do. Eventually, XLE may support other sorts of constraints that can only be checked after f-structures have been constructed. For instance, you might want to assert that if an f-structure has an NTYPE, then it won't have a VTYPE and vice-versa. This could be done using the following constraints:

NTYPE: ~(<- VTYPE) ... .
VTYPE: ~(<- NTYPE) ... .
...


However, this is not currently implemented.

Multiple Feature Declarations

It is possible to have more than one set of feature declarations for a grammar. For instance, a variant grammar may use the standard grammar's feature declarations plus a set of feature declarations just for the variant grammar. The feature declarations are processed in the order that they appear in the grammar's CONFIG section:

FEATURES (STANDARD LANGUAGE) (VARIANT LANGUAGE).

If XLE encounters a feature declaration for a feature that already has a declaration, then XLE will print a warning message unless the feature name is preceded by one of the edit operators "&", "+", or "!" (these are related in spirit to the lexicon edit entries, but the notation and interpretation is slightly different). If an edit operator is used in a feature declaration but the feature hasn't already been declared, then XLE will print a warning message. Here are some examples of these edit operators in use:

VARIANT LANGUAGE FEATURES (1.0)

&CASE: -> $ {nom acc}.
+NUM: -> $ {dual}.
!PERS: -> $ {first second third}.

The "&" operator conjoins the given feature declaration with the prior feature declarations for this feature. This is useful when you want to eliminate some feature values from a standard feature declaration without eliminating the feature declaration check. For instance, suppose that the standard feature declaration had:

CASE: -> $ {nom acc gen dat}.

It is possible to remove the dat value by giving a new feature declaration with the "&" operator:

&CASE: -> $ {nom acc gen}.

The effective result of combining these feature declarations is:

CASE: -> $ {nom acc gen dat} -> $ {nom acc gen}.

The "&" operator can be used to remove feature values but not to add them. If you try to add new feature values using something like:

&CASE: -> $ {nom acc obl}.

then the obl value will be disallowed by the first feature declaration since both feature declarations are checked. Thus, the original feature declaration is still in force.

The "+" operator disjoins the given feature declaration with the prior feature declarations for this feature. This is useful when you want to add some feature values to a standard feature declaration without copying the whole feature declaration. For instance, the following feature declaration would add the obl value to the standard feature declaration:

+CASE: -> $ {obl}.

The effective result of combining this with the original feature declaration is:

CASE: { -> $ {nom acc gen dat} | -> $ {obl} }.

which can be rewritten as:

CASE: -> $ { nom acc gen dat obl }.

The "!" operator replaces the prior feature declarations with the given feature declaration. This is useful when the new feature declaration has little to do with the existing feature declaration. Here are some examples:

!CASE: -> $ {nom acc obl}.
!CASE: -> $ {}.

The second example shows how you can effectively delete a feature declaration from the prior feature declaration set.

These edit operators also allow parallel grammars to share a common set of feature declarations while still letting each grammar make minor changes to the feature declarations. Any changes to the common set of feature declarations will be marked by an edit operator, which will make it easier to keep track of the changes. For best results, you should use the "&" and "+" operators whenever possible, and only use the "!" operator when there is no other way to describe the change.

If you want to see the effective feature declarations for a grammar, the Tcl command "print-feature-declarations" will print the feature declarations with the edit operations already taken into account. This is useful if you want to compare the feature declarations between two grammars, or if you want to verify that the edit operations did what you intended.

XLE MORPHOLOGY Section

Two aspects of XLE morphological processing are described below: the MORPHOLOGY section, and text-specified transducers. The MORPHOLOGY section describes the transducers used for parsing and generating the morphology. It begins with "XXX YYY MORPHOLOGY (1.0)", and ends with "----". When used for parsing, the morphology section specifies the sequence of transducers to be applied to an input string to obtain the lexical items assumed to make up that string. If no MORPHOLOGY section is given, then XLE uses the a default tokenizer that tokenizes at spaces.

The next section describes the format of the MORPHOLOGY section, as used in both parsing and generation.

Text-specified transducers are simple specifications of morphological analyzers, placed in text files and referenceable from the MORPHOLOGY section, for purposes of adding-to or overriding the analyses of a primary morphological transducer. They are described further on.

MORPHOLOGY Section Format

First we discuss the structure of the MORPHOLOGY section if it is used only for parsing and then we discuss modifications needed if it is also used for generation.

The body of the MORPHOLOGY section consists of a collection of subsections. Each subsection begins with a line containing the subsection label, which must end with a ':' character, followed by one or more specification lines, followed by a blank line. An example of a MORPHOLOGY section containing all possible subsections would be:

MY LANGUAGE MORPHOLOGY (1.0)

ACCESS:
REFS GRAMMAR_DIRECTORY/../prelex/
BUILD rebuildtokenizers

BREAKTEXT:
breaktext-filename

TOKENIZE:
whitespace-normalizer-filename tokenizing-transducer-filename

#The first analysis
ANALYZE USEFIRST:
morph-override-filename
primary-analyzer-filename

#An alternative analysis
ANALYZE USEALL:
lower-caser-filename primary-analyzer-filename

#token normalization
NORMALIZE TOKENS:
removecapmarks.fst

#Multi-word processing

BuildMultiwordsFromLexicon:
Tag = +Prefer

BuildMultiwordsFromMorphology:
Tag = +Prefer

MULTIWORD:
multiword-filename

----
The sections may appear in any order, and all but the TOKENIZE section are optional, but only one section of each type is accepted. The filenames specified may be full path names, simple names, or relative names such as ../xxx, subdir/xxx. Simple names and relative names are assumed relative to the directory containing the root grammar file (N.B. not the file that contains the MORPHOLOGY section). If there is a REFS entry under ACCESS:, then simple names and relative names are assumed relative to the directory given there. Lines that begin with the character # are comment lines, and can occur anywhere.

The functions of the individual section types are as follows.

ACCESS. This is followed by a REFS line and/or a BUILD line. 

VERSIONFILES. This section contains a list of filenames that have version numbers in them. Each filename must appear on a separate line in the section. The version number must be on the first line of the file. Each version number will appear as a property value in the Prolog output file where the property name is the base name of the filename. The VERSIONFILES section must follow the REFS line, if it exists.

BREAKTEXT. This subsection should contain a single transducer that breaks a text up into segments (usually sentences) by inserting the multi-character symbol TEXTBREAK deterministically between segments. This is used by parse-file and make-testfile to break a text into a sequence of segments that can be given to the parser. The TEXTBREAK symbol is removed before a segment is given to the parser. Here is an example of the notation:
BREAKTEXT:
breaktext.fst

You can allow for sentence break ambiguity by having the BREAKTEXT transducer break the text into segments only where the breaks are very reliable (for instance, at paragraph boundaries) and then having the tokenizer insert sentence breaks non-deterministically. These sentence breaks will then appear in the input to the parser. For this to work, the grammar must be modified to accept sentence break markers.

TOKENIZE. The tokenize subsection is as illustrated in the example above. The named transducers are assumed to operate on full strings (sentences or other text segments), and to produce, as final output, strings with intervening @ characters specifying breaks between tokens. (If any of the morphological transducers have TB in their sigmas, then XLE will use TB as the token boundary symbol instead of @.) Using more than one transducer has the effect of applying the result of their composition. This can be useful if the actual composition becomes too large.

Notes:

  1. The tokenization transducers may also perform other functions, such as capitalization and multi-word recognition, but the insertion of @ or TB characters by the final transducer is required.
  2. The tokens output from the tokenize subsection can be used to match * morphcode entries in the XLE lexicon.
  3. If there are multiple lines in the tokenizer section, then a priority union is formed. This means that later lines are only applied if earlier lines fail to produce an output. This is useful if the first tokenizer is not robust (e.g. doesn't accept all possible strings).

ANALYZE USEFIRST. (May also be called just ANALYZE). This subsection contains one or more lines. Each line may specify one or more transducers, but has the effect of a single effective transducer (see Notes below). The ANALYZE transducers are assumed to take as input the individual tokens produced by the tokenizing transducer, and (if successful) to produce sequences of one or more output elements (e.g., a stem and a sequence of tags). So, for example, the TOKENIZE transducer might produce:
he@talks@
and an (effective) ANALYZE transducer might separately decompose the tokens he and talks to obtain analyses of the string such as:
he@+3SG@+PRON@  talk@+3SG@+Pres@+V@
he@+3SG@+PRON@ talk@+Pl@+N@
The effective ANALYZE USEFIRST transducers are applied to a token one by one until an analysis is found or there are no more lines. When an analysis is found, transducers specified in subsequent lines in the subsection are not applied to that token. So, for example, given the subsection
ANALYZE USEFIRST:
verb-analyzer
other-analyzer

The first analyzer might deliver the verb analysis of talks, blocking
the noun interpretation, and the result of the combination
would be:
he@+3SG@+PRON@  talk@+3SG@+Pres@+V@
A built-in transducer named *BLOCK* can be used with ANALYZE to prevent a guesser from being used. For instance, suppose that you didn't want the guesser to be applied to tokens that ended with a period. You could accomplish this with the following:
ANALYZE USEFIRST:
morph-analyzer
ends-with-period *BLOCK*
guesser
If a token isn't in morph-analyzer and satisfies ends-with-period, then it won't receive a morphological analysis. However, if it is not in morph-analyzer and does not satisfy ends-with-period, then the guesser does apply.

ANALYZE USEALL. This subsection has the same structure as ANALYZE USEFIRST: a sequence of lines each identifying one or more transducers, and each line representing a single effective transducer. However, all the lines of the subsection are applied to each TOKENIZE output token. So if, instead, we listed the above transducers within an ANALYZE USEALL subsection:
ANALYZE USEALL:
verb-analyzer
other-analyzer
both the verb and noun readings would be obtained. Thus, given an ANALYZE (USEFIRST) subsection containing m lines, and an ANALYZE USEALL subsection containing n lines, each token can have a maximum of n+1 successful intepretations, one from ANALYZE USEFIRST, and all from ANALYZE USEALL.

NORMALIZE TOKENS. This subsection provides transducers for normalizing the tokens that are used to match * entries in the lexicon and to build multi-word entries from the lexicon and the morphology. If this section is missing, then XLE implicitly uses an identity transducer to normalize the tokens.

BuildMultiwordsFromLexicon. This section instructs XLE to build a multi-word transducer that optionally converts the tokens that are output by the morphology into multi-word tokens that appear in the lexicon. So if

New` York N * (^ PRED)='New` York'.
appears in the lexicon, then XLE will convert the morphological output
New +Token York +Token
into
{New York +MWToken | New +Token York +Token}
The +MWToken will cause the token New York to match the lexical entry New` York but not the special token entry -token that is used for fragment grammars.

BuildMultiwordsFromMorphology. This section instructs XLE to build a multi-word transducer that optionally converts the tokens that are output by the morphology into multi-word tokens that appear in the morphology. So if the morphology has the following transduction in it (as given to a text-specified transducer):
New York : /New_York/ +City
then XLE will convert the morphological output
New +Token York +Token
into
{New_York +City | New +Token York +Token}

N.B. BuildMultiwordsFromMorphology doesn't work if the multi-words are part of a line of transducers (e.g. an effective transducer).

Properties of BuildMultiwordsFromLexicon and BuildMultiwordsFromMorphology.

If something like Tag = +Prefer appears in the lines immediately after BuildMultiwordsFromLexicon or BuildMultiwordsFromMorphology, then XLE will add the given tag after the new multi-word construction (e.g. New York +MWToken +Prefer). This can be used to prefer multi-words over the individual tokens with an optimality mark. One instance of the tag will appear for each token that the multi-word token matches except for the last one. Thus, if the lexicon has:

New` York` City N * (^ PRED)='New` York` City'

in it, then XLE will produce New York City +MWToken +Prefer +Prefer.

If something like TokenBoundary = _ appears in the lines immediately after BuildMultiwordsFromLexicon or BuildMultiwordsFromMorphology, then XLE will match on the given value instead of space. For instance, if

TokenBoundary = _

and the lexicon had

New_York N * (^ PRED)='New_York'.

then XLE would convert New +Token York +Token into {New_York +MWToken | New +Token York +Token}.

If there are multi-word tokens or multi-word morphemes such as United States of America or oil filter that are treated as multi-words mainly for performance reasons, then you might consider using a multi-word transducer to add brackets to the multi-words instead of converting them to multi-word lexical entries. Otherwise, you lose the internal structure of the multi-word construction.

The contents of BuildMultiwordsFromLexicon and BuildMultiwordsFromMorphology are ignored in the generation direction. However, the generator doesn't break the input to the generation morphology at pre-specified locations, so multiwords will often generate just fine. For instance, the word morphology for the generator will convert New_York +City into "New York" TB. If the untokenizer allows spaces in its input, then the final output of the generator will be "New York".

MULTIWORD. This subsection of the morphology section is used for transducers that deal with multi-word phenomena such as multi-word tokens and multi-word morphology. It has the same structure as the TOKENIZE subsection. It usually contains only one line of transducers. Here is an example:
MULTIWORD:
multiword1 multiword2

BuildMultiwordsFromLexicon and BuildMultiwordsFromMorphology are implicitly added in front of the transducers listed in MULTIWORDS. If you need BuildMultiwordsFromLexicon or BuildMultiwordsFromLexicon to apply after a multiword transducer, you can give its position explicitly in MULTIWORDS:

MULTIWORD:
multiword1 BuildMultiwordsFromLexicon multiword2

Overall Architecture

Conceptually, the tokenizers, the word analyzers, and the multiword transducers are composed together into a system that looks something like the following:

                         multiword2
                            .o.
                         multiword1
                            .o.
               BuildMultiwordsFromMorphology
                            .o.
                 BuildMultiwordsFromLexicon
                            .o.
  [[analyzers | token-normalizer +Token:epsilon] epsilon:TB]*
                            .o.
                         tokenizer2
			    .o.
			 tokenizer1

In this diagram, we are following the convention of putting the lexical side on the top and the surface side on the bottom. The last multiword transducer appears first in the list because the xfst operator .o. goes in the generation direction (from lexical side to surface side).

Working from the bottom up, the tokenizers take the whole sentence as an input and produces a finite-state machine as output. The tokens in the output are separated by TB, a multi-character symbol that represents a token boundary (some old grammars use @).

The line above the tokenizers shows how the word morphology fits in. analyzers is the union of the transducers in ANALYZE USEALL with the priority union of the transducers in ANALYZE (USEFIRST). token-normalizer normalizes the tokens. The normalized tokens have +Token added to the end of them by +Token:epsilon. The result is then unioned with analyzers. Finally, the TB is removed by epsilon:TB (if TB appears in the alphabet of one of the multiword transducers, then TB is passed through instead). TB is not allowed to appear on the lower side of analyzers or token-normalizer to make it easier for XLE to apply the word morphology. This whole process repeats until the entire output of the tokenizers is matched.

If token-normalizer is not given explicitly, then XLE uses an identity transducer. The +Token transducer is used to match lexical entries with * as the morphology category. That is, XLE interprets strings of characters followed by +Token to match a lexical entry that has the same string of characters with a * as the morphology category.

The output of the morphology consists of strings of characters used to represent stems followed by multi-character symbols used to represent tags (see Creating transducers for details).

BuildMultiwordsFromLexicon is an implicit transducer that adds multiwords from the lexicon (described in BuildMultiwordsFromLexicon). BuildMultiwordsFromMorphology is an implicit transducer that adds multiwords from the morphology (described in BuildMultiwordsFromMorphology). Multiword expressions need to be handled explicitly in the multiword transducers or implicitly using BuildMultiwordsFromLexicon and/or BuildMultiwordsFromMorphology.

Finally, the multiword transducers are applied to the output of the word morphology and the multiwords. This is the place to handle multiword phenomena that is sensitive to the morphological analysis of words.

When generating, the morphemes are composed with a virtual transducer that includes the multiword transducers and the word morphology as described above. If the generation succeeds, then the output is composed with the (un)tokenizers.

Notes on Effective Transducers. In the many of the subsections described above each line may specify more than one transducer, together treated as a single effective transducer in the following way. The first transducer on the line accepts as input a string, and produces as output (if successful), a set of strings. The second transducer is then applied to these strings to produce another set of strings, and so on. An effective transducer succeeds as a whole if it produces at least one string as output. In xfst terms, a line such as:

 transducer1 transducer2 transducer3 

is equivalent to:

 transducer3 .o. transducer2 .o. transducer1 

The order is different in xfst because xfst treats the generation direction as primary, and XLE treats the parse direction as primary.

Writers of transducers should also pay careful attention to the recommendations in the XFST book (6.3ff) to be clear about the input and output alphabets of their transducers. In particular, the transducers produced by create-transducer map surface strings into strings that mix surface characters with multicharacter "tag" symbols (this is spelled out clearly in the documentation on Text-specified transducers). Thus, a line

food : /foo/ +Verb +Past

will cause a mapping between "food" and the 5 symbol sequence "f o o +Verb +Past". Transducers must be aware of the encodings used "up" and "down" of them, in order to operate correctly.

Example of MORPHOLOGY section usage

The initial example, reproduced here, of how the facilities might be used:
MY LANGUAGE MORPHOLOGY (1.0)

TOKENIZE:
whitespace_normalizer_filename tokenizing-transducer-filename

ANALYZE USEFIRST:
morph-override-filename
primary-analyzer-filename

ANALYZE USEALL:
lower-caser-filename primary-analyzer-filename

----
Assume first that the grammar writer disagrees with the analysis produced by the primary-analyzer-filename for a few words. This problem might be solved by building a small transducer morph-override-filename delivering the desired analysis, and placing it before the primary analyzer in the ANALYZE USEFIRST subsection. (See the Text-specified transducers section for how such a small transducer might be built.)

Next, assume that the text to be analyzed contains some words written in upper case just for emphasis, and these may overlap some acronyms or names accepted as such by the primary analyzer (e.g., IS = information systems or IS = emphatic is). So one might write a lowercasing transducer lower-caser-filename that accepts just such words, and translates them to lower case. Then the grammar writer might add that transducer to the ANALYZE USEALL subsection, followed by the primary analyzer primary-analyzer-filename.

Use of the MORPHOLOGY section in generation

The same MORPHOLOGY section used for parsing is also used for generation. Without additional annotation, the transducers specified for parsing are applied to generation output, but inverted and in the reverse direction. However, since some parse transducers may not be appropriate for generation (e.g., normalizers), files may be annotated as P!filename or G!filename to confine their use to parsing or generation respectively. Or one may choose to use completely separate morph and gen transducers, e.g., in the tokenizer subsection, by annnotating each filename. For example, given the the MORPHOLOGY section:
MY LANGUAGE MORPHOLOGY (1.0)

TOKENIZE:
P!whitespace_normalizer_filename tokenizing-transducer-filename

ANALYZE USEFIRST:
morph-override-filename
primary-analyzer-filename

ANALYZE USEALL:
P!lower-caser-filename P!primary-analyzer-filename

----
Each stem-tag sequence in the generator output would be submitted first to morph-override-filename (logically inverted) and, if that fails, to primary-analyzer-filename (also logically inverted). The analyzers in the ANALYZE USEALL subsection would not be applied as none are applicable to generation. The resulting net of alternative word paths would then be composed with the inverted tokenizing-transducer-filename.

Creating transducers

There are several ways to create transducers to be used by the morphology section. One way is to use XFST to build transducers. Note that XLE currently ignores flag diacritics, so you should use the "eliminate flag FLAG" command to remove each flag and compose in the restriction that it encodes when you are done creating a transducer. Another is to use a build-in facility for creating Text-specified transducers. Finally, you can write C libraries for simulating transducers using Grammar library transducers.

XLE assumes that transducers have two different types of characters in them: standard characters (such as Unicode characters) that are encountered in running text, and multi-character symbols that look like strings of standard characters but are interpreted as a single character. The multi-character symbols are used to represent special entities recognized by the tokenizer or the morphology. For instance, the multi-character symbol "TB" is output by the tokenizer between tokens to represent a token boundary. Because it is a multi-character symbol, it cannot be confused with an abbreviation for tuberculosis.

Since there is no explicit separator for lexemes, XLE assumes that lexemes are separated by multi-character symbols, and that each multi-character symbol is a separate lexeme. Thus, XLE always breaks up the output of the morphology into standard characters followed by multi-character symbols, where the standard characters are stems and the multi-character symbols are tags. Because of this, stems must be followed by tags represented as multi-character symbols, and multi-character symbols cannot be embedded in stems. Morphological tags such as +N or +PL must be represented as multi-character symbols. Tokens are represented as a string of standard characters followed by the multi-character symbol +Token.

Text-specified transducers

To assist the grammar writer in specifying morphological analyses either extending or overriding those produced by a primary morphological analyzer, a transducer may be specified by a simple text file. This file may be referenced directly in a MORPHOLOGY section, or used as input to an fsm-file building capability.

Each line of the text file defines the analysis of a single word. For example:

talks  : /talk/ +Pres
walks : /walk/ +Pres
flies: /fly/ +N +Plural
disk drive : /disk/ +N +Sing /drive/ +N +Sing
The first field on the line, from the beginning of the line to the required : character, contains the input word. The input word may contain blanks, but prefixed and suffixed blanks are ignored. So, for example, the fourth line of the above specification gives an analysis for the string disk drive.

The analysis into "stems" and "tags" follows the input-word field. Stems should be enclosed in / characters, and also may contain blanks; tags are not enclosed. but must be separated by blanks. (The significance of the distinction, which may not be of interest to the grammar writer, is that the transducer built will put tags on arcs with multi-character symbols.) Each stem MUST be followed by a tag. If a stem is not followed by a tag, then XLE may merge the stem with another stem that follows it.

The file may also contain initial and interspersed comment lines that start with the character #, as well as blank lines, e.g.:

# Override Transducer for Verbs  

talks : /talk/ +Pres

# Not sure about this one
walks : /walk/ +Pres

The standard XLE escape character, `, is used when a character is to function as a literal rather than a control character, for example,

`:  : /colon/  +Punct

As mentioned, text-specified transducers may be referenced directly in the ANALYZE USEFIRST/USEALL subsections of the MORPHOLOGY section. For example, the morph-override-filename in one of the illustrations above might be the name of a transducer specification file.

However, the text file may also be input to a utility invoked from the XLE command line:

create-transducer  text-file-name  output-transducer-file-name
to produce a file-stored transducer output-transducer-filename that may be examined via the c-fsm interactive facilities, and/or directly referenced in the MORPHOLOGY section.

XLE assumes that text-specified transducers are encoded in iso8859-1. If you want to add transducers in utf-8, it is recommended that you use XFST to build a transducer using the command 'read spaced-text'.

Grammar library transducers

Transducers can also be implemented using code libraries associated with the grammar. Any place that a transducer occurs, you can invoke a grammar library using the notation libfoo initfile or libfoo. XLE looks for libfoo-<platform>.<ext> first, and then libfoo.<ext>, where <platform> is the current platform (e.g. one of solaris, macosx, linux2.1, linux2.2, or linux2.3) and <ext> is a library extension (e.g. one of so, dylib, or dll). XLE calls the initialize function in the library with the initfile specified. It then uses the library to analyze strings. See grammar-library.h for more details on how to implement a grammar library.

All of the variations on libfoo described above that XLE can find are automatically included when you use make-bug-grammar or make-release-grammar, along with any init files. However, files used to build libfoo are not automatically included. If you want to make them part of the grammar, add them to the OTHERFILES config field in the root grammar file. This is useful if you only build libfoo for one or two platforms, and want to allow people to build libfoo on other platforms.

Note that grammar libraries only work in the parsing direction. XLE does not support grammar libraries for generation.

XLE Lexicon Entries and Lookup Model

XLE Lexicons - Lexicon File Specification

The lexicon sections to be used with a grammar are specified by the LEXENTRIES clause of the CONFIG section of the grammar, as

  LEXENTRIES section_1 section_2 ... section_n

e.g.

LEXENTRIES (STANDARD ENGLISH) (TRACTOR ENGLISH)

where the lexicon precedence order is "last first", i.e., in the example (TRACTOR ENGLISH) is of higher precedence than (STANDARD ENGLISH). This is dicussed in the Walkthrough as well.

Standard Lexicon Entries

Lexicon entries should be constructed for each stem and tag to be returned by the morphological analysis transducers, and of interest to the application, e.g.,
     dream   N   XLE  @(COUNT-NOUN %stem);
V XLE @(INTRANS %stem).

+Nsg NSFX XLE constraints.

+Adj ASFX XLE .

+AdjC ASFX XLE @(COMPARATIVE).

+AdjS ASFX XLE @(SUPERLATIVE).
The list of relevant tags will depend on the morphology being used.

A lexicon subentry for a category X with an XLE (non *) morphcode is given the category X_BASE in the grammar. The latter is used in constructing the sublexical rules and tells XLE how to display the words in the c-structure tree. For example:

  N --> N_BASE NSFX_BASE.
If the analyses returned by the morphological analyzer are insufficient for the application, additional "surface" entries or subentries for the words involved may be constructed, using the morphcode *, e.g.,
     dreamed A   *  @(ADJ %stem).
The categories for these subentries are not implicitly suffixed by _BASE, as they represent full lexical items.

To provide for stems analyzed by the morphological transducers but not given explicit entries in the lexicon, entries headed by the word -unknown may be used to obtain parses for sentences involving such stems.

 -unknown  N  XLE @(NOUN %stem);
A XLE @(ADJ %stem).

%stem stands for whatever actual stem matches -unknown. These entries (actually for N_BASE and A_BASE), when combined by sublexical rules with entries for the transducer-produced tags, can allow parses to go through, and produce the appropriate f-structures.

If you want to distinguish stems based on their case (upper or lower case), with most morphologies, you can use the tags produced by the morphology to produce different analyses based on the case. For instance, in sentence-initial position Bill might be analyzed as bill +N +Sg and Bill +PN +Sg. The different tags can then be used by a sub-lexical rule to pick out different lexical entries for the stem. Here is an example of what might be used:

N --> { N_BASE N_SFX_BASE N_SFX_BASE*
|PN_BASE PN_SFX_BASE N_SFX_BASE*}.

+N N_SFX XLE.

+PN PN_SFX XLE.

+Sg N_SFX XLE.

-unknown N XLE @(COMMON-NOUN %stem);
PN XLE @(PROPER-NOUN %stem).

There is also a special lexical entry named -token which matches any token (including those that have an explicit entry in the lexicon). The -token entry is useful for robust parsing, where you may want to skip over one or more tokens in order to stitch together a set of fragments.

-token TOKEN * (^ TOKEN)=%stem.

Basic lookup model

A grammar may have several lexicon files. These may reflect how reliable the source is: for example, hand-constructed entries, extries from a machine-readable dictionary, entries culled from a corpus. They may reflect different categories: for example, a verb lexicon, a noun lexicon, a lexicon of closed-class items. To use multiple lexicons, it is necessary to specify what happens when a word is found in more than one lexicon, e.g., does one entry override the other or is it added to the other. The following sections describe the tools that XLE provides for controlling entries across multiple lexicons.

The grammar CONFIG specifies which lexicons are called in the FILES field and which lexicon sections are given priority in the LEXENTRIES field. Given a configuration specification

  LEXENTRIES lex_1 lex_2 ... lex_n.
we can describe the lexicon lookup model, for lexicons containing only the standard entries above, as
  

Set EffectiveEntry = the -unknown entry.
For lex_1 .. lex_n, in that order
If there is an entry for abc in lex_i replace the entire
EffectiveEntry by the new entry.
Use final EffectiveEntry.

The final EffectiveEntry is the one used by XLE.

Note that a particular lexical entry can only appear once per lexical section, even if the parts of speech don't overlap. XLE will warn you if you try to load a lexicon with multiple entries for a given verb. If you want to have more than one entry for a lexeme, you need to put them in different sections (e.g. (NOUN ENGLISH) and (VERB ENGLISH)) or use a single complex entry. For example:

bill V XLE @(TRANS %stem);
     N XLE @(COUNT-NOUN %stem).

Edit entries and augmented lookup model

Lexicon "edit" entries are supplied to allow greater flexibility in combining lexicons, and use of -unknown. In an edit entry, each category is (and must be) prefixed by an operator. Also any lexicon entry (edit or nonedit) may be given a disposition ONLY or ETC, with ONLY the default. Giving as alternatives:
  NonEdit Entry      Edit Entry
abc C1 M1 ...; abc +|-|!|=C1 M1 ...;
C2 M2 ...; +|-|!|=C2 M2 ...;
.... ....
; (ONLY/ETC). ; (ONLY/ETC).
For example, an edit entry might look like:

bill !V XLE @(TRANS %stem);
-N XLE; ETC.


More examples with explanations are provided below.

With these entries, and a configuration specification
  LEXICONFILES lex_1 lex_2...lex_n.
the full lookup model can be specified as:
  1. Set EffectiveEntry to the (effective) -unknown entry 
  2. For lex_1, lex_2, to lex_n, in that order 
    1. If the lexicon contains an edit entry for abc with part of speech C, for each subentry
    2. and then,
Treat non-edit entries for abc like edit entries with ! operators with ONLY. That is, the current non-edit entry for abc will replace any previous (and hence lower priority) entry. For clarity, it is generally advisable to use either all edit entries are all non-edit entries.

Examples

In this section, we step through several examples to show how the edit entries work. First we assume that we have the -unknown entry:
  -unknown  ADJ  XLE @(ADJ %stem); 
ADV XLE @(ADV %stem);
ONLY.

This entry states that any form that does not have an entry elsewhere can either be and adjective (ADJ) or an adverb (ADV). Whether these forms are realized in a parse will depend on whether XLE can build a sublexical rule based on the tags provided by the morphology. In general, you should only put entries that have predictable subcategorization frames in the -unknown entry (e.g., adjectives, adverbs, and nouns, but not verbs) because there is no way of specifying which specific stem/word is matched. A verb entry for -unknown might be included with some basic frames (e.g., transitive or intransitive) if in addition to a verb lexicon the grammar also used a guesser for unknown verbs.

Note that all subentries (op)POS XLE in these examples are really for categories POS_BASE, because of the XLE morph-code.

To add an alternative constraint for the ADJ reading.

   abc  +ADJ  XLE @(ADJ-OBL %stem on);
ETC.

This states that the stem abc has an additional adjective entry in which it takes an oblique on phrase.


         

To cancel the ADV reading

   abc  -ADV XLE; 
ETC.
The - operator must be accompanied by a part of speech category and a morph-code (* or XLE). This example states that any previously existing entry for abc as an adverb (ADV) known to the morphology (XLE) has been removed. This operator is often used when removing unwanted lexical items from the morphology (e.g., either mistakes or extremely rare words that are interfering with the parser).

To Do both previous. That is, to add an adjective (ADJ) reading while removing the adverb (ADV) one (and any other parts of speech that may have existed from previous entries).
   abc  +ADJ  XLE @(ADJ-OBL %stem on);
ONLY.

To add a noun reading, and replace the ADJ constraints.

   abc  !N    XLE @(NOUN %stem);
!ADJ XLE @(ADJ-OBL %stem on);
ETC.

The ! indicates that the current noun (N) and adjective (ADJ) entries replace any such entries from previous lexicons. The ETC states that entries for other parts of speech, such as adverbs or verbs, are still valid. Note that the ! is in fact optional because it is the default edit operator. However, for clarity it is recommended that you include the ! operator.

To retain only the adjective interpretation from the -unknown entry.

   abc =ADJ XLE; ONLY.
The retain operator only makes sense with a disposition of ONLY (the default); if a disposition of ETC is given, the retain operator is effectively purposeless.

The retain operator is used when you don't know what other parts of speech there may be entries for. To remove just a particular part of speech, you should use the - operator.

Overlay Grammars

Sometimes a grammar writer will want to use a modified version of an existing grammar, but won't want to modify the existing grammar. This can happen if the existing grammar is maintained by someone else. If the grammar writer modifies the existing grammar, then new modifications will have to be made whenever the existing grammar is improved. One solution to this problem is to create an overlay grammar.

An overlay grammar is a grammar whose config file includes a BASECONFIGFILE. The base config file points to the existing grammar. The overlay grammar inherits all of the config entries of the base config file. It can also add to and subtract from the config entries. In addition, it can have new lexicon entries, new rule entries, and a new performance vars file. The new lexicon entries can modify the existing lexicon entries using edit entries. The new rules can modify the existing rules using edit rules. The morphology can either be replaced by the overlay grammar or it can be modified in the performance vars file.

Edit Rules

Just as edit entries allow the grammar writer to edit a previous lexical entry, edit rules allow a rule to edit a previous rule. The notation is similar to the notation for edit entries. If a rule name is preceded by +, -, &, or !, then the rule will be combined with the previous version of this rule based on the operator:

For example, the following rules:

  YP --> AP BP.

  +YP --> CP DP.
Are equivalent to:
  YP --> {AP BP | CP DP}.

Morphology Commands

There are a limited number of modifications that can be made to the morphology via commands in the performance vars file: