Our goals for the Cyc-NL system include both understanding and generation. In the first case, we want to translate natural language texts into CycL, Cyc's internal representation language (see The Syntax of CycL for more information). In the second case, we want to provide natural language translations for CycL expressions. Building upon these two capabilities, Cyc and the Cyc-NL system can be applied to a wide array of tasks, including document indexing and retrieval, database querying, machine translation, enhanced speech recognition, and so on.
Large-scale natural language processing (NLP) is a famously difficult undertaking, for many reasons. Syntactic parsing, lexical representation, semantic interpretation, pragmatic processing, and discourse management, among other tasks, would be required to produce a truly functional natural-language dialogue system. Researchers have made important strides in many of the above-mentioned areas, but artificial language technology available today does not come close to mimicking conversation between human beings. See, for example, Ron Cole, Joseph Mariani, Hans Uszkoreit, Giovanni Batista Varile, Annie Zaenen, Antonio Zampolli, and Victor Zue (eds.), "Survey of the State of the Art in Human Language Technology (1997)" for a discussion of the challenges presented by various aspects of natural language processing, and the level of performance that existing NL systems can achieve.
We believe that any application relying on large-scale natural language processing would benefit from a broad and deep repository of commonsense knowledge such as Cyc. Here are a number of representative linguistic problems which NL systems have been called upon to solve in the past few decades, and for which, we claim, an adequate solution requires an NL system to work in tandem with a knowledge base like Cyc:
Lexical Ambiguity and Polysemy
The majority of the most common words in English have multiple
meanings: "bat", "bank", "table", "can", "will", etc. In applications
such as machine translation and document indexing/retrieval, it is
crucial to be able to figure out which meaning of an ambiguous word is
intended. For example, if a user queries a text database for "bats and
other small mammals", a standard Boolean search engine will also
deliver documents about baseball bats, even though it is obvious to
any reasonable human that this is not what the querier intended. The
background knowledge in Cyc can be used in attempting to choose the
most appropriate meaning of a word in context.
Syntactic Ambiguity
Syntactic ambiguity occurs when an input string has more than one
possible syntactic structure. For example, in the sentence "Fred
brought German beer and potato chips to the party", syntactic
information alone cannot tell us whether both the potato chips and the
beer were German, or whether just the beer was German. As with lexical
ambiguity, NL applications need to be able to select the intended
structure. Another pervasive type of syntactic ambiguity is
prepositional phrase attachment. In the sentence "John washed the
dishes on the table", the syntax alone allows two possibilities: John
washed the dishes which are now on the table, or John did the
dishwashing on the table. A system with access to a repository of
commonsense knowledge like Cyc would have an advantage over
non-knowledge-based systems in ruling out the second interpretation in
a principled manner. Cyc knows that dishwashing typically occurs in
sinks or dishwashers, not on tables, and this information could be
used to select the appropriate syntactic structure underlying the
sentence.
Coreference Resolution
In interpreting text, new pieces of information must be integrated
with what has been mentioned before. A pronoun may be used to refer
back to something already in the universe of discourse. For example,
in "Fred went to the store, bought a candy bar with his credit card,
walked to the park, and ate it", the pronoun "it" refers back to
"candy bar". Syntactically, the potential referents of "it" are the
nouns "store", "candy bar", "credit card", and "park". Cyc has rules
from which it can infer that parks, stores, and credit cards aren't
things which are normally eaten, and can thus help in determining that
"candy bar" is the only referent for "it" which makes sense in this
example.
The basic kinds of information represented in the lexicon are:
Representing Words In The Lexicon: Syntactic Information
Information in the Cyc lexicon centers around "word units", which are
actually instances of the collection #$EnglishWord. Each
word unit represents information about a root word in English. That
information is organized by part of speech and word sense. For
example, the word "bat" is represented by the Cyc constant
#$Bat-TheWord. "Bat" can be a noun; as a noun, it has (at least) three
word senses. "Bat" can also be a verb; as a verb, it has (at least)
three word senses.
Syntactically, the crucial information about a word includes its part of speech, and the features it carries (plural, past tense, etc.). This information is recorded using predicates which are instances of the collection#$NLSyntacticPredicate. For each of the main parts of speech, there is a set of corresponding predicates. Each predicate within a set indicates a distinctive feature for members of that part-of-speech (pos) category. For example, the predicates #$singular and #$plural are used to represent information about nouns. The former indicates a singular form, and the latter indicates a plural form. Predicates like #$singular and #$plural relate a Cyc word unit to a character string. From these assertions, forward inference derives assertions of the form (#$posForms [word-unit] [pos]), where [pos] is an instance of the collection #$SpeechPart. #$SpeechPart contains all parts of speech recognized by the Cyc-NL parser. Example uses of some of the syntactic predicates, and the #$posForms statements which they trigger, follow.
(#$singular #$Dog-TheWord "dog")
(#$plural #$Child-TheWord "children")
(#$posForms #$Dog-TheWord #$SimpleNoun)
(#$posForms #$Child-TheWord #$SimpleNoun)
(#$givenNames #$FredSmith "Frederick")
(#$familyName #$FredSmith "Smith")
(#$nicknames #$FredSmith "Freddie")
(#$acronymString #$UNICEF "UNICEF")
(#$initialismString #$YMCA-Organization "YMCA")
(#$PlaceName-LongForm #$Cambodia "the People's Republic of Kampuchea")
Representing Words In The Lexicon: Semantic Information
The semantic information contained in the Cyc lexicon forms the core of
our NL system. The power of Cyc-NL comes from
the links between word units and Cyc concepts. Cyc provides a clean
semantics to map words of the Cyc lexicon onto.
The most basic word-to-concept link is expressed with the predicate
#$denotation. Here are a few example #$denotation assertions:
(#$denotation #$Bat-TheWord #$SimpleNoun 0 #$Bat-Mammal)
(#$denotation #$Bat-TheWord #$SimpleNoun 1 #$BaseballBat)
(#$denotation #$Bat-TheWord #$Verb 0 #$BaseballBatting)
(#$denotation #$Red-TheWord #$Adjective 0 #$RedColor)
(#$denotation #$Walk-TheWord #$Verb 0 #$AnimalWalkingProcess)
The first argument to #$denotation is an instance of #$LexicalWord. The
second argument is an instance of #$SpeechPart. The third argument is an
integer representing the word sense number, and the fourth argument is a
Cyc constant. The first assertion above, for example, means that word
sense number 0 of the noun form of "bat" denotes the Cyc concept
#$Bat-Mammal. "Bat" as a noun has another word sense, which denotes
#$BaseballBat. It should be noted here that the word sense numbers do
not indicate frequency of occurrence of a particular word sense; they
simply act as unique identifiers of word senses.
Another useful semantic predicate is #$denotationRelatedTo. This is used in cases where an exact mapping between a word and a Cyc concept is not available, but where you want to indicate that there is some relation between the two. For example, "shuffling", "perambulating", and "striding" are all types of walking, but currently, there are not distinct Cyc concepts representing each of these forms. In this case, then, we could assert
(#$denotationRelatedTo #$Shuffle-TheWord #$Verb 0 #$AnimalWalkingProcess)
(#$denotationRelatedTo #$Perambulate-TheWord #$Verb 0 #$AnimalWalkingProcess)
(#$denotationRelatedTo #$Stride-TheWord #$Verb 0 #$AnimalWalkingProcess)
Besides the #$denotation links, the Cyc lexicon also provides more precise mappings between words and phrases and Cyc concepts.
For many nouns, the #$denotation link
is all that is needed to specify the meaning of a word in CycL. In
some cases, though, the meaning of a noun is a more complicated Cyc
formula. Here, we use the predicate #$nounSemTrans
to express that relation. Here are a couple of examples:
(#$nounSemTrans #$Bachelor-TheWord 0 (#$and (#$isa :NOUN
#$AdultMalePerson) (#$maritalStatus :NOUN #$Single)))
(#$nounSemTrans #$Barmaid-TheWord 0 (#$and (#$isa :NOUN #$Bartender)
(#$isa :NOUN #$FemalePerson)))
The first rule states that the word "bachelor" can be used to refer to an unmarried adult male. The
second rule states that the word "barmaid" can be used to refer to a female bartender.
For verbs, adjectives, and adverbs, simple denotation rules are
provided, but more precise mapping templates are required as well. Verbs act
as
the "glue" in a sentence; verbs display different argument patterns, and
it is important to specify how these argument positions are filled in.
Look at these examples:
John bores Fred.
John likes Fred.
John gave Fred a car.
John wanted Fred to leave.
In the first sentence, we must capture the fact that it is the direct
object, Fred, who is experiencing the boredom. The second sentence looks
similar on the surface, but here it is the subject, John, who is
experiencing the liking. The last two sentences demonstrate
verb-argument patterns other than the simple transitive structure seen
in the first two sentences.
Mappings for verbs are given by assertions using the predicate
#$verbSemTrans. Here are a few examples:
(#$verbSemTrans #$Eat-TheWord 0 #$TransitiveNPCompFrame (#$and (#$isa :ACTION
#$EatingEvent) (#$doneBy :ACTION :SUBJECT) (#$inputsDestroyed :ACTION
:OBJECT)))
(#$verbSemTrans #$Feed-TheWord 0 #$DitransitiveNPCompFrame (#$and (#$isa
:ACTION #$FeedingEvent) (#$fromPossessor :ACTION :SUBJECT)
(#$objectOfPossessionTransfer :ACTION :OBJECT) (#$toPossessor :ACTION
:INDIRECT-OBJECT)))
(#$verbSemTrans #$Like-TheWord 0 #$TransitiveNPCompFrame (#$likesObject
:SUBJECT :OBJECT))
The first rule provides a template for interpreting "eat" when used
transitively. The rule gives a CycL translation, with keyword variables
as "placeholders" for arguments which will fill them. In this rule,
slots are reserved for the syntactic subject and direct object. The
second rule gives a translation for "feed" when used ditransitively, as
in "I fed the horse an apple." In addition to subject and object slots,
this rule also provides for an indirect object slot. The third rule
demonstrates that verbs may refer to constants which are not #$Events. Many
verbs have "stative" meanings; they denote not an action, but a state of
affairs holding. In these cases, typically, the verb will translate into a Cyc
predicate. In this case, transitive "like" maps onto the predicate
#$likesObject. Each of these rules mentions the appropriate #$SubcategorizationFrame in which the given translation holds.
Finally, Cyc-NL has mechanisms for handling phrasal structures. See #$multiWordString and #$compoundString for information on how to handle multi-word terms like "swimming pool" or "attorney general".
For nouns, the plural is the only inflectional variant. A noun is assumed to have a regular plural variant if the plural is formed by adding "-s" or "-es" to the singular. So, "lunch" and "dog" have regular plurals, while "child" and "deer" do not.
For verbs, the infinitive is the root form, while the inflected forms are the gerund, the third person singular, the past tense, and the perfect. The gerund form is regular if it is formed by adding "-ing" to the infinitive (whether or not a consonant is doubled; so, both "singing" and "mapping" are considered regular forms). The third person singular form is regular if it is formed by adding "-s" or "-es" to the infinitive. The past tense and perfect forms are regular if they are formed by adding either "-d", "-ied", or "-ed" to the infinitive. So, "baked", "hurried", and "washed" are regular, but "eaten" and "went" are not.
Regularly inflected verb and noun forms need not be entered in the lexicon. Therefore, all uses of #$plural, #$gerund, #$pastTense, #$thirdPersonSg, and #$perfect should involve irregular forms.
For derivational morphological variants, instances of #$DerivedWordFormingFunction can be used to create new lexical items. For example, a lexical entry for "unhappy" can be composed as follows:
(#$WordWithPrefixFn #$Un_Neg-ThePrefix #$Happy-TheWord). If the appropriate rules are stated using #$baseForm, #$posBaseForm, and #$generalSemantics, important syntactic and semantic information will be inferred about the derived word.
It should be noted that these derivational morphology functions are quite new, and have not yet been fully implemented. Therefore, you will find word units of both forms in the lexicon: compositional word units like
(WordWithPrefixFn Un_Neg-ThePrefix Happy-TheWord)
as well as atomic word units like this:
Unconscious-TheWord
The Text Processor guides the NLU process. It controls the application of the various parsing subcomponents, using a heuristic best-first search mechanism that has information about the individual parsers, their applicability to coarse syntactic categories, cost, expected number of children, and so on. This information is used to perform a syntax-driven search over the parse space, applying relevant parsers to the sub-constituents until all are resolved, or until the parsing options have been exhausted. The parsers at the disposal of the Text Processor are the Template parser, the Noun Compound parser, and the Phrase Structure parser.
The Template parser is essentially a string-matching mechanism driven by a set of templates compiled into an efficient internal format. These templates, like those used for generation, employ a simple format so that users can add templates as they are entering new knowledge into the system. The template parser is relatively fast, but is of limited flexibility. It tabulates semantic constraints during a parse, but does not attempt to verify them; that task is passed along to the next processing layer.
The Noun Compound parser uses a set of semantic templates combined with a generic chart-parsing approach to construct representations for noun compounds such as ``anthrax vaccine stockpile''. Unlike other parsing components, it makes heavy use of the Cyc ontology, and can therefore resolve many ambiguities that are impossible to handle on a purely syntactic level (e.g. ``Mozart symphonies'' vs. ``Mozart expert''). See #$NounCompoundRule for examples of rules used by the Noun Compound parser.
The Phrase Structure parser takes a similar bottom-up approach to constructing parses. After completing a syntactic parse, it uses semantic constraints gleaned from the KB to perform pruning and to build the semantic representation. Specialized sub-parsers are used to parse noun phrases and verb phrases; resulting constituent parses are combined to produce a complete semantic translation.
The Cyc NLG system allows users to view any rule or statement in Cyc as an English sentence instead of a CycL formula. In order to see the English version of a CycL statement, click on the ball next to the rule, and then, on the next page, click [Show English]. You can also choose to see all assertions in English as the default, by going to the "Options" menu and selecting "Show assertions in English".