44.2 Linguistic Atom Types 381 well covered in the Stanford parser's training corpus); but the Stanford parser may be preferable in other respects, for instance it's more easily generalizable to languages beyond English (for a language with structure fairly similar to English, one just has to supply a new marked-up training corpus; whereas porting RelEx rules to other languages requires more effort). An unsupervised corpus-based learning system like the one to be described in Chapter?? is a little more distinct from rule-based systems, in that it is based on inducing patterns from natural rather than specially prepared data. But still, it is learning language as a phenomenon unto itself, rather than learning language as part and parcel of a system's overall experience in the world. The key distinction to be made, in our view, is between language systems that learn language in a social and physical context, versus those that deal with language in isolation. Dealing with language in context immediately changes the way the linguistics problem appears (to the Al system, and also to the researcher), and makes hand-coded rules and hand-tagged corpuses less viable, shifting attention toward experiential learning based approaches. Ultimately we believe that the "right" way to teach an AGI system language is via semi- supervised learning in a socially and physically embodied context. That is: talk to the system, and have it learn both from your reinforcement signals and from unsupervised analysis of the dialogue. However, we believe that other modes of teaching NLP systems can also contribute, especially if used in support of a system that also does semi-supervised learning based on embodied interactive dialogue. Finally, a note on one aspect of language comprehension that we don't deal with here. We deal only with text processing, not speech understanding or generation. A CogPrime approach to speech would be quite feasible to develop, for instance using neural-symbolic hybridization with DeSTIN or a similar perceptual-motor hierarchy. However, this potential aspect of CogPrime has not been pursued in detail yet, and we won't devote space to it here. 44.2 Linguistic Atom Types Explicit representation of linguistic knowledge in terms of Atoms is not a deep issue, more of a "plumbing" type of issue, but it must be dealt with before moving on to subtler aspects. In principle, for dealing with linguistic information coming in through ASCII, all we need besides the generic CogPrime structures and dynamics are two node types and one relationship type: • CharacterNode • CharacterinstanceNode • a unary, relationship coziest denoting an externally-observed list of items Sequences of characters may then be represented in terms of lists and the coziest schema. For instance the word "pig" is represented by the list concat(#p, #i, #g) The concat operator can be used to help define special NL atom types, such as: • MorphemeNode/ MorphemeinstanceNode • WordNode/WordlnstanceNode • PhraseNode/PhraselnstanceNode • SentenceNode/ SentencelnstanceNode • UtteranceNode/ UtterancelnstanceNode EFTA00624528
382 44 Natural Language Comprehension 44.3 The Comprehension and Generation Pipelines Exactly how the "comprehension pipeline" is broken down into component transformations, depends on one's linguistic theory of choice. The approach taken in OpenCogPrimes engineered NLP framework, in use from 2008-2012, looked like: Text --> Tokenizer --> Link Parser --> Syntactico-Semantic Relationship Extractor (RelEx) Semantic RelationshipExtractor (RelEx2Frame) --> SemanticNodes & Links In 2012-13, a new approach has been undertaken, which simplifies things a little and looks like Text --> Tokenizer --> Link Parser --> Syntactico-Semantic Relationship Extractor (Syn2Sem) --> Semantic Nodes & Links Note that many other variants of the NL pipeline include a"tagging" stage, which assigns part of speech tags to words based on the words occurring around them. In our current approach, tagging is essentially subsumed within parsing; the choice of a POS (part-of-speech) tag for a word instance is carried out within the link parser. However, it may still be valuable to derive information about likely POS tags for word instances from other techniques, and use this information within a link parsing framework by allowing it to bias the probabilities used in the parsing process. None of the processes in this pipeline are terribly difficult to carry out, if one is willing to use hand-coded rules within each step, or derive rules via supervised learning, to govern their operation. The truly tricky aspects of NL comprehension are: • arriving at the rules used by the various subprocesses, in a way that naturally supports generalization and modification of the rules based on ongoing experience • allowing semantic understanding to bias the choice of rules in particular contexts • knowing when to break the rules and be guided by semantic intuition instead Importing rules straight from linguistic databases results in a system that (like the current RelEx system) is reasonably linguistically savvy on the surface, but lacks the ability to adapt its knowledge effectively based on experience, and has trouble comprehending complex language. Supervised learning based on hand-created corpuses tends to result in rule-bases with similar problems. This doesn't necessarily mean that hand-coding or supervised learning of linguistic rules has no place in an AGI system. but it means that if one uses these methods, one must take extra care to make one's rules modifiable and generalizable based on ongoing experience, because the initial version of one's rules is not going to be good enough. Generation is the subject of the following chapter, but for comparison we give here a high- level overview of the generation pipeline, which may be conceived as: 1. Content determination: figuring out what needs to be said in a given context 2. Discourse planning: overall organization of the information to be communicated 3. Lexicalization: assigning words to concepts 4. Reference generation: linking words in the generated sentences using pronouns and other kinds of reference EFTA00624529
41.4 Parsing with Link Grammar 383 5. Syntactic and morphological realization: the generation of sentences via a process inverse to parsing, representing the information gathered in the above phases 6. Phonological or orthographic realization: turning the above into spoken or written words, complete with timing (in the spoken case), punctuation (in the written case), etc. In Chapter 46 we explain how this pipeline is realized in OpenCogPrimes current engineered NL generation system. 44.4 Parsing with Link Grammar Now we proceed to explain some of the details of OpenCogPrime's engineered NL comprehension system. This section gives an overview of link grammar, a key part of the current OpenCog NLP framework, and explains what makes it different from other linguistic formalisms. We emphasize that this particular grammatical formalism is not, in itself, a critical part of the CogPrime design. In fact, it should be quite possible to create and teach a CogPrime AGI system without using any particular grammatical formalism - having it acquire linguistic knowledge in a purely experiential way. However, a great deal of insight into CogPrime -based language processing may be obtained by considering the relevant issues in the concrete detail that the assumption of a specific grammatical formalism provides. This insight is of course useful if one is building a CogPrime that makes use of that particular grammatical formalism, but it's also useful to some degree even if one is building a CogPrime that deals with human language entirely experientially. This material will be more comprehensible to the reader who has some familiarity with computational linguistics, e.g. with notions such as parts of speech, feature structures, lexicons, dependency grammars, and so forth. Excellent references are [MS99, Jac0:31. We will try to keep the discussion relatively elementary, but have opted not to insert a computational linguistics tutorial. The essential idea of link grammar is that each word conies with a feature structure consisting of a set of typed connectors. Parsing consists of matching up connectors from one word with connectors from another To understand this in detail, the best course is to consider an example sentence. We will use the following example, drawn from the classic paper "Parsing with a Link Grammar" by Sleator and Temperley IS'I'93): The cat chased a snake The link grammar parse structure for this sentence is: « xp I • ss r lid • .0.--Ds +- 40- -4. -Jw -4. + - -se -r ♦ I I I I I I I -Paf - 1. LEFT-WALL the person.n with whoa she works.v is.v silly.a In phrase structure grammar terms, this corresponds loosely to (S (NP The cat) (VP chased (NP a snake)) EFTA00624530
384 44 Natural Language Comprehension but the OpenCog linguistic pipeline makes scant use of this kind of phrase structure rendition (which is fine in this simple example; but in the case of complex sentences, construction of analogous mappings from link parse structures to phrase structure grammar parse trees can be complex and problematic). Currently the hierarchical view Ls used in OpenCog only within some reference resolution heuristics. There is a database called the "link grammar dictionary" which contains connectors associ- ated with all common English words. The notation used to describe feature structures in this dictionary is quite simple. Different kinds of connectors are denoted by letters or pairs of letters like S or SX. Then if a word WI has the connector S+, this means that the word can have an S link coming out to the right side. If a word W2 has the connector S-, this means that the word can have an $ link coming out to the left side. In this case, if %VI occurs to the left of W2 in a sentence, then the two words can be joined together with an S link. The features of the words in our example sentence, as given in the S&T paper, are: Words Formula a, the D+ snake, cat D- & (0- or S+) Chased S- & 0+ To illustrate the role of syntactic sense disambiguation, we will uce alternate formulas for one of the words in the example: the verb sense of "snake." We then have Words Formula A, the D+ snake_N, cat, ran_N D- & (0- or S+) Chased S- & 0+ snake_V S- The variables to be used in parsing this sentence are, for each word: 1. the features in the Agreement structure of the word (for any of its senses) 2. the words matching each of the connectors of the word For example, 1. For "snake," there are features for "word that links to D-", "word that links to 0-" and "word that links to 8+". There are also features for "tense" and "person". 2. For "the", the only feature is "word that links to D+". No features for Agreement are needed. The nature of linkage imposes constraints on the variable assignments; for instance, if "the" is assigned as the value of the "word that links to D-" feature of "snake", then "snake" must be assigned as the value of the "word that links to D+" feature of "the." The rules of link grammar impose additional constraints — i.e. the planarity, connectivity, ordering and exclusion metarules described in Sleator and Temperley's papers. Planarity means that links don't cross - a rule that S&T's parser enforces with absoluteness, whereas we have found it is probably better to impose it as a probabilistic constraint, since sometimes it's really nice to let links cross (the representation of conjunctions is one example). Connectivity means that the links and words of a sentence mast form a connected graph - all the words must be linked into the other words in the sentence via some path. Again connectivity is a valuable constraint but in some cases one wants to relax it - if one just can't understand the whole sentence, one may wish to understand at least some parts of it, meaning that one has a disconnected graph whose components are the phrases of the sentence that have been EFTA00624531
4,1.4 Parsing with Link Grammar 385 successfully comprehended. Finally, linguistic transformations may potentially be applied while checking if these constraints are fulfilled (that is, instead of just checking if the constraints are fulfilled, one may check if the constraints are fulfilled after one or more transformations are performed.) We will use the term "Agreement" to refer to "person" values or ordered pairs (tense, person), and NAGR to refer to the number of agreement values (12-40, perhaps, in most realistic linguis- tic theories). Agreement may be dealt with alongside the connector constraints. For instance, "chased" has the Agreement values (past, third person), and it has the constraint that its S- argument must match the person component of its Agreement structure. Semantic restrictions may be imposed in the same framework. For instance, it may be known that the subject of "chased" is generally animate. In that case, we'd say Words Formula A, the D+ snake_N, cat D- & (0- or S+) Chased (S-, g Inheritance animate <.8>) tz 0+ Snake_V 5- wl ere we've added the modifier Inheritance animate) to the S- connector of the verb "chased," to indicate that with strength .8, he word connecting to this S- connector should denote something inheriting from "animate.' In this example, "snake" and "cat" inherit from "animate", so the probabilistic restriction down t help the parser any. If the sentence were instead The snake in the hat chased the car then the "animate" constraint would tell the parsing process not to start out by trying to connect "hat" to "chased", because the connection is semantically unlikely. 44.4.1 Link Grammar vs. Phrase Structure Grammar Before proceeding further, it's worth making a couple observations about the relationship be- tween link grammars and typical phrase structure grammars. These could also be formulated as observations about the relationship between dependency grammars and phrase structure gram- mars, but that gets a little more complicated as there are many kinds of dependency grammars with different properties; for simplicity we will restrict our discussion here to t he link grammar that we actually use in OpenCog. Two useful observations may be: 1. Link grammar formulas correspond to grammatical categories. For example, the link struc- ture for "chased" is "5- & O+." In categorical grammar, this would seem to mean that " 'chased' belongs to the category of words with link structure '5- & O+'." In other words, each "formula" in link grammar corresponds to a category of words attached to that formula. 2. Links to words might as well be interpreted as links to phrases headed by those words. For example, in the sentence "the cat chased a snake", there's an O-link from "chased" to "snake." This might as well be interpreted as "there's an O-link from the phrase headed by `chased' to the phrase headed by `snake'." Link grammar simplifies things by implicitly identifying each phrase by its head. EFTA00624532
386 44 Natural Language Comprehension Based on these observations, one could look at phrase structure as implicit in a link parse; and this does make sense, but also leads to some linguistic complexities that we won't enter into here. Fig. 44.1: Dependency and Phrase-Structure Parses bananas/II ritts/Ill LID the/D man/N that/IN eime/V eats/V eatsi// man/N theme cameN the man that/1N cameN that came fork/M eati/V eatiN bananas/kr with/1N eats bananas with/1N (akiN kith al rfisl I I a for* A comparison of dependency (above) and phrase-structure (below) parses. In general, one can be converted to the other (algorithmically); dependency grammars tend to be easier understand. (Image taken from C. Schneider, "Learning to Disambiguate Syntactic Relations" Linguistik online 17, 5/03) 44.5 The RelEx Framework for Natural Language Comprehension Now we move forward in the pipeline from syntax toward semantics. The NL comprehension framework provided with OpenCog at its inception in 2008 is RelEx, an English-language se- mantic relationship extractor, which consists of two main components: the dependency extractor and the relationship extractor. It can identify subject, object, indirect object and many other dependency relationships between words in a sentence; it generates dependency trees, resem- bling those of dependency grammars. In 2012 we are in the process of replacing RelEx with a different approach that we believe will be more amenable to generalization based on experience. Here we will describe both approaches. The overall processing scheme of RelEx is shown in Figure 44.2. The dependency extractor component carries out dependency grammar parsing via a cus- tomized version of the open-source Sleator and Temperley's link parser, as reviewed above. The link parser outputs several parses, and the dependencies of the best one are taken. The rela- tionship extractor component is composed of a number of template matching algorithms that act upon the link parser's output to produce a semantic interpretation of the parse. It contains three steps: EFTA00624533
41.5 The RelEx Fkamework for Natural Language Comprehension 3ST • LS Pent Lilt Pere Rpm enSsa it ILIMIMillaal Milifteall•I•POOSSISOMIDSIMINIDADOSIO••••••• /WWII, • • • • • • • • • ease, • • .M. • • W.V. ••• • ea.. • • • • • • empalaq gignmen Irealartenest Ibrionemen II Saber Aargau Ana' Refs Scat ltdoeselps r ItebassIdp EltIll4101 Fig. 44.2: A Overview of the RelEx Architecture for Language Comprehension 1. Convert the Link Parser output to a feature structure representation 2. Execute the Sentence Algorithm Applier, which contains a series of Sentence Algorithms, to modify the feature structure. 3. Extract the final output representation by traversing the feature structure. A feature structure, in the RelEx context, is a directed graph in which each node contains either a value, or an unordered list of features. A feature is just a labeled link to another node. Sentence Algorithm Applier loads a list of SentenceAlgorithms from the algorithm definition file, and the SentenceAlgorithms are executed in the order they are listed in the file. RelEx iterates through every single feature node in the feature structure, and attempts to apply the algorithm to each node. Then the modified feature structures are used to generate the final RelEx semantic relationships. 44.5.1 RelEx2Frame: Mapping Syntactico-Semantic Relationships into PrameNet Based Logical Relationships Next in the current OpenCog NL comprehension pipeline, the RelEx2Frame component uses hand-coded rules to map RelEx output into sets of relationships utilizing FrameNet and other similar semantic resources. This is definitively viewed as a "stopgap" without a role in a human- level AGI system, but it's described here because it's part of the current OpenCog system and EFTA00624534
388 44 Natural Language Comprehension is now being used together with other OpenCog components in practical projects, including theme with proto-AGI intentions. The syntax currently used for describing semantic relationships drawn from FrameNet and other sources is exemplified by the example Al_Benefit:Benefitor(give,$varl) The n indicates the data source, where 1 is a number indicating that the resource is FrameNet. The "give" indicates the word in the original sentence from which the relationship is drawn, that embodies the given semantic relationship. So far the resources we've utilized are: 1. FrameNet 2. Custom relationship names but using other resources in future is quite possible. An example using a custom relationship would be: A2_inheritance($varl,$var2) which defines an inheritance relationship: something that is part of CogPrime's ontology but not part of FrameNet. The "Benefit" part of the first example indicates the frame indicated, and the "Benefitor" indicates the frame element indicated. This distinction (frame vs. frame element) is particular to FrameNet; other knowledge resources might use a different sort of identifier. In general, whatever lies between the underscore and the initial parenthese should be considered as particular to the knowledge-resource in question, and may have different format and semantics depending on the knowledge resource (but shouldn't contain parentheses or underscores unless those are preceded by an escape character). As an example, consider: Put the ball on the table Here the RelEx output is: imperative(put) [1] _obj (Put, ball) (1] on (Put, table) (1] singular (ball) (1] singular (table) (1] , The relevant FrameNet Mapping Rules are: SvarO = ball Svarl = table I IF imperative(put) THEN Al_Placing:Agent(put,you) I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0) I IF on(put,$varl) & _obj(put,$var0) THEN A1_Placing:Goal(put,$varl) Al_Locative_relation:Figure($var0) Al_Locative_relation:Ground($varl) Finally, the output FrameNet Mapping is: EFTA00624535
44.5 The RelEx Framework for Natural Language Comprehension 389 ^1_Placing:Agent(put,you) al_Placing:Theme(put,ball) ^1_Placing:Goal(put,table) "l_Locative_relation:Figure(put,ball) "l_Locative_relation:Ground(put,table) The textual syntax used for the hand-coded rules mapping RelEx to FrameNet. at the mo- ment, looks like: I IF imperative(put) THEN Al_Placing:Agent(put,you) I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0) I IF on(put,$varl) & _obj(put,$var0) THEN Al_Placing:Goal(put,$varl) \ al_Locative_relation:Figure(Svar0) Al_Locative_relation:Ground($varl) Basically, this means each rule looks like I IF condition THEN action where the condition is a series of RelEx relationships, and the action is a series of F\ameNet relationships. The arguments of the relationships may be words or may be variables in which case their names must start with $ The only variables appearing in the action should be ones that appeared in the condition. 44.5.2 A Priori Probabilities For Rules It can be useful to attach a priori, heuristic probabilities to RelEx2Frame rules, say I IF _obj(put,$var0) THEN Al_Placing:Theme(put,$var0) <.5> to denote that the a priori probability for the rule is 0.5 This is a crude mechanism because the probability of a rule being useful, in reality, depends so much on context; but it still has some nonzero value. 44.5.3 Exclusions Between Rules It may be also useful to specify that two rules can't semantically-consistently be applied to the same RelEx relationship. To do this, we need to associate rules with labels, and then specify exclusion relationships such as # IF on(put,$varl) & _obj(put,$var0) THEN Al_Placing:Goal(put,$varl) \ Al_Locative_relation:Figure($var0) ^1_Locative_relation:Ground($varl) (1] # IF on(put,$varl) & _Aubj(put,$var0) THEN \ Al_Performing_arts:Performance(put,$varl) \ Al_Performing_arts:Performer(put,$var0) (2] # EXCLUSION 1 2 • An escape character " must be used to handle cases where the character "S" starts a word EFTA00624536
390 44 Natural Language Comprehension In this example, Rule 1 would apply to "He put the ball on the table", whereas Rule 2 would apply to "He put on a show". The exclusion says that generally these two rules shouldn't be applied to the same situation. Of course some jokes, poetic expressions, etc., may involve applying excluded rules in parallel. 44.5.4 Handling Multiple Prepositional Relationships Finally, one complexity arising in such rules is exemplified by the sentence: "Bob says killing for the Mafia beats killing for the government" whose RelEx mapping looks like uncountable (Bob) [6] present(says) [ 6] _subj (says, Bob) [6] _that (says, beats) [3] uncountable(killing) [6] for (killing, Mafia) [3] singular (Mafia) [6] definite(Mafia) [6] hyp(beats) [3] present (beats) [ 5 ] _subj (beats, killing) [3] _obj (beats, killing_1) [5] uncountable (killing_1) [5] for (killing_l, government) [2] definite(government) [6] In this case there are two instances of "for". The output of Relac2Frame must thus take care to distinguish the two different for's (or we might want to modify RelEx to make this distinction). The mechanism currently used for this is to subscript the for's, as in uncountable (Bob) [6] present(says) [ 6] _subj (says, Bob) [6] _that (says, beats) [3] uncountable(killing) [6] for (killing, Mafia) [3] singular (Mafia) [6] definite(Mafia) [6] hyp(beats) [3] present (beats) [6] _subj (beats, killing) [3] _obj (beats, killing_1) [5] uncountable (killing_1) [5] for_l (killing_1, government) [2] definite(government) [6] EFTA00624537
44.5 The RelEx Framework for Natural Language Comprehension 391 so that upon applying the rule: I IF for ($var0, $varl) ^ (present ($var0) OR past ($var0) OR future ($var0) ) \ THEN ^2_Benefit :Benefitor (for, $varl) ^2_Benefit :Act (for, $var0) we obtain ^2_Benefit:Benefitor(for,Mafia) ^2_Benefit:Act(for,killing) A2_Benefit:Benefitor(for_l,government) ^2_Benefit:Act(for_l,killing_1) Here the first argument of the output relationships allows us to correctly associate the dif- ferent acts of killing with the different benefitors 44.5.5 Comparatives and Phantom. Nodes Next, a bit of subtlety is needed to deal with sentences like Mike eats more cookies than Ben. which RelEx handles via _subj(eat, Mike) _obj(eat, cookie) more(cookie, $cVar0) ScVar0(Ben) Then a Relac2FrameNet mapping rule such as: IF _subj (eat, $var0) _obj (eat,$varl) more ($varl,$cVar0) ScVarO($var2) THEN ^2_AsymmetricEvaluativeComparison : Prof iledltem (more, $varl) "2_AsymmetricEvaluativeComparison : Standardltem (more, $varl_1) ^2_AsymmetricEvaluativeComparison :Valence (more, more) l_Ingest ion: Ingestor (eat, $var0) l_Ingest ion: Ingested (eat, $varl) l_Ingest ion: Ingestor (eat_1,$var2) l_Ingest ion : Ingested (eat_l, $varl_1) applies, which embodies the commonsense intuition about comparisons regarding eating. (Note that we have introduced a new frame AsymmetricEvaluativeComparison here, by analogy to the standard FrameNet frame Evaluative_comparison.) Note also that the above rule may be too specialized, though it's not incorrect. One could also try more general rules like EFTA00624538
392 44 Natural Language Comprehension IF %Agent ($var0) %Agent ($varl) _subj ($var3, $var0) _obj ($var3, $varl) more ($varl, $cVar0) ScVarO ($var2) THEN ^2_AsymmetricEvaluativeComparison : Prof iledltem (more, $varl) ^2_AsymmetricEvaluativeComparison:StandardItem (more, $varl_1) ^2_AsymmetricEvaluativeComparison :Valence (more, more) _subj ($var3, $var0) _obj ($var3, $varl) _subj ($var3_1, $var2) _obj ($var3_1, $varl_1) However, this rule is a little different than most RelEx2Frame rules, in that it produces output that then needs to be processed by the RelEx2Frame rule-base a second time. There's nothing wrong with this, it's just an added layer of complexity. 44.6 Frame2Atom The next step in the current OpenCog NLP comprehension pipeline is to translate the output of RelEx2Frame into Atoms. This may be done in a variety of ways; the current Frame2Atom script embodies one approach that has proved workable, but is certainly not the only useful one. The Node types currently used in Frame2Atom are: • WordNode • ConceptNode - DefinedFrameNode - DefinedLinguisticConceptNode • PredicateNode - DefinedFrameElementNode - DefinedLinguisticRelationshipNode • SpecificEntityNode The special node types • DefinedFrameNode • DefinedFrameElementNode have been created to correspond to FrameNet frames and elements respectively (or frames and elements drawn from similar resources to FrameNet, such as our own frame dictionary). Similarly, the special node types EFTA00624539
44.6 Frame2Atom 393 • DefinedLinguisticConceptNode • DefinedLinguisticRelationshipNode have been created to correspond to RelEx unary and binary relationships respectively. The "defined" is in the names because once we have a more advanced CogPrime system, it will be able to learn its own frames, frame elements, linguistic concepts and relationships. But what distinguishes these "defined" Atoms is that they have names which correspond to specific external resources. The Link types we need for Frarne2Atom are: • InheritanceLink • ReferenceLink (current using WRLink aka "word reference link") • FrameElementLink ReferenceLink is a special link type for connecting concepts to the words that they refer to. (This could be eliminated via using more complex constructs, but it's a very common case so for practical purposes it makes sense to define it as a link type.) FrameElementLink is a special link type connecting a frame to its element. Its semantics (and how it could be eliminated at cost of increased memory and complexity) will be explained below. 44.6.1 Examples of Frame2Atorn Below follow some examples to illustrate the nature of the mapping intended. The examples include a lot of explanatory discussion as well. Note that, in these examples, [14 denotes an Atom with AtomHandle n. All Atoms have Han- dles, but Handles are only denoted in cases where this seems useful. (In the XML representation used in the current OpenCogPrime impehnentation, these are replaced by UUID's) The notation WordNode#pig denotes a WordNode with name pig, and a similar convention is used for other AtomTypes whose names are useful to know. These examples pertain to fragments of the parse Ben slowly ate the fat chickens. A:_advmod:V(slowly:A, eat:VI N:_nn:N(fat:N, chicken:N) N:definite(Ben:N) N:definite(chicken:N) N:masculine(Ben:N) N:person(Ben:14/ N:plural(chicken:N) N:singular(Ben:N) V:_obj:N(eat:V, chicken:N) V:_subj:N(eat:V, Ben:N) V:past(eat:V) EFTA00624540
394 44 Natural Language Comprehension Al_Ingestion:Ingestor(eat,Ben) Al_Temporalcolocation:Event(past,eat) Al_Ingestion:Ingestibles(eate chicken) Al_Activity:Agent(subject,Ben) Al_Activity:Activity(verb,eat) Al_Transitive_action:Event(verbfeat) AlTransitiveaction:Patient(objectf chicken) 44.6.1.1 Example 1 _obj(eat,chicken) would map into EvaluationLink DefinedLinguisticRelationshipNode l_obj ListLink ConceptNode [2] ConceptNode [3] InheritanceLink [2] ConceptNode [4] InheritanceLink I 31 ConceptNode [5] ReferenceLink [6] WordNode feat [8] [4] ReferenceLink [7] WordNode 'chicken [9] I 51 Please note that the Atoms labeled 4,5,6,7,8,9 would not normally have to be created when entering the relationship _obj(eat,chicken) into the AtomTable. They should already be there, assuming the system already knows about the concepts of eating and chickens. These would need to be newly created only if the system had never seen these words before. For instance, the Atom 121 represents the specific instance of "eat" involved in the relationship being entered into the system. The Atom 141 represents the general concept of "eat", which is what is linked to the word "eat." Note that a very simple step of inference, from these Atoms, would lead to the conclusion EvaluationLink EFTA00624541
44.6 Frame2Atom 395 DefinedLinguisticRelationshipNode i_obj ListLink ConceptNode [4] ConceptNode [5] which represents the general statement that chickens are eaten. This is such an obvious and important step, that perhaps as soon as the relationship _obj (eat, chicken) is entered into the system, it should immediately be carried out (i.e. that link if not present should be created, and if present should have its truth value updated). This is a choice to be implemented in the specific scripts or schema t hat deal with ingestion of natural language text. 44.6.1.2 Example 2 mascutine(Ben) would map into InheritanceLink SpecificEntityNode [40] DefinedLinguisticConceptNode *masculine InheritanceLink [40] (10] ReferenceLink WordNode *Ben [10] 44.6.1.3 Example 3 The mapping of the RelExToFrame output Ingestion : Ingestor(eat, Ben) would use the existing Atoms DefinedFrameNode *Ingestion [11] DefinedFrameElementNode * Ingestion: Ingestor [12] which would be related via FrameElementLink [11] [12] (Note that FrameElementLink may in principle be reduced to more elementary PLN link types.) Note that each FrameNet frame contains some core elements and some optional elements. This may be handled by giving core elements links such as FrameElementLink F E <1> EFTA00624542
396 44 Natural Language Comprehension and optional ones links such as FrameElementLink F E <.7> Getting back to the example at hand, we would then have InheritanceLink [2] [11] (recall, 12] is the instance of eating involved in Example 1; and, till is the Ingestion frame), which says that this instance of eating is an instance of ingestion. (In principle, some instances of eating might not be instances of ingestion - or more generally, we can't assume that all instances of a given concept will always associate with the same FrameNodes. This could be assumed only if we assumed all word-associated concepts were disambiguated to a single known FrameNet frame, but this can't be assumed, especially if later on we want to use cognitive processes to do sense disambiguation.) We would then also have links denoting the role of Ben as an Ingestor in the frame-instance 12j, i.e. EvaluationLink DefinedFrameElementNode 4Ingestion:Ingestor [12] ListLink [2] [40] This says that the specific instance of Ben observed in that sentence OD served the role of Ingestion:Ingestor in regard to the frame-instance 121 (which is an instance of eating, which is known to be an instance of the frame of Ingestion). 44.6.2 Issues Involving Disambiguation Right now, OpenCogPrime's RelEx2Frame rulebase is far from adequately large (there are currently around 5000 rules) and the link parser and RelEx are also imperfect. The current OpenCog NLP system does work, but for complex sentences it tends to generate too many interpretations of each sentence - "parse selection" or more generally "interpretation selection" is not yet adequately addressed. This is a tricky issue that can be addressed to some extent via statistical linguistics methods, but we believe that to solve it convincingly and thoroughly will require more cognitively sophisticated methods. The most straightforward way to approach it statistically is to process a large number of sentences, and then tabulate co-occurrence probabilities of different relationships across all the sentences. This allows one to calculate the probability of a given interpretation conditional on the corpus, via looking at the probabilities of the combinations of relationships in the inter- pretation. This may be done using a Bayes Net or using PLN - in any case the problem is one of calculating the probability of a conjunction of terms based on knowledge regarding the probabilities of various sub-conjunctions. As this method doesn't require marked-up training data, but is rather purely unsupervised, it's feasible to apply it to a very large corpus of text - the only cost is computer time. What the statistical approach won't handle, though, are the more conceptually original lin- guistic constructs, containing combinations that didn't occur frequently in the system's training EFTA00624543
44.7 Syn2Sem: A Semi-Supervised Alternative to RelEx and RelEx2Frame 397 corpus. It will rate innovative semantic constructs as unlikely, which will lead it to errors some- times - errors of choosing an interpretation that seems odd in terms of the sentence's real-world interpretation, but matches well with things the system has seen before. The only way to solve this is with genuine understanding - with the system reasoning on each of the interpretations and seeing which one makes more sense. And this kind of reasoning generally requires some relevant commonsense background knowledge - which must be gained via experience, reading and conversing, or from a hand-coded knowledge base, or via some combination of the above. Related issues also involving disambiguation include word sense disambiguation (words with multiple meanings) and anaphor resolution (recognizing the referents of pronouns, and of nouns that refer to other nouns, etc.). The current RelEx system contains a simple statistical parse ranker (which rates a parse higher if the links it includes occur more frequently in a large parsed corpus), statistical methods for word sense disambiguation IMihoil inspired by those in Rada Mihalcea's work IS :009I, and an anaphor resolution algorithm based on the classic Hobbs Algorithm (customized to work with the link parser) Illob78J. While reasonably effective in many cases, from an AGI perspective these must all be considered "stopgaps" to be replaced with code that handles these tasks using probabilistic inference. It is conceptually straightforward to replace statistical linguistic algorithms with comparable PLN-based methods, however significant attention must be paid to code optimization as using a more general algorithm is rarely as efficient as using a specialized one. But once one is handling things in PLN and the Atomspace rather than in specialized computational linguistics code, there is the opportunity to use a variety of inference rules for generalization, analogy and so forth, which enables a radically more robust form of linguistic intelligence. 44.7 Syn2Sem: A Semi-Supervised Alternative to RelEx and RelEx2Frame This section describes an alternative approach to the RelEx / RelEx2Frame approach described above, which is in the midst of implementation at time of writing. This alternative represents a sort of midway point between the rule-based RelEx / RelEx2Frame approach, and a concep- tually ideal fully experiential learning based approach. The motivations underlying this alternative approach have been to create an OpenCog NIP system with the capability to: • support simple dialogue in a video game like world, and a robot system • leverage primarily semi-supervised experiential learning • replace the RelEx2Frame rules, which are currently problematic, with a different way of mapping syntactic relationships into Atoms, that is still reasoning and learning friendly • require only relatively modest effort for implementation (not multiple human-years) The latter requirement ruled out a pure "learn language from experience with no aid from computational linguistics tools" approach, which may well happen within OpenCog at some point. EFTA00624544
398 44 Natural Language Comprehension 44.8 Mapping Link Parses into Atom Structures The core idea of the new approach is to learn "Syn2Sem" rules that map link parses into Atom structures. These rules may then be automatically reversed to form Sem2Syn rules, which may be used in language generation. Note that this is different from the RelEx approach as currently pursued (the "old approach"), which contains • one set of rules (the RelEx rules) mapping link parses into semantic relation-sets ("RelEx relation-sets" or rel-sets) • another set of rules (the RelEx2Frame rules) mapping rehsets into FrameNet-based relation- sets • another set of rules (the Frame2Atom rules) mapping FrameNet-based relation-sets into Atom-sets In the old approach, all the rules were hand-coded. In the new approach • nothing needs to be hand-coded (except the existing link parser dictionary); the rules can be learned from a corpus of (link-parse, Atom-set) pairs. This corpus may be human-created; or may be derived via a system's experience in some domain where sentences are heard or read, and can be correlated with observed nonlinguistic structures that can be described by Atoms. • in practice, some hand-coded rules are being created to map RelEx rd-sets into Atom-sets directly (bypassing RelEx2Frame) in a simple way. These rules will be used, together with RelEx, to create a large corpus of (link parse, Atom-set) pairs, which will be used as a training corpus. This training corpus will have more errors than a hand-created corpus, but will have the compensating advantage of being significantly larger than any hand-created corpus would feasibly be. In the old approach, NL generation was done by using a pattern-matching approach, applied to a corpus of (link parse, rd-set) pairs, to mine rules mapping rd-sets to sets of link parser links. This worked to an extent, but the process of piecing together the generated sets of link parser links to form coherent "sentence parses" (that could then be turned into sentences) turned out to be subtler than expected. and appeared to require an es. calatingly complex set of hand-coded rules, to be extended beyond simple cases. In the new approach, NL generation is done by explicitly reversing the mapping rules learned for mapping link parses into Atom sets. This is possible because the rules are explicitly given in a form enabling easy reversal; whereas in the old approach, RelEx transformed link parses into rd-sets using a process of successively applying many rules to an ornamented tree, each rule acting on variables ("ornaments") deposited by previous rules. Put simply, RelEx transformed link parses into rehsets via imperative programming, whereas in the new approach, link parses are transformed into Atom-sets using learned rules that are logical in nature. The movement from imperative to logical style dramatically eases automated rule reversal. EFTA00624545
4,1.9 Making a Training Corpus 399 44.8.1 Example Training Pair For concreteness, an example (link parse, Atom-set) pair would be as follows. For the sentence "Trains move quickly", the link parse looks like Sp (trains, move) MVa (move, quickly) whereas the Atom-set looks like Inheritance move_l move Evaluation move_l train Inheritance move_l quick Rule learning proceeds, in the new approach, from a corpus consisting of such pairs. 44.9 Making a Training Corpus 44.9.1 Leveraging RelEx to Create a Training Corpus To create a substantial training corpus for the new approach, we are leveraging the existence of RelEx. We have a large corpus of sentences parsed by the link parser and then processed by RelEx. A new collection of rules is being created, RelEx2Atom, that directly translates RelEx parses into Atoms, in a simple way, embodying the minimal necessary degree of disambiguation (in a sense to be described just below). Using these RelEx2Atom rules, one can transform a corpus of (link parse, RelEx rel-set) triples into a corpus of (link parse, Atom-set) pairs - which can then be used as training data for learning Syn2Sem rules. 44.9.2 Making an Experience Based Training Corpus An alternate approach to making a training corpus would be to utilize a virtual world such as the Unity3D world now being used for OpenCog game AI research and development. A human game-player could create a training corpus by repeated: • typing in a sentence • indicating, via the graphic user interface, which entities or events in the virtual world were referred to by the sentence EFTA00624546
400 44 Natural Language Comprehension Since OpenCog possesses code for transforming entities and events in the virtual world into Atom-sets, this would implicitly produce a training corpus of (sentence, Atom-set) pairs, which using the link parser could then be transformed into (link parse, Atom-set) pairs. 44.94 Unsupervised, Experience Based Corpus Creation One could also dispense with the explicit reference-indication GUI, and just have a user type sentences to the Al agent as the latter proceeds through the virtual world. The Al agent would then have to figure out what specifically the sentences were referring to - maybe the human- controlled avatar is pointing at something; maybe one thing recently changed in the game world and nothing else did; etc. This mode of corpus creation would be reasonably similar to human first language learning in format (though of course there are many differences from human first language learning in the overall approach, for instance we are assuming the link parser, whereas a human language learner has to learn grammar for themselves, based on complex and ill-understood genetically encoded prior probabilistic knowledge regarding the likely aspects of the grammar to be learned). This seems a very interesting direction to explore later on, but at time of writing we are pro- ceeding with the RelEx-hased training corpus, for sake of simplicity and speed of development. 44.10 Limiting the Degree of Disambiguation Attempted The old approach is in a sense more ambitious than the new approach, because the RelEx2Frame rules attempt to perform a deeper and more thorough level of semantic disambiguation than the new rules. However, the RelEx2Frame rule-set in its current state is too "noisy" to be really useful; it would need dramatic improvement to be helpful in practice. The key difference is that, • In the new approach, the syntax-to -semantics mapping rules attempt only the disambigua- tion that needs to be done to get the structure of the resultant Atom-set correct. Any further disambiguation is left to be done later, by MindAgents acting on the Atom-sets after they've already been placed in the AtomSpace. • In the old approach, the RelEx2Frame rules attempted, in many cases, to disambiguate between different meanings beyond the level needed to disambiguate the structure of the Atom-set To illustrate the difference, consider the sentences • Love moves quickly. • Trains move quickly. These sentences involve different senses of "move" - change in physical location, versus a more general notion of progress. However, both sentences map to the same basic conceptual structure, e.g. Inheritance move_l EFTA00624547
44.11 Rule Format 401 move Evaluation move_l train Inheritance move_l quick versus Inheritance move_2 move Evaluation move_2 love Inheritance move_2 quick The RelEx2Frame rules try to distinguish between these cases via, in effect, associating the two instances move_l and move_2 with different frames, using hand-coded rules that map RelEx rehsets into appropriate Atom-sets defined in terms of FrameNet relations. This is not a useless thing to do; however, doing it well requires a very large and well-honed rule-base. Cyc's natural language engine attempts to do something similar, though using a different parser than the link parser and a different ontology than FrameNet; it does a much better job than the current version of RelEx2Frame, but still does a surprisingly incomplete job given the massive amount of effort put into sculpting the relevant rule-sets. The new approach does not try to perform this kind of disambiguation prior to mapping things into Atom-sets. Rather, this kind of disambiguation is left for inference to do, after the relevant Atoms have already been placed in the AtomSpace. The rule of thumb is: Do precisely the disambiguation needed to map the parse into a compact, simple Atom-set, whose component nodes correspond to English words. Let the disambiguation of the meaning of the English words be done by some other process acting on the AtomSpace. 44.11 Rule Format To represent Syn2Sem rules, it is convenient to represent link parses as Atom-sets. Each element of the training corpus will then be of the form (Atom set representing link parse, Atom-set representing semantic interpretation). Syn2Sem rules are then rules mapping Atom-sets to Atom-sets. Broadly speaking, the format of a Syn2Sem rule is then EFTA00624548
402 44 Natural Language Comprehension Implication Atom-set representing portion of link parse Atom-set representing portion of semantic interpretation 44.11.1 Example Rule A simple example rule would be Implication Evaluation Predicate: Sp \$V1 \ $V2 Evaluation \ $V2 \$V1 This rule, in essence, maps verbs into predicates that take their subjects as arguments. On the other hand, an Sem2Syn rule would look like the reverse: Implication Atom-set representing portion of link parse Atom-set representing portion of semantic interpretation Our current approach is to begin with Syn2Sem rules, because, due to the nature of natural language, these rules will tend to be more certain. That is: it is more strongly the case in natural languages that each syntactic construct maps into a small set of semantic structures, than that each semantic structure is realizable only via a small set of syntactic constructs. There are usually more ways structurally different, reasonably sensible ways to say an arbitrary thought, than there are structurally different, reasonably sensible ways to interpret an arbitrary sentence. Because of this fact about language, the design of the Atom-sets in the corpus is based on the principle of finding an Atom structure that most simply represents the meaning of the sentence corresponding to each given link parse. Thus, there will be many Syn2Sem rules with a high degree of certitude attached to them. On the other hand, the Sem2Syn rules will tend to have less certitude, because there may be many different syntactic ways to realize a given semantic expression. 44.12 Rule Learning Learning of Syn2Sem rules may be done via any algorithm that is able to search rule space for rules of the proper format with high truth value as evaluated across the training set. Currently we are experimenting with using OpenCogPrime's frequent subgraph mining algorithm in this context. MOSES could also potentially be used to learn Syn2Sem rules. One suspects that MOSES might be better than frequent subgraph mining for learning complex rules, but based EFTA00624549
44.13 Creating a Cyc-Like Database via Text Mining 403 on preliminary, experimentation, frequent subgraph mining seems fine for learning the simple rules involved in simple sentences. PLN inference may also be used to generate new rules by combining previous ones, and to generalize rules into more abstract forms. 44.13 Creating a Cyc-Like Database via Text Mining The discussion of these NL comprehension mechanisms leads naturally to one interesting poten- tial application of the OpenC,og NL comprehension pipeline - which is only indirectly related to CogPrime, but would create a valuable resource for use by CogPrime if implemented. The possibility exists to use the OpenCog NL comprehension system to create a vaguely Cyc-like database of common-sense rules. The approach would be as follows: 1. Get a corpus of text 2. Parse the text using OpenCog (RelEx or Syn2Sem) 3. Mine logical relationships among Atomrelationships from the data thus produced, using greedy data-mining. MOSES, or other methods These mined logical relationships will then be loosely analogous to the rules the Cyc team have programmed in. For instance, there will be many rules like: I IF _subj(understand,$var0) THEN ^l_Grasp:Cognizer(understand,$var0) I IF _subj(know,$var0) THEN Al_Grasp:Cognizer(understand,$var0) So statistical mining would learn rules like IF "l_Mental_property(stupid) & Al_Mental_property:Protagonist($var0) THEN ^l_Grasp:Cognizer(understand,$var0) <.3> IF ^1_Mental_property(smart) & Al_Mental_property:Protagonist($var0) THEN ^l_Grasp:Cognizer(understand,$var0) <.8> which means that stupid people mentally grasp less than smart people do. Note that these commonsense rules would come out automatically probabilistically quanti- fied. Note also that to make such rules come out well, one needs to do some (probabilistic) synonym-matching on nouns, adverbs and adjectives, e.g. so that mentions of "smart", "intelli- gent", "clever", etc. will count as instances of Al_Mental_property (smart) By combining probabilistic synonym matching on words. with mapping RelEx output into FrameNet input, and doing statistical mining, it should be passible to build a database like Cyc but far more complete and with coherent probabilistic weightings. Although this way of building a commonsense knowledge base requires a lot of human engineering, it requires far less than something like Cyc. One "just" needs to build the RelEx2FrameNet mapping rules, not all the commonsense knowledge relationships directly — EFTA00624550
404 44 Natural Language Comprehension those come from text. We do not advocate this as a solution to the AGI problem, but merely suggest that it could produce a large amount of useful knowledge to feed into an AGI's brain. And of course, the better an AI one has, the better one can do the step labeled "Rank the parses and FrameNet interpretations using inference or heuristics or both." So there is a potential virtuous cycle here: more commonsense knowledge mined helps create a better AI mind, which helps mine better commonsense knowledge, etc. 44.14 PROWL Grammar We have described the crux of the NL comprehension pipeline that is currently in place in the OpenCog codebase, plus some ideas for fairly moderate modifications or extensions. This section is a little more speculative, and describes an alternative approach that fits better with the overall CogPrime design, which however has not yet been implemented. The ideas given here lead more naturally to a design for experience-based language learning and processing, a connection that will be pointed out in a later section. What we describe here is a partially-new theory of language formed via combining ideas from three sources: Hudson's Word Grammar 11hu190, HudOiaj, Sleator and Temperley's link grammar. and Probabilistic Logic Networks. Reflecting its origin in these three sources, we have named the new theory PROWL grammar, meaning PRObabilistic Word Link Grammar. We believe PROWL has value purely as a conceptual approach to understanding language; however, it has been developed largely from the standpoint of computational linguistics - as part of an attempt to create a framework for computational language understanding and generation that both 1. yields broadly adequate behavior based on hand-coding of "expert rules" such as grammat- ical rules, combined with statistical corpus analysis 2. integrates naturally with a broader Al framework that combines language with embodied social, experiential learning, that ultimately will allow linguistic rules derived via expert encoding and statistical corpus analysis to be replaced with comparable, more refined rules resulting from the system's own experience PROWL has been developed as part of the larger CogPrime project; but, it is described in this section mostly in a CogPrime -independent way, and is intended to be independently evaluable (and, hopefully, valuable). As an integration of three existing frameworks, PROWL could be presented in various differ- ent ways. One could choose any one of the three components as an initial foundation, and then present the combined theory, as an expansion/modification of this component. Here we choose to present it as an expansion/modification of Word Grammar, as this is the way it originated, and it is also the most natural approach for readers with a linguistics background. From this perspective, to simplify a fair bit, one may describe PROWL as consisting of Word Grammar with three major changes: 1. Word Grammar's network knowledge representation is replaced with a richer PLN-based network knowledge representation. EFTA00624551
44.14 PROWL Grammar 405 a. This includes, for instance. the replacement of Word Grammar's single "isa" relation- ship type with a more nuanced collection of logically distinct probabilistic inheritance relationship types 2. Going along with the above, Word Grammar's "default inheritance" mechanism is replaced by an appropriate PLN control mechanism that guides the use of standard PLN inference rules a. This allows the same default-inheritance based inferences that Word Grammar relies upon, but embeds these inferences in a richer probabilistic framework that allows them to be integrated with a wide variety of other inferences 3. Word Grammar's small set of syntactic link types is replaced with a richer set of syntactic link types as used in Link Grammar a. The precise optimal set of link types is not clear; it may be that the link grammar's syntactic link type vocabulary is larger than necessary, but we also find it clear that the current version of Word Grammar's syntactic link type vocabulary is smaller than feasible (at least, without the addition of large, new, and as yet unspecified ideas to Word Grammar) In the following subsections we will review these changes in a little more detail. Basic familiarity with Word Grammar. Link Grammar and PLN is assumed. Note that in this section we will focus mainly on those issues that are somehow nonobvious. This means that a host of very important topics that come along with the Word Grammar / PLN integration are not even mentioned. The way Word Grammar deals with morphology, semantics and pragmatics, for instance, seems to us quite sensible and workable - and doesn't really change at all when you integrate Word Grammar with PLN, except that Word Grammar's crisp isa links become PLN-style probabilistic Inheritance links. 44.14.1 Brief Review of Word Grammar Word Grammar is a theory of language structure which Richard Hudson began developing in the early 1980's riltal901. While partly descended from Systemic Functional Grammar, there are also significant differences. The main ideas of Word Grammar are as follows • It presents language as a network of knowledge, linking concepts about words, their mean- ings, etc. - e.g. the word "dog" is linked to the meaning 'dog', to the form /dog/, to the word-class 'noun', etc. • If language is a network, then it is possible to decide what kind of network it is (e.g. it seems to be a scale-free small-world network) • It is monostratal - only one structure per sentence, no transformations. • It uses word-word dependencies - e.g. a noun is the subject of a verb. • It does not use phrase structure - e.g. it does not recognise a noun phrase as the subject of a clause, though these phrases are implicit in the dependency structure. t the following list is paraphrased with edits front http://www.phon.ucl.ac.uk/home/dick/wg.htm downloaded on June 27 2010 EFTA00624552
406 44 Natural Language Comprehension • It shows grammatical relations/functions by explicit labels - e.g. 'subject' and 'object'. • It uses features only for inflectional contrasts that are mentioned in agreement rules - e.g. number but not tense or transitivity. • It uses default inheritance, as a very general way of capturing the contrast between 'basic' or 'underlying' patterns and 'exceptions' or 'transformations' - e.g. by default, English words follow the word they depend on, but exceptionally subjects precede it; particular cases 'inherit' the default pattern unless it is explicitly overridden by a contradictory rule. • It views concepts as prototypes rather than 'classical' categories that can be defined by necessary and sufficient conditions. All characteristics (i.e. all links in the network) have equal status, though some may for pragmatic reasons be harder to override than others. • In this network there are no clear boundaries between different areas of knowledge - e.g. between 'lexicon' and 'grammar', or between 'linguistic meaning' and 'encyclopedic knowl- edge; language is not a separate module of cognition. • In particular, there is no clear boundary between 'internal' and 'external' facts about words, so a grammar should be able to incorporate sociolinguistic facts - e.g. the speaker of "side- walk" is an American. 44.14.2 Word Grammar's Logical Network Model Word Grammar presents an elegant framework in which all the different aspects of language are encompassed within a single knowledge network. Representationally, this network combines two key aspects: 1. Inheritance (called is-a) is explicitly represented 2. General relationships between n-ary predicates and their arguments, including syntactic relationships, are explicitly represented Dynamically, the network contains two key aspects: 1. An inference rule called "default inheritance" 2. Activation-spreading, similar to that in a neural network or standard semantic network The similarity between Word Grammar and CogPrime is fairly strong. In the latter, inheritance and generic predicate-argument relationships are explicitly represented; and, a close analogue of activation spreading is present in the "attention allocation" subsystem. As in Word Grammar, important cognitive phenomena are grounded in the symbiotic combination of logical-inference and activation-spreading dynamics. At the most general level, the reaction of the Word Grammar network to any situation is proposed to involve three stages: 1. Node creation and identification: of nodes representing the situation as understood, in its most relevant aspects 2. Where choices need to be made (e.g. where an identified predicate needs to choose which other nodes to bind to as arguments), activation spreading is used, and the most active eligible argument is utilized (this is called "best fit binding") 3. Default inheritance is used to supply new links to the relevant nodes as necessary EFTA00624553
44.14 PROWL Grammar 407 Default inheritance is a process that relies on the placement of each node in a directed acyclic graph hierarchy (dag) of isa links. The basic idea is as follows. Suppose one has a node N, and a predicate f(N,L), where L is another argument or list of arguments. Then, if the truth value of f(N,L) is not explicitly stored in the network, N inherits the value from any ancestor A in the dag so that: f(A,L) is explicitly stored in the network; and there is not any node P inbetween N and A for which f(P,L) is explicitly stored in the network. Note that multiple inheritance is explicitly supported, and in cases where this leads to multiple assignments of truth values to a predicate, confusion in the linguistic mind may ensue. In many cases the option coming from the ancestor with the highest level of activity may be selected. Our suggestion is that Word Grammar's network representation may be replaced with PLN's logical network representation without any loss, and with significant gain. Word Grammar's network representation has not been fleshed out as thoroughly as that of PLN, it does not handle uncertainty, and it is not associated with general mechanisms for inference. The one nontrivial issue that must be addressed in porting Word Grammar to the PLN representation is the role of default inheritance in Word Grammar. This is covered in the following subsection. The integration of activation spreading and default inheritance proposed in Word Gram- mar, should be easily achievable within CogPrime assuming a functional attention allocation subsystem. 44.14.3 Link Grammar Parsing vs Word Grammar Parsing From a CogPrime /PLN point of view, perhaps the most striking original contribution of Word Grammar is in the area of syntax parsing. Word Grammar's treatment of morphology and se- mantics is, basically, exactly what one would expect from representing such things in a richly structured semantic network. PLN adds much additional riclmess to Word Grammar via al- lowing nuanced representation of uncertainty, which is critical on every level of the linguistic hierarchy - but this doesn't change the fundamental linguistic approach of Word Grammar. Regarding syntax processing, however, Word Grammar makes some quite specific and unique hypotheses, which if correct are very valuable contributions. The conceptual assumption we make here is that syntax processing, while carried out using generic cognitive processes for uncertain inference and activation spreading, also involves some highly specific constraints on these processes. The extent to which these constraints are learned versus inherited is yet unknown, and for the subtleties of this issue the reader is referred to lEI33- 971. Word Grammar and Link Grammar are then understood as embodying different hypotheses regarding what these constraints actually are. It is interesting to consider the contributions of Word Grammar to syntax parsing via com- paring it to Link Grammar. Note that Link Grammar, while a less comprehensive conceptual theory than Word Gram- mar, has been used to produce a state-of-the-art syntax parser, which has been incorporated into a number of other software systems including OpenCog. So it is clear that the Link Gram- mar approach has a great deal of pragmatic value. On the other hand, it also seems clear that Link Grammar has certain theoretical shortcomings. It deals with many linguistic phenomena very elegantly, but there are other phenomena for which its approach can only be described as "hacky." EFTA00624554
408 44 Natural Language Comprehension Word Grammar contains fewer hacks than Link Grammar, but has not yet been put to the test of large-scale computational implementation, so it's not yet clear how many hacks would need to be added to give it the relatively broad coverage that Link Grammar currently has. Our own impression is that to make Word Grammar actually work as the foundation for a broad- coverage grammar parser (whether standalone, or integrated into a broader artificial cognition framework), one would need to move it somewhat in the direction of link grammar, via adding a greater number of specialized syntactic link types (more on this shortly). There are in fact concrete indications of this in IIlud07ai. The Link Grammar framework may be decomposed into three aspects: 1. The link grammar dictionary, which for each word in English, contains a number of links of different types. Some links point left, some point right, and each link is labeled. Furthermore, some links are required and others are optional. 2. The "no-links-cross" constraint, which states that the correct parse of a sentence will involve drawing links between words, in such a way that all the required links of each word are fulfilled, and no two links cross when the links are depicted in two dimensions 3. A processing algorithm, which involves first searching the space of all passible linkages among the words in a sentence to find all complete linkages that obey the no-links-cross constraint; and then applying various postprocessing rules to handle cases (such as con- junctions) that aren't handled properly by this algorithm In PROWL, what we suggest is that 1. The link grammar dictionary is highly valuable and provides a level of linguistic detail that is not present in Word Grammar; and, we suggest that in order to turn Word Grammar into a computationally tractable system, one will need something at least halfway between the currently minimal collection of syntactic link types used in Word Grammar and the much richer collection used in Link Grammar 2. The no-links-cross constraint is an approximation of a deeper syntactic constraint ("land- mark transitivity") that has been articulated in the most recent formulations of Word Gram- mar. Specifically: when a no-links-crossing parse is found, it is correct according to Word Grammar; but Word Grammar correctly recognizes some parses that violate this constraint 3. The Link Grammar parsing algorithm is not cognitively natural, but is effective in a standalone-parsing framework. The Word Grammar approach to parsing is cognitively natural, but as formulated could only be computationally implemented in the context of an already-very-powerful general intelligence system. Fortunately, various intermediary ap- proaches to parsing seem possible. 44.14.3.1 Using Landmark Transitivity with the Link Grammar Dictionary An earlier version of Word Grammar utilized a constraint called "no tangled links" which is equivalent to the link parser's "no links cross" constraint. In the new version of Word Grammar this is replaced with a subtler and more permissive constraint called "landmark transitivity." While in Word Grammar. landmark transitivity is used with a small set of syntactic link types, there is no reason why it can't be used with the richer set of link types that Link Grammar provides. In fact, this seems to us a probably effective method of eliminating most or all of the "postprocessing rules" that exist in the link parser, and that constitute the least elegant aspect of the Link Grammar framework. EFTA00624555
44.14 PROWL Grammar 409 The first foundational concept, on the path to the notion of landmark transitivity, is the notion of a syntactic parent. In Word Grammar each syntactic link has a parent end and a child end. In a dependency grammar context, the notion is that the child depends upon the parent. For instance, in Word Grammar, in the link between a noun and an adjective, the noun is the parent. To apply landmark transitivity in the context of the Link Grammar, one needs to provide some additional information regarding each link in the Link Grammar dictionary. One needs to specify which end of each of the link grammar links is the "parent" and which is the "child." Examples of this kind of markup are as follows (with P shown by the parent): S link: subject-noun finite verb (P) O link: transitive verb (P) direct or indirect object D link: determiner noun (P) MV link: verb (P) verb modifier J link: preposition object (P) ON link: on time-expression [P] M link: noun [P] modifiers In some cases a word may have more than one parent. In this case, the rule is that the landmark is the one that is superordinate to all the other parents. In the rare case that two words are each others' parents, then either may serve as the landmark. The concept of a parent leads naturally into that of a landmark. The first rule regarding landmarks is that a parent is a landmark for its child. Next, two kinds of landmarks are in- troduced: Before landmarks (in which the child is before the parent) and After landmarks (in which the child is after the parent). The Before/After distinction should be obvious in the Link Grammar examples given above. The landmark transitivity rule, then, has two parts. If A is a landmark for B, of subtype L (where L is either Before or After), then 1. Subordinate transitivity says that if B is a landmark for C, then A is also a type-L landmark for C 2. Sister transitivity says that if A is a landmark for C, then B is also a landmark for C Finally, there are some special link types that cause a word to depend on its grandparents or higher ancestors as well as its parents. I note that these are not treated thoroughly in (Hudson, 2007); one needs to look to the earlier, longer and rarer work [Hud901. Some questions are dealt with this way. Another example is what in Word Grammar is called a "proxy link", as occurs between "wit]l and -whom- in The person with whom she works EFTA00624556
410 44 Natural Language Comprehension The link parser deals with this particular example via a .1w link + xp I • Ss + Wd 4. 4.---Cs---• •i 4.--pe 4-40-.4.-je-• s.--Ss-• I I I I I I I-Pail. LEFT-WALL the person.n with whoa she works.v is.v silly.a so to apply landmark transitivity in the context of the Link Grammar, in this case, it seems one would need to implement the rule that in the case of two words connected by a .1w-link, the child of one of the words is also the child of the other. Handling other special cases like this in the context of Link Grammar seems conceptually unproblematic, though naturally some hidden rocks may appear. Basically a list needs to be made of which kinds of link parser links embody proxy relationships for which other kinds of link parser links. According to the landmark transitivity approach, then, the criterion for syntactic correctness of a parse is that, if one takes the links in the parse and applies the landmark transitivity rule (along with the other special-case "raising" rules we've discussed), one does not arrive at any contradictions (i.e. no situations where A is a Before landmark of B. mid The main problem with the landmark-transitivity constraint seems to be computational tractability. The problem exists for both comprehension and generation, but we'll focus on comprehension here. To find all possible parses of a sentence using Hudson's landmark-transitivity-based approach, one needs to find all linkages that don't lead to contradictions when used as premises for reason- ing based on the landmark-transitivity axioms. This appears to be extremely computationally intensive! So, it seems that Word Grammar style parsing is only computationally feasible for a system that has extremely strong semantic understanding, so as to be able to filter out the vast majority of possible parses on semantic rather than purely syntactic grounds. On the other hand, it seems possible to apply landmark-transitivity together with no-links- cross, to provide parsing that is both efficient and general. If applying the no-links-cross con- straint finds a parse in which no links cross, without using postprocessing rules, then this will always be a legal parse according to the landmark-transitivity rule. However, landmark-transitivity also allows a lot of other parses that link grammar either needs postprocessing rules to handle, or can't find even with postprocessing rules. So, it would make sense to apply no-links-cross parsing first, but then if this fails, apply landmark-transitivity parsing starting from the partial parses that the former stage produced. This is the approach suggested in PROWL, and a similar approach may be suggested for language generation. 44.14.3.2 Overcoming the Current Limitations of Word Grammar Finally, it is worth noting that expanding the Word Grammar parsing framework to include the link grammar dictionary, will likely allow us to solve some unsolved problems in Word Grammar. For instance, II lud0iaj notes that the current formulation of Word Grammar has no way to distinguish the behavior of last vs. this in I ate last night I ate this ham The issue he sees is that in the first case, night should be considered the parent of last; whereas in the second case, this should be considered the parent of ham. EFTA00624557
44.14 PROWL Grammar 411 The current link parser also fails to handle this issue according to Hudson's intuition: X13 LEFT—WALL I.p ate.v last.a night.t + A P + I + Q s + +--Wd--+ *i+ + ,D*1.1-+ I I I I I ap LEFT I WALL I.p ate v this.d ham.n . However, the link grammar framework gives us a clear possibility for allowing the kind of interpretation Hudson wants: just allow this to take a left-going 0 -link, and (in PROWL) let it optionally assume the parent role when involved in a D-link relationship. There are no funky link-crossing or semantic issues here; just a straightforward link-grammar dictionary, edit. This illustrates the syntactic flexibility of the link parsing framework, and also the inelegance - adding new links to the dictionary, generally solves syntactic problems, but at the cast of creating more complexity to be dealt with further down the pipeline, when the various link types need to be compressed into a smaller number of semantic relationship types for purposes of actual comprehension (as is done in RelEx, for example). However, as far as we can tell, this seems to be a necessary cast for adequately handling the full complexity of natural language syntax. Word Grammar holds out the hope of possibly avoiding this kind of complexity, but without filling in enough details to allow a clear estimate of whether this hope can ever be fulfilled. 44.14.4 Contextually Guided Greedy Parsing and Generation Using Word Link Grammar Another difference between Link Grammar and currently utilized. and Word Grammar as described, is the nature of the parsing algorithm. Link Grammar operates in a manner that is fairly traditional among contemporary parsing algorithms: given a sentence, it produces a large set of possible parses, and then it is left to other methods/algorithms to select the right parse, and to form a semantic interpretation of the selected parse. Parse selection may of course involve semantic interpretation: one way to choose the right parse is to choose the one that has the most contextually sensible semantic interpretation. We may call this approach whole-sentence purely-syntactic parsing, or WSPS parsing. One of the nice things about Link Grammar, as compared to many other computational parsing frameworks, is that it produces a relatively small number of parses, compared for instance to typical head-driven phrase-structure grammar parsers. For simple sentences the link parser generally produces only handful of parses. But for complex sentences the link parser can produce hundreds of parses, which can be computationally costly to sift through. EFTA00624558
412 44 Natural Language Comprehension Word Grammar, on the other hand, presents far fewer constraints regarding which words may link to other words. Therefore, to apply parsing in the style of the current link parser, in the context of Word Grammar, would be completely infeasible. The number of possible parses would be tremendous. The idea of Word Grammar is to pare down parses via semantic/pragmatic sensibleness, during the course of the syntax parsing proems, rather than breaking things down into two phases (parsing followed by semantic/pragmatic interpretation). Parsing is suggested to proceed forward through a sentence: when a word is encountered, it is linked to the words coming before it in the sentence, in a way that makes sense. If this seems impossible, consistently with the links that have already been drawn in the course of the parsing process, then some backtracking is done and prior chokes may be revisited. This approach is more like what humans do when parsing a sentence, and does not have the effect of producing a large number of syntactically passible, semantically/pragmatically absurd parses, and then sorting through them afterwards. It is what we call a contertually-guided greedy parsing (CGGP) approach. For language generation, the link parser and Word Grammar approaches also suggest different strategies. Link Grammar suggests taking a semantic network, then searching holistically for a linear sequence of words that, when link-parsed, would give rise to that semantic network as the interpretation. On the other hand, Word Grammar suggests taking that same semantic network and iterating through it progressively, verbalizing each node of the network as one walks through it, and backtracking if one reaches a point where there is no way to verbalize the current node consistently with how one has already verbalized the previous nodes. The main observation we want to make here is that, while Word Grammar by its nature (due to the relative paucity of explicit constraints on which syntactic links may be formed), can operate with CGGP but not WSPS parsing. On the other hand, while Link Grammar is cur- rently utilized with WPSP parsing, there is no reason one can't use it with CGGP parsing just as well. There is no objection to using CGGP parsing together with the link-parser dictionary, nor with the no-links cross constraint rather than the landmark-transitivity constraint (in fact, as noted above, earlier versions of Word Grammar made use of the no-links-cross constraint). What we propose in PROWL is to use the link grammar dictionary together with the CGGP parsing approach. The WSPS parsing approach may perhaps be useful as a fallback for handling extremely complex and perverted sentences where CGGP takes too long to come to an answer - it corresponds to sentences that are so obscure one has to do really hard, analytical thinking to figure out what they mean. Regarding constraints on link structure, the suggestion in PROWL is to use the no-links- cross constraint as a first approximation. In comprehension, if no sufficiently high-probability interpretation obeying the no-links-cross constraint is found, then the scope of investigation should expand to include link-structures obeying landmark-transitivity but violating no-links- cross. In generation, things are a little subtler: a list should be kept of link-type combinations that often correctly violate no-links-cross, and when these combinations are encountered in the generation process, then constructs that satisfy landmark-transitivity but not no-links-cross should be considered. Arguably, the PROWL approach is less elegant than either Link Grammar or Word Gram- mar considered on its own. However, we are dubious of the proposition that human syntax processing, with all its surface messiness and complexity, is really generated by a simple, uni- fied, mathematically elegant underlying framework. Our goal is not to find a maximally elegant theoretical framework, but rather one that works both as a standalone computational-linguistics system, and as an integrated component of an adaptively-learning AGI system. EFTA00624559
44.15 Aspects of Language Learning 413 44.15 Aspects of Language Learning Now we finally turn to language learning - a topic that spans the engineered and experiential approaches to NLP. In the experiential approach, learning is required to gain even simple lin- guistic functionality. In the engineered approach, even if a great deal of linguistic functionality is built in, learning may be used for adding new functionality and modifying the initially given functionality. In this section we will focus on a kw aspects of language learning that would be required even if the current engineered OpenCog comprehension pipeline were completed to a high level of functionality. The more thoroughgoing language learning required for the expe- riential approach will then be discussed in the following section. Further, Chapter 45 will dig in depth into an aspect of language learning that to some extent cuts across the engineered/- experiential dichotomy - unsupervised learning of linguistic structures from large corpora of text. 44.15.1 Word Sense Creation In our examples above, we've frequently referred to ReferenceLinks between WordNodes and ConceptNodes. But, how do these links get built? One aspect of this is the process of word sense creation. Suppose we have a WordNode W that has ReferenceLinks to a number of different Con- ceptNodes. A common case is that these ConceptNodes fall into clusters, each one denoting a "sense" of the word. The clusters are defined by the following relationships: 1. ConceptNodes within a cluster have high-strength SimilarityLinks to each other 2. ConceptNodes in different clusters have low-strength (i.e. dissimilarity-denoting) Similar- ityLinks to each other When a word is first learned, it will normally be linked only to mutually agreeable ConceptN- odes, i.e. there will only be one sense of the word. As more and more instances of the word are seen, however, eventually the WordNode will gather more than one sense. Sometimes dif- ferent senses are different syntactically, other times they are different only semantically, but are involved in the same syntactic relationships. In the case of a word with multiple senses, most of the relevant feature structure information will be attached to word-sense-representing ConceptNodes, not to WordNodes themselves. The formation of sense-representing ConceptNodes may be done by the standard clustering and predicate mining processes, which will create such ConceptNodes when there are adequately many Atoms in the system satisfying the criteria represent. It may also be valuable to create a particular SenseMining CIAO-Dynamic, which uses the same criteria for node formation as the clustering and predicate mining CIM-Dynamics, but focuses specifically on creating predicates related to WordNodes and their nearby ConceptNodes. EFTA00624560
414 44 Natural Language Comprehension 44.15.2 Feature Structure Learning We've mentioned above the obvious fact that, to intelligently use a feature-structure based grammar, the system needs to be capable of learning new linguistic feature structures. Probing into this in more detail, we see that there are two distinct but related kinds of feature structure learning: 1. learning the values that features have for particular word senses. 2. learning new features altogether. Learning the values that features have for particular word senses must be done when new senses are created; and even for features imported front resources like the link grammar, the possibility of corrections must obviously be accepted. This kind of learning can be done by straightforward inference - inference from examples of word usage, and by analogy from features for similar words. A simple example to think about, e.g., is learning the verb sense of "fax" when only the noun sense is known. Next, the learning of new features can be viewed as a reasoning problem, in that inference can learn new relations applied to nodes representing syntactic senses of words. In principle, these "features" may be very general or very specialized, depending on the case. New feature learning, in practice, requires a lot of examples, and is a more fundamental but less common kind of learning than learning feature values for known word senses. A good example would be the learning of "third person" by an agent that knows only first and second person. In this example, it's clear that information from embodied experience would be extremely helpful. In principle, it could be learned front corpus analysis alone - but the presence of knowl- edge that certain words ("him", "her", "they", etc.) tend to occur in association with observed agents different from the speaker or the hearer, would certainly help a lot with identifying "third person" as a separate construct. It seems that either a very large number of un-embodied examples or a relatively small number of embodied examples would be needed to support the inference of the "third person" feature. And we suspect this example is typical - i.e. that the most effective route to new feature structure learning involves both embodied social experience and rather deep commonsense knowledge about the world. 44.15.3 Transformation and Semantic Mapping Rule Learning Word sense learning and feature structure learning are important parts of language learning, but they're far from the whole story. An equally important role is played by linguistic trans- formations, such as the rules used in RelEx and RelEx2Frame. At least some of these must be learned based on experience, for human-level intelligent language processing to proceed. Each of these transformations can be straightforwardly cast as an ImplicationLink between PredicateNodes, and hence formalistically can be learned by PLN inference, combined with one or another heuristic methods for compound predicate creation. The question is what knowledge exists for PLN to draw on in assessing the strengths of these links, and more critically, to guide the heuristic predicate formation methods. This is a case that likely requires the full complexity of "integrative predicate learning" as discussed in Chapter 41. And, as with feature structure learning, it's a case that will be much more effectively handled using knowledge from social embodied experience alongside purely linguistic knowledge. EFTA00624561
44.16 Experiential Language Learning 415 44.16 Experiential Language Learning We have talked a great deal about "engineered" approaches to NL comprehension and only peripherally about experiential approaches. But there has been a not-so-secret plan underlying this approach. There are many approaches to experiential language learning, ranging from a "tabula rasa" approach in which language is just treated as raw data, to an approach where the whole structure of a language comprehension system is programmed in, and "merely" the content remains to be learned. There isn't much to say about the tabula rasa approach - we have already discussed CogPrime's approach to learning, and in principle it is just as applicable to language learning as to any other kind of learning. The more structured approach has more unique aspects to it, so we will turn attention to it here. Of course, various intermediate approaches may be constructed by leaving out various structures. The approach to experiential language learning we consider most promising is based on the PROWL approach, discussed above. In this approach one programs in a certain amount of "universal grammar," and then allows the system to learn content via experience that obeys this universal grammar. In a PROWL approach, the basic linguistic representational infrastructure is given by the Atomspace that already exists in OpenCog, so the content of "universal grammar" is basically • the propensity to identify words • the propensity to create a small set of asymmetric (i.e. parent/child) labeled relationship types, to use to label relationships between semantically related word-instances. These are "syntactic link types." • the set of constraints on syntactic links implicit in word grammar, e.g. landmark transitivity or no-links-cross Building in the above items, without building in any particular syntactic links, seems enough to motivate a system to learn a grammar resembling that of human languages. Of course, experiential language learning of this nature is very, very different from "tabula rasa" experiential language learning. But we note that, while PROWL style experiential lan- guage learning seems like a difficult problem given existing AI technologies, tabula rasa language learning seems like a nearly unapproachable problem. One could infer from this that current AI technologies are simply inadequate to approach the problem that the young human child mind solves. However, there seems to be some solid evidence that the young human child mind does contain some form of universal grammar guiding its learning. Though we don't yet know what form this universal prior linguistic knowledge takes in the human mind or brain, the evidence regarding common structures arising spontaneously in various unrelated Creole languages is extremely compelling supporting ideas presented previously based on different lines of evidence. So we suggest that PROWL based experiential language learning is actually con- ceptually closer to human child language learning than a tabula rasa approach - although we certainly don't claim that the PROWL based approach builds in the exact same things as the human genome does. What we need to make experiential language learning work, then, is a language-focu.sed inference-control mechanism that includes, e.g. • a propensity to look for syntactic link types, as outlined just above • a propensity to form new word senses, as outlined earlier EFTA00624562
416 44 Natural Language Comprehension • a propensity to search for implications of the general form of RelEx and RelEx2Frame or Syn2Sem rules Given these propensities, it seems reasonable to expect a PLN inference system to be able to "fill in the linguistic content" based on its experience, using links between linguistic and other experiential content as its guide. This is a very difficult learning problem, to be sure, but it seems in principle a tractable one, since we have broken it down into a number of interrelated component learning problems in a manner guided by the structure of language. Other aspects of language comprehension, such as word sense disambiguation and anaphor resolution, seem to plausibly follow from applying inference to linguistic data in the context of embodied experiential data, without requiring especial attention to inference control or supply- ing prior knowledge. Chapter ?? presents an elaboration of this sort of perspective, in a limited case which enables greater clarity: the learning of linguistic content from an unsupervised corpus, based on the assumption of linguistic infrastructure s just summarized above. 44.17 Which Path(s) Forward? We have discussed a variety of approaches to achieving human-level NL comprehension in the CogPrime framework. Which approach do we think is best? All things considered, we suspect that a tabula rasa experiential approach is impractical, whereas a traditional computational linguistics approach (whether based on hand-coded rules. corpus analysis, or a combination thereof) will reach an intelligence ceiling well short of human capability. On the other hand we believe that all of these options 1. the creation of an engineered NL comprehension system (as we have already done), and the adaptation and enhancement of this system using learning that incorporates knowledge from embodied experience 2. the creation of an engineered NL comprehension system via unsupervised learning from a large corpus, as described in Chapter ?? below 3. the creation of an experiential learning based NL comprehension system using in-built structures, such as the PROWL based approach described above 4. the creation of an experiential learning based system as described above, using an engineered system (like the current one) as a "fitness estimation" resource in the manner described at the end of Chapter 43 have significant promise and are worthy of pursuit. Which of these approaches we focus on in our ongoing OpenCogPrime implementation work will depend on logistical issues as much as on theoretical preference. EFTA00624563
Chapter 45 Language Learning via Unsupervised Corpus Analysis Co-authored with Linas Vepstas 45.1 Introduction The approach taken to NLP in the OpenCog project up through 2013, in practice, has involved engineering and integrating rule-based NLP systems as "scaffolding", with a view toward later replacing the rule content with alternative content learned via an OpenCog system's experience. In this chapter we present a variant on this approach, in which the rule content of the existing rule-based NLP system is replaced with new content learned via unsupervised corpus analysis. This content can then be modified and improved via an OpenCog system's experience, embodied and otherwise, as needed. This unsupervised corpus analysis based approach deviates fairly far from human cogni- tive science. However, as discussed above, language processing is one of those areas where the pragmatic differences between young humans and early-stage AGI systems may be critical to consider. The automated learning of language from embodied, social experience is a key part of the path to AGI, and is one way that CogPrimes and other AGI systems should learn language. On the other hand. unsupervised corpus based language learning, may perhaps also have a significant role to play in the path to linguistically savvy AGI, leveraging some advantages that AGIs have that humans do not, such as direct access to massive amounts of online text (without the need to filter the text through slow-paced sense-perception systems like eyes). The learning of language from unannot.ated text corpora is not a major pursuit within the computational linguistics community currently. Supervised learning of linguistic structures from expert-annotated corpora plays a large role, but this is a wholly different sort of pursuit, more analogous to rule-based NLP, in that it involves humans explicitly specifying formal linguistic structures (e.g. parse trees for sentences in a corpus). However, we hypothesize that unsuper- vised corpus-based language learning can be carried out by properly orchestrating the use of some fairly standard machine learning algorithms (already included in OpenCog / CogPrime), within an appropriate structured framework (such as OpenCog's current NLP framework). The review of [K M0-1I provides a summary, of the state of the art in automatic grammar induction (the third alternative listed above), as it stood a decade ago: it addresses a nun- Dr. Vepstas would properly be listed as the first author of this chapter; this material was developed in a collaboration between Vepstas and Coertzel. However, as with all the co-authored chapters in this book, final responsibility for any flaws in the presentation of the material lies with Ben Coertzel, the chief author of the bok. 417 EFTA00624564
418 45 Language Learning via Unsupervised Corpus Analysis ber of linguistic issues and difficulties that arise in actual implementations of algorithms. It is also notable in that it builds a bridge between phrase-structure grammars and dependency grammars, essentially pointing out that these are more or less equivalent, and that, in fact, sig- nificant progress can be achieved by taking on both points of view at once. Grammar induction has progressed somewhat since this review was written, and we will mention some of the more recent work below; but yet, it is fair to say that there has been no truly dramatic progress in this direction. In this chapter we describe a novel approach to achieving automated grammar induction, i.e. to machine learning of linguistic content front a large, unannotated text corpus. The methods described may also be useful for language learning based on embodied experience; and may make use of content created using hand-coded rules or machine learning front annotated corpora. But our focus in this chapter will be on learning linguistic content from a large, unannotated text corpus. The algorithmic approach given in this chapter is wholly in the spirit of the "PROWL" approach reviewed above in Chapter 44. However, PROWL is a quite general idea. Here we present a highly specific PROWL-like algorithm, which is focused on learning front a large unannotated corpus rather than from embodied experience. Because of the corpus-oriented focus, it is possible to tie the algorithm of this chapter in with the statistical language learning literature, more tightly than is possible with PROWL language learning in general. Yet, the specifics presented here could largely be generalized to a broader PROWL context. We consider the approach described here as "deep learning" oriented because it is based on hierarchical pattern recognition in linguistic data: identifying patterns, then patterns among these patterns, etc., in a hierarchy that allows "higher level" (more abstract) patterns to feed back down the hierarchy and affect the recognition of lower level patterns. Our approach does not use conventional deep learning architectures like Deep Boltzmann machines or recurrent neural networks. Conceptually, our approach is based on a similar intuition to these algorithms, in that it relies on the presence of hierarchical structure in its input data, and utilizes a hierarchical pattern recognition structure with copious feedback to adaptively identify this hierarchical structure. But the specific pattern recognition algorithms we use, and the specific nature of the hierarchy we construct, are guided by existing knowledge about what works and what doesn't in (both statistical and rule-based) computational linguistics. While the overall approach presented here is novel, most of the detailed ideas are extensions and generalizations of the prior work of multiple authors, which will be referenced and in some cases discussed below. In our view, the body of ideas needed to enable unsupervised learning of language front large corpora has been gradually emerging during the last decade. The approach given here has unique aspects, but also many aspects already validated by the work of others. For sake of simplicity, we will deal here only with learning from written text here. We believe that conceptually very similar methods can be applied to spoken language as well, but this brings extra complexities that we will avoid for the purpose of the present document. (In short: Below we represent syntactic and semantic learning as separate but similarly structured and closely coupled learning processes. To handle speech input thoroughly, we would suggest phonological learning as another separate, similarly structured and closely coupled learning process.) Finally, we stress that the algorithms presented here are intended to be used in conjunction with a large corpus, and a large amount of processing power. Without a very large corpus, some of the feedbacks required for the learning process described would be unlikely to happen (e.g. the ability of syntactic and semantic learning to guide each other). We have not yet sought EFTA00624565
45.2 Assumed Linguistic Infrastructure 419 to estimate exactly how large a corpus would be required, but our informal estimate is that Wikipedia might or might not be large enough, and the Web is certainly more than enough. We don't pretend to know just how far this sort of unsupervised, corpus based learning can be pushed. To what extent can the content of a natural language like English be learned this way. How much, if any, ambiguity will be left over once this kind of learning has been thoroughly done - only pragmatically disambiguable via embodied social learning? Strong opinions on these sorts of issues abound in the cognitive science, linguistics and AI communities; but the only apparent way to resolve these questions is empirically. 45.2 Assumed Linguistic Infrastructure While the approach outlined in this chapter aims to learn the linguistic content of a language from textual data, it does not aim to learn the idea of language. Implicitly, we assume a model in which a learning system begins with a basic "linguistic infrastructure" indicating the various parts of a natural language and how they generally interrelate; and it then learns the linguistic content characterizing a particular language. In principle, it would also be passible to have an AI system to learn the very concept of a language and build its own linguistic infrastructure. However, that is not the problem we address here; and we suspect such an approach would require drastically more computational resources. The basic linguistic infrastructure assumed here includes: • A formalism for expressing grammatical (dependency) rules is assumed. - The ideas given here are not tied to any specific grammatical formalism, but as in Chapter ?? we find it convenient to make use of a formalism in the style of dependency grammarres511. Taking a mathematical perspective, different grammar formalisms can be translated into one-another, using relatively simple rules and algorithinsIK:\ II. The primary difference between them is more a matter of taste, perceived linguistic 'natural- ness', adaptability, and choice of parser algorithm. In particular, categorial grammars can be converted into link grammars in a straight-forward way, and vice versa, but link grammars provide a more compact dictionary. Link grammarsIST9l, ST931 are a type of dependency grammar; these, in turn, can be converted to and from phrase-structure grammars. We believe that dependency grammars provide a more simple and natural description of linguistic phenomena. We also believe that dependency grammars have a more natural fit with maximum-entropy ideas, where a dependency relationship can be literally interpreted as the mutual information between word-pair:0)108F Dependency grammars also work well with Markov models; dependency parsers can be implemented as Viterbi decoders. Figure 44.1 illustrates two different formalisms. - The discussion below assumes the use of a formalism similar that of Link Grammar, as described above. In this theory, each word is associated with a set of 'connector disjuncts', each connector disjunct controlling the possible linkages that the word may take part in. A disjunct can be thought of as a jig-saw puzzle-piece; valid syntactic word orders are those for which the puzzle-pieces can be validly connected. A single connector can be thought of as a single tab on a puzzle-piece (shown in figure ??). Connectors are thus 'types' X with a + or - sign indicating that they connect to the left or right. For EFTA00624566
420 45 Language Learning via Unsupervised Corpus Analysis example, a typical verb disjunct might be S — & 0+ indicating that a subject (a noun) is expected on the left, and an object (also a noun) is expected on the right. - Some of the discussion below assumes select aspects of (Dick Hudson's) Word GrammarII I tid&I, Iludoibl. As reviewed above, Word Grammar theory (implicitly) uses connectors simi- lar to those of Link Grammar, but allows each connector to be marked as the head of a link or not. A link then becomes an arrow from a head word to the dependent word. (Somewhat confusingly, the head of the arrow points at the dependent word; this means the tail of the arrow is attached to the head word). - Each word is associated with a "lexical entry"; in Link Grammar, this is the set of connector disjuncts for that word. It is usually the case that many words share a common lexical entry; for example, most common nouns are syntactically similar enough that they can all be grouped under a single lexical entry. Conversely, a single word is allowed to have multiple lexical entries; so, for example, "saw", the noun, will have a different lexical entry from "saw", the past tense of the verb "to see". That is, lexical entries can loosely correspond to traditional dictionary entries. Whether or not a word has multiple lexical entries is a matter of convenience, rather than a fundamental aspect. Curiously, a single Link Grammar connector disjunct can be viewed as a very, fine-grained part- of-speech. In this way, it is a stepping stone to the semantic meaning of a word. • A parser, for extracting syntactic structure from sentences, is assumed. What's more, it is assumed that the parser is capable of using semantic relationships to guide parsing. - A paradigmatic example of such a parser is the "Viterbi Link Parser", currently under development for use with the Link Grammar. This parser is currently operational in a simple form. The name refers to its use of a the general ideas of the Viterbi algorithm. This algorithm seems biologically plausible, in that it applies only a local analysis of sentence structure, of limited scope, as opposed to a global optimization, thus roughly emulating the process of human listening. The current set of legal parses of a sentence is pruned incrementally and probabilistically, based on flexible criteria. These potentially include the semantic relationships extractable from the partial parse obtained at a given point in time. It also allows for parsing to be guided by inter-sentence relationships, such as pronoun resolution, to disambiguate otherwise ambiguous sentences. • A formalism for expressing semantic relationships is assumed. - A semantic relationship generalizes the notion of a lexical entry, to allow for changes of word order, paraphrasing, tense, number, the presence or absence of modifiers, etc. An example of such a relationship would be eat(X, Y) - indicating the eating of some entity Y by some entity X. This abstracts into common form several different syntac- tic expressions: "Ben ate a cookie", "A cookie will be eaten by Ben", "Ben sat, eating cookies". - Nothing particularly special is assumed here regarding semantic relationships, beyond a basic predicate-argument structure. It is assumed that predicates can have arguments that are other predicates, and not just atomic terms; this has an explicit impact on how predicates and arguments are represented. A "semantic representation" of a sentence is a network of arrows (defining predicates and arguments), each arrow or a small subset of arrows defining a "semantic relationship". However, the beginning or end of an arrow is not necessarily a single node, but may land on a subgraph. EFTA00624567
45.3 Linguistic Content To Be Learned 421 - Type constraints seem reasonable, but its not clear if these must be made explicit, or if they are the implicit result of learning. Thus, eat(X, Y) requires that X and Y both be entities, and not, for example, actions or prepositions. - We have not yet thought through exactly how rich the semantic formalism should be for handling the full variety of quantifier constructs in complex natural language. But we suspect that it's OK to just use basic predicate-argument relationships and not build explicit quantification into the formalism, allowing quantifiers to be treated like other predicates. - Obviously, CogPrime's formalism for expressing linguistic structures in terms of Atoms, presented in Chapter 44, fulfills the requirements of the learning scheme presented in this chapter. However, we wish to stress that the learning scheme presented her does not depend on the particulars of CogPrime's representation scheme, though it is very compatible with them. 45.3 Linguistic Content To Be Learned Given the above linguistic infrastructure, what remains for a language learning system to learn is the linguistic content that characterizes a particular language. Everything included in OpenCog's existing "scaffolding" rule-based NLP system would, in this approach, be learned to first approximation via unsupervised corpus analysis. Specifically, given the assumed framework, key things to be learned include: • A list of 'link types' that will be used to form 'disjuncts' must be learned. - An example of a link type is the 'subject' link S. This link typically connects the sub- ject of a sentence to the head verb. Given the normal English subject-verb word order, nouns will typically have an Si-connector, indicating that an S link may be formed only when the noun appears to the left of a word bearing an S— connector. Likewise, verbs will typically be associated with 8— connectors. The current Link Grammar contains roughly one hundred different link-types, with additional optional subtypes that are used to further constrain syntactic structure. This number of different link types seems required simply because there are many relationships between words: there is not just a subject-verb or verb-object relationship, but also rather fine distinctions, such as those needed to form grammatical time, date, money, and measurement expressions, punctu- ation use, including street-addresses, cardinal and ordinal relationships, proper (given) names, titles and suffixes, and other highly constrained grammatical constructions. This is in addition to the usual linguistic territory of needing to indicate dependent clauses, comparatives, subject-verb inversion, and so on. It is expected that a comparable mun- ber of link types will need to be learned. - Some link types are rather strict, such as those connect verb subjects and objects, while other types are considerably more ambiguous, such as those involving prepositions. This reflects the structure of English, where subject-verb-object order is fairly rigor- ously enforced, but the ordering and use of prepositions Ls considerably looser. When considering the looser cases, it becomes clear that there is no single, inherent 'right answer' for the creation and assignment of link types, and that several different, yet linguistically plausible linkage assignments may be made. EFTA00624568
422 45 Language Learning via Unsupervised Corpus Analysis - The definition of a good link-type is one that leads the parser - applied across the whole corpus - to allow parsing to be successful for almost all sentences, and yet not to be so broad as to enable parsing of word-salads. Significant pressure must be applied to prevent excess proliferation of link types, yet no so much as to over-simplify things, and provide valid parses for unobserved, ungrammatical sentences. • Lexical entries for different words must be learned. - Typically, multiple connectors are needed to define how a word can link syntactically to others. Thus, for example, many verbs have the disjunct S — 0+ indicating that they need a subject noun to the left, and an object to the right. All words have at least a handful of valid disjuncts that they can be used with, and sometimes hundreds or even more. Thus, a "lexical entry" mast be learned for each word, the lexical entry, being a set of disjuncts that can be used with that word. - Many words are syntactically similar; most common nouns can share a single lexical entry. Yet, there are many exceptions. Thus, during learning, there is a back-and forth process of grouping and =grouping words; clustering them so that they share lexical entries, but also splitting apart clusters when its realized that some words behave dif- ferently. Thus for example, the words "sing" and "apologize" are both verbs, and thus share some linguistic structure, but one cannot say "I apologized a song to Vicky"; if these two verbs were initially grouped together into a common lexical entry, they must later be split apart. - The definition of a good lexical entry is much the same as that for a good link type: ob- served sentences must be parsable; random sentences mostly must not be, and excessive proliferation and complexity must be prevented. • Semantic relationships mast be learned. - The semantic relationship eat(X,Y) is prototypical. Fotmdationally, such a semantic relationship may be represented as a set whose elements consist of syntactico-semantic subgraphs. For the relation eat(X, Y), a subgraph may be as simple as a single (syntactic) disjunct S - & 0+ for the normal word order "Ben ate a cookie", but it may also be a more complex set needed to represent the inverted word order in "a cookie was eaten by Ben". The set of all of these different subgraphs defines the semantic relationship. The subgraphs themselves may be syntactic (as in the example above), or they may be other semantic relationships, or a mixture thereof. - Not all re-phrasings are semantically equivalent. "Mr. Smith is late" has a rather dif- ferent meaning from "The late Mr. Smith." - In general, place-holders like X and Y may be words or category labels. In early stages of learning, it is expected that X and Y are each just sets of words. At some point, though, it should become clear that these sets are not specific to this one relationship, but can appropriately take part in many relationships. In the above example, X and Y mast be entities (physical objects), and, as such, can participate in (most) any other relationships where entities are called for. More narrowly, X is presumably a person or animal. while Y is a foodstuff. Furthermore, as entities, it might be inferred when these refer to the same physical object (see the section 'reference resolution' below). - Categories can be understood as sets of synonyms, including hyponyms (thus, "grub" is a synonym for "food", while "cookie" is a hyponym. EFTA00624569
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 423 • Idioms and set phrases must be learned. - English has a large number of idiomatic expressions whose meanings cannot be inferred from the constituent words (such as "to pull one's leg"). In this way, idioms present a challenge: their sometimes complex syntactic constructions belie their often simpler semantic content. On the other hand, idioms have a very rigid word-choice and word order, and are highly invariant. Set phrases take a middle ground: word-choice is not quite as fixed as for idioms, but, none-the-less, there is a conventional word order that is usually employed. Not that the manually-constructed Link Grammar dictionaries contain thousands of lexical entries for idiomatic constructions. In essence, these are multi-word constructions that are treated as if they were a single word. Each of the above tasks have already been accomplished and described in the literature; for example, automated learning of synonymous words and phrases has been described by LinILP011 and Poon & DomingostPD09]. The authors are not aware of any attempts to learn all of these, together, in one go, rather than presuming the pre-existance of dependent layers. 45.3.1 Deeper Aspects of Comprehension While the learning of the above aspects of language is the focus of our discussion here, the search for semantic structure does not end there; more is possible. In particular, natural language generation has a vital need for lexical functions, so that appropriate word-choice can be made when vocalizing ideas. In order to truly understand text, one also needs, as a minimum, to discern referential structure. and sophisticated understanding requires discerning topics. We believe automated, unsupervised learning of these aspects is attainable, but is best addressed after the 'simpler' language learning described above. We are not aware of any prior work aimed at automatically learning these, aside from relatively simple. unsophisticated (bag-of- words style) efforts at topic categorization. 45.4 A Methodology for Unsupervised Language Learning from a Large Corpus The language learning approach presented here is novel in its overall nature. Each part of it, however, draws on prior experimental and theoretical research by others on particular aspects of language learning, as well as on our own previous work building computational linguistic systems. The goal is to assemble a system out of parts that are already known to work well in isolation. Prior published research, from a multitude of authors over the last few decades, has already demonstrated how many of the items listed above can be learnt in an unsupervised setting (see e.g. Fur98, K4104, LPOI, CSW, PD09, AlihO7, ICSPCB] for relevant background). All of the previously demonstrated results, however, were obtained in isolation, via research that assumed the pre-existence of surrounding infrastructure far beyond what we assume above. The approach proposed here may be understood as a combination, generalization and refinement EFTA00624570
424 45 Language Learning via Unsupervised Corpus Analysis these techniques, to create a system that can learn, more or less ab initio from a large corpus, with a final result of a working, usable natural language comprehension system. However, we must caution that the proposed approach is in no way a haphazard mash-up of techniques. There is a deep algorithmic commonality to the different prior methods we combine, which has not always been apparent in the prior literature due to the different emphases and technical vocabularies used in the research papers in question. In parallel with implementing the ideas presented here, we intend to workin fully formalizing the underlying mathematics of the undertaking, so that it becomes clear what approximations are being taken, and what avenues remain unexplored. Some fairly specific directions in this regard suggest themselves. All of the prior research alluded to above invokes some or another variation of maximum en- tropy principles, sometimes explicitly, but usually implicitly. In general, entropy maximization principles provide the foundation for learning systems such as (hidden) Markov models, Markov networks and Hopfield neural networks, and they connect indirectly with Bayesian probability based analyses. However, the actual task of maximizing the entropy is an NP-hard problem; forward progress depends on short-cuts, approximations and clever algorithms, some of which are of general nature, and some domain-dependent. Part of the task of refining the details of the language learning methodology presented here, is to explore various short-cuts and approx- imations to entropy maximization, and discover new, clever algorithms of this nature that are relevant to the language learning domain. As has been the case in physics and other domains, we suspect that progress here will be best achieved via a coupled exploration of experimental and mathematical aspects of the subject matter. 45.4.1 A High Level Perspective on Language Learning On an abstract conceptual level, the approach proposed here depicts language learning as an instance of a general learning loop such as: 1. Group together linguistic entities (i.e. words or linguistic relationships, such as those de- scribed in the previous section) that display similar usage patterns (where one is looking at usage patterns that are compactly describable given one's meta-language). Many but not nectsbarily all usage patterns for a given linguistic entity, will involve its use in conjunction with other linguistic entities. 2. For each such grouping make a category label. 3. Add these category labels to one's meta-language 4. Return to Step 1 It stands to reason that the result of this sort of learning loop, if successful, will be a hierarchi- cally composed collection linguistic relationships possessing the following Linguistic Coherence Property: Linguistic entities are reasonably well characterizable in terms of the compactly describable patterns observable in their relationship with with other linguistic entities. Note that there Ls nothing intrinsically "deep" or hierarchical in this sort of linguistic co- herence. However, the ability to learn the patterns relating linguistic entities with others, via a recursive hierarchical learning loop such as described above, is contingent on the presence of a fairly marked hierarchical structure in the linguistic data being studied. There is much evidence that such hierarchical structure does indeed exist in natural languages. The "deep learning" in EFTA00624571
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 425 our approach is embedded in the repeated cycles through the loop given above - each time one goes through the loop, the learning gets one level deeper. This sort of property has observed to hold for many linguistic entities, an observation dating back at least to Saussure IdS771 and the start of structuralist linguistics. It is basically a fancier way of saying that the meanings of words and other linguistic constructs, may be found via their relationships to other words and linguistic constructs. We are not committed to structuralism as a theoretical paradigm, and we have considerable respect for the aid that non-linguistic in- formation -such as the sensorimotor data that comes from embodiment - can add to language, as should be apparent from the overall discussion in this book. However, the potential dramatic utility of non-linguistic information for language learning does not imply the impossibility or infeasibility of learning language from corpus data alone. It is inarguable that non-linguistic re- lationships comprise a significant portion of the everyday meaning of linguistic entities; but yet, redundancy is prevalent in natural systems, and we believe that purely linguistic relationships may well provide sufficient data for learning of natural languages. If there are some aspects of natural language that cannot be learned via corpus analysis, it seems difficult to identify what these aspects are via armchair theorizing, and likely that they will only be accurately identified via pushing corpus linguistics as far as it can go. This generic learning process is a special case of the general process of symbolization, de- scribed in Chaotic Logic roe9-1] and elsewhere as a key aspect of general intelligence. In this process, a system finds patterns in itself and its environment, and then symbolizes these patterns via simple tokens or symbols that become part of the system's native knowledge representation scheme (and hence parts of its "metalanguage" for describing things to itself). Having repre- sented a complex pattern as a simple symbolic token, it can then easily look at other patterns involving this patterns as a component. Note that in its generic format as stated above, the "language learning loop" is not restricted to corpus based analysis, but may also include extralinguistic aspects of usage patterns, such as gestures, tones of voice, and the physical and social context of linguistic communication. Linguistic and extra-linguistic factors may come together to comprise "usage patterns." How- ever, the restriction to corpus data does not necessarily denude the language learning loop of its power: it merely restricts one to particular classes of usage patterns, whose informativeness must be empirically determined. In principle, one might be able to create a functional language learning system based only on a very generic implementation of the above learning loops. In practice, however. biases toward particular sorts of usage patterns can be very valuable in guiding language learning. In a computational language learning context, it may be worthwhile to break down the language learning process into multiple instances of the basic language learning loops, each focused on different sorts of usage patterns, and coupled with each other in specific ways. This is in fact what we will propose here. Specifically, the language learning process proposed here involves: • One language learning loop for learning purely syntactic linguistic relationships (such as link types and lexical entries, described above), which are then used to provide input to a syntax parser. • One language learning loop for learning higher-level "syntactico-semantic" linguistic rela- tionships (such as semantic relationships, idioms, and lexical functions, described above), which are extracted from the output of the syntax parser. EFTA00624572
425 45 Language Learning via Unsupervised Corpus Analysis These two loops are not independent of one-another; the second loop can provide feedback to the first, regarding the correctness of the extracted structures; then as the first loop produces more correct, confident results, the second loop can in turn become more confident in it's output. In this sense, the two loops attack the same sort of slow-convergence issues that 'deep learning' tackles in neural-net training. The syntax parser itself, in this context, is used to extract directed acyclic graphs (dags), usually trees, from the graph of syntactic relationships associated with a sentence. These dags represent parses of the sentence. So the overall scope of the learning process proposed here is to learn a system of syntactic relationships that displays appropriate coherence and that, when fed into an appropriate parser, will yield parse trees that give rise to a system of syntactico-semantic relationships that displays appropriate coherence. 45.4.2 Learning Syntax The process of learning syntax from a corpus may be understood fairly directly in terms of entropy maximization. As a simple example, consider the measurement of the entropy of the arrangement of words in a sentence. To a fair degree. this can be approximated by the sum of the mutual entropy between pairs of words. Yuret showed that by searching for and maximizing this sum of entropies, one obtains a tree structure that closely resembles that of a dependency parserin 91. That is, the word pairs with the highest mutual entropy are more or less the same as the arrows in a dependency parse, such as that shown in figure 44.1. Thus, an initial task is to create a catalog of word-pairs with a large mutual entropy (mutual information, or MI) between them. This catalog can then be used to approximate the most-likely dependency parse of a sentence, although, at this stage, the link-types are as yet unknown. Finding dependency links using mutual information is just the first step to building a practical parser. The generation of high-MI word-pairs works well for isolating which words should be linked, but it does have several major drawbacks. First and foremast, the word-pairs do not come with any sort of classification; there is no link type describing the dependency relationship between two words. Secondly, most words fall into classes (e.g. nouns, verbs, etc.), but the high-MI links do not tell us what these are. A compact, efficient parser appears to require this sort of type information. To discover syntactic link types, it is necessary to start grouping together words that appear in similar contexts. This can be done with clustering and similarity techniques, which appears to be sufficient to discover not only basic parts of speech (verbs, nouns, modifiers, determiners), but also link types. So, for example, the computation of word-pair MI is likely to reveal the following high-MI word pairs: "big car, "fast car", "expensive car", 'ted car". It is reasonable to group together the words big, expensive, fast and red into a single category, interpreted as modifiers to car. The grouping can be further refined if these same modifiers are observed with other words (e.g. "big bicycle", "fast bicycle", etc.) This has two effects: it not only reinforces the correctness of the original grouping of modifiers, but also suggests that perhaps cars and bicycles should be grouped together. Thus, one has discovered two classes of words: modifiers and nouns. In essence, one has crudely discovered parts of speech. The link between these two classes carries a type; the type of that link is defined by these two classes. The use of a pair of word classes to define a link type is a basic premise of categorial grammarlICSIVA. In this example, a link between a modifier and a noun would be a type EFTA00624573
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 427 denoted as Id \ N in categorial grammar, M denoting the class of modifiers, and N the class of nouns. In the system of Link Grammar, this is replaced by a simple name, but its really one and the same thing. (In this case, the existing dictionaries use the A link for this relation, with A conjuring up 'adjective' as a mnemonic.) The simple-name is a boon for readability, as categorial grammars usually have very complex-looking link-type names: e.g. (N1)1,S)/NP for the simplest transitive verbs. Typing seems to be an inherent part of language; types must be extracted durng the learning process. The introduction of types here has mathematical underpinnings provided by type theory. An introduction to type theory can be found in JPro13J, and an application of type theory to linguistics can be found inEKSPC nil. This Ls a rather abstract work, but it sheds light on the nature of link types, word-classes, parts-of-speech and the like as formal types of type theory. This is useful in dispelling the seeming taint of ad hoc arbitrariness of clustering: in a linguistic context, it is not so much ad hoc as it is a way of guaranteeing that only certain words can appear in certain positions in grammatically correct sentences, a sort of constraint that seems to be an inherent part of language, and seems to be effectively formalizable via type theory. Word-clustering, as worked in the above example, can be viewed as another entropy- maximization technique. It is essentially a kind of factorization of dependent probabilities into most likely factors. By classifying a large number of words as 'modifiers of nouns', one is essentially admitting that they are equi-probable in that role, in the Markovian senselAsh651 (equivalently, treating them as equally-weighted priors, in the Bayesian probability sense). That is, given the word "car", we should treat big, fast, expensive and red as being equi-probable (in the absence of other information). Equi-probability is an axiom in Bayesian probability (the axiom of priors), but it derives from the principle of maximum entropy (as any other probability assignment would have a lower entropy). We have described how link types may be learned in an unsupervised setting. Connector types are then trivially assigned to the left and right words of a word-pair. The dependency graph, as obtained by linking only those word pairs with a high MI, then allows disjuncts to be easily extracted, on a sentence-by-sentence basis. At this point, another stage of pattern recognition may be applied: Given a single word, appearing in many different sentences, one should presumably find that this word only makes use of a relatively small, limited set of disjuncts. It is then a counting exercise to determine which disjuncts are occurring the most often for this word: these then form this word's lexical entry. (This "counting exercise" may also be thought of as an instance of frequent subgraph mining, as will be elaborated below.) A second clustering step may then be applied: it's presumably noticeable that many words use more-or-less the same disjuncts in syntactic constructions. These can then be grouped into the same lexical entry. However, we previously generated a different set of word groupings (into parts of speech), and one may ask: how does that grouping compare to this grouping? is it close, or can the groupings be refined? If the groupings cannot be harmonized, then perhaps there is a certain level of detail that was previously missed: perhaps one of the groups should be split into several parts. Conversely, perhaps one of the groupings was incomplete, and should be expanded to include more words. Thus, there is a certain back-and-forth feedback between these different learning steps, with later steps reinforcing or refining earlier steps, forcing a new revision of the later steps. EFTA00624574
428 45 Language Learning via Unsupervised Corpus Analysis 45.4.2.1 Loose language A recognized difficulty with the direct application of Yuret's observation (that the high-MI word-pair tree is essentially identical to the dependency parse tree) is the flexibility of the preposition in the English languagell<NIMI. The preposition is so widely used, in such a large variety of situations and contexts, that the mutual information between it, and any other word or word-set, is rather low (is unlikely and thus carries little information). The two-point, pair- wise mutual entropy provides a poor approximation to what the English language is doing in this particular case. It appears that the situation can be rescued with the use of a three-point mutual information (a special case of interaction information 113010:31). The discovery and use of such constructs is described in [PDOM. A similar, related issue can be termed "the richness of the MV link type in Link Grammar". This one link type, describing verb modifiers (which includes prepositions) can be applied in a very large class of situations; as a result, discovering this link type, while at the same time limiting its deployment to only grammatical sentences, may prove to be a bit of a challenge. Even in the manually maintained Link Grammar dictionaries, it can present a parsing challenge because so many narrower cases can often be treated with an MV link. In summary, some constructions in English are so flexible that it can be difficult to discern a uniform set of rules for describing them; certainly, pair-wise mutual information seems insufficient to elucidate these cases. Curiously, these more challenging situations occur primarily with more complex sentence constructions. Perhaps the flexibility is associated with the difficulty that humans have with composing complex sentences; short sentences are almost 'set phrases', while longer sentences can be a semi-grammatical jumble. In any case, some of the trouble might be avoided by limiting the corpus to smaller, easier sentences at first, perhaps by working with children's literature at first. 45.4.2.2 Elaboration of the Syntactic Learning Loop We now reiterate the syntactic learning process described above in a more systematic way. By getting more concrete, we also make certain assumptions, and restrictions, some of which may end up getting changed or lifted in the course of implementation and detailed exploration of the overall approach. What is discussed in this section is merely one simple, initial approach to concretizing the core language learning loop we envision in a syntactic context. Syntax, as we consider it here, involves the following basic entities: • words • categories of words • "co-occurrence links", each one defined as (in the simplest case) an ordered pair or triple of words, labeled with an uncertain truth value • "syntactic link types", each one defined as a certain set of ordered pairs of words • "disjuncts", each one associated with a particular word w, and consisting of an ordered set of link types involving the word w. That is, each of these links contains at least one word-pair containing w as first or second argument. (This nomenclature here comes from Link Grammar; each disjunct is a conjunction of link types. A word is associated with a set of disjuncts. In the course of parsing, one must choose between the multiple disjuncts associated with a word, to fulfill the constraints required of an appropriate parse structure.) EFTA00624575
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 429 An elementary version of the basic syntactic language learning loop described above would take the form. 1. Search for high-MI word pairs. Define one's usage links as the given co-occurrence links 2. Cluster words into categories based on the similarity of their associated usage links • Note that this will likely be a tricky instance of clustering, and classical clustering algorithms may not perform well. One interesting, less standard approach would be to use OpenCog's MOSES algorithm 114)006, ImoOrel to learn an array of program trees, each one serving as a recognizer for a single cluster, in the same general manner done with Genetic Programming in [I3E07/. 3. Define initial syntactic link types from categories that are joined by large bundles of usage links • That is, if the words in category Cu have a lot of usage links to the words in category C2, then create a syntactic link type whose elements are (tui ,w2), for all tui E Cu,w2 E C2. 4. Associate each word with an extended set of usage links, consisting of: its existing usage links, plus the syntactic links that one can infer for it based on the categories the word belongs to. One may also look at chains of (e.g.) 2 syntactic links originating at the word. • For example, suppose cat E Ci and Cu has syntactic link L. Suppose (cat, eat) and (dog ,run) are both in Li. Then if there is a sentence "The cat likes to rim", the link L1 lets one infer the syntactic link cat 4 run. The frequency of this syntactic link in a relevant corpus may be used to assign it an uncertain truth value. • Given the sentence "The cat likes to run in the park," a chain of syntactic links such as cat —) run Li park may be constructed. 5. Return to Step 2, but using the extended set of usage links produced in Step 4, with the goal of refining both clusters and the set of link types for accuracy. Initially, all categories contain one word each, and there is a unique link type for each pair of categories. This is an inefficeint representation of language, and so the goal of clustering is to have a relatively small set of clusters and link types, with many words/word-pairs assigned to each. This can be done by maximizing the sum of the logarithms of the sizes of the clusters and link types; that is, by maximing entropy. Since the category assignments depend on the link types, and vice versa, a very large number of iterations of the loop are likely to be required. Based on the current Link Grammar English dictionaries, one expects to discover hundreds of link types (or more, depending on how subtypes are counted), and perhaps a thousand word clusters (most of these corresponding to irregular verbs and idiomatic phrases). Many variants of this same sort of process are conceivable, and it's currently unclear what sort of variant will work best. But this kind of process is what one obtains when one implements the basic language learning loop described above on a purely syntactic level. How might one integrate semantic understanding into this syntactic learning loop? Once one has semantic relationships associated with a word, one uses them to generate new "usage links" for the word, and includes these usage links in the algorithm from Step l onwards. This may be done in a variety of different ways, and one may give different weightings to syntactic versus semantic usage links, resulting in the learning of different links. EFTA00624576
430 45 Language Learning via Unsupervised Corpus Analysis The above process would produce a large set of syntactic links between words. We then find a further series of steps. These may be carried out concurrently with the above steps, as soon as Step 4 has been reached for the first time. 1. This syntactic graph (with nodes as words and syntactic links joining them) may then be mined, using a variety of graph mining tools, to find common combinations of links. This gives the "disjuncts" mentioned above. 2. Given the set of disjuncts, one carries out parsing using a process such as link parsing or word grammar parsing, thus arriving at a set of parses for the sentences in one's reference corpus. Depending on the nature of one's parser, these parses may be ranked according to semantic plausibility. Each parse may be viewed as a directed acyclic graph (dag), usually a tree, with words at the nodes and syntactic-link type labels on the links. 3. One can now define new usage links for each word: namely, the syntactic links occurring in sentence parses, containing the word in question. These links may be weighted based on the weights of the parses they occur in. 4. One can now return to Step 2 using the new usage links, alongside the previous ones. Weighting these usage links relative to the others may be done in various ways. Several subtleties have been ignored in the above, such as the proper discovery, and treatment of idomatic phrases, the discovery of sentence boundaries, the handling of embedded data (price quotes, lists, chapter titles, etc.) as well as the potential speed bump that are prepositions. Fleshing out the details of this loop into a workable, efficient design is the primary engineering challenge. This will take significant time and effort. 45.4.3 Learning Semantics Syntactic relationships provide only the shallowest interpretation of language; semantics conies next. One may view semantic relationships (including semantic relationships close to the syntax level, which we may call "syntactico-semantic" relationships) as ensuing from syntactic relation- ships, via a similar but separate learning process to the one proposed above. Just as our approach to syntax learning is heavily influenced by our work with Link Grammar, our approach to seman- tics is heavily influenced by our work on the RelEx system [RVC03, LGEUJ, GPPG06, LG ' 2], which maps the output of the Link Grammar parser into a more abstract, semantic form. Proto- type systems [Goc10b, LGIC 1:4 have also been written mapping the output of RelEx into even more abstract semantic form, consistent with the semantics of the Probabilistic Logic Networks t(1( 108I formalism as implemented in CogPrime. These systems are largely based on hand- coded rules, and thus not in the spirit of language learning pursued in this proposal. However, they display the same structure that we assume here; the difference being that here we specify a mechanism for learning the linguistic content that fills in the structure via unsupervised corpus learning, obviating the need for hand-coding. Specifically, we suggest that discovery of semantic relations requires the implementation of something similar to'LI'01I, except that this work needs to be generalized from 2-point relations to 3-point and N-point relations, roughly as described in EPD091. This allows the automatic, unsupervised recognition of synonymous phrases, such as 'Texas borders on Mexico" and 'Texas is next to Mexico", to extract the general semantic relation next_ to(X, Y), and the fact that this relation can be expressed in one of several different ways. EFTA00624577
45.4 A Methodology for Unsupervised Language Learning from a Large Corpus 431 At the simplest level, in this approach, semantic learning proceeds by scanning the corpus for sentences that use similar or the same words, yet employ them in a different order, or have point substitutions of single words, or of small phrases. Sentences which are very similar, or identical, save for one word, offer up candidates for synonyms, or sometimes antonyms. Sentences which use the same words, but in seemingly different syntactic constructions, are candidates for synonymous sentences. These may be used to extract semantic relations: the recognition of sets of different syntactic constructions that carry the same meaning. In essence, similar contexts must be recognized, and then word and word-order differences between these other-wise similar contexts must be compared. There are two primary challenges: how to recognize similar contexts, and how to assign probabilities. The work of 1P1)001 articulates solutions to both challenges. For the first, it describes a general framework in which relations such as next_ to( X , Y) can be understood as lambda- expressions AmAy.next_to(x, y), so that one can employ first-order logic constructions in place of graphical representations. This is partly a notational trick; it just shows how to split up input syntactic constructions into atoms and terms, for which probabilities can be assigned. For the second challenge, they show how probabilities can be assigned to these expressions. by making explicit use of the notions of conditional random fields (or rather, a certain special case, termed Markov Logic Networks). Conditional random fields, or Markov networks, are a certain mathematical formalism that provides the most general framework in which entropy maximization problems can be solved: roughly speaking, it can be understood as a means of properly distributing probabilities across networks. Unfortunately, this work is quite abstract and rather dense. Thus, a much easier understanding to the general idea can be obtained from IL Pull; unfortunately, the later fails to provide the general N-point case needed for semantic relations in general, and also fails to consider the use of maximum entropy principles to obtain similarity measures. The above can be used to extract synonymous constructions, and, in this way, semantic relations. However, neither of the above references deal with distinguishing different meanings for a given word. That is, while eats(X, Y) might be a learnable semantic relation, the sentence "He ate it" does not necvsbarily justify its use. Of course: "He ate it" is an idiomatic expression meaning "he crashed", which should be associated with the semantic relation crash(X), not eat(X, Y). There are global textual clues that this may be the case: trouble resolving the reference "it", and a lack of mention of foodstuffs in neighboring sentences. A viable yet simple algorithm for the disambiguation of meaning is offered by the Mihalcea algorithmIMTF04, SNI071. This is an application of the (google) PageRank algorithm to word senses, taken across words appearing in multiple sentences. The premise is that the correct word-sense is the one that is most strongly supported by senses of nearby words; a graph between word senses is drawn, and then solved as a Markov chain. In the original formulation, word senses are defined by appealing to WordNet, and affinity between word-senses is obtained via one of several similarity measures. Neither of these can be applied in learning a language de novo. Instead, these must both be deduced by clustering and splitting, again. So, for example, it is known that word senses correlate fairly strongly with disjuncts (based on authors unpublished experiments), and thus, a reasonable first cut is to presume that every different disjunct in a lexical entry conveys a different meaning, until proved otherwise. The above-described discovery of synonymous phrases can then be used to group different disjuncts into a single "word sense". Disjuncts that remain ungrouped after this process are already considered to have distinct senses, and so can be used as distinct senses in the Mihalcea network. EFTA00624578
432 45 Language Learning via Unsupervised Corpus Analysis Sense similarity measures can then be developed by using the above-discovered senses, and measuring how well they correlate across different texts. That is, if the word "bell" occurs multi- ple times in a sequence of paragraphs, it is reasonable to assume that each of these occurrences are associated with the same meaning. Thus, each distinct disjunct for the word "bell" can then be presumed to still convey the same sense. One now asks, what words co-occur with the word "ben"? The frequent appearance of "chime" and "ring" can and should be noted. In essence, one Ls once-again computing word-pair mutual information, except that now, instead of limiting word-pairs to be words that are near each other, they can instead involve far-away words, several sentences apart. One can then expand the word sense of "bell" to include a list of co-occurring words (and indeed, this is the slippery slope leading to set phrases and eventually idioms). Failures of co-occurrences can also further strengthen distinct meanings. Consider "he chimed in" and "the bell chimed". In both cases, chime is a verb. In the first sentence, chime carries the disjunct S- & K+ (here, K+ is the standard Link Grammar connector to particles) while the second has only the simpler disjunct S-. Thus, based on disjunct usage alone, one already suspects that these two have a different meaning. This is strengthened by the lack of occurrence of words such as "belt' or "ring" in the first case, with a frequent observation of words pertaining to talking. There is one final trick that must be applied in order to get reasonably rapid learning; this can be loosely thought of as "the sigmoid function trick of neural networks", though it may also be manifested in other ways not utilizing specific neural net mathematics. The key point is that semantics intrinsically involves a variety of uncertain, probabilistic and fuzzy relationships; but in order to learn a robust hierarchy of semantic structures, one needs to iteratively crispen these fuzzy relationships into strict ones. In much of the above, there is a recurring need to categorize, classify and discover similarity. The naivest means of doing so is by counting, and applying basic probability (Bayesian, Marko- vian) to the resulting counts to deduce likelihoods. Unfortunately, such formulas distribute probabilities in essentially linear ways (i.e. form a linear algebra), and thus have a rather poor ability to discriminate or distinguish (in the sense of receiver operating characteristics, of dis- criminating signal from noise). Consider the last example: the list of words co-occurring with chime, over the space of a few paragraphs, is likely to be tremendous. Most of this is surely noise. There is a trick to over-coming this that is deeply embedded in the theory of neural networks, and yet completely ignored in probabilistic (Bayesian, Markovian) networks: the sig- moid function. The sigmoid function serves to focus in on a single stimulus, and elevate its importance, and, at the same time, strongly suppress all other stimuli. In essence, the sigmoid function looks at two probabilities, say 0.55 and 0.45, and says "let's pretend the first one is 0.9 and the second one is 0.1, and move forward from there". It builds in a strong discrimination to all inputs. In the language of standard, text-book probability theory, such discrimination is utterly unwarranted; and indeed, it is. However, applying strong discrimination to learning can help speed learning by converting certain vague impressions into certainties. These certainties can then be built upon to obtain additional certainties, or to be torn apart. as needed. Thus, in all of the above efforts to gauge the similarity between different things, it is useful to have a sharp yes/no answer, rather than a vague muddling with likelihoods. In some of the above-described algorithms, this sharpness is already built in: so, Yuret approximates the mutual information of an entire sentence as the sum of mutual information between word pairs: the smaller, unlikely corrections are discarded. Clearly, they mast also be revived in order to handle prepositions. Something similar must also be done in the extraction of synonymous EFTA00624579
45.4 A Methodology for Unsupervised Language Learning front a Large Corpus 433 phrases. semantic relations, and meaning; the domain is that much likelier to be noisy, and thus, the need to discriminate signal from noise that much more important. 45.4.3.1 Elaboration of the Semantic Learning Loop We now provide a more detailed elaboration of a simple version of the general semantic learning process described above. The same caveat applies here as in our elaborated description of syntactic learning above: the specific algorithmic approach outlined here is a simple instantiation of the general approach we have in mind, which may well require refinement based on lessons learned during experimentation and further theoretical analysis. One way to do semantic learning, according to the approach outlined above, is as follows: 1. An initial semantic corpus is posited, whose elements are parse graphs produced by the syntactic process described earlier 2. A semantic relationship set (or rel-set) is computed from the semantic corpus, via calculat- ing the frequent (or otherwise statistically informative) subgraphs occurring in the elements of the corpus. Each node of such a subgraph may contain a word, a category or a variable; the links of the subgraph are labeled with (syntactic, or semantic) link types. Each parse graph is annotated with the semantic graphs associated with the words it contains (ex- plicitly: each word in a parse graph may be linked via a ReferenceLink to each variable or literal with a semantic graph that corresponds to that word in the context of the sentence underlying the parse graph.) • For instance, the link combination v1 4 v2 4 v3 may commonly occur (representing the standard Subject-Verb-Object (SVO) structure) • In this case, for the sentence "The rock broke the window," we would have links such Ref erenceLink as rock v1 connecting the nodes (such as the "rock" node) in the parse structure with nodes (such as vi) in this associated semantic subgraph. 3. Rel-sets are divided into categories based on the similarities of their associated semantic graphs. • This division into categories manifests the sigmoid-function-style crispening mentioned above. Each rel-set will have similarities to other red-sets, to varying fuzzy degrees. Defining specific categories turns a fuzzy web of similarities into crisp categorial bound- aries; which involves some loss of information, but also creates a simpler platform for further steps of learning. • Two semantic graphs may be called "associated" if they have a nonempty intersection. The intersection determines the type of association involved. Similarity assessment be- tween graphs G and H may involve estimation of which graphs G and H are associated with in which ways. • For instance, "The cat ate the dog" and "The frog was eaten by the walrus" represent the semantic structure eat(cat,dog) in two different ways. In link parser terminology, they do so respectively via the subgraphs g, = v, 4 vs Or v3 and 02 = vi 4 vs 4 1 v3 UV v4 -) v5. These two semantic graphs will have a lot of the same associations. For iastance, in our corpus we may have "The big cat ate the dog in the morning" (including big 4 cat) and also "The big frog was eaten by the walrus in the morning" EFTA00624580
434 45 Language Learning via Unsupervised Corpus Analysis (including big 4 frog), meaning that big 4 v5 is a graph commonly associated with both g, and CJz. Due to having many commonly associated graphs like this, gi and 02 are likely to be associated to a common cluster. 4. Nodes referring to these categories are added to the parse graphs in the semantic corpus. Most simply, a category node C is assigned a link of type L pointing to another node x, if any element of C has a link of type L pointing to x. (More sophisticated methods of assigning links to category nodes may also be worth exploring.) • If gi and 02 have been assigned to a common category C, then "I believe the pig ate the horse" and "I believe the law was invalidated by the revolution" will both appear as instantiations of the graph ga = cv believe C. This 03 is compact because of the recognition of C as a cluster, leading to its representation as a single symbol. The recognition of 03 will occur in Step 2 the next time around the learning loop. 5. Return to Step 2, with the newly enriched semantic corpus. As before, one wants to discover not too many and not too few categories; again, the appropriate solution to this problem appears to be entropy maximization. That is, during the frequent subgraph mining stages, one maintains counts of how often these occur in the corpus; from these, one constructs the equivalent of the mutual information associated with the subgraphs; categorization requires maximizing the sum of the log of the sizes of the categories. As noted earlier, these semantic relationships may be used in the syntactic phase of language understanding in two ways: • Semantic graphs associated with words may be considered as "usage links" and thus included as part of the data used for syntactic category formation. • During the parsing proems, full or partial parses leading to higher-probability semantic graphs may be favored. 45.5 The Importance of Incremental Learning The learning process described here builds up complex syntactic and semantic structures from simpler ones. To start it, all one needs are basic before and after relationships derived from a corpus. Everything else is built up from there, given the assumption of appropriate syntactic and semantic formalisms and a semantics-guided syntax parser. As we have noted, the series of learning steps we propose falls into the broad category of "deep learning", or of hierarchical modeling. That is, learning must occur at several levels at once, each reinforcing, and making use of results from another. Link types cannot be identified until word clusters are found, and word clusters cannot be found until word-pair relationships are discovered. However, once link-types are known, these can be then used to refine clusters and the selected word-pair relations. Further, the process of finding word clusters - both pre and post parsing - relies on a hierarchical build-up of clusters, each phase of clustering utilizing results of the previous "lower level" phrase. However, for this bootstrapping learning to work well, one will likely need to begin with simple language, so that the semantic relationships embodied in the text are not that far removed from the simple before/after relationships. The complexity of the texts may then be EFTA00624581
45.6 Integrating Language Learned via Corpus Analysis into CogPrime's Experiential Learning 435 ramped up gradually. For instance, the needed effect might be achieved via sorting a very large corpus in order of increasing reading level. 45.6 Integrating Language Learned via Corpus Analysis into CogPrime's Experiential Learning Supposing everything in this chapter were implemented and tested and worked reasonably well as envisioned. What would this get us in terms of progress toward AGI? Arguably, with a relatively modest additional effort, it could get as a natural language question answering system, answering a variety of questions based on the text corpus available to it. One would have to use the learned rules for language generation, but the methods of Chapter 46 would likely suffice for that. Such a dialogue system would be a valuable achievement in its own right, of scientific, commercial and humanistic interest - but of course, it wouldn't be AGI. 'lb get something approaching AGI from this sort of effort, one would have to utilize additional reasoning and concept creation algorithms to enable the answering of questions based on knowledge not stored explicitly in the provided corpus. The dialogue system would have to be able to piece together new answers from various fragmentary, perhaps contradictory, pieces of information contain in the corpus. Ultimately, we suspect, one would need something like the CogPrime architec- ture, or something else with a comparable level of sophistication, to appropriately leverage the information extracted from texts via the learned language rules. An open question, as indicated above, is how much of language itself would a corpus based language learning system like the one outlined here miss, assuming a massive but realistic corpus (say, a significant fraction of the Web). This is unresolved and ultimately will only be determined via experiment. Our suspicion is that a very large percentage of language can be understood via these corpus-based methods. But there may be exceptions that would require an unrealistically large corpus size. As a simple example, consider the ability to interpret vaguely given spatial directions like "Go right out the door, past a few curves in the road, then when you get to a hill with a big red house on it (well not that big, but bigger than most of the others you'll see on the walk), start heading down toward the water, till the brash gets thick, then start heading left.... Follow the ground as it rises and eventually you'll see the lake." Of course, it is theoretically possible for an AGI system to learn to interpret directions like this purely via corpus analysis. But it seems the task would be a lot easier for an AGI endowed with a body so that it could actually experience routes like the one being described. And space and time are not the only source of relevant examples; social and emotional reasoning have a similar property. Learning to interpret language about these from reading is certainly possible, but one will have an easier time and do a better job if one is out in the world experiencing social and emotional life oneself. Even if there turn out to be significant limitations regarding what can be learned in practice about language via corpus analysis, though, it may still prove a valuable contributor to the mind of a CogPrime system. As compared to hand-coded rules, comparably abstract linguistic knowledge achieved via statistical corpus analysis should be much easier to integrate with the results of probabilistic inference and embodied learning, due to its probabilistic weighting and its connection with the specific examples that gave rise to it. EFTA00624582
EFTA00624583
Chapter 46 Natural Language Generation Co-authored with Ruiting Lian and Rui Liu 46.1 Introduction Language generation, unsurprisingly, shares most of the key features of language comprehension discussed in chapter 44 - after all, the division between generation and comprehension is to some extent an artificial convention, and the two functions are intimately bound up both in the human mind and in the CogPrime architecture. In this chapter we discuss language generation, in a manner similar to the previous chapter's treatment of language comprehension. First we discuss our currently implemented, "engineered" language generation system, and then we discuss some alternative approaches: • how a more experiential-learning based system might be made by retaining the basic struc- ture of the engineered system but removing the "pre-wired" contents. • how a "Sem2Syn" system might be made, via reversing the Syn2Sem system described in Chapter 44. This is the subject of implementation effort, at time of writing. At the start of Chapter 44 we gave a high-level overview of a typical NL generation pipeline. Here we will focus largely but not entirely on the "syntactic and morphological realization" stage, which we refer to for simplicity as "sentence generation" (taking a slight terminological liberty, as "sentence fragment generation" is also included here). All of the stages of language generation are important., and there is a nontrivial amount of feedback among them. However, there is also a significant amount of autonomy, such that it often makes sense to analyze each one separately and then tease out its interactions with the other stages. 46.2 SegSim for Sentence Generation The sentence generation approach currently taken in OpenCog (front 2009 - early 2012), which we call SegSim, is relatively simple and is depicted in Figure 46.1 and described as follows: 1. The NL generation system stores a large set of pairs of the form (semantic structure, syntactic/morphological realization) 2. When it is given a new semantic structure to express, it first breaks this semantic structure into natural parts, using a set of simple syntactic-semantic rules 437 EFTA00624584
438 46 Natural Language Generation 3. For each of these parts, it then matches the parts against its memory to find relevant pairs (which may be full or partial matches), and uses these pairs to generate a set of syntactic realizations (which may be sentences or sentence fragments) 4. If the matching has failed, then (a) it returns to Step 2 and carries out the breakdown into parts again. But if this has happened too many times, then (b) it recourses to a different algorithm (mast likely a search or optimization based approach, which is more computationally costly) to determine the syntactic realization of the part in question. 5. If the above step generated multiple fragments, they are pieced together, and a certain rating function is used to judge if this has been done adequately (using criteria of grammaticality and expected comprehensibility, among others). If this fails, then Step 3 is tried again on one or more of the parts; or Step 2 is tried again. (Note that one option for piecing the fragments together is to string together a number of different sentences; but this may not be judged optimal by the rating function.) 6. Finally, a "cleanup" phase is conducted, in which correct morphological forms are inserted, and articles and certain other "function words" are inserted. The specific OpenCog software implementing the SegSim algorithm is called "NLGen"; this is an implementation of the SegSim concept that focuses on sentence generation from RelEx semantic relationship. In the current (early 2012) NLGen version, Step 1 is handled in a very simple way using a relational database; but this will be modified in future so as to properly use the AtomSpace. Work is currently underway to replace NLGen with a different "Sem2Syn" approach, that will be described at the end of this chapter. But discussion of NLGen is still instructive regarding the intersection of language generation concepts with OpenCog concepts. The substructure currently used in Step 2 is defined by the predicates of the sentence, i.e. we define one substructure for each predicate, which can be described as follows: Predicate(Alyumenti(Afodifyj)) where • 1 < i < m and 0 < j < n and in, n are integers • "Predicate" stands for the predicate of the sentence, corresponding to the variable $0 of the RelEx relationship _subj($0, $1) or _obj($0, $1) • Argument; is the i-th semantic parameter related with the predicate • Modify.; is the j-th modifier of the Argument; If there is more than one predicate, then multiple subnets are extracted analogously. For instance, given the sentence "I happily study beautiful mathematics in beautiful China with beautiful people." The substructure can be defined as Figure 46.2. For each of these substructures, Step 3 is supposed to match the substructures of a sentence against its global memory (which contains a large body of previously encountered 'semantic structure, syntactic/morphological realization' pairs) to find the most similar or same substruc- tures and the relevant syntactic relations to generate a set of syntactic realizations, which may be sentences or sentence fragments. In our current implementation. a customized subgraph matching algorithm has been used to match the subnets from the parsed corpus at this step. If Step 3 generated multiple fragments, they mast be pieced together. In Step 4, the Link Parser's dictionary has been used for detecting the dangling syntactic links corresponding to the fragments, which can be used to integrate the multiple fragments. For instance, in the example of Figure 46.3, according to the last 3 steps, SegSim would generate two fragments: "the parser EFTA00624585
46.2 SegSim for Sentence Generation 139 MG brie Vron of Ref IV • breed • MM. I ul LS Phial . tunny ink twat Fula. lee SAGA MN. MC r ot if allf 'adman : Pisces Swiss's) 1 ( league ) Imnal %runt ..41 rad Fig. 46.1: A Overview of the SegSim Architecture for Language Generation will ignore the sentence" and "whose length is too long". Then it consults the Link Parser's dictionary, and finds that "whose" has a connector "Mr-", which is used for relative clauses involving "whose", to connect to the previous noun "sentence". Analogously, we can integrate the other fragments into a whole sentence. For instance, in the example of Figure 46.3, according to the last 3 steps, SegSim would generate two fragments: "the parser will ignore the sentence" and "whose length is too long". Then it consults the Link Parser's dictionary, and finds that "whose" has a connector "Mr-", which is used for relative clauses involving "whose", to connect to the previous noun "sentence". Analogously, we can integrate the other fragments into a whole sentence. Finally, a "cleanup" or "post-processing" phase is conducted, applying the correct inflections to each word depending on the word properties provided by the input RelEx relations. For example, we can use the RelEx relation "DEFINITE-FLAG(cover, T)" to insert the article "the" in front of the word "cover". We have considered five factors in this version of NLGen: article, EFTA00624586
440 46 Natural Language Generation bentfil tenth' Fig. 46.2: Example of a substructure • - • • ..... ”.-.b.....'.......••••-•••••••••••••••••••••••-”**********, .- :Ir . ' • * I •-•4*--•- ..-w• I I I I I i I t i l l 1177-111t tit putts allot yeon.i tle tertiv.aS ;mu.* ita ta Fig. 46.3: Linkage of an example noun plural, verb intense, possessive and query type (the latter which is only for interrogative sentences). In the "cleanup" step, we also use the chunk parser tool from OpenNLP4 for adjusting the position of an article being inserted. For instance, consider the proto-sentence "I have big red apple." If we use the RelEx relation "noun_number(apple, singular)" to inflect the word "apple" directly, the final sentence will be "I have big red an apple", which is not well-formed. So we use the chunk parser to detect the phrase "big red apple" first, then apply the article rule in front of the noun phrase. This is a pragmatic approach which may be replaced with something more elegant and principled in later revisions of the NLGen system. • http://opennIpsourceforge.net/ EFTA00624587
46.2 SegSim for Sentence Generation 411 46.2.1 NLGen: Example Results NLGen is currently in a relatively early stage of development, and does not handle the full range of linguistic and semantic phenomena that it will when it's completed. However, it can already express a variety of sentences encapsulating a variety of syntactic and semantic phenomena; in this section we will give some specific examples of what it can do. The SegSim approach performs sentence generation by matching portions of propositional input to a large corpus of parsed sentences, therefore, when the successful matches can be found in the corpus, it can generate some similar and well-formed sentences via the relevant syntactic structures. There currently do not exist any authoritative evaluation criteria for Natural Language Generation systems. Among many complicating factors here is the fact that different language generation systems have different kinds of inputs, depending on many things including their application area. So it's complicated to compare the results of NLGen with those obtained by other systems. It is easier however to test whether NLGen is implementing SegSim succmsfully. One ap- proach is to take a sentence, run it through RelEx to generate a set of relationships, and see if NLGen can regenerate the sentence from the relationship. We show here the results of some simple tests of this nature, performed with the current version of the system using a very small test corpus for similarity matching. Note: In each of these example results, the input of NLGen is the RelEx relationships produced by the sentence before "==>"; and the sentence to the right side of "==>" was one of the sentences generated by NLGen. Example 1: I study mathematics at Xiamen University. ==> I study mathematics at Xiamen\_University. Example 2: The red block is next to the blue cup. ==> The red block is next\_to the blue cup. Example 3: Ruiting walked from the table to the door. ==> Ruiting walked to the door from the table. Example 4: I intend to walk very slowly from Ben to Ruiting, with Hugo's pencil in my hand. ==> With Hugo's pencil in my hand, I intend to walk very slowly from Ben to Ruiting. Example 5: I'd like you to tell me where the red ball is. ==> I tell like you to 'd me where the red ball is. Example 6: EFTA00624588
442 46 Natural Language Generation I study at Xiamen University, which is located in China. ==> I study at Xiamen1_University. Example 7: What is next to the tree? ==> What is next to the tree? Example 8: Where is the red ball? ==> Where is the red ball? Example 9: Pass me the ball. ==> Pass me the ball. Example 10: Tell Ben where the red ball is. ==> Tell Ben where the red ball is. To make the process clearer, for Example 1 we also show the RelEx relationships produced from the sentence before the "==>": For instance, in the example of Figure 46.3, according to the last 3 steps, SegSim would generate two fragments: "the parser will ignore the sentence" and "whose length is too long". Then it consults the Link Parser's dictionary, and finds that "whose" has a connector "Mr-", which is used for relative clauses involving "whose", to connect to the previous noun "sentence". Analogously, we can integrate the other fragments into a whole sentence. Figure 46.4 shows the relationships of Example 1 fed to NLGen as input. The types of the semantic relationships are documented in the RelEx's wiki pagest. These examples illustrate some key points about the current version of NLGen. It works well on simple, commonplace sentences (Example 1, 2), though it may reorder the sentence fragments sometimes (Example 3, 4). On the other hand, because of its reliance on matching against a corpus, NLGen is incapable of forming good sentences with syntactic structures not found in the corpus (Example 5, 6). On a larger corpus these examples would have given successful results. In Example 5, the odd error is due to the presence of too many "_subj" RelEx relationships in the relationship-set corresponding to the sentence, which distracts the matching proems when it attempts to find similar substructures in the small test corpus. Then from Example 7 to 10, we can see NLGen still works well for question sentences and imperative sentence if the substructures we extract can be matched, but the substructures may be similar with the assertive sentence, so we need to refine it in the "cleanup" step. For example: the substructures we extracted for the sentence "are you a student?" are the same as the ones for "you are a student?", since the two sentences both have the same binary RelEx relationships: _subj (be, you) _obj (be, student) t http://opencog.org/wiki/RelEx#Relations_and_Features EFTA00624589
- 鹵 届 工 【 6 由 { … 七 山 ユ 。 亂 毛 9 ・ L 二 - 叩 口 … 貫 - 中 帥 囑 言 岶 c 靂 8 2 ユ I 一 召 ュ 田 巴 三 吮 田 - 己 扁 u 一 韶 - 目 畠 品 中 E 8 屡 ・ ? 山 u 岳 芸 器 ・ 工 - 中 」 唱 一 8 -v ・ 署 屮 ・ 。 m 蕃 昌 七 E 廿 齧 - 【 叩 置 - 】 ° c 屈 - u 嘱 】 莫 山 山 看 ・ 二 品 8 原 啻 一 ・ 誌 毛 專 】 詈 一 」 一 骨 x 国 一 & & ョ 哲 6 矯 石 一 鹵 《 晶 . 瑯 . 等 . 警 函 一 ← 吐 冪 - - 「 」 白 三 」 詈 一 」 G 一 … 一 一 」 - 置 苔 ~ ユ - 互 昌 』 詈 且 一 量 弓 口 寫 、 H 冪 - 与 【 口 三 一 言 m 岔 ◎ 、 H - 製 獸 吾 山 > 薯 a 詈 且 「 ( 蕾 且 、 宀 窘 - 一 詈 異 当 一 尊 m 訂 > 三 岳 舜 冖 - 、 曇 豈 岔 一H 、 - 暮 - 菫 冨 … 一 計 昌 ば - m 且 - 」 、 盲 巴 晋 囗 口 岳 鬲 暹 嘗 「 」 曲 一 「 一 【 」 曽 G 弓 ( 一 箇 寓 凵 当 囗 口 岳 舜 三 粤 一 量 曇 》 叩 含 冖 m 訂 h 」 曇 岳 舜 冖 ” 冪 君 口 口 巨 一 g 一 岔 ~ ~ 甘 昌 ロ 」 言 且 c … 篇 缶 c 庠 ) 8 c 3 c ● 切 蛔 & 【 c 一 嘱 晶 蚪 等 EFTA00624590
444 46 Natural Language Generation like "TRUTH-QUERY-FLAG(be, T)" which means if that the referent "be" is a verb/event and the event is involved is a question. The particular shortcomings demonstrated in these examples are simple to remedy within the current NLGen framework, via simply expanding the corpus. However, to get truly general behavior from NLGen it will be necessary to insert some other generation method to cover those cases where similarity matching fails, as discussed above. The NLGen2 system created by Blake Lentoine [Lenin)] is one possibility in this regard: based on RelEx and the link parser, it carries out rule-based generation using an implementation of Chomsky's Merge operator. Integration of NLGen with NLGen2 is currently being considered. We note that the Merge operator is computationally inefficient by nature, so that it will likely never be suitable for the primary, sentence generation method in a language generation system. However, pairing NLGen for generation of familiar and routine utterances with a Merge-based approach for generation of complex or unfamiliar utterances, may prove a robust approach. 46.3 Experiential Learning of Language Generation As in the case of language comprehension, there are multiple ways to create an experiential learning based language generation system, involving various levels of "wired in" knowledge. Our best guess is that for generation as for comprehension. a "tabula rasa" approach will prove computationally intractable for quite some time to come, and an approach in which some basic structures and processes are provided, and then filled out with content learned via experience, will provide the greatest odds of success. A highly abstracted version of SegSim may be formulated as follows: 1. The Al system stores semantic and syntactic structures, and its control mechanism is biased to search for, and remember, linkages between them 2. When it is given a new semantic structure to express, it first breaks this semantic structure into natural parts, using inference based on whatever implications it has in its memory that will serve this purpose 3. Its inference control mechanism is biased to carry out inferences with the following implica- tion: For each of these parts, match it against its memory to find relevant pairs (which may be full or partial matches), and use these pairs to generate a set of syntactic realizations (which may be sentences or sentence fragments) 4. If the matching has failed to yield results with sufficient confidence, then (a) it returns to Step 2 and carries out the breakdown into parts again. But if this has happened too many times, then (b) it uses its ordinary inference control routine to try to determine the syntactic realization of the part in question. 5. If the above step generated multiple fragments, they are pieced together, and an attempt is made to infer, based on experience, whether the result will be effectively communicative. If this fails, then Step 3 is tried again on one or more of the parts; or Step 2 is tried again. 6. Other inference-driven transformations may occur at any step of the process, but are par- ticularly likely to occur at the end. In some languages these transformations may result in the insertion of correct morphological forms or other "function words." What we suggest is that it may be interesting to supply a CogPrime system with this overall process, and let it fill in the rest by experiential adaptation. In the case that the system is EFTA00624591
46.5 Conclusion 445 learning to comprehend at the same time as it's learning to generate, this means that its early- stage generations will be based on its rough, early-stage comprehension of syntax - but that's OK. Comprehension and generation will then "grow up" together. 46.4 Sem2Syn A subject of current research is the extension of the Syn2Sem approach mentioned above into a reverse-order. Sem2Syn system for language generation. Given that the Syn2Sem rules are expressed as ImplicationLinks, they can be reversed auto- matically and immediately - although, the reversed versions will not necessarily have the same truth values. So if a collection of Syn2Sem rules are learned from a corpus, then they can be used to automatically generate a set of Sem2Syn rules, each tagged with a probabilistic truth value. Application of the whole set of Sem2Syn rules to a given Atom-set in need of articulation, will result in a collection of link-parse links. To produce a sentence from such a collection of link-parse links, another process is also needed, which will select a subset of the collection that corresponds to a complete sentence, legally parsable via the link parser. The overall collection might naturally break down into more than one sentence. In terms of the abstracted version of SegSim given above, the primary difference between NLGen and SegSim lies in Step 3. Syn2Sem replaces the SegSim "data-store matching" algo- rithm with inference based on implications obtained from reversing the implications used for language comprehension. 46.5 Conclusion There are many different ways to do language generation within OpenCog, ranging from pure experiential learning to a database-driven approach like NLGen. Each of these different ways may have value for certain applications, and it's unclear which ones may be viable in a human- level AGI context. Conceptually we would favor a pure experiential learning approach. but, we are currently exploring a "compromise" approach based on Sem2Syn. This is an area where experimentation is going to tell us more than abstract theory. EFTA00624592
EFTA00624593
Chapter 47 Embodied Language Processing Co-authored with Samir Araujo and Welter Silva 47.1 Introduction "Language" is an important abstraction — but one should never forget that it's an abstraction. Language evolved in the context of embodied action, and even the most abstract language is full of words and phrases referring to embodied experience. Even our mathematics is heavily based on our embodied experience - geometry is about space; calculus is about space and time: algebra is a sort of linguistic manipulation generalized from experience-oriented language, etc. (see 'ELMO] for detailed arguments in this regard). To consider language in the context of human-like general intelligence, one needs to consider it in the context of embodied experience. There is a large literature on the importance of embodiment for child language learning, but perhaps the most eloquent case has been made by Michael Tomasello, in his excellent book Constructing a Language ??. Citing a host of relevant research by himself and others, Tomasello gives a very clear summary of the value of social interaction and embodiment for language learning in human children. And while he doesn't phrase it in these terms, the picture he portrays includes central roles for reinforcement, imitative and corrective learning. Imitative learning is obvious: so much of embodied language learning has to do with the learner copying what it has heard other say in similar contexts. Corrective learning occurs every time a parent or peer rephrases something for a child. In this chapter, after some theoretical discussion of the nature of symbolism and the role of gesture and sound in language, we describe some computational experiments run with OpenCog controlling virtual pets in a virtual world, regarding the use of embodied experience for anaphor resolution and question-answering. These comprise an extremely simplistic example of the in- terplay between language and embodiment, but have the advantage of concreteness, since they were actually implemented and experimented with. Some of the specific OpenCog tools used in these experiments are no longer current (e.g. the use of RelEx2Frame, which is now deprecated in favor of alternative approaches to mapping parses into more abstract semantic relationships); but the basic principles and flow illustrated here are still relevant to current and future work. 447 EFTA00624594
448 47 Embodied Language Processing 47.2 Semiosis The foundation of communication is semiosis - the representation between the signifier and the signified. Often the signified has to do with the external world or the communicating agent's body; hence the critical role of embodiment in language. Thus, before turning to the topic of embodied language use and learning per se, we will briefly treat the related topic of how an AGI system may learn semiosis itself via its embodied experience. This is a large and rich topic, but we will restrict ourselves to giving a few relatively simple examples intended to make the principles clear. We will structure our discussion of semiotic learning according to Charles Sanders Peirce's theory of semiosis [Pei341, in which there are three basic types of signs: icons, indices and symbols. In Peirce's ontology of semiosis, an icon is a sign that physically resembles what it stands for. Representational pictures, for example, are icons because they look like the thing they represent. Onomatopoeic words are icons, as they sound like the object or fact they signify. The iconicity of an icon need not be immediate to appreciate. The fact that "kirikiriki" is iconic for a rooster's crow is not obvious to English-speakers yet it is to many Spanish-speakers; and the the converse is true for "cock-a-doodle-doo." Next, an index is a sign whose occurrence probabilistically implies the occurrence of some other event or object (for reasons other than the habitual usage of the sign in connection with the event or object among some community of communicating agents). The index can be the cause of the signified thing, or its consequence, or merely be correlated to it. For example, a smile on your face is an index of your happy state of mind. Loud music and the sound of many people moving and talking in a room is an index for a party in the room. On the whole, more contextual background knowledge is required to appreciate an index than an icon. Finally, any sign that is not an icon or index is a symbol. More explicitly, one may say that a symbol is a sign whose relation to the signified thing is conventional or arbitrary. For instance, the stop sign is a symbol for the imperative to stop; the word "dog" is a symbol for the concept it refers to. The distinction between the various types of signs is not always obvious, and some signs may have multiple aspects. For instance, the thumbs-up gesture is a symbol for positive emotion or encouragement. It is not an index - unlike a smile which is an index for happiness because smiling Ls intrinsically biologically tied to happiness. there is no intrinsic connection between the thumbs-up signal and positive emotion or encouragement. On the other hand, one might argue that the thumbs-up signal is very weakly iconic, in that its up-ness resembles the subjective up-ness of a positive emotion (note that in English an idiom for happiness Ls "feeling up"). Teaching an embodied virtual agent to recognize simple icons is a relatively straightforward learning task. For instance, suppose one wanted to teach an agent that in order to get the teacher to give it a certain type of object, it should go to a box full of pictures and select a picture of an object of that type, and bring it to the teacher. One way this may occur in an OpenCog-controlled agent is for the agent to learn a rule of the following form: EFTA00624595
47.2 Sentiosis 449 ImplicationLink ANDLink ContextLink Visual SimilarityLink $X $Y PredictivelmplicationLink SequentialANDLink ExecutionLink goto box ExecutionLink grab $X ExecutionLink goto teacher EvaluationLink give me teacher $Y While not a trivial learning problem, this is straightforward to a CogPrime -controlled agent that is primed to consider visual similarities as significant (i.e. is primed to consider the visual- appearance context within its search for patterns in its experience). Next, proceeding from icons to indices: Suppose one wanted to teach an agent that in order to get the teacher to give it a certain type of object, it should go to a box full of pictures and select a picture of an object that has commonly been used together with objects of that type, and bring it to the teacher. This is a combination of iconic and indexical semiosis, and would be achieved via the agent learning a rule of the form Implication AND Context Visual Similarity $X $2 Context Experience SpatioTemporalAssociation $Z $Y Predictivelmplication SequentialAnd Execution goto box Execution grab $X Execution goto teacher Evaluation give me teacher $Y Symbolism, finally, may be seen to emerge as a fairly straightforward extension of indexing. After all, how does an agent come to learn that a certain symbol refers to a certain entity? An advanced linguistic agent can learn this via explicit verbal instruction, e.g. one may tell it "The word 'hideous' means 'very, ugly'." But in the early stages of language learning, this sort of instructional device is not available, and so the way an agent learns that a word is associated with an object or an action is through spatiotemporal association. For instance, suppose the teacher wants to teach the agent to dance every time the teacher says the word "dance" - a very simple example of symbolism. Assuming the agent already knows how to dance, this merely requires the agent learn the implication Predictivelmplication Sequent ialAND Evaluation say teacher me "dance" Execution dance EFTA00624596
450 47 Embodied Language Processing give teacher me Reward And, once this has been learned, then simultaneously the relationship SpatioTemporalAssociation dance "dance" will be learned. What's interesting is what happens after a number of associations of this nature have been learned. Then, the system may infer a general rule of the form Implication AND SpatioTemporalAssociation \$X \SZ HasType \$X GroundedSchema Predictivelmplication SeguentialAND Evaluation say teacher me \$Z Execution \$X Evaluation give teacher me Reward This implication represents the general rule that if the teacher says a word corresponding to an action the agent knows how to do, and the agent does it, then the agent may get a reward front the teacher. Abstracting this from a number of pertinent examples is a relatively straightforward feat of probabilistic inference for the PLN inference engine. Of course, the above implication is overly simplistic, and would lead an agent to stupidly start walking every time its teacher used the word "walk" in conversation and the agent overheard it. To be useful in a realistic social context, the implication must be made more complex so as to include some of the pragmatic surround in which the teacher utters the word or phrase $Z. 47.3 Teaching Gestural Communication Based on the ideas described above, it is relatively straightforward to teach virtually embodied agents the elements of gestural comunication. This is important for two reasons: gestural com- munication is extremely useful unto itself, as one sees from its role in communication among young children and primates 122i; and, gestural communication forms a foundation for verbal communication, during the typical course of human language learning 1231. Note for instance the study described in 122J, which "reports empirical longitudinal data on the early stages of language development," concluding that ...the output systems of speech and gesture may draw on underlying brain mechanisms com- mon to both language and motor functions. We analyze the spontaneous interaction with their parents of three typically-developing children (2 Al, 1 F) videotaped monthly at home between 10 and 23 months of age. Data analyses focused on the production of actions, representational and deictic gestures and words, and gesture-word combinations. Results indicate that there is a continuity between the production of the first action schemes, the first gestures and the first words produced by children. The relationship between gestures and words changes over time. The onset of two-word speech was preceded by the emergence of gesture-word combinations. If young children learn language as a continuous outgrowth of gestural communication, per- haps the same approach may be effective for (virtually or physically) embodied AI's. EFTA00624597
47.3 Teaching Gestural Communication 451 An example of an iconic gesture occurs when one smiles explicitly to illustrate to some other agent that one is happy. Smiling is a natural expression of happiness, but of course one doesn't always smile when one's happy. The reason that explicit smiling is iconic is that the explicit smile actually resembles the unintentional smile, which is what it "stands for." This kind of iconic gesture may emerge in a socially-embedded learning agent through a very simple logic. Suppose that when the agent is happy, it benefits from its nearby friends being happy as well, so that they may then do happy things together. And suppose that the agent has noticed that when it smiles, this has a statistical tendency to make its friends happy. Then, when it is happy and near its friends, it will have a good reason to smile. So through very simple probabilistic reasoning, the use of explicit smiling as a communicative tool may result. But what if the agent is not actually happy, but still wants some other agent to be happy? Using the reasoning from the prior paragraph, it will likely figure out to smile to make the other agent happy - even though it isn't actually happy. Another simple example of an iconic gesture would be moving one's hands towards one's mouth, mimicking the movements of feeding oneself, when one wants to eat. Many analogous iconic gestures exist, such as doing a small solo part of a two-person dance to indicate that one wants to do the whole dance together with another person. The general rule an agent needs to learn in order to generate iconic gestures of this nature is that, in the context of shared activity, mimicking part of a process will sometimes serve the function of evoking that whole process. This sort of iconic gesture may be learned in essentially the same way as an indexical gesture such as a dog repeatedly drawing the owner's attention to the owner's backpack, when the dog wants to go outside. The dog doesn't actually care about going outside with the backpack - he would just as soon go outside without it - but he knows the backpack is correlated with going outside, which is his actual interest. The general rule here is R := Implication Simultaneouslmplication Execution $X $Y Predictivelmplication $X $Y I.e., if doing $X often correlates with $Y, then maybe doing $X will bring about $Y. This sort of rule can bring about a lot of silly "superstitious" behavior but also can be particularly effective in social contexts, meaning in formal terms that Context near_teacher R holds with a higher truth value than It itself. This is a very small conglomeration of semantic nodes and links yet it encapsulates a very important communicational pattern: that if you want something to happen, and act out part of it - or something historically associated with it - around your teacher, then the thing may happen. Many other cases of iconic gesture are more complex and mix iconic with symbolic aspects. For instance, one waves one hand away from oneself, to try to get someone else to go away. The hand is moving, roughly speaking, in the direction one wants the other to move in. However, understanding the meaning of this gesture requires a bit of savvy or experience. One one does grasp it, however, then one can understand its nuances: For instance, if I wave my hand in an EFTA00624598
452 47 Embodied Language Processing arc leading from your direction toward the direction of the door, maybe that means I want you to go out the door. Purely symbolic (or nearly so) gestures include the thumbs-up symbol mentioned above, and many others including valence-indicating symbols like a nodded head for YES, a shaken-side-to- side head for NO, and shrugged shoulders for "I don't know." Each of these valence-indicating symbols actually indicates a fairly complex concept, which is learned from experience partly via attention to the symbol itself. So, an agent may learn that the nodded head corresponds with situations where the teacher gives it a reward, and also with situations where the agent makes a request and the teacher complies. The cluster of situati<ms corresponding to the nodded- head then forms the agent's initial concept of "positive valence," which encompasses, loosely speaking, both the good and the true. Summarizing our discussion of gestural communication: An awful lot of language exists between intelligent agents even if no word is ever spoken. And, our belief is that these sorts of non-verbal semiosis form the best possible context for the learning of verbal language, and that to attack verbal language learning outside this sort of context is to make an intrinsically- difficult problem even harder than it has to be. And this leads us to the final part of the chapter, which is a bit more speculative and adventuresome. The material in this section and the prior ones describes experiments of the sort we are currently carrying out with our virtual agent control software. We have not yet demonstrated all the forms of semiosis and non-linguistic communication described in the last section using our virtual agent control system, but we have demonstrated some of them and are actively working on extending our system's capabilities. In the following section, we venture a bit further into the realm of hypothesis and describe some functionalities that are beyond the scope of our current virtual agent control software, but that we hope to put into place gradually during the next 1-2 years. The basic goal of this work is to move from non-verbal to verbal communication. It is interesting to enumerate the aspects in which each of the above components appears to be capable of tractable adaptation via experiential, embodied learning: • Words and phrases that are found to be systematically associated with particular objects in the world, may be added to the "gazeteer list" used by the entity extractor • The link parser dictionary may be automatically extended. In cases where the agent hears a sentence that is supposed to describe a certain situation, and realizes that in order for the sentence to be mapped into a set of logical relationships accurately describing the situation, it would be necessary for a certain word to have a certain syntactic link that it doesn't have, then the link parser dictionary may be modified to add the link to the word. (On the other hand, creating new link parser link types seems like a very difficult sort of learning - not to say it is unaddressable, but it will not be our focus in the near term.) • Similar to with the link parser dictionary, if it is apparent that to interpret an utterance in accordance with reality a RelEx rule must be added or modified, this may be automatically done. The RelEx rules are expressed in the format of relatively simple logical implications between Boolean combinations of syntactic and semantic relationships, so that learning and modifying them is within the scope of a probabilistic logic system such as OpenCogPrime's PLN inference engine. • The rules used by RelEx2Prame may be experientially modified quite analogously to those used by RelEx • Our current statistical parse ranker ranks an interpretation of a sentence based on the frequency of occurrence of its component links across a parsed corpus. A deeper approach, however, would be to rank an interpretation based on its commonsensical plausibility, as EFTA00624599
47.3 Teaching Gestural Communication 453 inferred from experienced-world-knowledge as well as corpus-derived knowledge. Again, this is within the scope of what an inference engine such as PLN should be able to do. • Our word sense disambiguation and reference resolution algorithms involve probabilistic estimations that could be extended to refer to the experienced world as well as to a parsed corpus. For example, in assessing which sense of the noun "run" is intended in a certain context, the system could check whether stockings, or sports-events or series-of-events, are more prominent in the currently-observed situation. In assessing the sentence "The children kicked the dogs, and then they laughed," the system could map "they" into "children" via experientially-acquired knowledge that children laugh much more often than dogs. • NLGen uses the link parser dictionary, treated above, and also uses rules analogous to (but inverse to) RelEx rules, mapping semantic relations into brief word-sequences. The "gold standard" for NLGen is whether, when it produces a sentence S from a set Ft of semantic relationships, the feeding of $ into the language comprehension subsystem produces Et (or a close approximation) as output. Thus, as the semantic mapping rules in RelEx and RelEx2Frame adapt to experience, the rules used in NLGen must adapt accordingly, which poses an inference problem unto itself. All in all, when one delves in detail into the components that make up our hybrid statistical/rule-based NLP system, one sees there is a strong opportunity for experiential adap- tive learning to substantially modify nearly every aspect of the NLP system, while leaving the basic framework intact. This approach, we suggest, may provide means of dealing with a number of problems that have systematically vexed existing linguistic approaches. One example is parse ranking for com- plex sentences: this seems almost entirely a matter of the ability to assess the semantic plausi- bility of different parses, and doing this based on statistical corpus analysis seems unreasonable. One needs knowledge about a world to ground reasoning about plausiblity. Another example is preposition disambiguation, a topic that is barely dealt with at all in the computational linguistics literature (see e.g. [331 for an indication of the state of the art). Consider the problem of assessing which meaning of "with" is intended in sentences like "I ate dinner with a fork", "I ate dinner with my sister", "I ate dinner with dessert." In performing this sort of judgment, an embodied system may use knowledge about which interpretations have matched observed reality in the case of similar utterances it has processed in the past, and for which it has directly seen the situations referred to by the utterances. If it has seen in the past, through direct embodied experience, that when someone said "I ate cereal with a spoon," they meant that the spoon was their tool not part of their food or their eating-partner; then when it hears "I ate dinner with a fork," it may match "cereal" to "dinner" and "spoon" to "fork" (based on probabilistic similarity measurement) and infer that the interpretation of "with" in the latter sentence should also be to denote a tool. How does this approach to computational language understanding tie in with gestural and general semiotic learning as we discussed earlier? The study of child language has shown that early language use is not purely verbal by any means, but is in fact a complex combination of verbal and gestural communication 123J. With the exception of first bullet point (entity extraction) above, every one of our instances of experiential modification of our language framework listed above involves the use of an understanding of what situation actually exists in the world, to help the system identify what the logical relationships output by the NLP system are supposed to be in a certain context. But a large amount of early-stage linguistic communication is social in nature, and a large amount of the remainder has to do with the body's relationship to physical objects. And, in understanding "what actually exists in the world" regarding social and physical relationships, a EFTA00624600
454 47 Embodied Language Processing full understanding of gestural communication is important. So, the overall pathway we propose for achieving robust, ultimately human-level NLP functionality is as follows: • The capability for learning diverse instances of semiosis is established • Gestural communication is mastered, via nonverbal imitative/reinforcement/corrective learning mechanisms such as we utilized for our embodied virtual agents • Gestural communication, combined with observation of and action in the world and verbal interaction with teachers, allows the system to adapt numerous aspects of its initial NLP engine to allow it to more effectively interpret simple sentences pertaining to social and physical relationships • Finally, given the ability to effectively interpret and produce these simple and practical sentences, probabilistic logical inference allows the system to gradually extend this ability to more and more complex and abstract senses, incrementally adapting aspects of the NLP engine as its scope broadens. In this brief section we will mention another potentially important factor that we have intentionally omitted in the above analysis - but that may wind up being very important, and that can certainly be taken into account in our framework if this proves necessary. We have argued that gesture is an important predecessor to language in human children, and that incorporating it in AI language learning may be valuable. But there is another aspect of early language use that plays a similar role to gesture, which we have left out in the above discussion: this is the acoustic aspects of speech. Clearly, pre-linguistic children make ample use of communicative sounds of various sorts. These sounds may be iconic, indexical or symbolic; and they may have a great deal of sub- tlety. Steven Mithen IMit961 has argued that non-verbal utterances constitute a kind of proto- language. and that both music and language evolved out of this. Their role in language learning is well-known. We are uncertain as to whether an exclusive focus on text rather than speech would critically impair the language learning process of an AI system. We are fairly strongly convinced of the importance of gesture because it seems bound up with the importance of semiosis - gesture, it seems, Ls how young children learn flexible semiotic communication skills, and then these skills are gradually ported from the gestural to the verbal domain. Semioti- cally, on the other hand, phonology doesn't seem to give anything special beyond what gesture gives. What it does give is an added subtlety of emotional expressiveness - something that is largely missing from virtual agents a S implemented today, due to the lack of really fine-grained facial expressions. Also, it provides valuable clues to parsing, in that groups of words that are syntactically hound together are often phrased together acoustically. If one wished to incorporate acoustics into the framework described above, it would not be objectionably difficult on a technical level. Speech-to-text and text-to-speech software both exist, but neither have been developed with a view specifically toward conveyance of emotional information. One could approach the problem of assessing the emotional state of an utterance based on its sound as a supervised categorization problem, to be solved via supplying a machine learning algorithm with training data consisting of human-created pairs of the form (utterance, emotional valence). Similarly, one could tune the dependence of text-to-speech software for appropriate emotional expressiveness based on the same training corpus. EFTA00624601
47.4 Simple Experiments with Embodiment and Anaphor Resolution 455 47.4 Simple Experiments with Embodiment and Anaphor Resolution Now we turn to some fairly simple practical work that was done in 2008 with the OpenCog-based PetBrain software, involving the use of virtually embodied experience to help with interpretation of linguistic utterances. This work has been superseded somewhat by more recent work using OpenCog to control virtual agents; but the PetBrain work was especially clear and simple, so suitable in an expository sense for in-depth discussion here. One of the two ways the PetBrain related language processing to embodied experience was via using the latter to resolve anaphoric references in text produced by human-controlled avatars. The PetBrain controlled agent lived in a world with many objects, each one with their own characteristics. For example, we could have multiple balls, with varying colors and sizes. We represent this in the OpenCog Atomspace via using multiple nodes: a single ConceptNode to represent the concept "ball", a WordNode associated with the word "ball", and numerous Se- meNocS representing particular balls. There may of course also be ConceptNodes representing ball-related ideas not summarized in any natural language word, e.g. "big fat squishy balls," "balls that can usefully be hit with a bat", etc. As the agent interacts with the world, it acquires information about the objects it finds, through perceptions. The perceptions associated with a given object are stored as other nodes linked to the node representing the specific object instance. All this information is represented in the Atomspace using FrameNet-style relationships (exemplified in the next section). When the user says, e.g., "Grab the red ball", the agent needs to figure out which specific ball the user is referring to - i.e. it needs to invoke the Reference Resolution (RR) process. RR uses the information in the sentence to select instances and also a few heuristic rules. Broadly speaking, Reference Resolution maps nouns in the user's sentences to actual objects in the virtual world. based on world-knowledge obtained by the agent through perceptions. In this example, first the brain selects the ConceptNodes related to the word "ball". Then it examines all individual instances associated with these concepts, using the determiners in the sentence along with other appropriate restrictions (in this example the determiner is the adjective "red"; and since the verb is "grab" it also looks for objects that can be fetched). If it finds more than one "fetchable red ball", a heuristic is used to select one (in this case, it chooses the nearest instance). The agent also needs to map pronouns in the sentences to actual objects in the virtual world. For example, if the user says "I like the red ball. Grab it.", the agent must map the pronoun "it" to a specific red ball. This process is done in two stages: first using anaphor resolution to associate the pronoun "it" with the previously heard noun "ball"; then using reference resolution to associate the noun "ball" with the actual object. The subtlety of anaphor resolution is that there may be more than one plausible "candidate" noun corresponding to a given pronouns. As noted above, at time writing RelEx's anaphor resolution system is somewhat simplistic and is based on the classical Hobbs algorithm[llob78]. Basically. when a pronoun (it, he, she, they and so on) is identified in a sentence, the Hobbs algorithm searches through recent sentences to find the nouns that fit this pronoun according to number, gender and other characteristics. The Hobbs algorithm is used to create a ranking of candidate nouns, ordered by time (most recently mentioned nouns come first). We improved the Hobbs algorithm results by using the agent's world-knowledge to help choose the best candidate noun. Suppose the agent heard the sentences: "The ball is red." EFTA00624602
456 47 Embodied Language Processing "The stick is brown." and then it receives a third sentence "Grab it.". The anaphor resolver will build a list containing two options for the pronoun "it" of the third sentence: ball and stick. Given that the stick corresponds to the most recently mentioned noun, the agent will grab it instead of (as Hobbs would suggest) the ball. Similarly, if the agent's history, contains "From here I can see as tree and a ball." "Grab it." Hobbs algorithm returns as candidate nouns "tree" and "ball", in this order. But using our integrative Reference Resolution process, the agent will conclude that a tree cannot be grabbed, so this candidate is discarded and "ball" is chosen. 47.5 Simple Experiments with Embodiment and Question Answering The PetBrain was also capable of answering simple questions about its feelings/emotions (hap- piness, fear, etc.) and about the environment in which it lives. After a question is asked to the agent, it is parsed by RelEx and classified as either a truth question or a discursive one. After that, RelEx rewrites the given question as a list of Frames (based on FrameNet" with some customizations), which represent its semantic content. The Frames version of the question is then processed by the agent and the answer is also written in Frames. The answer Frames are then sent to a module that converts it back to the RelEx format. Finally the answer, in RelEx format, is processed by the NLGen module, that generates the text of the answer in English. We will discuss this process here in the context of the simple question "What is next to the tree?", which in an appropriate environment receives the answer 'he red ball is next to the tree." Question answering (QA) of course has a long history, in AI iNlayoil, and our approach fits squarely into the tradition of "deep semantic QA systems"; however it is innovative in its combination of dependency parsing with FrameNet and most importantly in the manner of its integration of QA with an overall cognitive architecture for agent control. 4 7.5.1 Preparing/Matching Frames In order to answer an incoming question, the agent tries to match the Frames list, created by RelEx, against the Frames stored in its own memory. In general these Frames could come from a variety of sources, including inference, concept creation and perception; but in the current PetBrain they primarily come from perception, and simple transformations of perceptions. However, the agent cannot use the incoming perceptual Frames in their original format because they lack grounding information (information that connects the mentioned elements to • http://frameneticsi.berkeley.edu EFTA00624603
47.5 Simple Experiments with Embodiment and Question Answering 457 the real elements of the environment). So, two steps are then executed before trying to match the names: Reference Resolution (described above) and names Rewriting. Frames Rewriting is a process that changes the values of the incoming Frames elements into grounded values. Here is an example: Incoming Frame (Generated by RelEx) EvaluationLink DefinedFrameElementNode Color:Color WordlnstanceNode "redelaaa" EvaluationLink DefinedFrameElementNode Color:Entity WordlnstanceNode "ballellobb" ReferenceLink WordlnstanceNode "realaaa" WordNode "red" After Reference Resolution ReferenceLink WordlnstanceNode "ballebbb" SemeNode "ball 99" Grounded Frame (After Rewriting) EvaluationLink DefinedFrameElementNode ConceptNode "red" EvaluationLink DefinedFrameElementNode SemeNode "ball 99" Color:Color Color:Entity Frame Rewriting serves to convert the incoming Frames to the same structure used by the Frames stored into the agent's memory. After Rewriting, the new Frames are then matched against the agent's memory, and if all names were found in it, the answer is known by the agent, otherwise it is unknown. In the PetBrain system, if a truth question was posed and all Frames were matched suc- cessfully, the answer would be be "yes"; otherwise the answer is "no". Mapping of ambiguous matching results into ambiguous responses were not handled in the PetBrain. If the question requires a discursive answer the process is slightly different. For known answers the matched Frames are converted into RelEx format by Frames2RelEx and then sent to NLGen, which prepares the final English text to be answered. There are two types of unknown answers. The first one is when at least one name cannot be matched against the agent's memory and the answer is "I don't know". And the second type of unknown answer occurs when all Frames were matched successfully but they cannot be correctly converted into RelEx format or NLGen cannot identify the incoming relations. In this case the answer is "I know the answer, but I don't know how to say it". EFTA00624604
458 47 Embodied Language Processing 47.5.2 Frames2RelEx As mentioned above, this module is responsible for receiving a list of grounded names and returning another list containing the relations, in RelEx format, which represents the grammat- ical form of the sentence described by the given names. That is, the names list represents a sentence that the agent wants to say to another agent. NLGen needs an input in RelEx Format in order to generate an English version of the sentence; Frames2RelEx does this conversion. Currently, Frames2RelEx is implemented as a rule-based system in which the preconditions are the required frames and the output is one or more RelEx relations e.g. 'Color (Entity,Color) => present ($2) .a($2) adj ($2) _predadj ($1, $2) definite ($1) .n($1) noun ($1) singular ($1) .v (be) verb (be) punctuation ( .) det (the) where the precondition comes before the symbol => and Color is a frame which has two elements: Entity and Color. Each element is interpreted as a variable Entity = $1 and Color = $2. The effect, or output of the rule, is a list of RelEx relations. As in the case of RelEx2Frame, the use of hand-coded rules is considered a stopgap, and for a powerful AGI system based on this framework such rules will need to be learned via experience. 47.5.3 Example of the Question Answering Pipeline Turning to the example "What is next to the tree?", Figure ?? illustrates the processes involved: r Multiverse Client What is next to the tree? I Re4Ex RelEx itocative_relation:Figure = Sobject elocative_relation:Ground = tree olocativerelafion:Relation_type = next PetBrain _obi(not. tree) subj(rext ball 99) imperativelnext) hyp(next) 4 next to (be. tree) _subj (be. _SqVar) tense (be. present) Pet Brain BLocative _relation:Soule = ball 99 alLocative_mlabon:Ground = tree BLocativejelabon:Relation type = next NLGen The ball is next to the tree. Fig. 47.1: Overview of current PetBrain language comprehension process EFTA00624605
47.5 Simple Experiments with Embodiment and Question Answering 459 The question is parsed by RelEx, which creates the frames indicating that the sentence is a question regarding a location reference (next) relative to an object (tree). The frame that represents questions is called Questioning and it contains the elements Manner that indicates the kind of question (truth-question, what, where, and so on), Message that indicates the main term of the question and Addressee that indicates the target of the question. To indicate that the question is related to a location, the Locative_relation frame is also created with a variable inserted in its element Figure, which represents the expected answer (in this specific case, the object that is next to the tree). The question-answer module tries to match the question frames in the Atomspace to fit the variable element. Suppose that the object that is next to the tree is the red ball. In this way, the module will match all the frames requested and realize that the answer is the value of the element Figure of the frame Locative _relation stored in the Atom Table. Then, it creates location frames indicating the red ball as the answer. These frames will be converted into RelEx format by the RelEx2Frames rule based system as described above, and NLGen will generate the expected sentence "the red ball is next to the tree". 47.5.4 Example of the PetBrain Language Generation Pipeline To illustrate the process of language generation using NLGen, as utilized in the context of PetBrain query response, consider the sentence "The red ball is near the tree'. When parsed by RelEx, this sentence is converted to: _obj (near, tree) _subj (near, ball) imperative (near) hyp (near) definite (tree) singular (tree) _to-do (be, near) _subj (be, ball) present (be) definite (ball) singular (ball) So, if sentences with this format are in the system's experience, these relations are stored by NLGen and will he used to match future relations that must be converted into natural language. NLGen matches at an abstract level, so sentences like "The stick is next to the fountain" will also be matched even if the corpus contain only the sentence 'The ball is near the tree". If the agent wants to say that 'The red ball is near the tree", it must invoke NLGen with the above RelEx contents as input. However, the knowledge that the red ball is near the tree is stored as frames, and not as RelEx format. More specifically, in this case the related frame stored is the Locative_relation one, containing the following elements and respective values: Figure —> red ball, Ground —> tree, Relation _type near. So we must convert these frames and their elements' values into the RelEx format accept by NLGen. For AGI purposes, a system must learn how to perform this conversion in a flexible and context-appropriate way. In our current system, however, we have implemented a temporary EFTA00624606
460 47 Embodied Language Processing short-cut: a system of hand-coded rules, in which the preconditions are the required frames and the output is the corresponding RelEx format that will generate the sentence that represents the frames. The output of a rule may contains variables that mast be replaced by the frame elements' values. For the example above, the output _subj(be, bail) is generated from the rule output _subj(be, &ran) with the Strarl replaced by the Figure element value. Considering specifically question-answering (QA), the PetBrain's Language Comprehension module represents the answer to a question as a list of frames. In this case, we may have the following situations: • The frames match a precondition and the RelEx output is correctly recognized by NLGen, which generates the expected sentence as the answer; • The frames match a precondition, but NLGen did not recognize the RelEx output generated. In this case, the answer will be "I know the answer, but I don't know how to say it", which means that the question Was answered correctly by the Language Comphrehension, but the NLGen could not generate the correct sentence; • The frames didn't match any precondition; then the answer will also be "I know the answer, but I don't know how to say it". • Finally, if no frames are generated as answer by the Language Comprehension module, the agent's answer will be "I don't know". If the question is a truth-question, then NLGen is not required. In this case, the creation of frames as answer is considered as a "Yes", otherwise, the answer will be "No" because it was not possible to find the corresponding frames as the answer. 47.6 The Prospect of Massively Multiplayer Language Teaching Now we tie in the theme of embodied language learning with more general considerations regarding embodied experiential learning. Potentially, this may provide a means to facilitate robust language learning on the part of virtually embodied agents, and lead to an experientially-trained AGI language facility that can then be used to power other sorts of agents such as virtual babies, and ultimately virtual adult- human avatars that can communicate with experientially-grounded savvy rather than in the manner of chat-bots. As one concrete, evocative example, imagine millions of talking parrots spread across different online virtual worlds - all communicating in simple English. Each parrot has its own local memories, its own individual knowledge and habits and likes and dislikes - but there's also a common knowledge-base underlying all the parrots, which includes a common knowledge of English. The interest of many humans in interacting with chatbots suggests that virtual talking parrots or similar devices would be likely to meet with a large and enthusiastic audience. Yes, humans interacting with parrots in virtual worlds can be expected to try to teach the parrots ridiculous things, obscene things, and so forth. But still, when it conies down to it, even pranksters and jokesters will have more fun with a parrot that can communicate better, and will prefer a parrot whose statements are comprehensible. And for a virtual parrot, the test of whether it has used English correctly, in a given instance, will come down to whether its human friends have rewarded it, and whether it has gotten what EFTA00624607
47.6 The Prospect of Massively Multiplayer Language Thaching 461 it wanted. If a parrot asks for food incoherently, it's less likely to get food - and since the virtual parrots will be programmed to want food, they will have motivation to learn to speak correctly. If a parrot interprets a human-controlled avatar's request "Fetch my hat please" incorrectly, then it won't get positive feedback from the avatar - and it will be programmed to want positive feedback. And of course parrots are not the end of the story. Once the collective wisdom of throngs of human teachers has induced powerful language understanding in the collective bird-brain, this language understanding (and the commonsense understanding coming along with it) will be useful for many, many other purposes as well. Humanoid avatars - both human-baby avatars that may serve as more rewarding virtual companions than parrots or other virtual animals; and language-savvy human-adult avatars serving various useful and entertaining functions in online virtual worlds and games. Once AN have learned enough that they can flexibly and adaptively explore online virtual worlds and gather information from human-controlled avatars according to their own goals using their linguistic facilities, it's easy to envision dramatic acceleration in their growth and understanding. A baby Al has numerous disadvantages compared to a baby human being: it lacks the intricate set of inductive biases built into the human brain, and it also lacks a set of teachers with a similar form and psyche to it... and for that matter, it lacks a really rich body and world. However, the presence of thousands to millions of teachers constitutes a large advantage for the AI over human babies. And a flexible AGI framework will be able to effectively exploit this advantage. If nonlinguistic learning mechanisms like the ones we've described here, utilized in a virtually-embodied context, can go beyond enabling interestingly trainable virtual animals and catalyze the process of language learning - then, within a few years time, we may find ourselves significantly further along the path to AGI than most observers of the field currently expect. EFTA00624608
EFTA00624609
Chapter 48 Natural Language Dialogue Co-authored with Ruiting Lian 48.1 Introduction Language evolved for dialogue - not for reading, writing or speechifying. So it's natural that dialogue is broadly considered a critical aspect of humanlike AGI - even to the extent that (for better or for worse) the conversational "Thring Test" is the standard test of human-level AGI. Dialogue is a high-level functionality rather than a foundational cognitive process, and in the CogPrime approach it is something that must largely be learned via experience rather than being programmed into the system. In that some, it may seem odd to have a chapter on dialogue in a book section focused on engineering aspects of general intelligence. One might think: Dialogue is something that should emerge from an intelligent system in conjunction with other intelligent systems, not something that should need to be engineered. And this is certainly a reasonable perspective! We do think that, as a CogPrime system develops, it will develop its own approach to natural language dialogue, based on its own embodiment, environment and experience - with similarities and differences to human dialogue. However, we have also found it interesting to design a natural language dialogue system based on CogPrime, with the goal not of emulating human conversation, but rather of enabling interesting and intelligent conversational interaction with CogPrime systems. We call this sys- tem "ChatPrime" and will describe its architecture in this chapter. The components used in ChatPrime may also be useful for enabling CogPrime systems to carry out more humanlike conversation, via their incorporation in learned schemata; but we will not focus on that aspect here. In addition to its intrinsic interest, consideration of ChatPrime sheds much light on the conceptual relationship between NLP and other aspects of CogPrime. We are very aware that there is an active subfield of computational linguistics focused on dialogue systems iWaldlli, LDA05J, however we will not draw significantly on that literature here. Making practical dialogue systems in the absence of a generally functional cognitive engine is a subtle and difficult art, which has been addressed in a variety of ways; however, we have found that designing a dialogue system within the context of an integrative cognitive engine like CogPrime is a somewhat different sort of endeavor. 463 EFTA00624610
464 48 Natural Language Dialogue 48.1.1 Two Phases of Dialogue System Development In practical terms, we envision the ChatPrime system as pacsessing two phases of development: 1. Phase 1: • "Lower levels" of NL comprehension and generation executed by a relatively traditional approach incorporating statistical and rule-based aspects (the RelEx and NLGen sys- tems) • Dialogue control utilizes hand-coded procedures and predicates (SpeechActSchema and SpeechActTriggers) corresponding to fine-grained types of speech act • Dialogue control guided by general cognitive control system (OpenPsi, running within Open Cog) • SpeechActSchema and SpeechActTriggers, in some cases, will internally consult proba- bilistic inference, thus supplying a high degree of adaptive intelligence to the conversa- tion 2. Phase 2: • "Lower levels" of NL comprehension and generation carried out within primary cogni- tion engine, in a manner enabling their underlying rules and probabilities to be modified based the system's experience. Concretely, one way this could be done in OpenCog would be via - Implementing the RelEx and RelEx2F'ame rules as PLN implications in the Atom- space - Implementing parsing via expressing the link parser dictionary as Atoms in the Atomspace, and using the SAT link parser to do parsing as an example of logical unification (carried out by a MindAgent wrapping an SAT solver) — Implementing NLGen within the OpenCog core, via making NLGen's sentence database a specially indexed Atomspace, and wrapping the NLGen operations in a MindAgent • Reimplement the SpeechActSchema and SpeechActTriggers in an appropriate combina- tion of Combo and PLN logical link types, so they are susceptible to modification via inference and evolution It's worth noting that the work required to move from Phase 1 to Phase 2 is essentially software development and computer science algorithm optimization work, rather than compu- tational linguistics or AI theory. Then after the Phase 2 system is built there will, of course, be significant work involved in "tuning" PLN, MOSES and other cognitive algorithms to ex- perientially adapt the various portions of the dialogue system that have been moved into the OpenCog core and refactored for adaptiveness. 48.2 Speech Act Theory and its Elaboration We review here the very basics of speech act theory, and then the specific variant of speech act theory that we feel will be most useful for practical OpenCog dialogue system development. EFTA00624611
48.3 Speech Act Schemata and 'niggers 465 The core notion of speech act theory is to analyze linguistic behavior in terms of discrete speech acts aimed at achieving specific goals. This is a convenient theoretical approach in an OpenCog context, because it pushes us to treat speech acts just like any other acts that an OpenCog system may carry out in its world, and to handle speech acts via the standard OpenCog action selection mechanism. Searle, who originated speech act theory, divided speech acts according to the following (by now well known) ontology: • Assertives : The speaker commits herself to something being true. The sky is blue. • Directives: The speaker attempts to get the hearer to do something. Clean your room! • Commissives: The speaker commits to some future course of action. 1 will do it. • Expressives: The speaker expresses some psychological state. I'm sorry. • Declarations: The speaker brings about a different state of the world. The meeting is adjourned. Inspired by this ontology, Twitchell and Nunamaker (in their 2004 paper "Speech Act Pro- filing: A Probabilistic Method for Analyzing Persistent Conversations and Their Participants") created a much more fine-grained ontology of 42 kinds of speech acts, called SWBD-DAMSL (DAMSL = Dialogue Act Markup in Several Layers). Nearly all of their 42 speech act types can be neatly mapped into one of Searle's 5 high level categories, although a handful don't fit Searle's view and get categorized as "other." Figures 48.1 and 48.2 depict the 42 acts and their relationship to Searle's categories. 48.3 Speech Act Schemata and Triggers In the suggested dialogue system design, multiple SpeechActSchema would be implemented, corresponding roughly to the 42 SWBD-DAMSL speech acts. The correspondence is "rough" because • we may wish to add new speech acts not in their list • sometimes it may be most convenient to merge 2 or more of their speech acts into a single SpeechActSchema. For instance, it's probably easiest to merge their YES ANSWER and NO ANSWER categories into a single TRUTH VALUE ANSWER schema, yielding affirmative, negative, and intermediate answers like "probably", "probably not", "I'm not sure", etc. • sometimes it may be best to split one of their speech acts into several, e.g. to separately consider STATEMENTs which are responses to statements, versus statements that are unsolicited disbursements of "what's on the agent's mind." Overall, the SWBD-DAMSL categories should be taken as guidance rather than doctrine. How- ever, they are valuable guidance due to their roots in detailed analysis of real human conversa- tions, and their role as a bridge between concrete conversational analysis and the abstractions of speech act theory. Each SpeechActSchema would take in an input consisting of a DialogueNode, a Node type possessing a collection of links to • a series of past statements by the agent and other conversation participants, with - each statement labeled according to the utterer EFTA00624612
466 48 Natural Language Dialoglie Ttll NNs STAIttibtif-NOtö.OMtilt" ACKNOWLISOL takttOLINNOL) STATUCIT.0•IXION AOKLUACCIPT AlehNOOMOD.TtitN.Eittellt LIMINTRIMILTAWLI. ArtiliCIATION Tt.l.NO.QtY111Ct% 11on utan Ytt Atnitital enNvitaintiAL.Cuttitil plaQlnallOM N1nANAW.IA IteShinSt ArittOWLEIXAMILM MIME DtCLARATItE YM-NO-Qt..STIOS On» BACICCIUM4L IN LJI'.AiITONIVMY OtttIATION SinitlAttantltOthttlAlt Mital4ATittNON.YU "anal Antat,Dlititelive CtILLAINotA/rtt cotill.ntoN ItsrFAI.Platti ttn %i-1,M FMK... RHETOIIICAL-ottSTIONt NOKO Mfl« ANnutfAORLIMENT ItateCT ~Ulv( t•Ce4-1•0 .•!1$ Watt SIGNAL.NON.I.MIGOSIANIXttO OTIUM titir•IRS Otta IS T letat.oreltINCI Oattnrse DisinelanD Akt•lat tall.PAInInTAI X 0•TIONt CekalITS 51.11-TALK DowNFLAVIII lAnnt/Affettl , PAIET IMICLARAInt »Ot ot T44 hust& b ta Hy gyed cent Cg hf MI C2 tent IP O tt 9 1 Kl• II amti 11 MP CK q•Cd t4c. 1 it in k heta *Mittlei Uhnuly I dun Co inn het non> I on ~Mf Du >on Lnc ealtn Nn µAtIJ rabling• 5.101.114. LTbruet kounj Yn >Voll yCy ten okt tyting nyo. Wen. hr. edd art yes? Ho. Oh. n, 1 de LAn. inn Koda/ ny onne n nyt So y ou aftfed to pfi • brunt Mk k $1.« eitt • bon. >n tona ib« neir Kny 'KIL It pultane mon tun con Ult >ny anm own bol who* ful nt b a Kilo Ja,' I ymt yo Onn mon nottyln 11b. luta, lk• ann you' Who unn' 'tal le onnpipo ' ryn...uer • 140K Wen. n> N. at t.fMNa. Lunt tott I Onn Omn Ile. a »XI' N h a non n • nemnt.,• %tan Koma !Inn My fat« aur. onda,. Ane nat. ril tatt ot chra Mana MTeI'ulm.r 1 'ro Knm. hr T oe. J yiKlt Sueonlan L4 thot Risk? lny on Hai brøl N bar lin ~sakt Fig. 48.1: The 42 DAMSL speech act categories. ~0 be:nise, Dita342 Simulan Onttat ttti•Nte.Q. atm% Vis AA,WIIII. Altv4amilivtrlartattsitii i lit. QtraTio% No Anonts ...La< .~L/Actiatiutilt Otatlinint res444)-91,1Nlita gyVfAllOh M.M.,* ACILMMLWilliGUIT 1114<ms.rass,cesatiow Patietitiiti tot”ivi minner. bonn .nounk oni ..... st46 Si tematnatttninti i kT1 <Lutte...mi Comnatket Atuotsmirvi tann Atittn•Dettnat IIIIPTONCAL-911f1~1 Alr•IIC1Afilla ow.e.nan. tistattin lot -to Alla \ WO CIA, Vt., 11~,CLO/IN4 longo/onn Orm' Annvint 111~ DIXtAllArNt Wi.-Qtrasnaa MOIDIMII*15 n valt antatt Ditritansaiu,saltiat Ruin Ot« Cconninon Anovrto Onaz Colonlyannt DislaAtta TtnaPtal y ltal Onen. °niom. • Cotolrts Untlill/ACCIPT"t7 Noitttaittt Atokott TtimiNG Wittimaiiitta ~Sik Od.altylravnvYb ~borda nis-Ymta Fig. 48.2: Connecting the 42 DAMSL speech act categories to Searle's 5 higher-level categories. - each statement uttered by the agent, labeled according to which SpeechActSchema was used to produce it, plus (see below) which SpeechAct'I\'igger and which response generator was involvecl EFTA00624613
48.3 Speech Act Schemata and whiggers 467 • a set of Atoms comprising the context of the dialogue. These Atoms may optionally be linked to some of the Atoms representing some of the past statements. If they are not so linked, they are considered as general context. The enaction of SpeechActSchema would be carried out via PredictivelmplicationLinks em- bodying "Context AND Schema —> Goal" schematic implications, of the general form Predictivelmplication AND Evaluation SpeechActTrigger T DialogueNode D Execution SpeechActSchema S DialogueNode D Evaluation Evaluation Goal G with ExecutionOutput SpeechActSchema S DialogueNode D UtteranceNode U being created as a result of the enaction of the SpeechActSchema. (An UtteranceNode is a series of one or more SentenceNodes.) A single SpeechActSchema may be involved in many such implications, with different prob- abilistic weights, if it naturally has many different Trigger contexts. Internally each SpeechActSchema would contain a set of one or more response generators, each one of which is capable of independently producing a response based on the given input. These may also be weighted, where the weight determines the probability of a given response generation process being chosen in preference to the others, once the choice to enact that particular SpeechActSchema has already been made. 48.3.1 Notes Toward Example SpeechActSchema To make the above ideas more concrete, let's consider a few specific SpeechActSchema. We won't fully specify them here, but will outline them sufficiently to make the ideas clear. 48.3.1.1 TruthValueAnswer The TruthValueAnswer SpeechActSchema would encompass SWED-DAMSL's YES ANSWER and NO ANSWER, and also more flexible truth value based responses. EFTA00624614
468 Trigger context 48 Natural Language Dialogue : when the conversation partner produces an utterance that RelEx maps into a truth-value query (this is simple as truth-value-query is one of RelEx's relationship types). Goal : the simplest goal relevant here is pleasing the conversation partner, since the agent may have noticed in the past that other agents are pleased when their questions are answers. (More advanced agents may of course have other goals for answering questions, e.g. providing the other agent with information that will let it be more useful in future.) Response generation schema : for starters, this SpeechActSchema could simply operate as follows. It takes the relationship (Atom) corresponding to the query, and uses it to launch a query to the pattern matcher or PLN backward chainer. Then based on the result, it produces a relationship (Atom) embodying the answer to the query, or else updates the truth value of the existing relationship corresponding to the answer to the query. This "answer" relationship has a certain truth value. The schema could then contain a set of rules mapping the truth values into responses, with a list of possible responses for each truth value range. For example a very high strength and high confidence truth value would be mapped into a set of responses like (definitely, certainly, surely, yes, in- deed). This simple case exemplifies the overall Phase 1 approach suggested here. The conversa- tion will be guided by fairly simple heuristic rules, but with linguistic sophistication in the comprehension and generation aspects, and potentially subtle inference invoked within the SpeechActSchema or (less frequently) the Trigger contexts. Then in Phase 2 these simple heuris- tic rules will be refactored in a manner rendering them susceptible to experiential adaptation. 48.3.1.2 Statement: Answer The next few SpeechActSchema (plus maybe some similar ones not given here) are intended to collectively cover the ground of SWBD-DAMSL's STATEMENT OPINION and STATEMENT NON-OPINION acts. Trigger context : The trigger is that the conversation partner asks a wh- question EFTA00624615
48.3 Speech Act Schemata and 'nigger's Goal : Similar to the case of a TruthValueAnswer, discussed above Response generation schema 469 : When a wh- question is received, one reasonable response is to produce a statement comprising an answer. The question Atom is posed to the pattern matcher or PLN, which responds with an Atom-set comprising a putative answer. The answer Atoms are then pared down into a series of sentence-sized Atom-sets, which are articulated as sentences by NLGen. If the answer Atoms have very low-confidence truth values, or if the Atomspace contains knowledge that other agents significantly disagree with the agent's truth value assessments, then the answer Atom-set may have Atoms corresponding to "I think" or "In my opinion" etc. added onto it (this gives an instance of the STATEMENT NON-OPINION act). 48.3.1.3 Statement: Unsolicited Observation Trigger context : when in the presence of another intelligent agent (human or AI) and nothing has been said for a while, there is a certain probability of choosing to make a "random" statement. Goal 1 : Unsolicited observations may be made with a goal of pleasing the other agent, as it may have been observed in the past that other agents are happier when spoken to Goal 2 : Unsolicited observations may be made with goals of increasing the agent's own pleasure or novelty or knowledge - because it may have been observed that speaking often triggers conver- sations, and conversations are often more pleasurable or novel or educational than silence Response generation schema : One option is a statement describing something in the mutual environment, another option is a statement derived from high-STI Atoms in the agent's Atomspace. The particulars are similar to the "Statement: Answer" case. EFTA00624616
470 48.3.1.4 Statement: External Change Notification Trigger context 48 Natural Language Dialogue : when in a situation with another intelligent agent, and something significant changes in the mutually perceived situation, a statement describing it may be made. Goal 1 : External change notification utterances may be made for the same reasons as Unsolicited Observations, described above. Goal 2 : The agent may think a certain external change is important to the other agent it is talking to, for some particular reason. For instance, if the agent sees a dog steal Bob's property, it may wish to tell Bob about this. Goal 3 : The change may be important to the agent itself - and it may want its conversation partner to do something relevant to an observed external change ... so it may bring the change to the partner's attention for this reason. For instance, "Our friends are leaving. Please try to make them come back." Response generation schema : The Atom-set for expression characterizes the change observed. The particulars are similar to the "Statement: Answer" case. 48.3.1.5 Statement: Internal Change Notification Trigger context 1 : when the importance level of an Atom increases dramatically while in the presence of an- other intelligent agent, a statement expressing this Atom (and some of its currently relevant surrounding Atoms) may be made EFTA00624617
48.4 Probabilistic Mining of Trigger contexts 471 Trigger context 2 : when the truth value of a reasonably important Atom changes dramatically while in the presence of another intelligent agent, a statement expressing this Atom and its truth value may be made Goal : Similar goals apply here as to External Change Notification, considered above Response generation schema : Similar to the "Statement: External Change Notification" case. 48.3.1.6 WHQuestion Trigger context : being in the presence of an intelligent agent thought capable of answering questions Goal 1 : the general goal of increasing the agent's total knowledge Goal 2 : the agent notes that, to achieve one of its currently important goals, it would be useful to possess a Atom fulfilling a certain specification Response generation schema : Formulate a query whose answer would be an Atom fulfilling that specification, and then articulate this logical query as an English question using NLGen 48.4 Probabilistic Milling of Trigger contexts One question raised by the above design sketch is where the Trigger contexts come from. They may be hand-coded, but this approach may suffer from excessive brittleness. The approach suggested by Twitchell and Nunamaker's work (which involved modeling human dialogues rather EFTA00624618
472 48 Natural Language Dialogue than automatically generating intelligent dialogues) is statistical. That is, they suggest marking up a corpus of human dialogues with tags corresponding to the 42 speech acts, and learning from this annotated corpus a set of Markov transition probabilities indicating which speech acts are most likely to follow which others. In their approach the transition probabilities refer only to series of speech acts. In an OpenCog context one could utilize a more sophisticated training corpus in a more sophisticated way. For instance, suppose one wants to build a dialogue system for a game char- acter conversing with human characters in a game world. Then one could conduct experiments in which one human controls a "human" game character, and another human puppeteers an "Al" game character. That is, the puppeteered character funnels its perceptions to the AI system, but has its actions and verbalizations controlled by the human puppeteer. Given the dialogue from this sort of session, one could then perform markup according to the 42 speech acts. As a simple example, consider the following brief snippet of annotated conversation: speaker utterance speech act type Ben Go get me the ball ad Al Where is it? qw Ben Over there 'points] sd Al By the table? qy Ben Yeah ny Al Thanks ft Al I'll get it now. commits A DialogueNode object based on this snippet would contain the inforn ation in the table, plus some physical information about the situation, such as, in this case: predicates describing the relative locations of the two agents, the ball an the table (e.g. the two agents are very near each other, the ball and the able are very near each other, but these two groups of entities are only moderately near each other): and, predicates involving Then, one could train a machine learning algorithm such as MOSES to predict the probability of speech act type S1 occurring at a certain point in a dialogue history, based on the prior history of the dialogue. This prior history could include percepts and cognitions as well as utterances, since one has a record of the AI system's perceptions and cognitions in the course of the marked-up dialogue. One question is whether to use the 42 SWBD-DAMSL speech acts for the creation of the annotated corpus, or whether instead to use the modified set of speech acts created in design- ing SpeechActSchema. Either way could work, but we are mildly biased toward the former, since this specific SWBD-DAMSL markup scheme has already proved its viability for marking up conversations. It seems unproblematic to map probabilities corresponding to these speech acts into probabilities corresponding to a slightly refined set of speech acts. Also, this way the corpus would be valuable independently of ongoing low-level changes in the collection of SpeechActSchema. In addition to this sort of supervised training in advance, it will be important to enable the system to learn Trigger contexts online as a consequence of its life experience. This learning may take two forms: 1. Most simply, adjustment of the probabilities associated with the PredictivelmplicationLinks between SpeechActTriggers and SpeechActSchema EFTA00624619
48.5 Conclusion 473 2. More sophisticatedly, learning of new SpeechActTrigger predicates. using an algorithm such as MOSES for predicate learning, based on mining the history of actual dialogues to estimate fitness In both cases the basis for learning is information regarding the extent to which system goals were fulfilled by each past dialogue. PredictiveImplications that correspond to portions of suc- cessful dialogues will be have their truth values increased, and those corresponding to portions of unsuccessful dialogues will have their truth values decreased. Candidate SpeechActTriggers will be valued based on the observed historical success of the responses they would have generated based on historically perceived utterances; and (ultimately) more sophisticatedly, based on the estimated success of the responses they generate. Note that. while somewhat advanced, this kind of learning is much easier than the procedure learning required to learn new SpeechActSchema. 48.5 Conclusion While the underlying methods are simple, the above methods appear capable of producing arbitrarily complex dialogues about any subject that is represented by knowledge in the Atom- Space. There is no reason why dialogue produced in this manner should be indistinguishable from human dialogue; but it may nevertheless be humanly comprehensible, intelligent and in- sightful. What is happening in this sort of dialogue system is somewhat similar to current natural language query systems that query relational databases, but the "database" in ques- tion is a dynamically self-adapting weighted labeled hypergraph rather than a static relational database, and this difference means a much more complex dialogue system is required, as well as more flexible language comprehension and generation components. Ultimately, a CogPrime system - if it works as desired - will be able to learn increased linguistic functionality, and new languages, on its own. But this is not a prerequisite for having intelligent dialogues with a CogPrime system. Via building a ChatPrime type system, as out- lined here, intelligent dialogue can occur with a CogPrime system while it is still at relatively early stages of cognitive development, and even while the underlying implementation of the CogPrime design is incomplete. This is not closely analogous to human cognitive and linguistic development, but, it can still be pursued in the context of a CogPrime development plan that follows the overall arc of human developmental psychology. EFTA00624620
EFTA00624621
Section VIII From Here to AGI EFTA00624622
EFTA00624623
Chapter 49 Summary of Argument for the CogPrime Approach 49.1 Introduction By way of conclusion, we now return to the "key claims" that were listed at the end of Chapter 1 of Part 1. Quite simply, this is a list of claims such that - roughly speaking - if the reader accepts these claims, they should accept that the CogPrime approach to AGI is a viable one. On the other hand if the reader rejects one or more of these claims, they may well find one or more aspects of CogPrime unacceptable for some related reason. In Chapter 1 of Part 1 we merely listed these claims; here we briefly discuss each one in the context of the intervening chapters, giving each one its own section or subsection. As we clarified at the start of Part 1, we don't fancy that we have provided an ironclad argument that the CogPrime approach to AGI Ls guaranteed to work as hoped, once it's fully engineered, tuned and taught. Mathematics isn't yet adequate to analyze the real-world behavior of complex systems like these; and we have not yet implemented, tested and taught enough of CogPrime to provide convincing empirical validation. So, most of the claims listed here have not been rigorously demonstrated, but only heuristically argued for. That is the reality of AGI work right now: one assembles a design based on the best combination of rigorous and heuristic arguments one can, then proceeds to create and teach a system according to the design, adjusting the details of the design based on experimental results as one goes along. = For an uncluttered list of the claims, please refer back to Chapter 1 of Part 1; here we will review the claims integrated into the course of discussion. The following chapter, aimed at the more mathematically-minded reader, gives a list of formal propositions echoing many of the ideas in the chapter - propositions such that, if they are true, then the success of CogPrime as an architecture for general intelligence is likely. 49.2 Multi-Memory Systems The first of our key claims is that to achieve general intelligence in the context of human- intelligence-friendly environments and goals using feasible computational resources, it's impor- tant that an AGI system can handle different kinds of memory (declarative, procedural, episodic, sensory, intentional, attentional) in customized but interoperable ways. The basic idea is that 477 EFTA00624624
478 49 Summary of Argument for the CogPrime Approach these different kinds of knowledge have very different characteristics, so that trying to handle them all within a single approach, while surely possible, is likely to be unacceptably inefficient. The tricky issue in formalizing this claim is that "single approach" is an ambiguous notion: for instance, if one has a wholly logic-based system that represents all forms of knowledge using predicate logic, then one may still have specialized inference control heuristics corresponding to the different kinds of knowledge mentioned in the claim. In this case one has "customized but interoperable ways" of handling the different kinds of memory, and one doesn't really have a "single approach" even though one is using logic for everything. To bypass such conceptual difficulties, one may formalize cognitive synergy using a geometric framework as discussed in Appendix B, in which different types of knowledge are represented as metrized categories, and cognitive synergy becomes a statement about paths to goals being shorter in metric spaces combining multiple knowledge types than in those corresponding to individual knowledge types. In CogPrime we use a complex combination of representations, including the Atomspace for declarative, attentional and intentional knowledge and some episodic and sensorimotor knowl- edge, Combo programs for procedural knowledge, simulations for episodic knowledge, and hi- erarchical neural nets for some sensorimotor knowledge (and related episodic, attentional and intentional knowledge). In cases where the same representational mechanism is used for dif- ferent types of knowledge, different cognitive processes are used, and often different aspects of the representation (e.g. attentional knowledge is dealt with largely by ECAN acting on Atten- tionValues and HebbianLinks in the Atomspace: whereas declarative knowledge is dealt with largely by PLN acting on TruthValues and logical links, also in the AtomSpace). So one has a mix of the "different representations for different memory types" approach and the "different control processes on a common representation for different memory, types" approach. It's unclear how closely dependent the need for a multi-memory approach is on the particulars of "human-friendly environments." We argued in Chapter 9 of Part 1 that one factor militating in favor of a multi-memory approach is the need for multimodal communication: declarative knowledge relates to linguistic communication; procedural knowledge relates to demonstrative communication; attentional knowledge relates to indicative communication; and so forth. But in fact the multi-memory approach may have a broader importance, even to intelligences with- out multimodal communication. This is an interesting issue but not particularly critical to the development of human-like, human-level AGI, since in the latter case we are specifically con- cerned with creating intelligences that can handle multimodal communication. So if for no other reason, the multi-memory approach is worthwhile for handling multi-modal communication. Pragmatically, it is also quite clear that the human brain takes a multi-memory approach, e.g. with the cerebellum and closely linked cortical regions containing special structures for handling procedural knowledge, with special structures for handling motivational (intentional) factors, etc. And (though this point is certainly not definitive, it's meaningful in the light of the above theoretical discussion) decades of computer science and narrow-Al practice strongly suggest that the "one memory structure fits all" approach is not capable of leading to effective real-world approaches. 49.3 Perception, Action and Environment The more we understand of human intelligence, the clearer it becomes how closely it has evolved to the particular goals and environments for which the human organism evolved. This is true EFTA00624625
49.4 Developmental Pathways 479 in a broad sense, as illustrated by the above issues regarding multi-memory systems, and is also true in many particulars, as illustrated e.g. by Changizi's IChaol evolutionary, analysis of the human visual system. While it might be possible to create a human-like, human-level AGI by abstracting the relevant biases from human biology and behavior and explicitly encoding them in one's AGI architecture, it seems this would be an inordinately difficult approach in practice, leading to the claim that to achieve human-like general intelligence, it's important for an intelligent agent to have sensory data and motoric affordances that roughly emulate those available to humans. We don't claim this is a necessity - just a dramatic convenience. And if one accepts this point, it has major implications for what sorts of paths toward AGI it makes most sense to follow. Unfortunately, though, the idea of a "human-like" set of goals and environments is fairly vague; and when you come right down to it, we don't know exactly how close the emulation needs to be to form a natural scenario for the maturation of human-like, human-level AGI systems. One could attempt to resolve this issue via a priori theory, but given the current level of scientific knowledge it's hard to see how that would be possible in any definitive sense ... which leads to the conclusion that our AGI systems and platforms need to support fairly flexible experimentation with virtual-world and/or robotic infrastructures. Our own intuition is that currently neither current virtual world platforms, nor current robotic platforms, are quite adequate for the development of human-level, human-like AGI. Virtual worlds would need to become a lot more like robot simulators, allowing more flexible interaction with the environment, and more detailed control of the agent. Robots would need to become more robust at moving and grabbing - e.g. with Big Dog's movement ability but the grasping capability of the best current grabber arms. We do feel that development of adequate virtual world or robotics platforms is quite possible using current technology, and could be done at fairly low cost if someone were to prioritize this. Even without AGI-focused prioritization, it seems that the needed technological improvements are likely to happen during the next decade for other reasons. So at this point we feel it makes sense for AGI researchers to focas on AGI and exploit embodiment-platform improvements as they come along - at least, this makes sense in the case of AGI approaches (like CogPrime ) that can be primarily developed in an embodiment-platform-independent manner. 49.4 Developmental Pathways But if an AGI system is going to live in human-friendly environments, what should it do there? No doubt very many pathways leading from incompetence to adult-human-level general intel- ligence exist, but one of them is much better understood than any of the others, and that's the one normal human children take. Of course, given their somewhat different embodiment, it doesn't make sense to try to force AGI systems to take exactly the same path as human chil- dren, but having AGI systems follow a fairly close approximation to the human developmental path seems the smoothest developmental course ... a point summarized by the claim that: To work toward adult human-level, roughly human-like general intelligence, one fairly easily com- prehensible path is to use environments and goals reminiscent of human childhood, and seek to advance one's AGI system along a path roughly comparable to that followed by human children. Human children learn via a rich variety of mechanisms; but broadly speaking one conclusion one may drawn from studying human child learning is that it may make sense to teach an EFTA00624626
480 49 Summary of Argument for the CogPrime Approach AGI system aimed at roughly human-like general intelligence via a mix of spontaneous learning and explicit instruction, and to instruct it via a combination of imitation, reinforcement and correction, and a combination of linguistic and nonlinguistic instruction. We have explored exactly what this means in Chapter 31 and others, via looking at examples of these types of learning in the context of virtual pets in virtual worlds, and exploring how specific CogPrime learning mechanisms can be used to achieve simple examples of these types of learning. One important case of learning that human children are particularly good at is language learning; and we have argued that this is a case where it may pay for AGI systems to take a route somewhat different from the one taken by human children. Humans seem to be born with a complex system of biases enabling effective language learning, and it's not yet clear exactly what these biases are nor how they're incorporated into the learning process. It is very tempting to give AGI systems a "short cut" to language proficiency via making use of existing rule-based and statistical-corpus-analysis-based NLP systems; and we have fleshed out this approach sufficiently to have convinced ourselves it makes practical as well as conceptual sense, in the context of the specific learning mechanisms and NLP tools built into OpenCog. Thus we have provided a number of detailed arguments and suggestions in support of our claim that one effective approach to teaching an AGI system human language is to supply it with some in-built linguistic facility, in the form of rule-based and statistical-linguistics-based NLP systems, and then allow it to improve and revise this facility based on experience. 49.5 Knowledge Representation Many knowledge representation approaches have been explored in the AI literature, and ulti- mately many of these could be workable for human-level AGI if coupled with the right cog- nitive processes. The key goal for a knowledge representation for AGI should be naturalness with respect to the AGI's cognitive processes - i.e. the cognitive processes shouldn't need to undergo complex transformative gymnastics to get information in and out of the knowl- edge representation in order to do their cognitive work. Toward this end we have come to a similar conclusion to some other researchers (e.g. Joscha Bach and Stan Franklin), and con- cluded that given the strengths and weaknesses of current and near-future digital computers, a (loosely) neural-symbolic network is a good representation for directly storing many kinds of memory, and interfacing between those that it doesn't store directly. CogPrinte's AtomSpace is a neural-symbolic network designed to work nicely with PLN, MOSES, ECAN and the other key CogPrime cognitive processes; it supplies them with what they need without causing them undue complexities. It provides a platform that these cognitive processes can use to adaptively, automatically construct specialized knowledge representations for particular sorts of knowledge that they encounter. 49.6 Cognitive Processes The crux of intelligence is dynamics, learning, adaptation; and so the crux of an AGI design is the set of cognitive processes that the design provides. These processes must collectively allow the AGI system to achieve its goals in its environments using the resources at hand. Given EFTA00624627












