36.3 Inference Control in PLN 281 36.3.4 Basic Forward and Backward Inference Steps The choice of an inference step, at the microscopic level, may be done in a number of ways, of which perhaps the simplest are: • "'Basic forward step". Choose an Atom Al, then choose a rule R. If It only takes one input, then apply R. to Al. If R applies to two Atoms, then find another Atom A2 so that (Al, A2) may be taken as the inputs of R. • "'Basic backward step."' Choose an Atom Al, then choose a rule R. If R takes only one input, then find an Atom A2 so that applying R. to A2, yields Al as output. If R takes two inputs, then find two Atoms (A2, A3) so that applying It to (A2, A3) yields Al as output. Given a target Atom such as Al - Inheritance $V1 breathe the VariableAbstractionRule will do inferences such as ExecutionLink VariableAbstractionRule HypotheticalLink Inheritance people breathe HypotheticalLink Inheritance Svl breathe This allows the basic backward step to carry out variable fulfillment queries as well as truth value queries. We may encapsulate these processes in the Atomspace as GroundedSchemallode: BasicForwardlnferenceStep GroundedSchemallode: BasicBackwardInferenceStep which take as input some Atom Al. and also as GroundedSchemallode: AttentionalForwardlnferenceStep GroundedSchemallode: AttentionalBackwardlnferenceStep which automatically choose the Atom Al they start with, via choosing some Atom within the AttentionalFocus, with probability proportional to STI. Forward chaining, in its simplest form, then becomes: The process of repeatedly executing the AttentionalForwardlnferenceStep Schemallode. Backward chaining, in the simplest case (we will discuss more complex cases below), becomes the process of 1. Repeatedly executing the BasicBackwardlnferenceStep Schemallode, starting from a given target Atom 2. Concurrently. repeatedly executing the AttentionalBackwardlnferenceStep Schemallode, to ensure that backward inference keeps occurring, regarding Atoms that were created via Step 1 Inside the BasicForwardStep or BasicBackwardStep schema, there are two chokes to be made: choosing a rule R, and then choosing additional Atoms A2 and possibly A3. The choice of the rule R should be made probabilistically, choosing each rule with probability proportional to a certain weight associated with each rule. Initially we can assign these weights EFTA00624428
282 36 Adaptive, Integrative Inference Control generically, by hand, separately for each application domain. Later on they should be chosen adaptively, based on information mined from the InferenceRepository, regarding which rules have been better in which contexts. The choice of the additional Atoms A2 and A3 is subtler, and should be done using STI values as a guide: • First the AttentionalFocus is searched, to find all the Atoms there that fit the input criteria of the rule R. Among all the Atoms found, an Atom is chosen with probability proportional to STI. • If the AttentionalFocus doesn't contain anything suitable, then an effort may be made to search the rest of the Atomspace to find something suitable. If multiple candidates are found within the amount of effort allotted, then one should be chosen with probability proportional to STI If an Atom A is produced as output of a forward inference step, or is chosen as the input of a backward inference step, then the STI of this Atom A should be incremented. This will increase the probability of A being chosen for ongoing inference. In this way, attention allocation is used to guide the course of ongoing inference. 36.3.5 Interaction of Forward and Backward Inference Starting from a target, a series of backward inferences can figure out ways to estimate the truth value of that target, or fill in the variables within that target. However, once the backward-going chain of inferences is done (to some reasonable degree of satisfaction), there is still the remaining task of using all the conclusions drawn during the series of backward inferences, to actually update the target. Elegantly, this can be done via forward inference. So if forward and backward inference are both operating concurrently on the same pool of Atoms, it is forward inference that will propagate the information learned during backward chaining inference, up to the target of the backward chain. 36.3.6 Coordinating Variable Bindings Probably the thorniest subtlety that comes up in a PLN implementation is the coordination of the values assigned to variables, across different micro-level inferences that are supposed to be coordinated together as part of the same macro-level inference. For a very simple example, suppose we have a truth-value query with target Al a InheritanceLink Bob rich Suppose the deduction rule It is chosen. Then if we can find (A2, A3) that look like, say, A2 a InheritanceLink Hob owns_mansion A3 a InheritanceLink owns_mansion rich EFTA00624429
36.3 Inference Control in PLN 283 , our problem is solved. But what if there is no such simple solution in the Atomspace available? Then we have to build something like A2 - InheritanceLink Bob Svl A3 - InheritanceLink Svl rich and try, to find something that works to fill in the variable $v1. But this is tricky, because $v1 now has two constraints (A2 and A3). So, suppose A2 and A3 are both created as a result of applying BasicBackwardlnferenceStep to Al, and thus A2 and A3 both get high STI values. Then both A2 and A3 are going to be acted on by Atten- tionalBackwardInferenceStep. But as A2 and A3 are produced via other inputs using backward inference, it is necessary that the values assigned to $v1 in the context of A2 and A3 remain consistent with each other. Note that, according to the operation of the Atomspace, the same VariableAtom will be used to represent $vl no matter where it occurs. For instance, it will be problematic if one inference rule schema tries to instantiate $vl with "owns_mansion", but another tries to instantiate $vl with "lives_in_Manhattan". That is, we don't want to find InheritanceLink Bob lives_in_mansion InheritanceLink lives_in_mansion owns_mansion InheritanceLink Bob owns_mansion which binds $v1 to owns_mansion, and InheritanceLink lives_in_Manhattan lives_in_top_city InheritanceLink lives_in_top_city rich !- InheritanceLink lives_in_Manhattan rich which binds $vl to lives_in_Manhattan We want A2 and A3 to be derived in ways that bind $vl to the same thing. The most straightforward way to avoid confusion in this sort of context, is to introduce an addition kind of inference step, • "'Variable-guided backward step"'. Choose a set V of VariableNodes (which may just be a single VariableNode $v1), and identify the set S_V of all Atoms involving any of the variables in V. - Firstly: If V divides into two sets VI and V2, so that no Atom contains variables in both VI and V2, then launch separate variable-guided backwards steps for VI and V2. 'This step is "Problem Decompositionl — Carry out the basic backward step for all the Atoms in S_V, but restricting the search for Atoms A2, A3 in such a way that each of the variables in V is consistently instan- tiated. This is a non-trivial optimization, and more will be said about this below. • "'Variable-guided backward step, Atom-triggered."" ChnacP an Atom Al. Identify the set V of VariableNodes targeted by Al, and then do a variable-guided backward step starting from V. This variable guidance may, of course, be incorporated into the AttentionalBackwardlnfer- enceStep as well. In this case, backward chaining becomes the process of EFTA00624430
284 36 Adaptive, Integrative Inference Control • Repeatedly executing the VariableGuidedBackwardlnferenceStep Schemallode, starting from a given target Atom • Concurrently, repeatedly executing the AttentionalVariableGuidedBackwardlnferenceStep Schemallode, to ensure that backward inference keeps occurring, regarding Atoms that were created via Step 1 The hard work here is then done in step 2 of the Variable Guided Backward Step, which has to search for multiple Atoms, to fulfill the requirements of multiple inference rules, in a way that keeps consistent variable instantiations. But this same difficulty exists in a conventional backward chaining framework, it's just arranged differently, and not as neatly encapsulated. 36.3.7 An Example of Problem Decomposition Illustrating a point raised above, we now give an example of a case where, given a problem of finding values to assign a set of variables to make a set of expressions hold simultaneously, the appropriate course is to divide the set of expressions into two separate parts. Suppose we have the six expressions El - Inheritance ( $vl, Animal) E2 a Evaluation( $v1, ($v2, Bacon) ) E3 - Inheritance( $v2, $v3) E4 - Evaluation( Eat, ($v3, $vl) ) ES a Evaluation (Eat, ($v7, $v9) ) E6 a Inheritance $v9 $v6 Since the set {El, E2, E3, E4} doesn't share any variables with {E5, EC), there is no reason to consider them all as one problem. Rather we will do better to decompose it into two problems, one involving {El, E2, E3, E4} and one involving {E5, E6}. In general, given a set of expressions, one can divide it into subsets, where each subset S has the property that: for every variable v contained in 5, all occurrences of v in the Atomspace, are in expressions contained in S. 36.3.8 Example of Casting a Variable Assignment Problem as an Optimization Problem Suppose we have the four expressions El a Inheritance ( $v1, Animal) E2 a Evaluation( $v2, ($vl, Bacon) ) E3 a Inheritance( $v2, $v3) E4 a Evaluation( Enjoy, ($v1, $v3) ) EFTA00624431
36.3 Inference Control in PLN 285 where Animal. Bacon and Enjoy are specific Atoms. Suppose the task at hand is to find values for ($v1, $v2, $v3) that will make all of these expressions confidently true. If there is some assignment (6v1, $v2, $v3) (A1,A2, A3) ready to hand in the Atomspace, that fulfills the equations El, E2, E3, E4, then the Atomspace API's pattern matcher will find it. For instance, (6v1, $v2, $v3) - (Cat, Eat, Chew) would work here, since El - Inheritance 4 Cat, Animal) E2 - Evaluation( Eat, (Cat, Bacon) ) E3 - Inheritance( Eat, Chew) E4 - Evaluation( Enjoy, (Cat, Chew) ) are all reasonably true. If there is no such assignment ready to hand, then one is faced with a search problem. This can can be approached as an optimization problem, e.g. one of maximizing a function f(Sol, $v2, $v3) sc(E1) * sc(E2) * sc(E3) where sc(A) A.strength • A.confidence The function f is then a function with signature f: Atom"4 --> float f can then be optimized by a host of optimization algorithms. For instance a genetic algorithm approach might work, but a BOA (Bayesian Optimization Algorithm) approach would probably be better. In a GA approach, mutation would work as follows. Suppose one had a candidate (Sol, Sv2, Sv3) - (Al, A2, A3) Then one could mutate this candidate by (for instance) replacing Al with some other Atom that is similar to Al, e.g. connected to Al with a high-weight SimilarityLink in the Atomspace. 36.3.9 Backward Chaining via Nested Optimization Given this framework that does inference involving variables via using optimization to solve simultaneous equations of logical expressions with overlapping variables, "backward chaining" becomes the iterative launch of repeated optimization problems, each one defined in terms of the previous ones. We will now illustrate this point via continuing with the E2, E3, E4)). example from above. Suppose one found an assignment (Sol, Sv2, Sv3) - (Al, A2, A3) that worked for every equation except E3. Then there is the problem of finding some way to make EFTA00624432
286 E3 - Inheritance( A2, A3) work. For instance, what if we have found the assignment (Svl, Sv2, Sv3) - (Cat, Eat, Chase) 36 Adaptive, Integrative Inference Control In this case, we have El - Inheritance ( Cat, Animal) -- YES E2 - Evaluation( Eat, (Cat, Bacon) ) -- YES E3 - Inheritance( Eat, Chase) -- NO E4 - Evaluation( Enjoy, (Cat, Chase) ) -- YES so the assignment works for every equation except E3. Then there is the problem of finding some way to make E3 Inheritance( Eat, Chase) work. But if the tnith value of Inheritance( Eat, Chase) has a low strength and high confidence, this may seem hopeless, so this assignment may not get followed up on. On the other hand, we might have the assignment (Svl, Sv2, Sv3) - (Cat, Eat, SocialActivity) In this case, for a particular CogPrime instance, we might have El - Inheritance ( Cat, Animal) -- YES E2 - Evaluation( Eat, (Cat, Bacon) ) -- YES E3 - Inheritance( Eat, SocialActivity) -- UNKNOWN E4 - Evaluation( Enjoy, (Cat, SocialActivity) ) -- YES The above would hold if the reasoning system knew that cats enjoy social activities, but did not know whether eating is a social activity. In this case, the reasoning system would have reason to launch a new inference process aimed at assessing the truth value of E3 - Inheritance( Eat, SocialActivity) -- This is backward chaining: Launching a new inference process to figure out a question raised by another inference process. For instance, in this case the inference engine might: Choose an inference Rule (let's say it's Deduction, for simplicity), and then look for $v4 so that Inheritance Eat Sv4 Inheritance Sv4 SocialActivity are both true. In this case one has spawned a new Variable-Guided Backward Inference problem, which must be solved in order to make {AI, A2, A3} an OK solution for the problem of {El, E2, E3, E4}. Or it might choose the Induction rule, and look for $v4 so that Inheritance Sv4 Eat Inheritance Sv4 SocialActivity Maybe then it would find that $v4=Dinner works, because it knows that EFTA00624433
36.3 Inference Control in PLN 287 Inheritance Dinner Eat Inheritance Dinner SocialActivity But maybe $v4=Dinner doesn't boost the truth value of E3 - Inheritance( Eat, SocialActivity) high enough. In that case it may keep searching for more information about E4 in the context of this particular variable assignment. It might choose Induction again, and discover e.g. that Inheritance Lunch Eat Inheritance Lunch SocialActivity In this example, wove assumed that some non-backward-chaining heuristic search mechanism found a solution that almost works, so that backward chaining is only needed on E3. But of course, one could backward chain on all of El, E2, E3, E4 simultaneously - or various subsets thereof. For a simple example, suppose one backward chains on El - Inheritance ( Svl, Animal) E3 Inheritance( $v2, SocialActivity) simultaneously. Then one is seeking, say, ($v4, $v5) so that Inheritance $vl $vS Inheritance $vS Animal Inheritance $v2 $v4 Inheritance $v4 SocialActivity\ This adds no complexity, as the four relations partition into two disjoint sets of two. Separate chaining processes may be carried out for El and E3. On the other hand, for a slightly more complex example, what if we backward chain on E2 Evaluation( Sv2, {$vl, Bacon) ) E3 - Inheritance( Sv2, SocialActivity) simultaneously? (Assuming that a decision has already been made to explore the possibility $v3 = SocialActivity.) Then we have a somewhat more complex situation. We are trying to find $v2 that is a SocialActivity, so that $v1 likes to do $v2 in conjunction with Bacon. If the Member2Evaluation rule is chosen for E2 and the Deduction rule is chosen for E3, then we have ES - Inheritance $v2 $v6 E6 - Inheritance $v6 SocialActivity E7 - Member ($v1, Bacon) (SatisfyingSet $v2) and if the Inheritance2Member rule is then chosen for E7, we have ES - Inheritance $v2 $v6 E6 - Inheritance $v6 SocialActivity ES - Inheritance ($vl, Bacon) (SatisfyingSet $v2) and if Deduction is then chosen for ES then we have ES - Inheritance $v2 $v6 E6 - Inheritance $v6 SocialActivity E9 - Inheritance ($vl , Bacon) $v8 ElD - Inheritance $v8 (SatisfyingSet $v2) EFTA00624434
288 36 Adaptive. Integrative Inference Cont roe Following these steps expands the search to involve more variables and means the inference engine now gets to deal with El - Inheritance I Svl, Animal) E4 - Evaluation( Enjoy, (Svl, SocialActivity) ) ES - Inheritance $v2 $v6 E6 - Inheritance $v6 SocialActivity E9 - Inheritance ($171 , Bacon) Sv8 E10 - Inheritance 6178 (SatisfyingSet Sv2) or some such i.e. we have expanded our problem to include more and more simtdtaneou.s logical equations in more and more variables! Which is not necesbarily a terrible thing, but it does get complicated. We might find, for example, that $v=1 Pig, $v6=Dance, $v2=Waltz, $v8 = PiggyWaltz El - Inheritance ( Pig, Animal) E4 - Evaluation( Enjoy, (Pig, SocialActivity) I ES - Inheritance Waltz Dance E6 - Inheritance Dance SocialActivity E9 - Inheritance (Pig , Bacon) PiggyWaltz E10 - Inheritance PiggyWaltz (SatisfyingSet Waltz) Here PiggyWaltz is a special dance that pigs do with their Bacon, as a SocialActivity! Of course, this example is extremely contrived. Real inference examples will rarely be this simple, and will not generally involve Nodes that have simple English names. This example is just for illustration of the concepts involved. 36.4 Combining Backward and Forward Inference Steps with Attention Allocation to Achieve the Same Effect as Backward Chaining (and Even Smarter Inference Dynamics) Backward chaining is a powerful heuristic, one can achieve the same effect - and even smarter inference dynamics - via a combination of • heuristic search to satisfy simultaneous expressions • boosting the STI of expressions being searched • importance spreading (of STI) • ongoing background forward inference can combine to yield the same basic effect as backward chaining, but without explicitly doing backward chaining. The basic idea is: When system of expressions involving variables is explored using a GA or whatever other optimization process is deployed, these expressions also get their STI boosted. Then, the atoms with high STI, are explored by the forward inference process, which is always acting in the background on the atoms in the Atomspace. Other atoms related to these also get STI via importance spreading. And these other related Atoms are then acted on by forward inference as well. This forward chaining will then lead to the formation of new Atoms, which may make the solution of the system of expressions easier the next time it is visited by the backward inference process EFTA00624435
36.5 Hebbian Inference Control 289 In the above example, this means: • El, E2, E3, E4 will all get their STI boosted • Other Atoms related to these (Animal, Bacon and Enjoy) will also get their STI boosted • These other Atoms will get forward inference done on them • This forward inference will then yield new Atoms that can be drawn on when the solution of the expression-system El, E2, E3, E4 is pursued the next time So, for example, if the system did not know that eating is a social activity, it might learn this during forward inference on SocialActivity. The fact that SocialActivity has high STI would cause forward inferences such as Inheritance Dinner Eat Inheritance Dinner SocialActivity Inheritance Eat SocialActivity to get done. These forward inferences would then produce links that could simply be found by the pattern matcher when trying to find variable assignments to satisfy (El, E2, E3, E4}. 36.4.1 Breakdown into MindAgents To make this sort of PLN dynamic work, we require a number of MindAgents to be operating "ambiently" in the background whenever inference is occurring; to wit: • attentional forward chaining (i.e. each time this MindAgent is invoked, it chooses high-STI Atoms and does basic forward chaining on them) • attention allocation (importance updating is critical, Hebbian learning is also useful) • attentional (variable guided) backward chaining On top of this ambient inference, we may then have query-driven backward chaining inferences submitted by other processes (via these launching backward inference steps and giving the associated Atoms lots of STI). The ambient inference processes will help the query-driven inference processes to get fulfilled. 36.5 Hebbian Inference Control A key aspect of the PLN control mechanism described here is the use of attention allocation to guide inference. A key aspect here is the use of attention allocation to guide Atom choice in the course of forward and backward inference. Figure 36.1 gives a simple illustrative example of the use of attention allocation, via HebbianLinks, for PLN backward chaining. The semantics of a HebbianLink between A and B is, intuitively: In the past, when A was important, B was also important. HebbianLinks are created via two basic mechanisms: pattern- mining of associations between importances in the system's history, and PLN inference based on HebbianLinks created via pattern mining (and inference). Thus, saying that PLN inference control relies largely on HebbianLinks is in part saying that PLN inference control relies on EFTA00624436
290 36 Adaptive, Integrative Inference Control WILBUR a FRIENDLY] DEDUCTION toe) WILBUR PIG I 0 % G # <OG0 O, i t ot Search episodic memory for episodes with friendly or unfriendly pigs PIG hFRIENDLY PIG 26( DEDUCTION FRIENDLY DECLARATIVE-ATTENTIONAL INTERACTION USE IMPORTANCE SPREADING: GIVE WILBUR PIG AND FRIENDLY HIGH STI AND TRY SOME THAT GETS HIGH STI Fig. 36.1: The Use of Attention Allocation for Guiding Backward Chaining Inference. PLN. There is a bit of a recursion here, but it's not a bottomless recursion because it bottoms out with HebbianLinks learned via pattern mining. As an example of the Atom-choices to be made by a forward or backward inference agent in the course of doing inference, consider that to evaluate (Inheritance A C) via the deduction Rule, some collection of intermediate nodes for the deduction must be chosen. In the case of higher-order deduction, each deduction may involve a number of complicated subsidiary steps, so perhaps only a single intermediate node will be chosen. This choice of intermediate nodes must be made via context-dependent prior probabilities. In the case of other Rules besides deduction, other similar choices must be made. The basic means of using HebbianLinks in inferential Atom-choice is simple: If there are Atoms linked via HebbianLinks with the other Atoms in the inference tree, then these Atoms should be given preference in the selection process. Along the same lines but more subtly, another valuable heuristic for guiding inference control is "on-the-fly associatedness assessment." If there is a chance to apply the chosen Rule via working with Atoms that are: • strongly associated with the Atoms in the Atom being evaluated (via HebbianLinks) EFTA00624437
36.5 Hebbian Inference Control 291 • strongly associated with each other via HebbianLinks (hence forming a cohesive set) then this should be ranked as a good thing. For instance, it may be the case that, when doing deduction regarding relationships between humans, using relationships involving other humans as intermediate nodes in the deduction is often useful. Formally this means that, when doing inference of the form: AND Inheritance A human Inheritance A Inheritance C human Inheritance C Inheritance A C then it is often valuable to choose B so that: HebbianLink B human has high strength. This would follow from the above-mentioned heuristic. Next, suppose one has noticed a more particular heuristic - that in trying to reason about humans, it is particularly useful to think about their wants. This suggests that in abductions of the above form it is often useful to choose B of the form: SatisfyingSet [ wants(human, SX) ) This is too fine-gained of a cognitive-control intuition to come from simple association- following. Instead, it requires fairly specific data-mining of the system's inference history. It requires the recognition of "Hebbian predicated' of the form: Hebbianlmplication AND Inheritance SA human Inheritance SC human Similarity SB SatisfyingSet Evaluation wants (human, SX) AND Inheritance SA SB Inheritance SC SB The semantics of: Hebbianlmplication X Y is that when X is being thought about, it is often valuable to think about Y shortly thereafter. So what is required to do inference control according to heuristics like think about humans according to their wants is a kind of backward-chaining inference that combines Hebbian im- plications with PLN inference rules. PLN inference says that to assess the relationship between two people, one approach is abduction. But Hebbian learning says that when setting up an abduction between two people, one useful precondition is if the intermediate term in the ab- duction regards wants. Then a check can be made whether there are any relevant intermediate terms regarding wants in the system's memory. What we see here is that the overall inference control strategy can be quite simple. For each Rule that can be applied, a check can be made for whether there is any relevant Hebbian knowl- edge regarding the general constructs involved in the Atoms this Rule would be manipulating. EFTA00624438
292 36 Adaptive, Integrative Inference Control If so, then the prior probability of this Rule is increased, for the purposes of the Rule-choice bandit problem. Then, if the Rule is chosen, the specific Atoms this Rule would involve in the inference can be summoned up, and the relevant Hebbian knowledge regarding these Atoms can be utilized. To take another similar example, suppose we want to evaluate: Inheritance pig dog via the deduction Rule (which also carries out induction and abduction). There are a lot of possible intermediate terms, but a reasonable heuristic is to ask a few basic questions about them: How do they move around? What do they eat? How do they reproduce? How intelligent are they? Some of these standard questions correspond to particular intermediate terms, e.g. the intelligence question partly boils down to computing: Inheritance pig intelligent and: Inheritance dog intelligent So a link: Hebbianlmplication animal intelligent may be all that's needed to guide inference to asking this question. This HebbianLink says that when thinking about animals, it's often interesting to think about intelligence. This should bias the system to choose "intelligent" as an intermediate node for inference. On the other hand, the what do they eat question is subtler and boils down to asking; Find $X so that when: R(SX) SatisfyingSet[SY] eats (SY,SX) holds (R($X) is a concept representing what eat $X), then we have: Inheritance pig RISX) and: Inheritance dog RISX) In this case, a HebbianLink from animal to eat would not really be fine-grained enough. Instead we want a link of the form: Hebbianlmplication Inheritance SX animal SatisfyingSet[SY] eats (SX, SY) This says that when thinking about an animal, it's interesting to think about what that animal eats. The deduction Rule, when choosing which intermediate nodes to use, needs to look at the scope of available HebbianLinks and HebbianPredicates and use them to guide its choice. And if there are no good intermediate nodes available, it may report that it doesn't have enough experience to assess with any confidence whether it can come up with a good conclusion. As a consequence of the bandit-problem dynamics, it may be allocated reduced resources, or another Rule is chosen altogether. EFTA00624439
36.7 Evolution As an Inference Control Scheme 293 36.6 Inference Pattern Mining Along with general-purpose attention spreading, it it very useful for PLN processes to receive specific guidance based on patterns mined from previously performed and storedife This information is stored in CogPrime in a data repository called the InferencePattern- Repository - which is, quite simply, a special "data table" containing inference trees extracted from the system's inference history, and patterns recognized therein. An "inference tree" refers to a tree whose nodes, called InferenceTreeNodes, are Atoms (or generally Atom-versions, Atoms with truth value relative to a certain context), and whose links are inference steps (so each link is labeled with a certain inference rule). In a large CogPrime system it may not be feasible to store all inference trees; but then a wide variety of trees should still be retained, including mainly successful ones as well as a sampling of unsuccessful ones for purpose of comparison. The InferencePatternRepository may then be used in two ways: • An inference tree being actively expanded (i.e. utilized within the PLN inference system) may be compared to inference trees in the repository, in real time, for guidance. That is, if a node N in an inference tree is being expanded, then the repository can be searched for nodes similar to N, whose contexts (within their inference trees) are similar to the context of N within its inference tree. A study can then be made regarding which Rules and Atoms were most useful in these prior, similar inferences, and the results of this can be used to guide ongoing inference. • Patterns can be extracted from the store of inference trees in the InferencePatternRepos- itory, and stored separately from the actual inference trees (in essence, these patterns are inference subtrees with variables in place of some of their concrete nodes or links). An infer- ence tree being expanded can then be compared to these patterns instead of, or in addition to, the actual trees in the repository. This provides greater efficiency in the case of common patterns among inference trees. A reasonable approach may be to first check for inference patterns and see if there are any close matches; and if there are not, to then search for individual inference trees that are close matches. Mining patterns front the repository of inference trees is a potentially highly computationally expensive operation, but this doesn't particularly matter since it can be run periodically in the background while inference proceeds at its own pace in the foreground, using the mined patterns. Algorithmically, it may be done either by exhaustive frequent-itemset-mining (as in deterministic greedy datamining algorithms), or by stochastic greedy mining. These operations should be carried out by an InferencePatternMiner MindAgent. 36.7 Evolution As an Inference Control Scheme It is possible to use PEPL (Probabilistic Evolutionary Program Learning, such as MOSES) as, in essence, an InferenceControl scheme. Suppose we are using an evolutionary learning mechanism such as MOSES or PLEASURE EGoe08al to evolve populations of predicates or schemata. Recall that there are two ways to evaluate procedures in CogPrime : by inference EFTA00624440
294 36 Adaptive, Integrative Inference Control or by direct evaluation. Consider the case where inference is needed in order to provide high- confidence estimates of the evaluation or execution relationships involved. Then, there is the question of how much effort to spend on inference, for each procedure being evaluated as part of the fitness evaluation process. Spending a small amount of effort on inference means that one doesn't discover much beyond what's immediately apparent in the AtomSpace. Spending a large amount of effort on inference means that one is trying very hard to use indirect evidence to support conjectures regarding the evaluation or execution Links involved. When one is evolving a large population of procedures, one can't afford to do too much inference on each candidate procedure being evaluated. Yet, of course, doing more inference may yield more accurate fitness evaluations, hence decreasing the number of fitness evaluations required. Often, a good heuristic is to gradually increase the amount of inference effort spent on procedure evaluation, during the course of evolution. Specifically, one may make the amount of inference effort roughly proportional to the overall population fitness. This way, initially, evolution is doing a cursory search, not thinking too much about each possibility. But once it has some fairly decent guesses in its population, then it starts thinking hard, applying more inference to each conjecture. Since the procedures in the population are likely to be interrelated to each other, inferences done on one procedure are likely to produce intermediate knowledge that's useful for doing inference on other procedures. Therefore, what one has in this scheme is evolution as a control mechanism for higher-order inference. Combined with the use of evolutionary learning to achieve memory across optimization runs, this is a very subtle approach to inference control, quite different from anything in the domain of logic-based Al. Rather than guiding individual inference steps on a detailed basis, this type of control mechanism uses evolutionary logic to guide the general direction of inference, pushing the vast mass of exploratory inferences in the direction of solving the problem at hand, based on a flexible usage of prior knowledge. 36.8 Incorporating Other Cognitive Processes into Inference Hebbian inference control and inference pattern mining are valuable and powerful processes, but they are not always going to be enough. The solution of some problems that CogPrime chooses to address via inference will ultimately require the use of other methods, too. In these cases, one workaround is for inference to call on other cognitive processes to help it out. This is done via the forward or backward chaining agents identifying specific Atoms deserv- ing of attention by other cognitive processes, and then spawning Tasks executing these other cognitive processes on the appropriate Atoms. Firstly, which Atoms should be selected for this kind of attention? What we want are Infer- enceTreeNodes that: • have high STI. • have the impact to significantly change the overall truth value of the inference tree they are embedded in (something that can be calculated by hypothetically varying the truth value of the InferenceTreeNode and seeing how the truth value of the overall conclusion is affected). • have truth values that are known with low confidence. EFTA00624441
36.9 PLN and Bayes Nets 295 Truth values meeting these criteria should be taken as strong candidates for attention by other cognitive processes. The next question is which other cognitive processes do we apply in which cases? MOSES in supervised categorization mode can be applied to a candidate InferenceTreeNode representing a CogPrime Node if it has a sufficient number of members (Atoms linked to it by MemberLinks); and, a sufficient number of new members have been added to it (or have had their membership degree significantly changed) since MOSES in supervised categorization mode was used on it last. Next, pattern mining can be applied to look for connectivity patterns elsewhere in the Atom- Table, similar to the connectivity patterns of the candidate Atom, if the candidate Atom has changed significantly since pattern mining last visited it. More subtly, what if, we try to find whether "cross breed" implies "Ugliness", and we know that "bad genes" implies Ugliness, but can't find a way, by backward chaining, to prove that "cross breed" implies "bad genes". Then we could launch a non-backward-chaining algorithm to measure the overlap of SatisfyingSet(cross breed) and SatisfyingSet(bad genes). Specifically, we could use MOSES in supervised categorization mode to find relationships characterizing "cross breed" and other relationships characterizing "bad genes", and then do some forward chaining inference on these relationships. This would be a general heuristic for what to do when there's a link with low confidence but high potential importance to the inference tree. SpeculativeConceptFormation (see Chapter 38) may also be used to create new concepts and attempt to link them to the Atoms involved in an inference (via subsidiary inference processes, or HebbianLink formation based on usage in learned procedures, etc.), so that they may be used in inference. 36.9 PLN and Bayes Nets Finally, we give some comments on the relationship between PLN and Bayes Nets IP.I88al. We have not yet implemented such an approach, but it may well be that Bayes Nets methods can serve as a useful augmentation to PLN for certain sorts of inference (specifically, for inference on networks of knowledge that are relatively static in nature). We can't use standard Bayes Nets as the primary way of structuring reasoning in CogPrime because CogPrime's knowledge network is loopy. The peculiarities that allow standard Bayer net belief propagation to work in standard loopy Bayes nets, don't hold up in CogPrime, because of the way you have to update probabilities when you're managing a very large network in interaction with a changing world, so that different parts of which get different amounts of focus. So in PLN we use different mechanisms (the "inference trail" mechanism) to avoid "repeated evidence counting" whereas in loopy Bayes nets they rely on the fact that in the standard loopy Bayer net configuration, extra evidence counting occurs in a fairly constant way across the network. However, when you have within the AtomTable a set of interrelated knowledge items that you know are going to be static for a while, and you want to be able to query them probabilistically, then building a Bayes Net of some sort (i.e. "freezing" part of CogPrime's knowledge network and mapping it into a Bayes Net) may be useful. I.e., one way to accelerate some PLN inference would be: EFTA00624442
296 36 Adaptive, Integrative Inference Control 1. Freeze a subnetwork of the AtomTable which is expected not to change a lot in the near future 2. Interpret this subnetwork as a loopy Bayes net, and use standard Bayesian belief propaga- tion to calculate probabilities based on it This would be a highly efficient form of "background inference" in certain contexts. (Note that this ideally requires an "indefinite Bayes net" implementation that propagates indefinite probabilities through the standard Bayes-net local belief propagation algorithms, but this is not mathematically problematic.) On the other hand, if you have a very important subset of the Atomspace, then it may be worthwhile to maintain a Bayes net modeling the conditional probabilities between these Atoms, but with a dynamically updated structure. EFTA00624443
Chapter 37 Pattern Mining Co-authored with Jade O'Neill 37.1 Introduction Having discussed inference in depth we now turn to other, simpler but equally important ap- proaches to creating declarative knowledge. This chapters deals with pattern mining - the creation of declarative knowledge representing patterns among other knowledge (which may be declarative, sensory, episodic, procedural, etc.) - and the following chapter deals with specula- tive concept creation. Within the scope of pattern mining, we will discuss two basic approaches: • supervised learning: given a predicate, finding a pattern among the entities that satisfy that predicate. • unsupervised learning: undirected search for "interesting patterns". The supervised learning case is easier and we have done a number of experiments using MOSES for supervised pattern mining, on biological (microarray gene expression and SNP) and textual data. In the CogPrime case, the "positive examples" are the elements of the Sat- isfyingSet of the predicate P, and the "negative examples" are everything else. This can be a relatively straightforward problem if there are enough positive examples and they actually share common aspects ... but some trickiness emerges, of course, when the common aspects are, in each example, complexly intertwined with other aspects. The unsupervised learning case is considerably trickier. The main problem issue here regards the definition of an appropriate fitness function. We are searching for "interesting patterns." So the question is, what constitutes an interesting pattern? We will also discuss two basic algorithmic approaches: • program learning, via MOSES or hificlimbing • frequent subgraph mining, using greedy algorithms The value of these various approaches is contingent on the environment and goal set being such that algorithms of this nature can actually recognize relevant patterns in the world and mind. Fortunately, the everyday human world does appear to have the property of possessing multiple relevant patterns that are recognizable using varying levels of sophistication and effort. It has patterns that can be recognized via simple frequent pattern mining, and other patterns that are too subtle for this, and are better addressed by a search-based approach. In order for 297 EFTA00624444
298 37 Pattern Mining an environment and goal set to be appropriate for the learning and teaching of a human-level AI, it should have the same property of possessing multiple relevant patterns recognizable using varying levels of subtlety. 37.2 Finding Interesting Patterns via Program Learning As one important case of pattern mining, we now discuss the use of program learning to find "interesting" patterns in sets of Atoms. Clearly, "interestingness" is a multidimensional concept. One approach to defining it is em- pirical, based on observation of which predicates have and have not proved interesting to the system in the past (based on their long-term importance values, i.e. LTI). In this approach, one has a supervised categorization problem: learn a rule predicting whether a predicate will fall into the interesting category or the uninteresting category. Once one has learned this rule, which has expressed this rule as a predicate itself, one can then use this rule as the fitness function for evolutionary learning evolution. There is also a simpler approach, which defines an objective notion of interestingness. This objective notion is a weighted sum of two factors: • Compactness. • Surprisingness of truth value. Compactness is easy to understand: all else equal, a predicate embodied in a small Combo tree is better than a predicate embodied in a big one. There is some work hidden here in Combo tree reduction; ideally, one would like to find the smallest representation of a given Combo tree, but this is a computationally formidable problem, so one necessarily approaches it via heuristic algorithms. Surprisingness of truth value is a slightly subtler concept. Given a Boolean predicate, one can envision two extreme ways of evaluating its truth value (represented by two different types of ProcedureEvaluator). One can use an IndependenceAssumingProcedureEvaluator, which deals with all AND and OR operators by assuming probabilistic independence. Or, one can use an ordinary EffortBasedProcedureEvaluator, which uses dependency information wherever feasible to evaluate the truth values of AND and OR operators. These two approaches will normally give different truth values but, how different? The more different, the more surprising is the truth value of the predicate, and the more interesting may the predicate be. In order to explore the power of this kind of approach in a simple context, we have tested pattern mining using MOSES on Boolean predicates as a data mining algorithm on a number of different datasets, including some interesting and successful work in the analysis of gene expression data, and some more experimental work analyzing sociological data from the National Longitudinal Survey of Youth (NLSY) (http : // st at s bl s goy in Is /). A very, simple illustrative result from the analysis of the NLSY data is the pattern: OR (NOT(MothersAge(X)) AND NOT(FirstSexAge(X))) (Wealth(%) AND PIAT(X)) where the domain of X are individuals, meaning that: • being the child of a young mother correlates with having sex at a younger age; EFTA00624445
37.3 Pattern Mining via Frequent/Surprising Subgraph Mining 299 • being in a wealthier family correlates with better Math (PIAT) scores; • the two sets previously described tend to be disjoint. Of course, many data patterns are several times more complex than the simple illustrative pattern shown above. However, one of the strengths of the evolutionary learning approach to pattern mining is its ability to find simple patterns when they do exist, yet without (like some other mining methods) imposing any specific restrictions on the pattern format. 37.3 Pattern Mining via Frequent/Surprising Subgraph Mining Probabilistic evolutionary learning is an extremely powerful approach to pattern mining, but, may not always be realistic due to its high computational cost. A cheaper, though also weaker, alternative, is to use frequent subgraph mining algorithms such as RIWP03, IiKOIJ, which may straightforwardly be adapted to hypergraphs such as the Atomspace. Frequent subgraph mining is a port to the graph domain of the older, simpler idea of frequent itemset mining, which we now briefly review. There are a number of algorithms in the latter category, the classic is Apriori LAS911, and an alternative is Relim 11101051 which Ls conceptually similar but seems to give better performance. The basic goal of frequent itemset mining is to discover frequent subsets ("helmets") in a group of sets, whose members are all drawn from some base set of items. One knows that for a set of N items, there are 2A — I possible subgroups. The algorithm operates in several rounds. Round i heuristically computes frequent i-itemsets (i.e. frequent sets containing i items). A round has two steps: candidate generation and candidate counting. In the candidate generation step, the algorithm generates a set of candidate i-itemsets whose support - the percentage of events in which the item must appear - has not been yet been computed. In the candidate- counting step, the algorithm scans its memory, database, counting the support of the candidate itemsets. After the scan. the algorithm discards candidates with support lower than the specified minimum (an algorithm parameter) and retains only the sufficiently frequent i-itemsets. The algorithm reduces the number of tested subsets by pruning apriori those candidate itemsets that cannot be frequent, based on the knowledge about infrequent itemsets obtained from previous rounds. So for instance if {A, B} is a frequent 2-itemset then {A, B, C) will be considered as a potential 3-itemset, on the contrary if {A, B} is not a frequent itemset then {A, B,C}, as well as any superset of {A, B}, will be discarded. Although the worst case of this sort of algorithm is exponential, practical executions are generally fast, depending essentially on the support limit. Frequent subgraph mining follows the same pattern, but instead of a set of items it deals with a group of graphs. There are many frequent subgraph mining algorithms in the literature, but the basic concept underlying nearly all of them is the same: first find small frequent subgraphs. Then seek to find slightly larger frequent patterns encompassing these small ones. Then seek to find slightly larger frequent patterns encompassing these, etc. This approach is much faster than something like MOSES, although management of the large number of subgraphs to be searched through can require subtle design and implementation of data structures. If, instead of an ensemble of small graphs, one has a single large graph like the AtomSpace, one can follow the same approach, via randomly subsampling from the large graph to find the graphs forming the ensemble to be mined from; see VI I1(11 for a detailed treatment of this sort of EFTA00624446
300 37 Pattern Mining approach. The fact that the AtomSpace is a hypergraph rather than a graph doesn't fundamen- tally affect the matter since a hypergraph may always be considered a graph via introduction of an additional node for each hyperedge (at the cost of a potentially great multiplication of the number of links). Frequent subgraph mining algorithms appropriately deployed can find subgraphs which occur repeatedly in the Atomspace, including subgraphs containing Atom-valued variables . Each such subgraph may be represented as a PredicateNode, and frequent subgraph mining will find such PredicateNodes that have surprisingly high truth values when evaluated across the Atomspace. But unlike MOSES when applied as described above, such an algorithm will generally find such predicates only in a "greedy" way. For instance, a greedy subgraph mining algorithm would be unlikely to find OR INOT(MothersAge(X)) AND NOT(FirstSexAge(X))) (Nealth(X) AND PIATOO) as a surprising pattern in an AtomSpace, unless at least one (and preferably both) of Wealth(%) AND PIAT(X) and NOT(MothersAge(X)) AND NOT(FirstSexAge(X)) were surprising patterns in that Atomspace on their own. 37.4 Fishgram Fishgram is a simple example of an algorithm for finding patterns in an Atomspace, instantiating the general concepts presented in the previous section. It represents patterns as conjunctions (AndLink) of Links, which usually contain variables. It does a greedy search, so it can quickly find many patterns. In contrast, algorithms like MOSES are designed to find a small number of the best patterns. Fishgram works by finding a set of objects that have links in common, so it will be most effective if the AtomSpace has a lot of raw data, with simple patterns. For example, it can be used on the perceptions from the virtual world. There are predicates for basic perceptions (e.g. what kind of object something is, objects being near each other, types of blocks. and actions being performed by the user or the AI). The details of the Fishgram code and design are not sufficiently general or scalable to serve as a robust, omnipurpose pattern mining solution for CogPrime. However, Fishgram is nevertheless interesting, as an existent, implemented and tested prototype of a greedy frequent/interesting subhypergraph mining system. A more scalable analogous system, with a similar principle of operation, has been outlined and is in the process of being designed at time of writing, but will not be presented here. 37.4.1 Example Patterns Here is some example output from Fishgram, when run on the virtual agent's memories. EFTA00624447
37A Fishgram 301 (AndLink (EvaluationLink is_edible:PredicateNode (ListLink $1000041)) (InheritanceLink $1000041 Battery:ConceptNode) This means a battery which can be "eaten" by the virtual robot. The variable $1000041 refers to the object (battery). Fishgram can also find patterns containing a sequence of events. In this case, there is a list of EvaluationLinks or InheritanceLinks which describe the objects involved, followed by the sequence of events. (AndLink (InheritanceLink $1007703 Battery:ConceptNode) (SequentialAndLink (EvaluationLink isHolding:PredicateNode (ListLink $1008725 $1007703))) This means the agent was holding a battery. $1007703 is the battery, and there is also a variable for the agent itself. Many interesting patterns involve more than one object. This pattern would also include the user (or another AI) holding a battery, because the pattern does not refer to the AI character specifically. It can find patterns where it performs an action and achieves a goal. There is code to create implications based on these conjunctions. After finding many conjunctions, it can produce ImplicationLinks based on some of them. Here is an example where the Al-controlled virtual robot discovers how to get energy. (ImplicationLink (AndLink (EvaluationLink is_edible:PredicateNode (ListLink $1011619)) (InheritanceLink $1011619 Battery:ConceptNode) (PredictivelmplicationLink (EvaluationLink actionDone:PredicateNode (ListLink (ExecutionLink eat:GroundedSchemallode (ListLink $1011619)))) (EvaluationLink increased: PredicateNode (ListLink (EvaluationLink EnergyDemandGoal:PredicateNode))) 37.4.2 The Fishgram Algorithm The core Fishgram algorithm, in pseudocode, is as follows: EFTA00624448
302 37 Pattern Mining initial layer = every pair (relation, binding) while previous layer is not empty: foreach (conjunction, binding) in previous layer: let incoming = all (relation, binding) pairs containing an object in the conjunction let possible_next_events = all (event, binding) pairs where the event happens during or shortly after the last event in conjunction foreach (relation, relation_binding) in incoming and possible_next_events: (new_relation new_conjunction_binding) = map_to_existing_variables(conjunction , binding , relation , relation_binding) if new_relation is already in conjunction, skip it new_conjunction = conjunction + new_relation if new_conjunction has been found already, skip it otherwise, add new_conjunction to the current layer map_to_existing_variables(conjunction, conjunction_binding, relation, relation_binding) r', s' = a copy of the relation and binding using new variables foreach variable v, object o in relation_binding: foreach variable v2, object o2 in conjunction_binding: if o == o2: change r' and s' to use v2 instead of v 37.4.3 Preprocessing There are several preprocessing steps to make it easier for the main Fishgram search to find patterns. There is a list of things that have to be variables. For example, any predicate that refers to object (including agents) will be given a variable so it can refer to any object. Other predicates or InheritanceLinIcs can be added to a pattern, to restrict it to specific kinds of objects, as shown above. So there Ls a step which goes through all of the links in the AtomSpace, and records a list of predicates with variables. Such as "X is red" or "X eats Y". This makes the search part simpler, because it never has to decide whether something should be a variable or a specific object. There is also a filter system, so that things which seem irrelevant can be excluded from the search. There is a combinatorial explosion as patterns become larger. Some predicates may be redundant with each other, or known not to be very useful. It can also try to find only patterns in the Al's "attentional focus", which is much smaller than the whole AtomSpace. The Fishgram algorithm cannot currently handle patterns involving numbers, although it could be extended to do so. The two options would be to either have a separate discretization step, creating predicates for different ranges of a value. Or alternatively, have predicates for mathematical operators. It would be passible to search for a "splitpoint" like in decision trees. EFTA00624449
37.4 Fishgram 303 So a number would be chosen, and only things above that value (or only things below that value) would count for a pattern. It would also be possible to have multiple numbers in a pattern, and compare them in various ways. It is uncertain how practical this would be in Fishgrain. MOSES is good for finding numeric patterns, so it may be better to simply use those patterns inside Fishgrain. The "increased" predicate is added by a preprocessing step. The goals have a fuzzy TruthValue representing how well the goal is achieved at any point in time, so e.g. the EnergyDemandGoal represents how much energy the virtual robot has at some point in time. The predicate records times that a goal's TruthValue increased. This only happens immediately after doing something to increase it, which helps avoid finding spurious patterns. 37.4.4 Search Process Fishgram search is breadth-first. It starts with all predicates (or InheritanceLinks) found by the preprocessing step. Then it finds pairs of predicates involving the same variable. Then they are extended to conjunctions of three predicates, and so on. Many relations apply at a specific time, for example the agent being near an object, or an action being performed. These are included in a sequence, and are added in the order they occurred. Fishgram remembers the examples for each pattern. If there is only one variable in the pattern, an example is a single object; otherwise each example is a vector of objects for each variable in the pattern. Each time a relation is added to a pattern, if it has no new variables, some of the examples may be removed, because they don't satisfy the new predicate. It needs to have at least one variable in common with the previous relations. Otherwise the patterns would combine many unrelated things. In frequent itemset mining (for example the APRIOR] algorithm), there is effectively one variable, and adding a new predicate will often decrease the number of items that match. It can never increase it. The number of passible conjunctions increases with the length, up to some point, after which it decreases. But when mining for patterns with multiple objects there is a much larger combinatorial explosion of patterns. Various criteria can be used to prune the search. The most basic criterion is the frequency. Only patterns with at least N examples will be included, where N is an arbitrary constant. You can also set a maximum number of patterns allowed for each length (number of relations), and only include the best ones. The next level of the breadth-first search will only search for extensions of those patterns. One can also use a measure of statistical interestingness, to make sure the relations in a pattern are correlated with each other. There are many spurious frequent patterns, because anything which is frequent will occur together with other things, whet her they are relevant or not. For example "breathing while typing" is a frequent pattern, because people breathe at all times. But "moving your hands while typing" is a much more interesting pattern. As people only move their hands some of the time, a measure of correlation would prefer the second pattern. The best measure may be interaction information, which is a generalisation of mutual information that applies to patterns with more than two predicates. An early-stage AI would not have much knowledge of cause and effect, so it would rely on statistical measures to find useful patterns. EFTA00624450
304 37 Pattern Mining 37.4.5 Comparison to other algorithms Fishgram is more suitable for OpenCogPrime's purposes than existing graph mining algorithms, most of which were designed with molecular datasets in mind. The OpenCog AtomSpace is a different graph in various ways. For one, there are many possible relations between nodes (much like in a semantic network). Many relations involve more than two objects, and there are also properties predicates about a single object. So the relations are effectively directed links of varying arity. It also has events, and many states can change over time (e.g. an egg changes state while it's cooking). Fishgram is designed for general knowledge in an embodied agent. There are other major differences. Fishgram uses a breadth-first search, rather than depth- first search like most graph mining algorithms. And it does an "embedding-based" search, search- ing for patterns that can be embedded multiple times in a large graph. Molecular datasets have many separate graphs for separate molecules, but the embodied perceptions are closer to a single, fairly well-connected graph. Depth-first search would be very slow on such a graph, as there are many very long paths through the graph, and the search would mostly find those. Whereas the useful patterns tend to be compact and repeated many times. Lastly the design of Fishgram makes it easy to experiment with multiple different scoring functions, from simple ones like frequency to much more sophisticated functions such as inter- action information. As mentioned above, the current implementation of Fishgram is not sufficiently scalable to be utilized for general-purpose Atomspaces. The underlying data structure within Fishgram, used to store recognized patterns, would need to be replaced, which would lead to various other modifications within the algorithm. But, the general principle and approach illustrated by Fishgrarn will be persisted in any more scalable reimplementation. EFTA00624451
Chapter 38 Speculative Concept Formation 38.1 Introduction One of the hallmarks of general intelligence is its capability to deal with novelty in its envi- ronment and/or goal-set. And dealing with novelty intrinsically requires creating novelty. It's impossible to efficiently handle new situations without creating new ideas appropriately. Thus, in any environment complex and dynamic enough to support human-like general intelligence (or any other kind of highly powerful general intelligence), the creation of novel ideas will be paramount. New idea creation takes place in OpenCog via a variety of methods - e.g. inside MOSES which creates new program trees, PLN which creates new logical relationships, ECAN which creates new associative relationships, etc. But there is also a role for explicit, purposeful creation of new Atoms representing new concepts, outside the scope of these other learning mechanisms. The human brain gets by, in adulthood, without creating that many new neurons - although neurogenesis does occur on an ongoing basis. But this is achieved only via great redundancy, because for the brain it's cheaper to maintain a large number of neurons in memory at the same time, than to create and delete neurons. Things are different in a digital computer: memory is more expensive but creation and deletion of object is cheaper. Thus in CogPrime, forgetting and creation of Atoms is a regularly occurring phenomenon. In this chapter we discuss a key class of mechanisms for Atom creation, "speculative concept formation." Further methods will be discussed in following chapters. The philosophy underlying CogPrime's speculative concept formation is that new things should be created from pieces of good old things (a form of "evolution", broadly construed), and that probabilistic extrapolation from experience should be used to guide the creation of new things (inference). It's clear that these principles are necvsbary for the creation of new mental forms but it's not obvious that they're sufficient: this is a nontrivial hypothesis, which may also be considered a family of hypotheses since there are many different ways to do extrapolation and intercombination. In the context of mind-world correspondence, the implicit assumption underlying this sort of mechanism is that the relevant patterns in the world can often be combined to form other relevant patterns. The everyday human world does quite markedly display this kind of combinatory structure, and such a property seems basic enough that it's appropriate for use as an assumption underlying the design of cognitive mechanisms. In CogPrime we have introduced a variety of heuristics for creating new Atoms - especially ConceptNodes - which may then be reasoned on and subjected to implicit (via attention allo- 305 EFTA00624452
306 38 Speculative Concept Formation cation) and explicit (via the application of evolutionary learning to predicates obtained front concepts via "concept predicatization") evolution. Among these are the node logical operators described in the book Probabilistic Logic Networks, which allow the creation of new concepts via AND, OR, XOR and so forth. However, logical heuristics alone are not sufficient. In this chapter we will review some of the nonlogical heuristics that are used for speculative concept formation. These operations play an important role in creativity - to use cognitive-psychology language, they are one of the ways that CogPrime implements the process of blending, which Falconnier and Turner (2003) have argued is key to human creativity on many different levels. Each of these operations may be considered as implicitly associated with a hypothesis that, in fact, the everyday human world tends to assign utility to patterns that are combinations of other patterns produced via said operation. An evolutionary perspective may also be useful here, on a technical level as well as philo- sophically. As noted in The Hidden Pattern and hinted at in Chapter 3 of Part 1, one way to think about an AGI system like CogPrime is as a huge evolving ecology. The AtomSpace is a biosphere of sorts, and the mapping from Atom types into species has some validity to it (though not complete accuracy: Atom types do not compete with each other; but they do reproduce with each other, and according to most of the reproduction methods in use, Atoms of differing type cannot cross-reproduce). Fitness is defined by importance. Reproduction is defined by various operators that produce new Atoms from old, including the ones discussed in this chapter, as well as other operators such as inference and explicit evolutionary operators. New ConceptNode creation may be triggered by a variety of circumstances. If two ConceptN- odes are created for different purposes, but later the system finds that most of their meanings overlap, then it may be more efficient to merge the two into one. On the other hand, a node may become overloaded with different usages, and it is more useful to split it into multiple nodes, each with a more consistent content. Finally, there may be patterns across large numbers of nodes that merit encapsulation in individual nodes. For instance, if there are 1000 fairly similar ConceptNodes, it may be better not to merge them all together, but rather to create a single node to which they all link, reifying the category that they collectively embody. In the following sections, we will begin by describing operations that create new ConceptN- odes from existing ones on a local basis: by mutating individual ConceptNodes or combining pairs of ConceptNodes. Some of these operations are inspired by evolutionary operators used in the GA, others are based on the cognitive psychology concept of "blending." Then we will turn to the use of clustering and formal concept analysis algorithms inside CogPrime to refine the system's knowledge about existing concepts, and create new concepts. 38.2 Evolutionary Concept Formation A simple and useful way to combine ConceptNodes is to use GA-inspired evolutionary operators: crossover and mutation. In mutation, one replaces some number of a Node's links with other links in the system. In crossover, one takes two nodes and creates a new node containing some links from one and some links from another. More concretely, to cross over two ConceptNodes X and Y, one may proceed as follows (in short clustering the union of X and Y): • Create a series of empty nodes Zr, Z2, . Zk EFTA00624453
38.2 Evolutionary Concept Formation 307 • Form a "link pool" consisting of all X's links and all Y's links, and then divide this pool into clusters (clustering algorithms will be described below). • For each cluster with significant cohesion, allocate the links in that cluster to one of the new nodes Zi On the other hand, to mutate a ConceptNode, a number of different mutation processes are reasonable. For instance, one can • Cluster the links of a Node, and remove one or more of the clusters, creating a node with less links • Cluster the links, remove one or more clusters, and then add new links that are similar to the links in the remaining clusters The EvolutionaryConceptFormation MindAgent selects pairs of nodes from the system, where the probability of selecting a pair is determined by • the average importance of the pair • the degree of similarity of the pair • the degree of association of the pair (Of course, other heuristics are possible too). It then crosses over the pair, and mutates the result. Note that, unlike in some GA implementations, the parent node(s) are retained within the system; they are not replaced by the children. Regardless of how many offspring they generate by what methods, and regardless of their age, all Nodes compete and cooperate freely forever according to the fitness criterion defined by the importance updating function. The entire AtomSpace may be interpreted as a large evolutionary, ecological system, and the action of CogPrime dynamics, as a whole, is to create fit nodes. A more advanced variant of the EvolutionaryConceptFormation MindAgent would adapt its mutation rate in a context-dependent way. But our intuition is that it is best to leave this kind of refinement for learned cognitive schemata, rather than to hard-wire it into a MindAgent. To encourage the formation of such schemata, one may introduce elementary schema functions that embody the basic node-level evolutionary operators: ConceptNode ConceptCrossover(ConceptNode A, ConceptNode B) ConceptNode mutate(ConceptNode A, mutationAmount m) There will also be a role for more abstract schemata that utilize these. An example cognitive schema of this sort would be one that said: "When all my schema in a certain context seem unable to achieve their goals, then maybe I need new concepts in this context, so I should increase the rate of concept mutation and crossover, hoping to trigger some useful concept formation." As noted above, this component of CogPrime views the whole AtomSpace as a kind of genetic algorithm - but the fitness function is "ecological" rather than fixed, and of course the crossover and mutation operators are highly specialized. Most of the concepts produced through evolutionary operations are going to be useless nonsense, but will be recognized by the importance updating process and subsequently forgotten from the system. The useful ones will link into other concepts and become ongoing aspects of the system's mind. The importance updating process amounts to fitness evaluation, and it depends implicitly on the sum total of the cognitive processes going on in CogPrime. EFTA00624454
308 38 Speculative Concept Formation To ensure that importance updating properly functions as fitness evaluation, it is critical that evolutionarily-created concepts (and other speculatively created Atoms) always comprise a small percentage of the total concepts in the system. This guarantees that importance will serve as a meaningful "fitness function" for newly created ConceptNodes. The reason for this is that the importance measures how useful the newly created node is, in the context of the previously existing Atoms. If there are too many speculative, possibly useless new ConceptNodes in the system at once, the importance becomes an extremely noisy fitness measure, as it's largely measuring the degree to which instances of new nonsense fit in with other instances of new nonsense. One may find interesting self-organizing phenomena in this way, but in an AGI context we are not interested in undirected spontaneous pattern-formation, but rather in harnessing self-organizing phenomena toward system goals. And the latter is achieved by having a modest but not overwhelming amount of speculative new nodes entering into the system. Finally, as discussed earlier, evolutionary operations on maps may occur naturally and au- tomatically as a consequence of other cognitive operations. Maps are continually mutated due to fluctuations in system dynamics; and maps may combine with other maps with which they overlap, as a consequence of the nonlinear properties of activation spreading and importance updating. Map-level evolutionary operations are not closely tied to their Atom-level counter- parts (a difference from e.g. the close correspondence between map-level logical operations and underlying Atom-level logical operations). 38.3 Conceptual Blending The notion of Conceptual Blending (aka Conceptual Integration) was proposed by Gilles Fan- cannier and Mark Turner IFT021 as general theory, of cognition. According to this theory, the basic operation of creative thought is the "blend" in which elements and relationships from diverse scenarios are merged together in a judicious way. As a very simple example, we may consider the blend of "tower" and "snake" to form a new concept of "snake tower" (a tower that looks somewhat like a snake). However, most examples of blends will not be nearly so obvious. For instance, the complex numbers could be considered a blend between 2D points and real numbers. Figure 38.1 gives a conceptual illustration of the blending process. The production of a blend is generally considered to have three key stages (elucidated via the example of building a snake-tower out of blocks): • composition: combining judiciously chosen elements from two or more concept inputs - Example: Taking the "buildingness" and "verticalness" of a tower, and the "head" and "mouth" and "tail" of a snake • completion: adding new elements from implicit background knowledge about the concept inputs — Example: Perhaps a mongoose-building will be built out of blocks, poised in a position indicating it is chasing the snake-tower (incorporating the background knowledge that mongeese often chase snakes) • elaboration: fine-tuning, which shapes the elements into a new concept, guided by the desire to optimize certain criteria EFTA00624455
38.3 Conceptual Blending Concept Blending Segal.' rondo... arrow . . . . . . . .......) i tleird human Fig. 38.1: Conceptual Illustration of Conceptual Blending 309 — Example: The tail of the snake-tower is a part of the building that rests on the ground, and connects to the main tower. The head of the snake-tower is a portion that sits atop the main tower, analogous to the restaurant atop the Space Needle. The "judiciousness" in the composition phase may be partially captured in CogPrime via PLN inference, via introducing a "consistency criterion" that the elements chosen as part of the blend should not dramatically decrease in confidence after the blend's relationships are submitted to PLN inference. One especially doesn't want to choose mutually contradictory elements from the two inputs. For instance one doesn't want to choose "alive" as an element of "snake", and "non-living" as an element of "building." This kind of contradictory choice can be ruled out by inference, because after very few inference steps, this choice would lead to a drastic confidence reduction for the InheritanceLinks to both "alive" and "non-living." Aside from consistency, some other criteria considered relevant to evaluating the quality of a blend, are: • topology principle that relations in the blend should match the relations of their counterparts in other concepts related to the concept inputs • web principle that the representation in the blended space should maintain mappings to the concept inputs • unpacking principle that, given a blended concept, the interpreter should be able to infer things about other related concepts • good reason principle that there should be simple explanations for the elements of the blend EFTA00624456
310 38 Speculative Concept Formation • metonymic tightening that when metonymically related elements are projected into the blended space, there is pressure to compress the "distance" between them. While vague-sounding in their verbal formulations, these criteria have been computationally implemented in the Sapper system, which uses blending theory, to model analogy and metaphor IVlC91, VO071; and in a different form in rat-061's framework for computational creativity. In CogPrime terms, these various criteria essentially boil down to: the new, blended concept should get a lot of interesting links. One could implement blending in CogPrime very straightforwardly via an evolutionary ap- proach: search the space of possible blends, evaluating each one according to its consistency but also the STI that it achieves when released into the Atomspace. However, this will be quite computationally expensive, so a wiser approach is to introduce heuristics aimed at increasing the odds of producing important blends. A simple heuristic is to calculate, for each candidate blend, the amount of STI that the blend would possess N cycles later if, at the current time, it was given a certain amount of STI. A blend that would accumulate more STI in this manner may be considered more promising, because this means that its components are more richly interconnected. Further, this heuristic may be used as a guide for greedy heuristics for creating blends: e.g. if one has chosen a certain element A of the first blend input, then one may seek an element B of the second blend input that has a strong Hebbian link to A (if such a B exists). However, it may also be interesting to pursue different sorts of heuristics, using information- theoretic or other mathematical criteria to preliminarily filter possible blends before they are evaluated more carefully via integrated cognition and importance dynamics. 38.3.1 Outline of a CogPrime Blending Algorithm A rough outline of a concept blending algorithm for CogPrime is as follows: • Choose a pair of concepts Cl and C2, which have a nontrivially-strong HebbianLink between them, but not an extremely high-strength SimilarityLink between them (i.e. the concepts should have something to do with each other, but not be extremely similar; blends of extremely similar things are boring). These parameters may be twiddled. • Form a new concept C3, which has some of Cl's links, and some of C2's links • If C3 has obvious contradictions, resolve them by pruning links. (For instance, if Cl inherits from alive to degree .9 and C2 inherits from alive to degree .1, then one of these two TruthValue versions for the inheritance link from alive, has got to be pruned...) • For each of C3's remaining links L, make a vector indicating everything it or its targets are associated with (via HebbianLinks or other links). This is basically a list of "what's related to L". Then, assess whether there are a lot of common associations to the links L that came from Cl and the links L that came from C2 • If the filter in step 4 is passed, then let the PLN forward chainer derive some conclusions about C3, and see if it comes up with anything interesting (e.g. anything with surprising truth value, or anything getting high STI, etc.) Steps 1 and 2 should be repeated over and over. Step 5 is basically "cognition as usual" - i.e. by the time the blended concept is thrown into the Atomspace and subjected to Step 5, it's being treated the same as any other ConceptNode. EFTA00624457
38.3 Conceptual Blending 311 The above is more of a meta-algorithm than a precise algorithm. Many avenues for variation exist, including • Step 1: heuristics for choosing what to try to blend • Step 3: how far do we go here, at removing contradictions? Do we try simple PLN inference to see if contradictions are unveiled, or do we just limit the contradiction-check to seeing if the same exact link is given different truth-values? • Step 4: there are many different ways to build this association-vector. There are also many ways to measure whether a set of association-vectors demonstrates "common associations". Interaction information 113e1031 is one fancy way; there are also simpler ones. • Step 5: there are various ways to measure whether PLN has come up with anything inter- esting 38.3.2 Another Example of Blending To illustrate these ideas further, consider the example of the SUV - a blend of "Car" and "Jeep" Among the relevant properties of Car are: • appealing to ordinary, consumers • fuel efficient • fits in most parking spots • easy to drive • 2 wheel drive Among the relevant properties of Jeep are: • 4 wheel drive • rugged • capable of driving off road • high clearance • open or soft top Obviously, if we want to blend Car and Jeep, we need to choose properties of each that don't contradict each other. We can't give the Car/Jeep both 2 wheel drive and 4 wheel drive. 4 wheel drive wins for Car/Jeep because sacrificing it would get rid of "capable of driving off road", which is critical to Jeep-ness; whereas sacrificing 2WD doesn't kill anything that's really critical to car-ness. On the other hand, having a soft top would really harm "appealing to consumers", which from the view of car-makers is a big part of being a successful car. But getting rid of the hard top doesn't really harm other aspects of jeep-ness in any series way. However, what really made the SUV successful was that "rugged" and "high clearance" turned out to make SUVs look funky to consumers, thus fulfilling the "appealing to ordinary consumers" feature of Car. In other words, the presence of the links • looks funky —> appealing to ordinary consumers • rugged Si high clearance —> looks funky EFTA00624458
312 38 Speculative Concept Formation made a big difference. This is the sort of thing that gets figured out once one starts doing PLN inference on the links associated with a candidate blend. However, if one views each feature of the blend as a probability distribution over concept space - for instance indicating how closely associated each concept is with that feature (e.g. via HebbianLinks) then we see that the mutual information (and more generally interaction information) between the features of the blend, is a quick estimate of how likely it is that inference will lead to interesting conclusions via reasoning about the combination of features that the blend possesses. 38.4 Clustering Next, a different method for creating new ConceptNodes in CogPrime is using clustering al- gorithms. There are many different clustering algorithms in the statistics and data mining literature, and no doubt many of them could have value inside CogPrime. We have experi- mented with several different clustering algorithms in the CogPrime context, and have selected one, which we call Omniclust IGCPM061, based on its generally robust performance on high- volume, noisy data. However, other methods such as EM (Expectation-Maximization) clustering IWI:051 would likely serve the purpose very well also. In the above discussion on evolutionary concept creation, we mentioned the use of a cluster- ing algorithm to cluster links. The same algorithm we describe here for clustering ConceptNodes directly and creating new ConceptNodes representing these clusters, can also be used for clus- tering links in the context of node mutation and crossover. The application of Omniclust or any other clustering algorithm for ConceptNode creation in CogPrime is simple. The clustering algorithm is run periodically, and the most significant clusters that it finds are embodied as ConceptNodes, with InheritanceLinks to their members. If these significant clusters have subclusters also identified by Omniclust, then these subclusters are also made into ConceptNodes, etc., with InheritanceLinks between clusters and subclusters. Clustering technology is famously unreliable, but this unreliability may be mitigated some- what by using clusters as initial guesses at concepts, and using other methods to refine the clusters into more useful concepts. For instance, a cluster may be interpreted as a disjunctive predicate, and a search may be made to determine sub-disjunction about which interesting PLN conclusions may be drawn. 38.5 Concept Formation via Formal Concept Analysis Another approach to concept formation is an uncertain version of Formal Concept Analysis V;SW051. There are many ways to create such a version, here we describe one approach we have found interesting, called Fuzzy Concept Formation (FCF). The general formulation of FCF begins with n objects 0 1,...,0„, in basic attributes al, ..., am, and information that object 0 ; possesses attribute aj to degree E t0,1j. In CogPrime, the objects and attributes are Atoms, and is% is the strength of the InheritanceLink pointing from a to EFTA00624459
38.5 Concept Formation via Formal Concept Analysis 313 In this context, we may define a concept as a fuzzy set of objects, and a derived attribute as a fuzzy set of attributes. Fuzzy concept formation (FCF) is, then, a process that produces N "concepts" Cn+N and ill "derived attributes" ...,d,n+m, based on the initial set of objects and attributes. We can extend the weight matrix .1% to include entries involving concepts and derived at- tributes as well, so that e.g. wn+3,„,+5 indicates the degree to which concept Cn+3 possesses derived attribute dm+s• The learning engine underlying FCF is a clustering algorithm dust = X,.;b) which takes in r vectors Xr E [0,1]n and outputs b or fewer clusters of these vectors. The overall FCF process Ls independent of the particular clustering algorithm involved, though the interestingness of the concepts and attributes formed will of course vary widely based on the specific clustering algorithm. Some clustering algorithms will work better with large values of b, others with smaller values of b. We then define the process form_coneepts(b) to operate as follows. Given a set S = SI, S. containing objects, concepts, or a combination of objects and concepts, and an attribute vector wt of length h with entries in [0,1) corresponding to each Si, one applies dust to find b clusters of attribute vectors B1,..., B6. Each of these clusters may be considered as a fuzzy set, for instance by considering the membership of x in cluster B to be 2- d(x,centroid(B)) for an appropriate metric d. These fuzzy sets are the b concepts produced by form_concepts(b). 38.5.1 Calculating Membership Degrees of New Concepts The degree to which a concept defined in this way possesses an attribute, may be defined in a number of ways, maybe the simplest is: weighted-summing the degree to which the members of the concept possess the attribute. For instance, to figure out the degree to which beautiful women (a concept) are insane (an attribute), one would calculate EwEbeaulif ut _women Xbeendif ut_ women (W)Xinsane(W) EwEbeata if tot_ women Xbeatai fut _women (w) where xx (w) denotes the fuzzy membership degree of win X. One could probably also consider Extensionalinheritance beautiful_ women insane. 38.5.2 Forming New Attributes One may define an analogous process form_aaributes(b) that begins with a set A = Ak containing (basic and/or derived) attributes, and a column vector Ch) EFTA00624460
314 38 Speculative Concept Formation of length It with entries in [0,1) corresponding to each Ai (the column vector tells the degrees to which various objects possess the attributes Ai). One applies dust to find b clusters of vectors vi: B1,...,Bb. These clusters may be interpreted as fuzzy sets, which are derived attributes. 38.5.2.1 Calculating Membership Degrees of New, Derived Attributes One must then defines the degree to which an object or concept possesses a derived attribute. One way to do this is using a geometric mean. For instance, suppose there is a derived attribute formed by combining the attributes vain, selfish and egocentric. Then. the degree to which the concept banker possesses this new derived attribute could be defined by ESE banker XbenkcA (Xvoin(b)XseifishitEl ,-,XceoccnericODlia EbEtianker Xbanker(b) 38.5.3 Iterating the Fuzzy Concept Formation Process Given a set S of concepts and/or objects with a set A of attributes, one may define • append _concepts(S' , 5) as the result of adding the concepts in the set S' to $, and evalu- ating all the attributes in A on these concepts, to get an expanded matrix w • append _attributes(A' , A) as the result of adding the attributes in the set A' to A, and evaluating all the attributes in A' on the concepts and objects in S, to get an expanded matrix to • collapse(S, A) is the result of taking (S, A) and eliminating any concept or attribute that has distance less than c front some other concept or attribute that comes before it in the lexicographic ordering of concepts or attributes. I.e., collapse removes near-duplicate concepts or attributes. Now, one may begin with a set S of objects and attributes, and iteratively run a process such as b = rac \\e.g. r=2, or r=1.5 while (b>1) S append_concepts (S, form_concepts (S, b) ) S collapse (S) S append_attributes (S, form_attributes (S, b) ) S collapse (S) b b/r with c corresponding to the number of iterations. This will terminate in finite time with a finitely expanded matrix w containing a number of concepts and derived attributes in addition to the original objects and basic attributes. Or, one may look at while(S is different from old_S) ( EFTA00624461
38.5 Concept Formation via Formal Concept Analysis 315 old_S = S S = add_concepts(S, form_concepts(S,b)) S = collapse(S) S = add_attributes(S, form_attributes(S,b)) S = collapse(S) This second version raises the mathematical question of the speed with which it will terminate (as a function of c). I.e., when does the concept and attribute formation process converge, and how fast? This will surely depend on the clustering algorithm involved. EFTA00624462
EFTA00624463
Section VI Integrative Learning EFTA00624464
EFTA00624465
Chapter 39 Dimensional Embedding 39.1 Introduction Among the many key features of the human brain omitted by typical formal neural network models, one of the foremost is the brain's three-dimensionality. The brain is not just a network of neurons arranged as an abstract graph; it's a network of neurons arranged in three-dimensional space, and making use of this three-dimensionality directly and indirectly in various ways and for various purposes. The somatosensory cortex contains a geometric map reflecting, approxi- matively, the geometric structure of parts of the body. The Visual cortex uses the 2D layout of cortical sheets to reflect the geometric structure of perceived space: motion detection neu- rons often fire in the actual physical direction of motion, etc. The degree to which the brain uses 2D and 3D geometric structure to reflect conceptual rather than perceptual or motoric knowledge is unclear, but we suspect considerable. One well-known idea in this direction is the "self-organizing map" or Kohonen net [Koh011, a highly effective computer science algo- rithm that performs automated classification and clustering via projecting higher-dimensional (perceptual, conceptual or motoric) vectors into a simulated 2D sheet of cortex. It's not clear that the exploitation of low-dimensional geometric structure is something an AGI system necessarily must support - there are always many different approaches to any aspect of the AGI problem. However, the brain does make clear that exploitation of this sort of structure is a powerful way to integrate various useful heuristics. In the context of mind- world correspondence theory, there seems clear potential value in having a mind mirror the dimensional structure of the world, at some level of approximation. It's also worth emphasizing that the brain's 3D structure has minuses as well as plusses - one suspects it complexifies and constrains the brain, along with implicitly suggesting various useful heuristics. Any mathematical graph can be represented in 3 dimensions without links crossing (unlike in 2 dimensions), but that doesn't mean the representation will always be efficient or convenient - sometimes it may result in conceptually related, and/or frequently interacting, entities being positioned far away from each other geometrically. Coupled with noisy signaling methods such as the brain uses, this sometime lack of alignment between conceptual/pragmatic and geometric structure can lead to various sorts of confusion (i.e. when neuron A sends a signal to physical distant neurons B, this may cause various side-effects along the path, sonic of which wouldn't happen if A and B were close to each other). In the context of CogPrime, the most extreme way to incorporate a brain-like 3D structure would be to actually embed an Atomspace in a bounded 3D region. Then the Atomspace would 319 EFTA00624466
320 39 Dimensional Embedding be geometrically something like a brain, but with abstract nodes and links (some having explicit symbolic content) rather than purely sub symbolic neurons. This would not be a ridiculous thing to do, and could yield interesting results. However, we are unsure this would be an optimal approach. Instead we have opted for a more moderate approach: couple the non-dimensional Atomspace with a dimensional space, containing points corresponding to Atoms. That is, we perform an embedding of Atoms in the OpenCog AtomSpace into n-dimensional space - a judicious transformation of (hyper)graphs into vectors. This embedding has applications to PLN inference control, and to the guidance of instance generation in PEPL learning of Combo trees. It is also, in itself, a valuable and interesting heuristic for sculpting the link topology of a CogPrime AtomSpace. The basic dimensional embedding algorithm described here is fairly simple and not original to CogPrime, but it has not previously been applied in any similar context. The intuition underlying this approach is that there are some cases (e.g. PLN control, and PEPL guidance) where dimensional geometry provides a useful heuristic for constraining a huge search space, via providing a compact way of storing a large amount of information. Dimensionally embedding Atoms lets CogPrime be dimensional like the brain when it needs to be, yet with the freedom of nondimensionality the rest of the time. This dual strategy is one that may be of value for AGI generally beyond the CogPrime design, and is somewhat related to (though different in detail from) the way the CLARION cognitive architecture ISZO-il maps declarative knowledge into knowledge appropriate for its neural net layer. There is an obvious way to project CogPrime Atoms into n-dimensional space, by assigning each Atom a numerical vector based on the weights of its links. But this is not a terribly useful approach, because the vectors obtained in this way will live, potentially, in millions- or billions-dimensional space. The approach we describe here is a bit different. We are defining more specific embeddings, each one based on a particular link type or set of link types. And we are doing the embedding into a space whose dimensionality is high but not too high, e.g. n=50. This moderate dimensional space could then be projected down into a lower dimensional space, like a 3D space, if needed. The philosophy underlying the ideas proposed here is similar to that underlying Principal Components Analysis (PCA) in statistics poli0j. The n-dimensional spaces we define here, like those used in PCA or LSI (for Latent Semantic Indexing II.NIDK(l7j), are defined by sets of or- thogonal concepts extracted from the original space of concepts. The difference is that PCA and LSI work on spaces of entities defined by feature vectors, whereas the methods described here work for entities defined as nodes in weighted graphs. There is no precise notion of orthogonality for nodes in a weighted graph, but one can introduce a reasonable proxy. 39.2 Link Based Dimensional Embedding In this section we define the type of dimensional embedding that we will be talking about. For concreteness we will speak in terms of CogPrime nodes and links, but the discussion applies much more generally than that. A link based dimensional embedding is defined as a mapping that maps a set of CogPrime Atoms into points in an n-dimensional real space, by: • mapping link strength into coordinate values in an embedding space, and EFTA00624467
39.2 Link Based Dimensional Embedding 321 • representing nodes as points in this embedding space, using the coordinate values defined by the strengths of their links. In the usual case, a dimensional embedding is formed from links of a single type, or from links whose types are very closely related (e.g. from all symmetrical logical links). Mapping all the link strengths of the links of a given type into coordinate values in a dimen- sional space is a simple, but not a very effective strategy. The approach described here is based on strategically choosing a subset of particular links and forming coordinate values from them. The choice of links is based on the desire for a correspondence between the metric structure of the embedding space, and the metric structure implicit in the weights of the links of the type being embedded. The basic idea of metric preservation is depicted in Figure 39.1. V • poi. equation MOUSE a Stet DIMENSIONAL EMBEDDING SPACE ATONSPACE Fig. 39.1: Metric-Preserving Dimensional Embedding. The basic idea of the sort of em- bedding described here is to map Atoms into numerical vectors, in such a way that, on average, distance between Atoms roughly correlates with distance between corresponding vectors. (The picture shows a 3D embedding space for convenience, but in reality the dimension of the em- bedding space will generally be much higher.) More formally, let proj(A) denote the point in IV' corresponding to the Atom A. Then if, for example, we are doing an embedding based on SimilarityUnits, we want there to be a strong correlation (or rather anticorrelation) between: EFTA00624468
322 39 Dimensional Embedding (SimilarityLink A El) .tv. s and d E (proj (A), proj(B)) where dE denotes the Euclidean distance on the embedding space. This is a simple case because SimilarityLink is symmetric. Dealing with asymmetric links like InheritanceLinks is a little subtler, and will be done below in the context of inference control. Larger dimensions generally allow greater correlation, but add complexity. If one chooses the dimensionality equal to the number of nodes in the graph, there is really no point in doing the embedding. On the other hand, if one tries to project a huge and complex graph into 1 or 2 dimensions, one is bound to lose a lot of important structure. The optimally useful embedding will be into a space whose dimension is law but not too tarp. For internal CogPrime inference purposes, we should generally use a moderately high- dimensional embedding space, say n=50 or n=100. 39.3 Hard and Koren's Dimensional Embedding Algorithm Our technique for embedding CogPrime Atoms into high-dimensional space is based on an algorithm suggested by David Harel and Yehuda Koren II IK021. Their work is concerned with visualizing large graphs, and they propose a two-phase approach: 1. embed the graph into a high-dimensional real space 2. project the high-dimensional points into 2D or 3D space for visualization In CogPrime, we don't always require the projection step (step 2); our focus is on the initial embedding step. Hard and Koren's algorithm for dimensional embedding (step 1) is directly applicable to the CogPrime context. Of course this is not the only embedding algorithm that would be reasonable to use in an CogPrime context; it's just one possibility that seems to make sense. Their algorithm works as follows. Suppose one has a graph with symmetric weighted links. Further, assume that between any two nodes in the graph, there is a way to compute the weight that a link between those two nodes would have. even if the graph in fact doesn't contain a link between the two nodes. In the CogPrime context, for instance, the nodes of the graph may be ConeeptNodes, and the links may be SimilarityLinks. We will discuss the extension of the approach to deal with asymmetric links like InheritanceLinks, later on. Let n denote the dimension of the embedding space (e.g. n = 50). We wish to map graph nodes into points in R", in such a way that the weight of the graph link between A and B correlates with the distance between proj(A) and proj(B) in R". 39.3.1 Step 1: Choosing Pivot Points Choose n "pivot points" that are roughly uniformly distributed across the graph. EFTA00624469
39.4 Embedding Based Inference Control 323 To do this, one chooses the first pivot point at random and then iteratively chooses the i'th point to be maximally distant from the previous (i — 1) points chosen. One may also use additional criteria to govern the selection of pivot points. In CogPrime, for instance, we may use long-term stability as a secondary criterion for selecting Atoms to serve as pivot points. Greater computational efficiency is achieved if the pivot-point Atoms don't change frequently. 39.3.2 Step 2: Similarity Estimation Estimate the similarity between each Atom being projected, and the n pivot Atoms. This is expensive. However, the cost is decreased somewhat in the CogPrime case by caching the similarity values produced in a special table (they may not be important enough otherwise to be preserved in CogPrime ). Then, in cases where neither the pivot Atom nor the Atom being compared to it have changed recently, the cached value can be reused. 39.3.3 Step 3: Embedding Create an n-dimensional space by assigning a coordinate axis to each pivot Atom. Then, for an Atom i, the i'th coordinate value is given by its similarity to the i'th pivot Atom. After this step, one has transformed one's graph into a collection of n-dimensional vectors. WIKISOURCE:EmbeddingBasedInference 39.4 Embedding Based Inference Control One important application for dimensional embedding in CogPrime Ls to help with the control of • Logical inference • Direct evaluation of logical links We describe how it can be used specifically to stop the CogPrime system from continually trying to make the same unproductive inferences. To understand the problem being addressed, suppose the system tries to evaluate the strength of the relationship SimilarityLink foot toilet Assume that no link exists in the system representing this relationship. Here "foot" and "toilet" are hypothetical ConceptNodes that represent aspects of the concepts of foot and toilet respectively. In reality these concepts might well be represented by complex maps rather than individual nodes. EFTA00624470
324 39 Dimensional Embedding Suppose the system determines that the strength of this Link is very close to zero. Then (depending on a threshold in the MindAgent), it will probably not create a SimilarityLink between the "foot" and "toilet" nodes. Now, suppose that a few cycles later, the system again tries to evaluate the strength of the same Link, SimilarityLink foot toilet Again, very likely, it will find a low strength and not create the Link at The same problem may occur with InheritanceLinks, or any other (first or higher order) logical link type. Why would the system try, over and over again, to evaluate the strength of the same nonex- istent relationship? Because the control strategies of the current forward-chaining inference and pattern mining MindAgents are simple by design. These MindAgents work by selecting Atoms from the AtomTable with probability proportional to importance, and trying to build links between them. If the foot and toilet nodes are both important at the same time, then these MindAgents will try to build links between them - regardless of how many times they've tried to build links between these two nodes in the past and failed. How do we solve this problem using dimensional embedding? Generally: • one will need a different embedding space for each link type for which one wants to prevent repeated attempted inference of useless relationships. Sometimes, very closely related link types might share the same embedding space; this must be decided on a case-by-case basis. • in the embedding space for a link type L, one only embeds Atoms of a type that can be related by links of type L It is too expensive to create a new embedding very often. Fortunately, when a new Atom is cre- ated or an old Atom is significantly modified, it's easy to reposition the Atom in the embedding space by computing its relationship to the pivot Atoms. Once enough change has happened, however, new pivot Atoms will need to be recomputed, which is a substantial computational expense. We must update the pivot point set every N cycles, where N is large; or else, whenever the total amount of change in the system has exceeded a certain threshold. Now, how is this embedding used for inference control? Let's consider the case of similarity first. Quite simply, one selects a pair of Atoms (A,B) for SimilarityMining (or inference of a SimilarityLink) based on some criterion such as, for instance: importance(A) * importance(S) * simproj(A,B) where distproj(A,B) dE( proj(A) proj(B) ) simproj 2-c*distproj and c is an important tunable parameter. What this means is that, if A and B are far apart in the SimilarityLink embedding space, the system is unlikely to try to assess their similarity. There is a tremendous space efficiency of this approach, in that, where there are N Atoms and m pivot Atoms, 1%1'2 similarity relationships are being approximately stored in m*N coor- dinate values. Furthermore, the cost of computation is m*N times the cost of assessing a single SimilarityLink. By accepting crude approximations of actual similarity values, one gets away with linear time and space cost. EFTA00624471
39.5 Dimensional Embedding and InheritanceLinks 325 Because this is just an approximation technique, there are definitely going to be cases where A and B are not similar, even though they're close together in the embedding space. When such a case is found, it may be useful for the AtomSpace to explicitly contain a low-strength SimilarityLink between A and B. This link will prevent the system from making false embedding- based decisions to explore (SimilarityLink A B) in the future. Putting explicit low-strength SimilarityLinks in the system in these cases, is obviously much cheaper than using them for all cases. We've been talking about SimilarityLinks, but the approach is more broadly applicable. Any symmetric link type can be dealt with about the same way. For instance, it might be useful to keep dimensional embedding maps for • SimilarityLink • ExtensionalSimilarityLink • EquivalenceLink • ExtensionalEquivalenceLink On the other hand, dealing with asymmetric links in terms of dimensional embedding requires more subtlety - we turn to this topic below. 39.5 Dimensional Embedding and InheritanceLinks Next, how can we use dimensional embedding to keep an approximate record of which links do not inherit from each other? Because inheritance is an asymmetric relationship, whereas distance in embedding spaces is a symmetrical relationship, there's no direct and simple way to do so. However, there is an indirect approach that solves the problem, which involves maintaining two embedding spaces. and combining information about them in an appropriate way. In this subsection, we'll discuss an approach that should work for InheritanceLink, SubsetLink, Impli- cationLink, and ExtensionalImplicationLink and other related link types. But we'll explicitly present it only for the InheritanceLink case. Although the embedding algorithm described above was intended for symmetric weighted grapks, in fact we can use it for asymmetric links in just about the same way. The use of the embedding graph for inference control differs, but not the basic method of defining the embedding. In the InheritanceLink case, we can define pivot Atoms in the same way, and then we can define two vectors for each Atom A: proj_iparent (A)_i (InheritanceLink A A_i) .tv.s proj_i child) (A)_i (InheritanceLink A_i A) .tv.s where is the i'th pivot Atom. If generally projthisd(A); < projaim(B); then qualitatively "children of A are children of B"; and if generally projp.,„„i(A)1 ≥ projp, a(B); then qualitatively "parents of B are parents of A". The combination of these two conditions means heuristically that (Inheritance. A B) is likely. So, by combining the two embedding vectors assigned to each Atom, one can get heuristic guidance regarding inheritance relations, analogous to the case with similarity relationships. One EFTA00624472
326 39 Dimensional Embedding may produce mathematical formulas estimating the error of this approach under appropriate conditions, but in practice it will depend on the probability distribution of the vectors. EFTA00624473
Chapter 40 Mental Simulation and Episodic Memory 40.1 Introduction This brief chapter deals with two important, coupled cognitive components of CogPrime : the component concerned with creating internal simulations of situations and episodes in the ex- ternal physical world, and the one concerned with storing and retrieving memories of situations and episodes. These are components that are likely significantly different in CogPrime from anything that exists in the human brain, yet, the functions they carry out are obviously essential to human cognition (perhaps more so to human cognition than to CogPrime's cognition, because Cog- Prime is by design more reliant on formal reasoning than the human brain is). Much of human thought consists of internal, quasi-sensory "imaging" of the external physical world - and much of human memory consists of remembering autobiographical situations and episodes from daily life, or from stories heard from others or absorbed via media. Often this episodic remembering takes the form of visualization, but not always. Blind people generally think and remember in terms of non-visual imagery, and many sighted people think in terms of sounds, tastes or smells in addition to visual images. So far, the various mechanisms proposed as part of CogPrime do not have much to do with either internal imagery or episodic remembering, even though both seem to play a large role in human thought. This is OK, of course, since CogPrime is not intended as a simulacrum of human thought, but rather as a different sort of intelligence. However, we believe it will actually be valuable to CogPrime to incorporate both of these factors. And for that purpose, we propose • a novel mechanism: the incorporation within the CogPrime system of a 3D physical-world simulation engine. • an episodic memory store centrally founded on dimensional embedding, and linked to the internal simulation model 327 EFTA00624474
328 40 Mental Simulation and Episodic Memory 40.2 Internal Simulations The current use of virtual worlds for OpenCog is to provide a space in which human-controlled agents and CogPrime -controlled agents can interact, thus allowing flexible instruction of the CogPrime system by humans, and flexible embodied, grounded learning by CogPrime systems. But this very same mechanism may be used internally to CogPrime, i.e. a CogPrime system may be given an internal simulation world, which serves as a sort of "mind's eye." Any sufficiently flexible virtual world software may be used for this purpose, for example OpenSim (http: \opensim.org). Atoms encoding percepts may be drawn from memory and used to generate forms within the internal simulation world. These forms may then interact according to • the patterns via which they are remembered to act • the laws of physics, as embodied in the simulation world This allows a kind of "implicit memory," in that patterns emergent from the world-embedded interaction of a number of entities need not explicitly be stored in memory, so long as they will emerge when the entities are re-awakened within the internal simulation world. The SimulatorMindAgent grabs important perceptual Atoms and uses them to generate forms within the internal simulation world, which then act according to remembered dynamical patterns, with the laws of physics filling in the gaps in memory. This provides a sort of running internal visualization of the world. Just as important, however, are specific schemata that utilize visualization in appropriate contexts. For instance, if reasoning is having trouble solving a problem related to physical entities, it may feed these entities to the internal simulation world to see what can be discovered. Patterns discovered via simulation can then be fed into reasoning for further analysis. The process of perceiving events and objects in the simulation world is essentially identical to the process of perceiving events and objects in the "actual" world. And of course, an internal simulation world may be used whether the CogPrime system in question is hooked up to a virtual world like OpenSim, or to a physical robot. Finally, perhaps the most interesting aspect of internal simulation is the generation of "vir- tual perceptions" from abstract concepts. Analogical reasoning may be used to generate virtual perceptions that were never actually perceived, and these may then be visualized. The need for "reality discrimination" comes up here, and is easier to enforce in CogPrime than in humans. A PerceptNode that was never actually perceived may be explicitly embedded in a Hypotheti- calLink, thus avoiding the possibility of confusing virtual percepts with actual ones. How useful the visualization of virtual perceptions will be to CogPrime cognition, remains to be seen. This kind of visualization is key to human imagination but this doesn't mean it will play the same role in CogPrime's quite different cognitive processes. But it is important that CogPrime has the power to carry out this kind of imagination. 40.3 Episodic Memory Episodic memory refers to the memory of our own "life history" that each of us has. Loss of this kind of memory is the most common type of amnesia in fiction - such amnesia is particularly dramatic because our episodic memories constitute so much of what we consider as our "selves." EFTA00624475
40.3 Episodic Memory 329 To a significant extent, we as humans remember, reason and relate in terms of stories - and the centerpiece of our understanding of stories is our episodic memory. A CogPrime system need not be as heavily story-focused as a typical human being (though it could be, potentially) - but even so, episodic memory is a critical component of any CogPrime system controlling an agent in a world. The core idea underlying CogPrime's treatment of episodic memory, is a simple one: two dimensional embedding spaces dedicated to episodes. An episode - a coherent collection of happenings, often with causal interrelationships, often (but not always) occurring near the same spatial or temporal locations as each other - may be represented explicitly as an Atom, and implicitly as a map whose key is that Atom. These episode-Atoms may then be mapped into two dedicated embedding spaces: • one based on a distance metric determined by spatiotemporal proximity • one based on a distance metric determined by semantic similarity A story is then a series of episodes - ideally one that, if the episodes in the series become important sequentially in the AtomSpace, causes a significant important-goal-related (ergo emo- tional) response in the system. Stories may also be represented as Atoms, in the simplest case consisting of SequentialAND links joining episode-Atoms. Stories then correspond to paths through the two episodic embedding spaces. Each path through each embedding space implic- itly has a sort of "halo" in the space - visualizable as a tube snaking through the space, centered on the path. This tube contains other paths - other stories - that related to the given center story, either spatiotemporally or semantically. The familiar everyday human experience of episodic memory may then be approximatively emulated via the properties of the dimensional embedding space. For instance, episodic memory is famously associative - when we think of one episode or story, we think of others that are spatiotemporally or semantically associated with it. This emerges naturally from the embedding space approach, due to the natural emergence of distance-based associative memory in an embedding space. Figures 40.1 and 40.2 roughly illustrates the link between episodic/perceptual and declarative memory. EFTA00624476
330 40 Mental Simulation and Episodic Memory 19805 L New York CRY - - Jack) kiss • Me ASSOCIATIVE EPISODIC MEMORY LINKED TO PERCEPTION HIERARCHY ATOMSPACE Fig. 40.1: Relationship Between Episodic, Declarative and Perceptual Memory. The nodes and links at the bottom depict declarative memory stored in the Atomspace; the picture at the top illustrates an archetypal episode stored in episodic memory, and linked to the perceptual hierarchy enabling imagistic simulation. EFTA00624477
40.3 Episodic Memory 331 my Energy ASSOCIATIVE EPISODIC MEMORY LINKED TO PERCEPTION HIERARCHY ATOMSPACE Fig. 40.2: Relationship Between Episodic, Declarative and Perceptual Memory. An- other example similar to the one in ??, but referring specifically to events occurring in an OpenCogPrime -controlled agent's virtual world. EFTA00624478
EFTA00624479
Chapter 41 Integrative Procedure Learning 41.1 Introduction "Procedure learning" - learning step-by-step procedures for carrying out internal or external operations - is a highly critical aspect of general intelligence, and is carried out in CogPrime via a complex combination of methods. This somewhat heterogeneous chapter reviews several advanced aspects of procedure learning in CogPrime, mainly having to do with the integration between different cognitive processes. In terms of general cognitive theory and mind-world correspondence, this is some of the subtlest material in the book. We are not concerned just with how the mind's learning of one sort of knowledge correlated with the way this sort of knowledge is structured in the mind's habitual environments, in the context of its habitual goals. Rather, we are concerned with how various sorts of knowledge intersect and interact with each other. The proposed algorithmic intersections between, for instance, declarative and procedural learning processes. are reflective of implicit assumptions about how declarative and procedural knowledge are presented in the world in the context of the system's goals - but these implicit assumptions are not always easy to tease out and state in a compact way. We will do our best to highlight these assumptions as they arise throughout the chapter. Key among these assumptions, however, are that a human-like mind • is presented with various procedure learning problems at various levels of difficulty (so that different algorithms may be appropriate depending on the difficulty level). This leads for instance to the possibility of using various different algorithms like MOSES or hill climbing, for different procedure learning problems. • is presented with some procedure learning problems that may be handled in a relatively isolated way, and others that are extremely heavily dependent on context, often in a way that recurs across multiple learning instances in similar contexts. This leads to a situations where the value of bringing declarative (PLN) and associative (ECAN) and episodic knowledge into the procedure learning process, has varying value depending on the situation. • is presented with a rich variety of procedure learning problems with complex interrelation- ships, including many problems that are closely related to previously solved problems in various ways. This highlights the value of using PLN analogical reasoning, and importance spreading along HebbianLinks learned by ECAN, to help guide procedure learning in various ways. 333 EFTA00624480
334 41 Integrative Procedure Learning • needs to learn some procedures whose execution may be carried out in a relatively isolated way, and other procedures whose execution requires intensive ongoing interaction with other cognitive processes The diversity of procedure learning situations reflected in these assumptions, leads naturally to the diversity of technical procedure learning approaches described in this chapter. Potentially one could have a single, unified algorithm covering all the different sorts of procedure learning, but instead we have found it more practical to articulate a small number of algorithms which are then combined in different ways to yield the different kinds of procedure learning. 41.1.1 The Diverse Technicalities of Procedure Learning in CogPrime On a technical level, this chapter discusses two closely related aspects of CogPrime : schema learning and predicate learning, which we group under the general category of "procedure learn- ing." Schema learning - the learning of Schemallodes and schema maps (explained further in the Chapter 42) - is CogPrime lingo for learning how to do things. Learning how to act, how to perceive, and how to think - beyond what's explicitly encoded in the system's MindAgents. As an advanced CogPrime system becomes more profoundly self-modifying, schema learning will drive more and more of its evolution. Predicate learning, on the other hand, is the most abstract and general manifestation of pattern recognition in the CogPrime system. PredicateNodes, along with predicate maps, are CogPrime's way of representing general patterns (general within the constraints imposed by the system parameters, which in turn are governed by hardware constraints). Predicate evolu- tion, predicate mining and higher-order inference - specialized and powerful forms of predicate learning - are the system's most powerful ways of creating general patterns in the world and in the mind. Simpler forms of predicate learning are grist for the mill of these processes. It may be useful to draw an analogy with another (closely related) very hard problem in CogPrime, discussed in the book Probabilistic Logic Networks: probabilistic logical unification, which in the CogPrime /PLN framework basically comes down to finding the SatisfyingSets of given predicates. Hard logical unification problems can often be avoided by breaking down large predicates into small ones in strategic ways, guided by non-inferential mind processes, and then doing unification only on the smaller predicates. Our limited experimental experience indicates that the same "hierarchical breakdown" strategy also works for schema and predicate learning, to an extent. But still, as with unification, even when one does break down a large schema or predicate learning problem into a set of smaller problems, one is still in most cases left with a set of fairly hard problems. More concretely, CogPrime procedure learning may be generally decomposed into three as- pects: 1. Converting back and forth between maps and ProcedureNodes (encapsulation and expan- sion) 2. Learning the Combo Trees to be embedded in grounded ProcedureNodes 3. Learning procedure maps (networks of grounded ProcedureNodes acting in a coordinated way to carry out procedures) EFTA00624481
41.1 Introduction 335 Each of these three aspects of CogPrime procedure learning mentioned above may be dealt with somewhat separately, though relying on largely overlapping methods. CogPrime approaches these problems using a combination of techniques: • Evolutionary procedure learning and hillclimbing for dealing with brand new procedure learning problems, requiring the origination of innovative, highly approximate solutions out of the blue • Inferential procedure learning for taking approximate solutions and making them exact, and for dealing with procedure learning problems within domains where closely analogous procedure learning problems have previously been solved • Heuristic, probabilistic data mining for the creation of encapsulated procedures (which then feed into inferential and evolutionary, procedure learning), and the expansion of encapsulated procedures into procedure maps • PredictivelmplicationLink formation (augmented by PLN inference on such links) as a Cog- Prime version of goal-directed reinforcement learning Using these different learning methods together, as a coherently-tuned whole, one arrives at a holistic procedure learning approach that combines speculation, systematic inference, encapsu- lation and credit assignment in a single adaptive dynamic process. We are relying on a combination of techniques to do what none of the techniques can ac- complish on their own. The combination is far from arbitrary, however. As we will see, each of the techniques involved plays a unique and important role. 41.1.1.1 Comments on an Alternative Representational Approach We briefly pause to contrast certain technical aspects of the present approach to analogous as- pects of the Webmind AI Engine (one of CogPrime's predecessor Al systems, briefly discussed in Chapter 19.1). This predecessor system used a knowledge representation somewhat similar to the Atomspace, but with various differences; for instance the base types were Node and Link rather than Atom, and there was a Node type not used in CogPrime called the Schemain- stanceNode (each one corresponding to a particular instance of a Schemallode, used within a particular procedure). In this approach, complex, learned schema were represented as distributed networks of el- ementary SchemalnstanceNodes, but these networks were not defined purely by function ap- plication - they involved explicit passing of variable values through VariableNodes. Special logic-gate-bearing objects were created to deal with the distinction between arguments of a SchemalnstanceNode, and predecessor tokens giving a SchemanstanceNode permission to act. While this approach is in principle workable, it proved highly complex in practice, and for the Novamente Cognition Engine and CogPrime we chose to store and manipulate procedural knowledge separately from declarative knowledge (via Combo trees). EFTA00624482
336 41 Integrative Procedure Learning 41.2 Preliminary Comments on Procedure Map Encapsulation and Expansion Like other knowledge in CogPrime, procedures may be stored in either a localized (Combo tree) or globalized (procedure map) manlier, with the different approaches being appropriate for different purposes. Activation of a localized procedure may spur activation of a globalized procedure, and vice versa - so on the overall mind-network level the representation of procedures is heavily "glocal." One issue that looms large in this context is the conversion between localized and globalized procedures - i.e., in CogPrime lingo, the encapsulation and expansion of procedure maps. This matter will be considered in more detail in Chapter 42 but here we briefly review some key ideas. Converting from grounded ProcedureNodes into maps is a relatively simple learning prob- lem: one enacts the procedure, observes which Atoms are active at what times during the enaction process, and then creating PredictivelmplicationLinks between the Atoms active at a certain time and those active at subsequent times. Generally it will be nectsbary to enact the procedure multiple times and with different inputs, to build up the appropriate library of Fred ictivelmplic at ion Links. Converting from maps into ProcedureNodes is significantly trickier. First, it involves carrying out data mining over the network of ProcedureNodes, identifying subnetworks that are coherent schema or predicate maps. Then it involves translating the control structure of the map into explicit logical form, so that the encapsulated version will follow the same order of execution as the map version. This is an important case of the general process of map encapsulation, to be discussed in Chapter 42 Next, the learning of grounded ProcedureNodes is carried out by a synergistic combination of multiple mechanisms, including pure procedure learning methods like hillclimbing and evolu- tionary learning, and logical inference. These two approaches have quite different characteristics. Evolutionary learning and hillclimbing excel at confronting a problem that the system has no clue about, and arriving at a reasonably good solution in the form of a schema or predicate. Inference excels at deploying the system's existing knowledge to form useful schemata or pred- icates. The choice of the appropriate mechanism for a given problem instance depends largely on how much relevant knowledge is available. A relatively simple case of ProcedureNode learning is where one is given a ConceptNode and wants to find a ProcedureNode matching it. For instance, given a ConceptNode C, one may wish to find the simplest possible predicate whose corresponding PredicateNode P satisfies SatisfyingSet(P) C On the other hand, given a ConceptNode C involved in inferred ExecutionLinks of the form ExecutionLink C Ai Bi i-1, .. ,n one may wish to find a Schemallode so that the corresponding Schemallode will fulfill this same set of ExecutionLinks. It may seem surprising at first that a ConceptNode might be involved with ExecutionLinks, but remember that a function can be seen as a set of tuples (ListLink in CogPrime ) where the first elements, the inputs of the function, are associated with a unique output. These kinds of ProcedureNode learning may be cast as optimization problems, and carried out by hillclimbing or evolutionary programming. Once procedures are learned via evolutionary programming or other techniques. they may be refined via inference. EFTA00624483
41.3 Predicate Schematization 337 The other case of ProcedureNode learning is goal-driven learning. Here one seeks a Sche- mallode whose execution will cause a given goal (represented by a Goal Node) to be satisfied. The details of Goal Nodes have already been reviewed; but all we need to know here is simply that a Goal Node presents an objective function, a function to be maximized; and that it poses the problem of finding schemata whose enaction will cause this function to be maximized in specified contexts. The learning of procedure maps, on the other hand, is carried out by reinforcement learn- ing, augmented by inference. This is a matter of the system learning HebbianLinks between ProcedureNod , as will be described below. 41.3 Predicate Schematization Now we turn to the process called "predicate schematization," by which declarative knowledge about how to carry, out actions may be translated into Combo trees embodying specific pro- cedures for carrying out actions. This process is straightforward and automatic in some cases, but in other cases requires significant contextually-savvy inference. This is a critical process because some procedure knowledge - especially that which is heavily dependent on context in either its execution or its utility - will be more easily learned via inferential methods than via pure procedure-learning methods. But, even if a procedure is initially learned via inference (or is learned by inference based on cruder initial guesses produced by pure procedure learning methods), it may still be valuable to have this procedure in compact and rapidly executable form such as Combo provides. To proceed with the technical description of predicate schematization in CogPrime. we first need the notion of an "executable predicate". Some predicates are executable in the sense that they correspond to executable schemata, others are not. There are executable atomic predi- cates (represented by individual PredicateNodes). and executable predicates (which are link structures). In general. a predicate may be turned into a schema if it is an atomic executable predicate, or if it is a compound link structure that consists entirely of executable atomic predicates (e.g. pick_up, walk_to, can_do, etc.) and temporal links (e.g. SimultaneousAND, PredictiveImplication, etc.) Records of predicate execution may then be made using ExecutionLinks, e.g. ExecutionLink pick_up ( me, ball_7) is a record of the fact that the schema corresponding to the pick_up predicate was executed on the arguments (me, ball_7). It is also useful to introduce some special (executable) predicates related to schema execution: • can_do, which represents the system's perceived ability to do something • do, which denotes the system actually doing something; this is used to mark actions as opposed to perceptions • just_done, which is true of a schema if the schema has very recently been executed. The general procedure used in figuring out what predicates to schematize, in order to create a procedure achieving a certain goal, is: Start from the goal and work backwards, following Predictivelmplications and EventualPredictivelmplications and treating can_do's as transpar- ent, stopping when you find something that can currently be done, or else when the process dwindles due to lack of links or lack of sufficiently certain links. EFTA00624484
338 41 Integrative Procedure Learning In this process, an ordered list of perceptions and actions will be created. The Atoms in this perception/action-series (PA-series) are linked together via temporal-logical links. The subtlety of this process, in general, will occur because there may be many different paths to follow. One has the familiar combinatorial explosion of backward-chaining inference, and it may be hard to find the best PA-series among all the mess. Experience-guided pruning is needed here just as with backward-chaining inference. Specific rules for translating temporal links into executable schemata, used in this process, are as follows. All these rule-statements assume that B is in the selected PA-series. All node variables not preceded by do or can_do are assumed to be perceptions. The —> denotes the transformation from predicates to executable schemata. EventualPredictivelmplicationLink (do A) Repeat (do A) Until B EventualPredictivelmplicationLink (do A) (can_do 8) Repeat do A do B Until Evaluation just done B the understanding being that the agent may try to do B and fail, and then try again the next time around the loop PredictivelmplicationLink (do A) (can_do 8) <time-lag T> do A wait T do B SimultaneouslmplicationLink A (can do B) if A then do B SimultaneouslmplicationLink (do A) (can_do B) do A do B PredictivelmplicationLink A (can_do 8) if A then do B SequentialAndLink Al EFTA00624485
41.3 Predicate Sclmniatization Al An SequentialAndLink Al Al Wait T A2 Wait T Wait T An SimultaneousANDLink Al ? An Al An <time_lag T> 339 Note how all instances of can_do are stripped out upon conversion from predicate to schema, and replaced with instances of do. 41.3.1 A Concrete Example For a specific example of this process, consider the knowledge that: "If I walk to the teacher while whistling, and then give the teacher the ball, Ill get rewarded." This might be represented by the predicates walk to the teacher while whistling A_1 SimultaneousAND do Walk_to ExOutLink locate teacher EvaluationLink do whistle If I walk to the teacher while whistling, eventually I will be next to the teacher EventualPredictivelmplication A_1 Evaluation next_to teacher While next to the teacher, I can give the teacher the ball Simultaneouslmplication EvaluationLink next_to teacher can_do EvaluationLink give (teacher, ball) If I give the teacher the ball, I will get rewarded EFTA00624486
340 41 Integrative Procedure Learning Predictivelmplication just_done EvaluationLink done give (teacher, ball) Evaluation reward Via goal-driven predicate schematization, these predicates would become the schemata walk toward the teacher while whistling Repeat: do WalkTo ExOut locate teacher do Whistle Until: next_to(teacher, ball) if next to the teacher, give the teacher the ball If: Evaluation next_to teacher Then do give(teacher, ball) Carrying out these two schemata will lead to the desired behavior of walking toward the teacher while whistling, and then giving the teacher the ball when next to the teacher. Note that, in this example: • The walk_to, whistle, locate and give used in the example schemata are procedures corre- sponding to the executable predicates walk_to, whistle, locate and give used in the example predicates • Next_to is evaluated rather than executed because (unlike the other atomic predicates in the overall predicate being made executable) it has no "do" or "can_do" next to it 41.4 Concept-Driven Schema and Predicate Creation In this section we will deal with the "conversion" of ConceptNodes into Schemallodes or Fred- icateNodes. The two cases involve similar but nonidentical methods; we will begin with the simpler PredicateNode case. Conceptually, the importance of this should be clear: sometimes knowledge may be gained via concept-learning or linguistic means, but yet may be useful to the mind in other forms, e.g. as executable schema or evaluable predicates. For instance, the system may learn conceptually about bicycle-riding, but then may also want to learn executable procedures allowing it to ride a bicycle. Or it may learn conceptually about criminal individuals, but may then want to learn evaluable predicates allowing it to quickly evaluate whether a given individual is a criminal or not. 41.4.1 Concept-Driven Predicate Creation Suppose we have a ConceptNode C, with a set of links of the form EFTA00624487
41.4 Concept-Driven Schema and Predicate Creation 341 MemberLink A_i C, il, ...,n Our goal is to find a PredicateNode so that firstly, MemberLink X C is equivalent to X "within" SatisfyingSet(P) and secondly, P is as simple as possible This is related to the "Occam's Razor," Solomonoff induction related heuristic to be presented later in this chapter. We now have an optimization problem: search the space of predicates for P that maximize the objective function f(P,C), defined as for instance f(P,C) = cp(P) x r(C, where cp(P), the complexity penalty of P, is a positive function that decreases when P gets larger and with r(C, = GetStrength SimilarityLink C SatisfyingSet(P) This is an optimization problem over predicate space, which can be solved in an approximate way by the evolutionary programming methods described earlier. The ConceptPredicatization MindAgent selects ConceptNodes based on • Importance • Total (truth value based) weight of attached MemberLinks and EvaluationLinks and launches an evolutionary learning or hillclimbing task focused on learning predicates based on the nodes it selects. 41.4.2 Concept-Driven Schema Creation In the schema learning case, instead of a ConceptNode with MemberLinks and EvaluationLinks, we begin with a ConceptNode C with ExecutionLinks. These ExecutionLinks were presumably produced by inference (the only CogPrime cognitive process that knows how to create Execu- tionLinks for non-ProcedureNodes). The optimization problem we have here is: search the space of schemata for S that maximize the objective function f (S,C), defined as follows: f(S,(7)==cp(S)xr(S,C) Let Q(C) be the set of pairs (X, Y) so that ExecutionLink C X Y, and EFTA00624488
342 41 Integrative Procedure Learning r (S, G) GetStrength SubsetLink GM) Graph(S) where Graph(S) denotes the set of pairs (X, Y) so that ExecutionLink S X Y, where S has been executed over all valid inputs. Note that we consider a SubsetLink here because in practice C would have been observed on a partial set of inputs. Operationally, the situation here is very similar to that with concept predicatization. The ConceptSchematization MindAgent must select ConceptNodes based on: • Importance • Total (truth value based) weight of ExecutionLinks and then feed these to evolutionary optimization or hillclimbing. 41.5 Inference-Guided Evolution of Pattern -Embodying Predicates Now we turn to predicate learning - the learning of PredicateNodes, in particular. Aside from logical inference and learning predicates to match existing concepts, how does the system create new predicates? Goal-driven schema learning (via evolution or reinforcement learning) provides one alternate approach: create predicates in the context of creating use- ful schema. Pattern mining, discussed in Chapter 37, provides another. Here we will describe (yet) another complementary dynamic for predicate creation: pattern-oriented, inference-guided PredicateNode evolution. In most general terms, the notion pursued here is to form predicates that embody patterns in itself and in the world. This brings us straight back to the foundations of the patternist philosophy of mind, in which mind is viewed as a system for recognizing patterns in itself and in the world, and then embodying these patterns in itself. This general concept is manifested in many ways in the CogPrime design, and in this section we will discuss two of them: • Reward of surprisingly probable Predicates • Evolutionary learning of pattern-embodying Predicates These are emphatically not the only way pattern-embodying PredicateNodes get into the sys- tem. Inference and concept-based predicate learning also create PredicateNodes embodying patterns. But these two mechanisms complete the picture. 41 4. 1 Rewarding Surprising Predicates The TruthValue of a PredicateNode represents the expected TruthValue obtained by averaging its TruthValue over all its possible legal argument-values. Some Predicates, however, may have high TruthValue without really being worthwhile. They may not add any information to their EFTA00624489
41.5 Inference-Cuided Evolution of Pattern-Embodying Predicates 343 components. We want to identify and reward those Predicates whose TruthValues actually add information beyond what is implicit in the simple fact of combining their components. For instance, consider the PredicateNode AND InheritanceLink X man InheritanceLink X ugly If we assume the man and ugly concepts are independent, then this PredicateNode will have the TruthValue tnan.tv.s x ugly.tv.s In general, a PredicateNode will be considered interesting if: 1. Its Links are important 2. Its TruthValue differs significantly from what would be expected based on independence assumptions about its components It is of value to have interesting Predicates allocated more attention than uninteresting ones. Factor 1 is already taken into account, in a sense: if the PredicateNode is involved in many Links this will boost its activation which will boost its importance. On the other hand, Factor 2 is not taken into account by any previously discussed mechanisms. For instance, we may wish to reward a PredicateNode if it has a surprisingly large or small strength value. One way to do this is to calculate: sdiff = 'actual strength— strength predicted via independence assumptions' x weight_ of evidence and then increment the value: K x sdiff onto the PredicateNode's LongTermlmportance value, and similarly increment STI using a different constant. Another factor that might usefully be caused to increment LTI is the simplicity of a Predi- cateNode. Given two Predicates with equal strength, we want the system to prefer the simpler one over the more complex one. However, the OccamsRazor MindAgent, to be presented below, rewards simpler Predicates directly in their strength values. Hence if the latter is in use, it seems unnecessary to reward them for their simplicity in their LTI values as well. This is an issue that may require some experimentation as the system develops. Returning to the surprisingness factor, consider the PredicateNode representing AND InheritanceLink X cat EvaluationLink (eats X) fish If this has a surprisingly high truth value, this means that there are more X known to (or inferred by) the system, that both inherit from cat and eat fish, than one would expect given the probabilities of a random X both inheriting from cat and eating fish. Thus, roughly speaking, the conjunction of inheriting from cat and eating fish may be a pattern in the world. EFTA00624490
344 41 Integrative Procedure Learning We now see one very clear sense in which CogPrime dynamics implicitly leads to predicates representing patterns. Small predicates that have surprising truth values get extra activation, hence are more likely to stick around in the system. Thus the mind fills up with patterns. 41.5.2 A More Formal Treatment It is worth taking a little time to clarify the sense in which we have a pattern in the above example, using the mathematical notion of pattern reviewed in Chapter 3 of Part 1. Consider the predicate: predl(T) .tv equals GetStrength AND Inheritance $X cat Evaluation eats ($X, fish) where T is sonic threshold value (e.g. .8). Let B = SatisfyingSet(predl(T)). B is the set of everything that inherits from cat and eats fish. Now we will make use of the notion of basic complexity. If one assumes the entire AtomSpace A constituting a given CogPrime system as given background information, then the basic com- plexity °(BI IA) may be considered as the number of bits required to list the handles of the elements of B, for lookup in A; whereas c(B) is the number of bits required to actually list the elements of B. Now, the formula given above, defining the set B, may be considered as a process P whose output is the set B. The simplicity c(PII A) is the number of bits needed to describe this proems, which is a fairly small number. We assume A is given as background information, accessible to the process. Then the degree to which P is a pattern in B is given by 1 — c(PIIA)MBIIA) which, if B is a sizable category, is going to be pretty close to 1. The key to there being a pattern here is that the relation: (Inheritance X cat) AND (eats X fish) has a high strength and also a high count. The high count means that B is a large set, either by direct observation or by hypothesis (inference). In the case where the count represents actual pieces of evidence observed by the system and retained in memory, then quite literally and directly, the PredicateNode represents a pattern in a subset of the system (relative to the background knowledge consisting of the system as a whole). On the other hand, if the count value has been obtained indirectly by inference, then it is possible that the system does not actually know any examples of the relation. In this case, the PredicateNode is not a pattern in the actual memory store of the system, but it is being hypothesized to be a pattern in the world in which the system is embedded. EFTA00624491
41.6 PredicateNode Mining 345 41.6 PredicateNode Mining We have seen how the natural dynamics of the CogPrime system, with a little help from spe- cial heuristics, can lead to the evolution of Predicates that embody patterns in the system's perceived or inferred world. But it is also valuable to more aggressively and directly create pattern-embodying Predicates. This does not contradict the implicit process, but rather com- plements it. The explicit process we use is called PredicateNode Alining and is carried out by a PredicateNodeMiner MindAgent. Define an Atom structure template as a schema expression corresponding to a CogPrime Link in which some of the arguments are replaced with variables. For instance, Inheritance X cat EvaluationLink (eats X) fish are Atom structure templates. (Recall that Atom structure templates are important in PLN inference control, as reviewed in 36) What the PredicateNodeMiner does is to look for Atom structure templates and logical combinations thereof which • Minimize PredicateNode size • Maximize surprisingness of truth value This is accomplished by a combination of heuristics. The first step in PredicateNode mining is to find Atom structure templates with high truth values. This can be done by a fairly simple heuristic search process. First, note that if one specifies an (Atom, Link type), one is specifying a set of Atom structure templates. For instance, if one specifies (cat, InheritanceLink) then one is specifying the templates InheritanceLink SX cat and InheritanceLink cat SX One can thus find Atom structure templates as follows. Choose an Atom with high truth value, and then, for each Link type, tabulate the total truth value of the Links of this type involving this Atom. When one finds a promising (Atom, Link type) pair, one can then do inference to test the truth value of the Atom structure template one has found. Next, given high-truth-value Atom structure templates, the PredicateNodeMiner experi- ments with joining them together using logical connectives. For each potential combination it assesses the fitness in terms of size and surprisingness. This may be carried out in two ways: 1. By incrementally building up larger combinations from smaller ones, at each incremental stage keeping only those combinations found to be valuable 2. For large combinations, by evolution of combinations Option 1 is basically greedy data mining (which may be carried out via various standard al- gorithms, as discussed in Chapter 37), which has the advantage of being much more rapid EFTA00624492
346 41 Integrative Procedure Learning than evolutionary programming, but the disadvantage that it misses large combinations whose subsets are not as surprising as the combinations themselves. It seems there is room for both approaches in CogPrime (and potentially many other approaches as well). The PredicateN- odeMiner MindAgent contains a parameter telling it how much time to spend on stochastic pattern mining vs. evolution, as well as parameters guiding the processes it invokes. So far we have discussed the process of finding single-variable Atom structure templates. But multivariable Atom structure templates may be obtained by combining single-variable ones. For instance, given eats SX fish lives_in $X Antarctica one may choose to investigate various combinations such as (eats SX SY) AND (lives_in $Y) (this particular example will have a predictably low truth value). So, the introduction of multiple variables may be done in the same process as the creation of single-variable combinations of Atom structure templates. When a suitably fit Atom structure template or logical combination thereof is found, then a PredicateNode is created embodying it, and placed into the AtomSpace. WIKISOURCE:SchemaMaps 41.7 Learning Schema Maps Next we plunge into the issue of procedure maps - schema maps in particular. A schema map is a simple yet subtle thing - a subnetwork of the AtomSpace consisting of Schemallodes, computing some useful quantity or carrying out some useful process in a cooperative way. The general purpose of schema maps is to allow schema execution to interact with other mental processes in a more flexible way than is allowed by compact Combo trees with internal hooks into the AtomSpace. I.e., to handle cases where procedure execution needs to be very highly interactive, mediated by attention allocation and other CogPrime dynamics in a flexible way. But how can schema maps be learned? The basic idea is simply reinforcement learning. In a goal-directed system consisting of interconnected, cooperative elements, you reinforce those connections and/or those elements that have been helpful for achieving goals, and weaken those connections that haven't. Thus, over time, you obtain a network of elements that achieves goals effectively. The central difficulty in all reinforcement learning approaches is the 'assignment of credit' problem. If a component of a system has been directly useful for achieving a goal, then rewarding it is easy. But if the relevance of a component to a goal is indirect, then things aren't so simple. Measuring indirect usefulness in a large, richly connected system is difficult - inaccuracies creep into the process easily. In CogPrime, reinforcement learning is handled via HebbianLinks, acted on by a combination of cognitive processes. Earlier, in Chapter 23. we reviewed the semantics of HebbianLinks, and discussed two methods for forming HebbianLinks: 1. Updating HebbianLink strengths via mining of the System Activity Table EFTA00624493
41.7 Learning Schema Maps 347 2. Logical inference on HebbianLinks, which may also incorporate the use of inference to combine HebbianLinks with other logical links (for instance, in the reinforcement learning context, PredictivelmplicationLinks) We now describe how HebbianLinks, formed and manipulated in this manner, may play a key role in goal-driven reinforcement learning. In effect, what we will describe is an implicit integration of the bucket brigade with PLN inference. The addition of robust probabilistic inference adds a new kind of depth and precision to the reinforcement learning process. Goal Nodes have an important ability to stimulate a lot of Schemallode execution activity. If a goal needs to be fulfilled, it stimulates schemata that are known to make this happen. But how is it known which schemata tend to fulfill a given goal? A link: PredictivelmplicationLink S G means that after schema S has been executed, goal G tends to be fulfilled. If these links between goals and goal-valuable schemata exist, then activation spreading from goals can serve the purpose of causing goal-useful schemata to become active. The trick, then, is to use HebbianLinks and inference thereon to implicitly guess Predic- tivelmplicationLinks. A HebbianLink between Si and S says that when thinking about Si was useful in the past, thinking about S was also often useful. This suggests that if doing S achieves goal G, maybe doing Si is also a good idea. The system may then try to find (by direct lookup or reasoning) whether, in the current context, there is a Predictivelmplication joining Si to S. In this way Hebbian reinforcement learning is being used as an inference control mechanism to aid in the construction of a goal-directed chain of PredictiveehmplicationLinks, which may then be schematized into a contextually useful procedure. Note finally that this process feeds back into itself in an interesting way, via contributing to ongoing HebbianLink formation. Along the way, while leading to the on-the-fiy construction of context-appropriate procedures that achieve goals, it also reinforces the HebbianLinks that hold together schema maps, sculpting new schema maps out of the existing field of interlinked Schemallodes. 41.7.1 Goal-Directed Schema Evolution Finally, as a complement to goal-driven reinforcement learning, there is also a process of goal- directed Schemallode learning. This combines features of the goal-driven reinforcement learning and concept-driven schema evolution methods discussed above. Here we use a Goal Node to provide the fitness function for schema evolution. The basic idea is that the fitness of a schema is defined by the degree to which enactment of that schema causes fulfillment of the goal. This requires the introduction of Causallmpli- cationLinks, as defined in PLN. In the simplest case, a CausallmplicationLink is simply a PredictiveImplicationLink. One relatively simple implementation of the idea is as follows. Suppose we have a Goal Node G, whose satisfaction we desire to have achieved by time Ti. Suppose we want to find a Schemallode S whose execution at time T2 will cause G to be achieved. We may define a fitness function for evaluating candidate S by: f(S,C,T1,T2). cp(S) x r(S, C, TI, T2) EFTA00624494
348 41 Integrative Procedure Learning r(S,G,T1,T2) GetStrength CausallmplicationLink EvaluationLink AtTime T1 ExecutionLink S X Y EvaluationLink AtTime (T2, G) Another variant specifies only a relative time lag, not two absolute times. f(S,G,T) = cp(S) x c(S,G,T) v(S,G,T) AND NonEmpty SatisfyingSet r(S,G,T1,T2) TI > T2 - T Using evolutionary learning or hillclimbing to find schemata fulfilling these fitness functions, results in Schemallodes whose execution is expected to cause the achievement of given goals. This is a complementary approach to reinforcement-learning based schema learning, and to schema learning based on PredicateNode concept creation. The strengths and weaknesses of these different approaches need to be extensively experimentally explored. However, prior ex- perience with the learning algorithms involved gives us some guidance. We know that when absolutely nothing is known about an objective function, evolutionary programming is often the best way to proceed. Even when there is knowledge about an objective function, the evolution process can take it into account, because the fitness functions involve logical links, and the evaluation of these logical links may involve inference operations. On the other hand, when there's a lot of relevant knowledge embodied in previously executed procedures, using logical reasoning to guide new procedure creation can be cumbersome, due to the overwhelming potentially aseful number of facts to choose when carrying inference. The Hebbian mechanisms used in reinforcement learning may be understood as inferential in their conceptual foundations (since a HebbianLink is equivalent to an ImplicationLink between two propositions about importance levels). But in practice they provide a much-streamlined approach to bringing knowledge implicit in existing procedures to bear on the creation of new procedures. Reinforcement learning, we believe, will excel at combining existing procedures to form new ones, and modifying existing procedures to work well in new contexts. Logical inference can also help here, acting in cooperation with reinforcement learning. But when the system has no clue how a certain goal might be fulfilled, evolutionary schema learning provides a relatively time-efficient way for it to find something minimally workable. Pragmatically, the GoalDrivenSchemaLearning MindAgent handles this aspect of the sys- tem's operations. It selects Goal Nodes with probability proportional to importance, and then spawns problems for the Evolutionary Optimization Unit Group accordingly. For a given Goal Node, PLN control mechanisms are used to study its properties and select between the above objective functions to use, on an heuristic basis. EFTA00624495
41.8 Occam's Razor 349 41S Occam's Razor Finally we turn to an important cognitive process that fits only loosely into the category of "CogPrime Procedure learning" - it's not actually a procedure learning process, but rather a process that utilizes the fruits of procedure learning. The well-known "Occam's razor" heuristic says that all else being equal, simpler is better. This notion is embodied mathematically in the Solomonoff-Levin "universal prior," according to which the a priori probability of a computational entity X is defined as a normalized version of: m(X) E 2-1(➢) where: • the sum is taken over all programs p that compute X • l(p) denotes the length of the program p Normalization is necessary because these values will not automatically sum to 1 over the space of all X. Without normalization, in is a semimeasure rather than a measure; with normalization it becomes the "Solomonoff-Levin measure" iLev011. Roughly speaking, Solomonoff's induction theorem 'Saluda, SolG 111 shows that, if one is trying to learn the computer program underlying a given set of observed data, and one does Bayesian inference over the set of all programs to try and obtain the answer, then if one uses the universal prior distribution one will arrive at the correct answer. CogPrime is not a Solomonoff induction engine. The computational cost of actually applying Solomonoff induction is unrealistically large. However, as we have seen in this chapter, there are aspects of CogPrime that are reminiscent of Solomonoff induction. In concept-directed schema and predicate learning, in pattern-based predicate learning - and in causal schema learning, we are searching for schemata and predicates that minimize complexity while maximizing some other quality. These processes all implement the Occam's Razor heuristic in a Solomonoffian style. Now we will introduce one more method of imposing the heuristic of algorithmic simplicity on CogPrime Atoms (and hence, indirectly, on CogPrime maps as well). This is simply to give a higher a priori probability to entities that are more simply computable. For starters, we may increase the node probability of ProcedureNodes proportionately to their simplicity. A reasonable formula here is simply: 2- " (P) where P is the ProcedureNode and r > 0 is a parameter. This means that infinitely complex P have a priori probability zero, whereas an infinitely simple P has an a priori probability 1. This is not an exact implementation of the Solomonoff-Levin measure, but it's a decent heuristic approximation. It is not pragmatically realistic to sum over the lengths of all programs that do the same thing as a given predicate P. Generally the first term of the Solomonoff-Levin summation is going to dominate the sum anyway, so if the ProcedureNode P is maximally compact, then our simplified formula will be a good approximation of the Solomonoff-Levin EFTA00624496
350 41 Integrative Procedure Learning summation. These a priori probabilities may be merged with node probability estimates from other sources, using the revision rule. A similar strategy may be taken with ConceptNodes. We want to reward a ConceptNode C with a higher a priori probability if C E SatisfyingSet(P) for a simple PredicateNode P. To achieve this formulaically, let sim(X, Y) denote the strength of the SimilarityLink between X and Y, and let: sirn'(C, P) = sim(C, SatisfyingSet(P)) We may then define the a priori probability of a ConceptNode as: pr(C) = E pp—rc(P) P where the sum goes over all P in the system. In practice of course it's only necessary to compute the terms of the sum corresponding to P so that sim'(C, P) is large. As with the a priori PredicateNode probabilities discussed above, these a priori ConceptN- ode probabilities may be merged with other node probability information, using the revision rule, and using a default parameter value for the weight of evidence. There is one pragmatic difference here from the PredicateNode case, though. As the system learns new PredicateNodes, its best estimate of pr(C) may change. Thus it makes sense for the system to store the a priori probabilities of ConceptNodes separately from the node probabilities, so that when the a priori probability is changed, a two step operation can be carried out: • First, remove the old a priori probability from the node probability estimate, using the reverse of the revision rule • Then, add in the new a priori probability Finally, we can take a similar approach to any Atom Y produced by a Schemallode. We can construct: pr(Y) = :E:S(Srr iY,112—ri c(S)+cCY)) S,X where the sum goes over all pairs (.9, X) so that: ExecutionLink S X Y and s(S, X, Y) is the strength of this ExecutionLink. Here, we are rewarding Atoms that are produced by simple schemata based on simple inputs. The combined result of these heuristics is to cause the system to prefer simpler explanations, analysis, procedures and ideas. But of course this is only an apriori preference. and if more complex entities prove more useful, these will quickly gain greater strength and importance in the system. Implementationally, these various processes are carried out by the OccamsRazor MindAgent. This dynamic selects ConceptNodes based on a combination of: • importance • time since the a priori probability was last updated (a long time is preferred) It selects ExecutionLinks based on importance and based on the amount of time since they were last visited by the OccamsRazor MindAgent. And it selects PredicateNodes based on importance, filtering out PredicateNodes it has visited before. EFTA00624497
Chapter 42 Map Formation Abstract 42.1 Introduction In Chapter 20 we distinguished the explicit versus implicit aspects of knowledge representa- tion in CogPrime. The explicit level consists of Atoms with clearly comprehensible meanings, whereas the implicit level consists of "maps" - collections of Atoms that become important in a coordinated manner, analogously to cell assemblies in an attractor neural net. The combination of the two is valuable because the world-patterns useful to human-like minds in achieving their goals, involve varying degrees of isolation and interpenetration, and their effective goal-oriented processing involves both symbolic manipulation (for which explicit representation is most valu- able) and associative creative manipulation (for which distributed, implicit representation is most valuable). The chapters since have focused primarily on explicit representation, commenting on the implicit "map" level only occasionally. There are two reasons for this: one theoretical, one pragmatic. The theoretical reason is that the majority of map dynamics and representations are implicit in Atom-level correlates. And the pragmatic reason is that, at this stage, we simply do not know as much about CogPrime maps as we do about CogPrime Atoms. Maps are emergent entities and, lacking a detailed theory of CogPrime dynamics, the only way we have to study them in detail is to run CogPrime systems and mine their System Activity Tables and logs for information. If CogPrime research goes well, then updated versions of this book may include more details on observed map dynamics in various contexts. In this chapter, however, we finally turn our gaze directly to maps and their relationships to Atoms, and discuss processes that convert Atoms into maps (expansion) and vice versa (encapsulation). These processes represent a bridge between the concretely-implemented and emergent aspects of CogPrime's mind. Map encapsulation is the process of recognizing Atoms that tend to become important in a coordinated manner, and then creating new Atoms grouping these. As such it is essentially a form of AtomSpace pattern mining. In terms of patternist philosophy, map encapsulation is a direct incarnation of the so-called "cognitive equation"; that is, the process by which the mind recognizes patterns in itself, and then embodies these patterns as new content within 351 EFTA00624498
352 42 Map Formation itself - an instance of what Hofstadter famously labeled a "strange loop" 'Hurl. In SMEPH terms, the encapsulation process is how CogPrime explicitly studies its own derived hypergraph and then works to implement this derived hypergraph more efficiently by recapitulating it at the concretely-implemented-mind level. This of course may change the derived hypergraph considerably. Among other things, map encapsulation has the possibility of taking the things that were the mast abstract, highest level patterns in the system and forming new patterns involving them and their interrelationships - thus building the highest level of patterns in the system higher and higher. Figures 42.2 and 42.1 illustrate concrete examples of the process. Map Encapsulation Atom Table Before Atom Table After Map Encapsulation Map Encapsulation Fig. 42.1: Illustration of the process of creating explicit Atoms corresponding to a pattern previously represented as a distributed "map." Map expansion, on the other hand, is the process of taking knowledge that is explicitly represented and causing the AtomSpace to represent it implicitly, on the map level. In many cases this will happen automatically. For instance, a ConceptNode C may turn into a concept map if the importance updating process iteratively acts in such a way as to create/reinforce a map consisting of C and its relata. Or, an Atom-level InheritanceLink may implicitly spawn a map-level InheritanceEdge (in SMEPH terms). However, there is one important case in which Atom-to-map conversion must occur explicitly: the expansion of compound ProoedureNodes into procedure maps. This must occur explicitly because the process graphs inside ProcedureN- odes have no dynamics going on except evaluation; there is no opportunity for them to manifest themselves as maps, unless a MindAgent is introduced that explicitly does so. Of course, just unfolding a Combo tree into a procedure map doesn't intrinsically make it a significant part of the derived hypergraph - but it opens the door for the inter-cognitive-process integration that may make this occur. EFTA00624499
42.2 Map Encapsulation 353 Concept node symbolizing map related to the agent talking 1 Implication Guided by the results of map formation, PLN seeks specific, related logical relationships implication concept node O symbolizing map related to happy people Hap formation creates new, abstract Concept Nodes symbolizing the patterns of co-Importance it has noted Fig. 42.2: Illustration of the process of creating explicit Atoms corresponding to a pattern previously represented as a distributed "map." 42.2 Map Encapsulation Returning to encapsulation: it may be viewed as a form of symbolization, in which the system creates concrete entities to serve as symbols for its own emergent patterns. It can then study an emergent pattern's interrelationships by studying the interrelationships of the symbol with other symbols. For instance, suppose a system has three derived-hypergraph ConceptVertices A, B and C, and observes that: EFTA00624500
354 42 Map Formation InheritanceEdge A S InheritanceEdge B C Then encapsulation may create ConceptNodes A', B' and C' for A, B and C, and Inheri- tanceLinks corresponding to the InheritanceEdges, where e.g. A' is a set containing all the Atoms contained in the static map A. First-order PLN inference will then immediately con- clude: InheritanceLink A' C' and it may possibly do so with a higher strength than the strength corresponding to the (per- haps not significant) InheritanceEdge between A and C. But if the encapsulation is done right then the existence of the new InheritanceLink will indirectly cause the formation of the corre- sponding: InheritanceEdge A C via the further action of inference, which will use (InheritanceLink A' C') to trigger the inference of further inheritance relationships between members of A' and members of C', which will create an emergent inheritance between members of A (the map corresponding to A') and C (the map corresponding to C'). The above example involved the conversion of static maps into ConceptNodes. Another approach to map encapsulation is to represent the fact that a set of Atoms constitutes a map as a predicate; for instance if the nodes A, B and C are habitually used together, then the predicate P may be formed, where: P AND A is used at time T B is used at time T C is used at time T The habitualness of A, B and C being used together will be reflected in the fact that P has a surprisingly high truth value. By a simple concept formation heuristic, this may be used to form a link AND(A, B, C), so that: AND(A, B, C) is used at time T This composite link AND(A, B, C) is then an embodiment of the map in single-Atom form. Similarly, if a set of schemata is commonly used in a certain series, this may be recognized in a predicate, and a composite schema may then be created embodying the component schemata. For instance, suppose it is recognized as a pattern that: AND Si is used at time T on input I1 producing output 01 S2 is used at time T+s on input 01 producing output 02 Then we may explicitly create a schema that consists of Si taking input and feeding its output to 52. This cannot be done via any standard concept formation heuristic; it requires a special process. One might wonder why this Atom-to-map conversion process is necessary: Why not just let maps combine to build new maps, hierarchically, rather than artificially transforming some maps into Atoms and letting maps then form from these map-representing Atoms. It is all a matter of precision. Operations on the map level are fuzzier and less reliable than operations on the Atom level. This fuzziness has its positive and its negative aspects. For example, it is EFTA00624501
42.3 Atom and Predicate Activity Tables 355 good for spontaneous creativity, but bad for constructing lengthy, confident chains of thought. WIKISOURCE:ActivityTables 42.3 Atom and Predicate Activity Tables A major role in map formation is played by a collection of special tables. Map encapsulation takes place, not by data mining directly on the AtomTable, but by data mining on these special tables constructed from the AtomTable, specifically with efficiency of map mining in mind. First, there is the Atom Utilization Table, which may be derived from the SystemActivi- tyTable. The Atom Utilization Table, in its most simple possible version, takes the form shown in Table 42.1. Time Atom Handle H ? ? ? T ? (Effort spent on Atom H at time t, utility derived front atom H at time t) ? ? ? Table 42.1: Atom Utilization Table The calculation of "utility" values for this purpose must be done in a "local" way by MindAgents, rather than by a global calculation of the degree to which utilizing a certain Atom has led to the achievement of a certain system goal (this kind of global calculation would be better in principle, but it would require massive computational effort to calculate for every Atom in the system at frequent intervals). Each MindAgent needs to estimate how much utility it has obtained from a given Atom, as well as how much effort it has spent on this Atom, and report these numbers to the Atom Utilization Table. The normalization of effort values is simple, since effort can be quantified in terms of time and space expended. Normalization of utility values Ls harder, as it is difficult to define a common scale to span all the different MindAgents, which in some cases carry out very different sorts of operations. One reasonably "objective" approach is to assign each MindAgent an amount of "utility credit", at time T, equal to the amount of currency that the MindAgent has spent since it last disbursed its utility credits. It may then divide up its utility credit among the Atoms it has utilized. Other reasonable approaches may also he defined. The use of utility and utility credit for Atoms and MindAgents is similar to the stimulus used in the Attention allocation system. There, MindAgents reward Atoms with stimulus to indicate that their short and long term importance should be increased. Merging utility and stimulus is a natural approach to implementing utility in OpenCogPrime. Note that there are many practical manifestations that the abstract notion of an Activi- tyTable may take. It could be an ordinary row-and-column style table, but that is not the only nor the most interesting possibility. An ActivityTable may also be effectively stored as a series of graphs corresponding to time intervals - one graph for each interval, consisting of Hebbian- Links formed solely based on importance during that interval. In this case it is basically a set of graphs, which may be stored for instance in an AtomTable, perhaps with a special index. Then there is the Procedure Activity Table, which records the inputs and outputs associated with procedures: EFTA00624502
356 Time ProcedureNode Handle H ? ? ? T ? (Inputs to H, Outputs from H) 7 ? 42 Map Formation Table 42.2: Procedure Activity Table for a Particular MindAgent Data mining on these tables may be carried out by a variety of algorithms (see MapMining) - the more advanced the algorithm, the fuller the transfer from the derived-hypergraph level to the concretely-implemented level. There is a tradeoff here similar to that with attention allocation - if too much time is spent studying the derived hypergraph, then there will not be any interesting cognitive dynamics going on anymore because other cognitive processes get no resources, so the map encapsulation process will fail because there is nothing to study! These same tables may be used in the attention allocation process, for assigning of MindAgent- specific AttentionValues to Atoms. WIKISOURCE:MapMining 42.4 Mining the AtomSpace for Maps Searching for general maps in a complex AtomSpace is an unrealistically difficult problem, as the search space is huge. So, the bulk of map-mining activity involves looking for the most simple and obvious sorts of maps. A certain amount of resources may also be allocated to looking for subtler maps using more resource-intensive methods. The following categories of maps can be searched for at relatively low cost: • Static maps • Temporal motif maps Conceptually, a static map is simply a set of Atoms that all tend to be active at the same time. Next, by a "temporal motif map" we mean a set of pairs: ti) of the type: (Atom, int) so that for many activation cycle indices T, A; is highly active at some time very close to index T +ti. The reason both static maps and temporal motif maps are easy to recognize is that they are both simply repeated patterns. Perceptual context formation involves a special case of static and temporal motif mining. In perceptual context formation, one specifically wishes to mine maps involving perceptual nodes associated with a single interaction channel (see Chapter 26 for interaction channel). These maps then represent real-world contexts, that may be useful in guiding real-world-oriented goal activity (via schema-context-goal triads). In CogPrime so far we have considered three broad approaches for mining static and temporal motif maps from AtomSpaces: EFTA00624503
42.4 Mining the AtomSpace for Maps 357 • Frequent subgraph mining, frequent itemset mining, or other sorts of datamining on Activity Tables • Clustering on the network of HebbianLinks • Evolutionary Optimization based datamining on Activity Tables The first two approaches are significantly more time-efficient than the latter, but also signifi- cantly more limited in the scope of patterns they can find. Any of these approaches can be used to look for maps subject to several types of constraints, such as for instance: • Unconstrained: maps may contain any kinds of Atoms • Strictly constrained: maps may only contain Atom types contained on a certain list • Probabilistically constrained: maps must contain Atom types contained on a certain list, as x% of their elements • Trigger-constrained: the map must contain an Atom whose type is on a certain list, as its most active element Different sorts of constraints will lead to different sorts of maps, of course. We don't know at this stage which sorts of constraints will yield the best results. Some special cases, however, are reasonably well understood. For instance: • procedure encapsulation, to be discussed below, involves searching for (strictly-constrained) maps consisting solely of ProcedureinstanceNodes. • to enhance goal achievement, it is likely useful to search for trigger-constrained maps trig- gered by Goal Nodes. What the MapEncapsulation CIAO-Dynamic (Concretely-Implemented-Mind-Dynamic, see Chapter 19) does once it finds a map, is dependent upon the type of map it's found. In the special case of procedure encapsulation, it creates a compound ProcedureNode (selecting Sche- mallode or PredicateNode based on whether the output is a TruthValue or not). For static maps, it creates a ConceptNode, which links to all members of the map with MemberLinks, the weight of which is determined by the degree of map membership. For dynamic maps, it creates Predictivelmplication links depicting the pattern of change. 42.4.1 Frequent Itemset Mining for Map Mining One class of technique that is useful here Ls frequent itemset mining (FIN1), a process that looks to find all frequent combinations of items occurring in a set of data. Another useful class of algorithms Ls greedy or stochastic itemset mining, which does roughly the same thing as FIM but without being completely exhaustive (the advantage being greater execution speed). Here we will discuss FIN1, but the basic concepts are the same if one is doing greedy or stochastic mining instead. The basic goal of frequent itemset mining is to discover frequent subsets in a group of items. One knows that for a set of N items, there are 2N-1 possible subgroups. To avoid the exponential explosion of subsets, one may compute the frequent itemsets in several rounds. Round i computes all frequent i-itemsets. EFTA00624504
358 42 Map Formation A round has two steps: candidate generation and candidate counting. In the candidate gen- eration step, the algorithm generates a set of candidate i-itemsets whose support - a minimum percentage of events in which the item must appear - has not been yet been computed. In the candidate-counting step, the algorithm scans its memory database, counting the support of the candidate itemsets. After the scan, the algorithm discards candidates with support lower than the specified minimum (an algorithm parameter) and retains only the frequent i-itemsets. The algorithm reduces the number of tested subsets by pruning a priori those candidate itemsets that cannot be frequent, based on the knowledge about infrequent itemsets obtained from pre- vious rounds. So for instance if {A, B) is a frequent 2-itemset then {A, B,C} may possibly be a 3-itemset, on the contrary if {A, B) is not a frequent itemset then {A, B,C}, as well as any super set of (A, B), will be discarded. Although the worst case of this sort of algorithm is exponential, practical executions are generally fast, depending essentially on the support limit. To apply this kind of approach to search for static maps, one simply creates a large set of sets of Atoms - one set for each time-point. In the set S(t) corresponding to time t, we place all Atoms that were firing activation at time t. The itemset miner then searches for sets of Atoms that are subsets of many different S(t) corresponding to many different times t. These are Atom sets that are frequently co-active. Table ?? presents a typical example of data prepared for frequent itemset mining, in the context of context formation via static-map recognition. Columns represent important nodes and rows indicate time slices. For simplicity, we have thresholded the values and show only activity values; so that a 1 in a cell indicates that the Atom indicated by the column was being utilized at the time indicated by the row. In the example, if we assume minimum support as 50 percent, the context nodes Cl = {Q, Ft}, and C2 = {Q, T, U) would be created. Using frequent itemset mining to find temporal motif maps is a similar, but slightly more complex process. Here, one fixes a time-window W. Then, for each activation cycle index t, one creates a set S(t) consisting of pairs of the form: (A, a) where A is an Atom and 0 ≤ s ≤ W is an integer temporal offset. We have: (A,$) "within" S(t) if Atom A is firing activation at time t+s. helmet mining is then used to search for common subsets among the S(t). These common subsets are common patterns of temporal activation, i.e. repeated temporal motifs. The strength of this approach is its ability to rapidly search through a huge space of possibly significant subsets. Its weakness is its restriction to finding maps that can be incrementally built up from smaller maps. How significant this weakness is, depends on the particular statistics of map occurrence in CogPrime. Intuitively, we believe frequent itemset mining can perform rather well in this context, and our preliminary experiments have supported this intuition. Frequent Subgraph Mining for Map Mining A limitation of FIM techniques, from a CogPrime perspective, is that they are intended for relational databases (RDBs); but the information about co-activity in a CogPrime instance is generally going to be more efficiently stored as graphs rather than RDB's. Indeed an Activi- tyTable may be effectively stored as a series of graphs corresponding to time intervals - one EFTA00624505
42.5 Map Dynamics 359 graph for each interval, consisting of HebbianLinks formed solely based On importance during that interval. From ActivityTable stores like this, the way to find maps is not frequent itemset mining but rather frequent subgraph mining - a variant of FIM that is conceptually similar but algorithmically more subtle, and on which there has arisen a significant literature in re- cent years. We have already briefly discussed this technology in Chapter 37 on pattern mining the Atomspace - map mining being an important special case of Atomspace pattern mining. As noted there, some of the many approaches to frequent subgraph mining are described in Ili W PO3, K KO I]. 42.4.2 Evolutionary Map Detection Just as general Atomspace pattern mining may be done via evolutionary learning as well as greedy mining, the same holds for the special case of map mining. Complementary to the itemset mining approach, the CogPrime design also uses evolutionary optimization to find maps. Here the data setup is the same as in the itemset mining case, but instead of using an incremental search approach, one sets up a population of subsets of the sets S(t), and seeks to evolve the population to find an optimally fit S(t). Fitness is defined simply as high frequency - relative to the frequency one would expect based on statistical independence assumptions alone. In principle one could use evolutionary learning to do all map encapsulation, but this isn't computationally feasible - it would limit too severely the amount of map encapsulation that could be done. Instead, evolutionary learning must be supplemented by some more rapid, less expensive technique. 42.5 Map Dynamics Assume one has a collection of Atoms, with: • Importance values I(A), assigned via the economic attention allocation mechanism. • HebbianLink strengths (HebbianLink A B).tv.s, assigned as (loosely speaking) the proba- bility of B's importance assuming A's importance. Then, one way to search for static maps is to look for collections C of Atoms that are strong clusters according to HebbianLinks. That is, for instance, to find collections C so that: • The mean strength of (HebbianLink A B).tv.s, where A and B are in the collection C, is large. • The mean strength of (HebbianLink A Z).tv.s, where A is in the collection C and Z is not, is small. (this is just a very simple cluster quality measurement; there is a variety of other cluster quality measurements one might use instead.) Dynamic maps may be more complex, for instance there might be two collections CI and C2 so that: • Mean strength of (HebbianLink A B).s, where A is in CI and B is in C2 EFTA00624506
360 42 Map Formation • Mean strength of (HebbianLink B A).s, where B is in C2 and A is in Cl are both wry large. A static map will tend to be an attractor for CogPrime's attention-allocation-based dynamics, in the sense that when a few elements of the map are acted upon, it is likely that other elements of the map will soon also come to be acted upon. The reason is that, if a few elements of the map are acted upon usefully, then their importance values will increase. Node probability inference based on the HebbianLinks will then cause the importance values of the other nodes in the map to increase, thus increasing the probability that the other nodes in the map are acted upon. Critical here is that the HebbianLinks have a higher weight of evidence than the node importance values. This is because the node importance values are assumed to be ephemeral - they reflect whether a given node is important at a given moment or not - whereas the HebbianLinks are assumed to reflect longer-lasting information. A dynamic map will also be an attractor, but of a more complex kind. The example given above, with Cl and C2, will be a periodic attractor rather than a fixed-point attractor. INIK- ISOURCE:ProcedureEncapsulation 42.6 Procedure Encapsulation and Expansion One of the most important special cases of map encapsulation is procedure encapsulation. This refers to the process of taking a schema/predicate map and embodying it in a single ProcedureNode. This may be done by mining on the Procedure Activity Table, described in Activity Tables, using either: • a special variant of itemset mining that seeks for procedures whose outputs serve as inputs for other procedures. • Evolutionary optimization with a fitness function that restricts attention to sets of pro- cedurm that form a digraph, where the procedures lie at the vertices and an arrow from vertex A to vertex B indicates that the outputs of A become the inputs of B. The reverse of this process, procedure expansion, is also interesting, though algorithmically easier - here one takes a compound ProcedureNode and expands its internals into a collection of appropriately interlinked ProcedureNodes. The challenge here is to figure out where to split a complex Combo tree into subtrees. But if the Combo tree has a hierarchical structure then this is very simple; the hierarchical subunits may simply be split into separate ProcedureNodes. These two processes may be used in sequence to interesting effect: expanding an important compound ProcedureNode so it can be modified via reinforcement learning, then encapsulating its modified version for efficient execution, then perhaps expanding this modified version later on. To an extent, the existence of these two different representations of procedures is an artifact of CogPrime's particular software design (and ultimately, a reflection of certain properties of the von Neumann computing architecture). But it also represents a more fundamental dichotomy, between: • Procedures represented in a way that allows them to be dynamically, improvisationally restructured via interaction with other mental processes during the execution process. EFTA00624507
42.6 Procedure Encapsulation and Expansion 361 • Procedures represented in a way that is relatively encapsulated and mechanical, allowing collaboration with other aspects of the mind during execution only in fairly limited ways Conceptually, we believe that this is a very useful distinction for a mind to make. In nearly any reasonable cognitive architecture, it's going to be more efficient to execute a procedure if that procedure is treated as something with a relatively rigid structure, so it can simply be executed without worrying about interactions except in a few specific regards. This is a strong motivation for an artificial cognitive system to have a dual (at least) representation of procedures, or else a subtle representation that is flexible regarding its degree of flexibility, and automagically translates constraint into efficiency. 42.6.1 Procedure Encapsulation in More Detail A procedure map is a temporal motif: it is a set of Atoms (ProcedureNodes), which are habit- ually, executed in a particular temporal order, and which implicitly pass arguments amongst each other. For instance, if procedure A acts to create node X, and procedure B then takes node X as input, then we may say that A has implicitly passed an argument to B. The encapsulation process can recognize some very subtle patterns, but a fair fraction of its activity can be understood in terms of some simple heuristics. For instance, the map encapsulation process will create a node h = Bfg= f og =f composed with g (B as in combinatory logic) when there are many examples in the system of: ExecutionLink g x y ExecutionLink f y z The procedure encapsulation process will also recognize larger repeated subgraphs, and their patterns of execution over time. But some of its recognition of larger subgraphs may be done incrementally, by repeated recognition of simple patterns like the ones just described. 42.6.2 Procedure Encapsulation in the Human Brain Finally, we briefly discuss some conceptual issues regarding the relation between CogPrime procedure encapsulation and the human brain. Current knowledge of the human brain is weak in this regard, but we won't be surprised if, in time, it is revealed that the brain stores procedures in several different ways, that one distinction between these different ways has to do with degree of openness to interactions, and that the less open ways lead to faster execution. Generally speaking, there is good evidence for a neural distinction between procedural, episodic and declarative memory. But knowledge about distinctions between different kinds of procedural memory, is scanter. It is known that procedural knowledge can be "routinized" - so that, e.g., once you get good at serving a tennis ball or solving a quadratic equation, your brain handles the process in a different way than before when you were learning. And it seems plausible that routinized knowledge, as represented in the brain, has fewer connections EFTA00624508
362 42 Map Formation back to the rest of the brain than the pre-routinized knowledge. But there will be much firmer knowledge about such things in the coming years and decades as brain scanning technology advances. Overall, there is more knowledge in cognitive and neural science about motor procedures than cognitive procedures (see e.g. ISW051. In the brain, much of motor procedural memory resides in the pre-motor area of the cortex. The motor plans stored here are not static entities and are easily modified through feedback, and through interaction with other brain regions. Generally, a motor plan will be stored in a distributed way across a significant percentage of the premotor cortex; and a complex or multipart actions will tend to involve numerous sub-plans, executed in both parallel and in serial. Often what we think of as separate/distinct motor-plans may in fact be just slightly different combinations of subplans (a phenomenon also occurring with schema maps in CogPrime ). In the case of motor plans, a great deal of the mutinization process has to do with learning the timing necessary for correct coordination between muscles and motor subplans. This involves integration of several brain regions - for instance, timing is handled by the cerebellum to a degree, and some motor-execution decisions are regulated by the basal ganglia. One can think of many motor plans as involving abstract and concrete sub-plans. The ab- stract sub-plans are more likely to involve integration with those parts of the cortex dealing with conceptual thought. The concrete sub-plans have highly optimized timings, based on close integration with cerebellum, basal ganglia and so forth - but are not closely integrated with the conceptualization-focused parts of the brain. So, a rough CogPrime model of human motor procedures might involve schema maps coordinating the abstract aspects of motor procedures, triggering activity of complex Schemallodes containing precisely optimized procedures that interact carefully with external actuators. WIKISOURCE:MapsAndAttention 42.7 Maps and Focused Attention The cause of map formation is important to understand. Formation of small maps seems to follow from the logic of focused attention, along with hierarchical maps of a certain nature. But the argument for this is somewhat subtle, involving cognitive synergy between PLN inference and economic attention allocation. The nature of PLN is that the effectiveness of reasoning is maximized by (among other strategies) minimizing the number of incorrect independence assumptions. If reasoning on N nodes, the way to minimize independence assumptions is to use the full inclusion-exclusion formula to calculate interdependencies between the N nodes. This involves 2N terms, one for each subset of the N nodes. Very rarely, in practical cases, will one have significant information about all these subsets. However, the nature of focused attention is that the system seeks to find out about as many of these subsets as possible, so as to be able to make the most accurate possible inferences, hence minimizing the use of unjustified independence assumptions. This implies that focused attention cannot hold too many items within it at one time, because if N is too big, then doing a decent sampling of the subsets of the N items is no longer realistic. So, suppose that N items have been held within focused attention, meaning that a lot of predicates embodying combinations of N items have been constructed and evaluated and rea- soned on. Then, during this extensive process of attentional focus, many of the N items will be useful in combination with each other - because of the existence of predicates joining the items. EFTA00624509
42.8 Recognizing and Creating Self-Referential Structures 363 Hence, many HebbianLinks will grow between the N items - causing the set of N items to form a map. By this reasoning, it seems that focused attention will implicitly be a map formation process - even though its immediate purpose is not map formation, but rather accurate inference (infer- ence that minimizes independence assumptions by computing as many cross terms as is possible based on available direct and indirect evidence). Furthermore, it will encourage the formation of maps with a small number of elements in them (say, N<10). However, these elements may themselves be ConceptNodes grouping other nodes together, perhaps grouping together nodes that are involved in maps. In this way, one may see the formation of hierarchical maps. formed of clusters of clusters of clusters..., where each cluster has N<10 elements in it. These hierar- chical maps manifest the abstract dual network concept that occurs frequently in CogPrime philosophy. It is tempting to postulate that any intelligent system must display similar properties - so that focused attention, in general, has a strictly limited scope and causes the formation of maps that have central cores of roughly the same size as its scope. If this is indeed a general principle, it is an important one, because it tells you something about the general structure of derived hypergraphs assnriated with intelligent systems, based on the computational resource constraints of the systems. The scope of an intelligent system's attentional focus would seem to generally increase log- arithmically with the system's computational power. This follows immediately if one assumes that attentional focus involves free intercombination of the items within it. If attentional focus is the major locus of map formation, then - lapsing into SMEPH-speak - it follows that the bulk of the ConceptVertices in the intelligent system's derived hypergraphs may correspond to maps focused on a fairly small number of other ConeeptVertic . In other words, derived hypergraphs may tend to have a fairly localized structure, in which each ConceptVertex has very strong InheritanceEdges pointing from a handful of other ConceptVertices (corresponding to the other things that were in the attentional focus when that ConceptVertex was formed). itVIKISOURCE:RecognizingAndCreatingSelfReferentialStructum 42.8 Recognizing and Creating Self-Referential Structures Finally, this brief section covers a large and essential topic: how CogPrime will be able to recognize and create large-scale self-referential structures. Some of the most essential structures underlying human-level intelligence are self-referential in nature. These include: • the phenomenal self (see Thomas Metzinger's book "Being No One") • the will • reflective awareness These structures are arguably not critical for basic survival functionality in natural environ- ments. However, they are important for adequate functionality within advanced social systems, and for abstract thinking regarding science, humanities, arts and technolo&v. Recall that in Chapter 3 of Part 1 these entities are formalized in terms of hypersets and, the following recursive definitions are given: EFTA00624510
364 42 Map Formation • "S is conscious of X" is defined as: The declarative content that "S is conscious of X" correlates with "X is a pattern in S" • "S wills X" is defined as: The declarative content that "S wills X" causally implies "S does X" • "X is part of S's self" is defined as: The declarative content that "X is a part of S's self" correlates with "X is a persistent pattern in S over time" Relatedly, one may posit multiple similar processes that are mutually recursive, e.g. • S is conscious of T and U • T is conscious of S and U • U is conscious of S and T The cognitive importance of this sort of mutual recursion is further discussed in Appendix ??. According to the philosophy underlying CogPrime, none of these are things that should be programmed into an artificial mind. Rather, they must emerge in the course of a mind's self- organization in connection with its environment. However, a mind may be constructed so that, by design, these sorts of important self-referential structures are encouraged to emerge. 42.8.1 Encouraging the Recognition of Self-Referential Structures in the AtomSpace How can we do this - encourage a CogPrime instance to recognize complex self-referential structures that may exist in its AtomTable? This is important, because, according to the same logic as map formation: if these structures are explicitly recognized when they exist, they can then be reasoned on and otherwise further refined, which will then cause them to exist more definitively, and hence to be explicitly recognized as yet more prominent patterns ... etc. The same virtuous cycle via which ongoing map recognition and encapsulation is supposed to lead to concept formation, may be posited on the level of complex self-referential structures, leading to their refinement, development and ongoing complexity. One really simple way is to encode self-referential operators in the Combo vocabulary, that is used to represent the procedures grounding GroundedPredicateNodes. That way, one can recognize self-referential patterns in the AtomTable via standard Cog- Prime methods like MOSES and integrative procedure and predicate learning as discussed in Chapter 41, so long as one uses Combo trees that are allowed to include self-referential operators at their nodes. All that matters is that one is able to take one of these Combo trees, compare it to an AtomTable, and assess the degree to which that Combo tree constitutes a pattern in that AtomTable. But how can we do this? How can we match a self-referential structure like: EquivalenceLink EvaluationLink will (S,X) CausallmplicationLink EvaluationLink will (S,X) EvaluationLink do (S,X) against an AtomTable or portion thereof? The question is whether there is some "map" of Atoms (some set of PredicateNodes) willMap, so that we may infer the SMEPH (see Chapter 14) relationship: EFTA00624511
42.8 Recognizing and Creating Self-Referential Structures 365 EquivalenceEdge EvaluationEdge willMap (S, X) Causal Implicat ionEdge EvaluationEdge willMap (S, X) EvaluationEdge doMap ($,X) as a statistical pattern in the AtomTable's history over the recent past. (Here, doMap is defined to be the map corresponding to the built-in "do" predicate.) If so, then this map willMap, may be encapsulated in a single new Node (call it willNode), which represents the system's will. This willNode may then be explicitly reasoned upon, used within concept creation, etc. It will lead to the spontaneous formation of a more sophisticated, fully-fleshed-out will map. And so forth. Now, what is required for this sort of statistical pattern to be recognizable in the AtomTable's history? What is required is that EquivalenceEdges (which, note, must be part of the Combo vocabulary in order for any MOSES-related algorithms to recognize patterns involving them) must be defined according to the logic of hypersets rather than the logic of sets. What is fascinating is that this is no big deal! In fact, the AtomTable software structures support this automatically; it's just not the way most people are used to thinking about things. There is no reason, in terms of the AtomTable, not to create self-referential structures like the one given above. The next question, though, is how do we calculate the truth values of structures like those above. The truth value of a hyperset structure turns out to be an infinite order probability dis- tribution, which a complex and peculiar entity [Coe' Oa]. Infinite-order probability distributions are partially-ordered, and so one can compare the extent to which two different self-referential structures apply to a given body of data (e.g. an AtomTable), via comparing the infinite-order distros that constitute their truth values. In this way, one can recognize self-referential patterns in an AtomTable, and carry out encapsulation of self-referential maps. This sounds very abstract and complicated, but the class of infinite-order distributions defined in the above-referenced pa- pers actually have their truth values defined by simple matrix mathematics, so there is really nothing that abstruse involved in practice. Finally, there is the question of how these hyperset structures are to be logically manipulated within PLN. The answer is that regular PLN inference can be applied perfectly well to hypersets, but some additional hyperset operations may also be introduced; these are currently being researched. Clearly, with this subtle, currently unimplemented component of the CogPrime design we have veered rather far from anything the human brain could plausibly be doing in detail. But yet, some meaningful connections may be drawn. In Chapter 13 of Part 1 we have discussed how probabilistic logic might emerge from the brain, and also how the brain may embody self-referential structures like the ones considered here, via (perhaps using the hippocampus) encoding whole neural nets as inputs to other neural nets. Regarding infinite-order probabilities, it is certainly the case that the brain is efficient at carrying out operations equivalent to matrix manipulations (e.g. in vision and audition), and IG oe Will reduced infinite-order probabilities to finite matrix manipulations, so that it's not completely outlandish to posit the brain could be doing something mathematically analogous. Thus, all in all, it seems at least plausible that the brain could be doing something roughly analogous to what we've described here, though the details would obviously be very different. EFTA00624512
EFTA00624513
Section VII Communication Between Human and Artificial Minds EFTA00624514
EFTA00624515
Chapter 43 Communication Between Artificial Minds 43.1 Introduction Language is a key aspect of human intelligence, and seems to be one of two critical factors separating humans from other intelligent animals - the other being the ability to use tools. Steven Mithen IMit9(1 argues that the key factor in the emergence of the modern human mind from its predecessors was the coming-together of formerly largely distinct mental modules for linguistic communication and tool making/use. Other animals do appear to have fairly sophisticated forms of linguistic communication, which we don't understand very well at present; but as best we can tell, modern human language has many qualitatively different aspects from these, which enable it to synergize effectively with tool making and use, and which have enabled it to co-evolve with various aspects of tool-dependent culture. Some AGI theorists have argued that, since the human brain is largely the same as that of apes and other mammals without human-like language, the emulation of human-like language is not the right place to focus if one wants to build human-level AGI. Rather, this argument goes, one should proceed in the same order that evolution did - start with motivated perception and action, and then once these are mastered, human-like language will only be a small additional step. We suspect this would indeed be a viable approach - but may not be well suited for the hardware available today. Robot hardware Ls quite primitive compared to animal bodies, but the kind of motivated perception and action that non-human animals do Ls extremely body- centric (even more so than is the case in humans). On the other hand, modern computing technology is quite sophisticated as regards language - we program computers (including AIs) using languages of a sort, for example. This suggests that on a pragmatic basis, it may make sense to start working with language at an earlier stage in AGI development, than the analogue with the evolution of natural organisms would suggest. The CogPrime architecture is compatible with a variety of different approaches to language learning and capability, and frankly at this stage we are not sure which approach is best. Our intention is to experiment with a variety of approaches and proceed pragmatically and empir- ically. One option is to follow the more "natural" course and let sophisticated non-linguistic cognition emerge first, before dealing with language in any serious way - and then encourage human-like language facility to emerge via experience. Another option is to integrate some sort of traditional computational linguistics system into CogPrime, and then allow CogPrime's learning algorithms to modify this system based on its experience. Discussion of this latter option occupies most of this section of the book - involves many tricks and compromises, but 369 EFTA00624516
370 43 Communication Between Artificial Minds could potentially constitute a faster route to success. Yet another option is to communicate with young CogPrime systems using an invented language halfway between the human-language and programming-language domains, such as Lojban (this possibility is discussed in Appendix E). In this initial chapter on communication, we will pursue a direction quite different from the latter chapters, and discuss a kind of communication that we think may be very valuable in the CogPrime domain, although it has no close analogue among human beings. Many aspects of CogPrime closely resemble aspects of the human mind; but in the end CogPrime is not intended as an emulation of human intelligence, and there are sonic aspects of CogPrime that bear no resemblance to anything in the human mind, but exploit some of the advantages of digital computing infrastructure over neural wetware. One of the latter aspects is Psynese, a word we have introduced to refer to direct mind-to-mind information transfer between artificial minds. Psynese has some relatively simple practical applications: e.g. it could aid with the use of linguistic resources and hand-coded or statistical language parsers within a learning-based Ifni- guage system, to be discussed in following chapters. In this use case, one sets up one CogPrime using the traditional NLP approaches. and another CogPrime using a purer learning-based ap- proach, and lets the two systems share mind-stuff in a controlled way. Psynese may also be useful in the context of intelligent virtual pets, where one may wish to set up a CogPrime representing "collective knowledge" of multiple virtual pets. But it also has some grander potential implications, such as the ability to fuse multiple AI systems into "mindplexes" as discussed in Chapter 12 of Part 1. One might wonder why a community of two or more CogPrime s would need a language at all, in order to communicate. After all, unlike humans, CogPrime systems can simply exchange "brain fragments" - subspaces of their Atomspaces. One CogPrime can just send relevant nodes and links to another CogPrime (in binary form, or in an XML representation, etc.), bypassing the linear syntax of language. This is in fact the basis of Psynese: why transmit linear strings of characters when one can directly transit Atoms? But the details are subtler than it might at first seem. One CogPrime can't simply "transfer a thought" to another CogPrime. The problem is that the meaning of an Atom consists largely of its relationships with other Atoms, and so to pass a node to another CogPrime, it also has to pass the Atoms that it is related to, and so on, and so on. Atomspaces tend to be densely interconnected, and so to transmit one thought fully accurately, a CogPrime system is going to end up having to transmit a copy of its entire Atomspace! Even if privacy were not an issue, this form of communication (each utterance coming packaged with a whole mind-copy) would present rather severe processing load on the communicators involved. The idea of Psynese is to work around this interconnectedness problem by defining a mecha- nism for CogPrime instances to query each others' minds directly, and explicitly represent each others' concepts internally. This doesn't involve any unique cognitive operations besides those required for ordinary individual thought, but it requires some unique ways of wrapping up these operations and keeping track of their products. Another idea this leads to is the notion of a PsyneseVocabulary: a collection of Atoms, associated with a community of CogPrime s, approximating the most important Atoms inside that community. The combinatorial explosion of direct-Atomspace communication is then halted by an appeal to standardized Psynese Atoms. Pragmatically, a PsyneseVocabulary might be contained in a PsyneseVocabulary server, a special CogPrime instance that exists to mediate communications between other CogPrime s, and provide CogPrime s with information. Psynese makes sense both as a mechanism for peer-to-peer communication between CogPrime s, and as EFTA00624517
43.2 A Simple Example Using a PsyneseVocabulary Server 371 a mechanism allowing standardized communication between a community of CogPrime s using a PsyneseVocabulary server. 43.2 A Simple Example Using a PsyneseVocabulary Server Suppose CogPrime 1 wanted to tell CogPrime 2 that "Russians are crazy" (with the latter word meaning something inbetween "insane" and "impractical"); and suppose that both CogPrime s are connected to the same Psynese CogPrime with PsyneseVocabulary PV. Then, for instance, it must find the Atom in PV corresponding to its concept "crazy." To do this it must create an AtomStructureTemplate such as Predl(C1) equals ThereExists WI, C2, C3, W2, W3 AND ConceptNode: CI ReferenceLink Cl WI WordNode: W1 #crazy ConceptNode: C2 HebbianLink CI C2 ReferenceLink C2 W2 WordNode: W2 #insane ConceptNode: C3 HebbianLink CI C3 ReferenceLink C3 W3 WordNode: W3 #impractical encapsulating relevant properties of the Atom it wants to grab from PV. In this example the properties specified are: • ConceptNode, linked via a ReferenceLink to the WordNode for "crazy" • HebbianLinks with ConceptNodes linked via ReferenceLinks to the WordNodes for 'insane" and -impractical" So, what CogPrime 1 can do is fish in PV for "some concept that is denoted by the word 'crazy' and is associated with 'insane' and 'impractical'." The association with 'insane" provides more insurance of getting the correct sense of the word "crazy" as opposed to e.g. the one in the phrase "He was crazy about her" or in 'That's crazy, man, crazy" (in the latter slang usage "crazy" basically means "excellent"). The association with "impractical" biases away from the interpretation that all Russians are literally psychiatric patients. So, suppose that CogPrime 1 has fished the appropriate Atoms for "crazy" and "Russian" from PV. Then it may represent in its Atomspace something we may denote crudely (a better notation will be introduced later) as InheritanceLink PV:477335:1256953732 PV:744444:1256953735 C.8.,6> • A similar but perhaps more compelling example would be the interpretation of the phrase "the accountant cooked the books." In this case both "cooked" and "books" are used in atypical senses, but specifying a Hebbian- Link to 'accounting' would cause the right Nodes to get retrieved from PV. EFTA00624518
372 43 Communication Between Artificial Minds where e.g. "PV:744444" means "the Atom with Handle 744444 in CogPrime PV at time 1256953735," and may also wish to store additional information such as PsyneseEvaluationLink <.9> PV Predl PV:744444:1256953735 meaning that Predl(PV : 744444 : 1256953735) holds true with truth value < .9 > if all the Atoms referred to within Predi are interpreted as existing in PV rather than CogPrime 1. The InheritanceLink then means: "In the opinion of CogPrime 1, 'Russian' as defined by PV:477335:1256953732 inherits from 'crazy' as defined by PV:744444:1256953735 with truth value <.8,.6>." Suppose CogPrime 1 then sends the InheritanceLink to CogPrime 2. It is going to be mean- ingfully interpretable by CogPrime 2 to the extent that CogPrime 2 can interpret the relevant PV Atoms, for instance by finding Atoms of its own that correspond to them. To interpret these Atoms, CogPrime 2 must carry out the reverse process that CogPrime 1 did to find the Atoms in the first place. For instance, to figure out what PV:744444:1256953735 means to it, CogPrime 2 may find some of the important links associated with the Node in PV, and make a predicate accordingly, e.g.: Pred2(C1) equals ThereExists Wl, C2, C3, W2, W3 AND ConceptNode: CI ReferenceLink Cl WI WordNode: W1 ►crazy ConceptNode: C2 HebbianLink CI C2 ReferenceLink C2 W2 WordNode: W2 #lunatic ConceptNode: C3 HebbianLink CI C3 ReferenceLink C3 W3 WordNode: W3 ►unrealistic On the other hand, if there is no PsyneseVocabulary involved, then CogPrime 1 can submit the same query directly to CogPrime 2. There is no problem with this, but if there is a reasonably large community of CogPrime s it becomes more efficient for them all to agree on a standard vocabulary of Atoms to be used for communication - just as, at a certain point in human history, it was recognized as more efficient for people to use dictionaries rather than to rely on peer-to-peer methods for resolution of linguistic disagreements. The above examples involve human natural language terms, but this does not have to be the case. PsyneseVocabularies can contain Atoms representing quantitative or other types of data, and can also contain purely abstract concepts. The basic idea is the same. A CogPrime has some Atoms it wants to convey to another CogPrime, and it looks in a PsyneseVocabulary to see how easily it can approximate these Atoms in terms of "socially understood" Atoms. This is particularly effective if the CogPrime receiving the communication is familiar with the PsyneseVocabulary in question. Then the recipient may already know the PsyneseVocabulary Atoms it is being pointed to; it may have already thought about the difference between these consensus concepts and its own related concepts. Also, if the sender CogPrime is encapsulating EFTA00624519
43.3 Psynese as a Language 373 maps for easy communication, it may specifically seek approximate encapsulations involving PsyneseVocabulary terms, rather than first encapsulating in its own terms and then translating into PsyneseVocabulary terms. 43.2.1 The Psynese Match Schema One way to streamline the above operations is to introduce a Psynese Match Schema. with the property that ExOut PsyneseMatch PV A within CogPrime instance CPI, denotes the Atom within CogPrime instance PV that most closely matches the Atom A in CPI. Note that the PsyneseMatch schema implicitly relies on various parameters, because it must encapsulate the kind of process described explicitly in the above example. PsyneseMatch must, internally, decide how many and which Atoms related to A should be used to formulate a query to PV, and also how to rank the responses to the query (e.g. by strength x confidence). Using PsyneseMatch, the example written above as Inheritance PV:477335:1256953732 PV:744444:1256953735 ‹.8.,6> could be rewritten as Inheritance c.8.,G> ExOut PsyneseMatch PV Cl PsyneseMatch PV C2 where Cl and C2 are the ConceptNodes in CPI corresponding to the intended senses of "crazy" and "Ru.ssian." ExOut 43.3 Psynese as a Language The general definition of a psynese expression for CogPrime is a Set of Atoms that contains only: • Nodes from Psynes.eVocabularies • Perceptual nodes (numbers, words, etc.) • Relationships relating no nodes other than the ones in the above two categories, and relating no relationships except ones in this category • Predicates or Schemata involving no relationships or nodes other than the ones in the above three categories, or in this category The PsyneseEvaluationLink type indicated earlier forces interpretation of a predicate as a Psy- nese expression. In what sense is the use of Psynese expressions to communicate a language? Clearly it is a formal language in the mathematical sense. It is not quite a "human language" as we normally EFTA00624520
374 43 Communication Between Artificial Minds conceive it, but it is ideally suited to serve the same functions for CogPrime s as human language serves for humans. The biggest differences front human language are: • Psynese uses weighted, typed hypergraphs (i.e. Atomspaces) instead of linear strings of symbols. This eliminates the "parsing" aspect of language (syntax being mainly a way of projecting graph structures into linear expressions). • Psynese lacks subtle and ambiguous referential constructions like "this", "it" and so forth. These are tools allowing complex thoughts to be compactly expressed in a linear way, but CogPrime s don't need them. Atoms can be named and pointed to directly without complex, poorly-specified mechanisms mediating the process. • Psynese has far less ambiguity. There may be Atoms with more than one aspect to their meanings, but the cost of clarifying such ambiguities Ls much lower for CogPrime s than for humans using language, and so habitually there will not be the rampant ambiguity that we see in human expressions. On the other hand. mapping Psynese into Lojban - a syntactically formal, semantically highly precise language created for communication between humans - rather than a true natural language would be much more straightforward. Indeed, one could create a PsyneseVocabulary based on Lojban, which might be ideally suited to serve as an intermediary between different CogPrime s. And Lojban may be used to create a linearized version of Psynese that looks more like a natural language. We return to this point in Appendix ??. 43.4 Psynese Mindplexes We now recall from Chapter 12 of Part 1 the notion of a mindplex: that is, an intelligent system that: 1. Is composed of a collection of intelligent systems, each of which has its own "theater of consciousness" and autonomous control system, but which interact tightly, exchanging large quantities of information frequently 2. Has a powerful control system on the collective level, and an active "theater of consciousness" on the collective level as well In informal discussions, we have found that some people, on being introduced to the mindplex concept, react by contending that either human minds or human social groups are mindplexes. However, I believe that, while there are significant similarities between mindplexes and minds, and between mindplexes and social groups, there are also major qualitative differences. It's true that an individual human mind may be viewed as a collective, both from a theory-of- cognition perspective (e.g. Minsky's "society of mind" theory IMM88I) and from a personality- psychology perspective (e.g. the theory of subpersonalities Illow901). And it's true that social groups display some autonomous control and some emergent-level awareness. However, in a healthy human mind, the collective level rather than the cognitive-agent or subpersonality level is dominant, the latter existing in service of the former; and in a human social group, the individual-human level is dominant, the group-mind clearly "cognizing" much more crudely than its individual-human components, and exerting most of its intelligence via its impact on individual human minds. A mindplex is a hypothetical intelligent system in which neither level is dominant, and both levels are extremely powerful. A mindplex is like a human mind in EFTA00624521
43.4 Psynese Mindplexes 375 which the subpersonalities are fully-developed human personalities, with full independence of thought, and yet the combination of subpersonalities is also an effective personality. Or, from the other direction, a mindplex is like a human society that has become so integrated and so cohesive that it displays the kind of consciousness and self-control that we normally associate with individuals. There are two mechanisms via which mindplexes may possibly arise in the medium-term future: 1. Humans becoming more tightly coupled via the advance of communication technologies, and a communication-centric AI system coming to embody the "emergent conscious theater" of a human-incorporating mindplex 2. A society of AI systems communicating amongst each other with a richness not possible for human beings, and coming to form a mindplex rather than merely a society of distinct Al's The former sort of mindplex relates to the concept of a "global brain" discussed in Chapter 12 of Part 1. Of course, these two sorts of mindplexes are not mutually contradictory, and may coexist or fuse. The passibility also exists for higher-order mindplexes, meaning mindplexes whose component minds are themselves mindplexes. This would occur, for example, if one had a mindplex composed of a family of closely-interacting AI systems, which acted within a mindplex associated with the global communication network. Psynese, however, is more directly relevant to the latter form of mindplex. It gives a concrete mechanism via which such a mindplex might be sculpted. 43.4.1 AGI Mindpleres How does one get from CogPrime s communicating via Psynese to CogPrime mindplexes? Clearly, with the Psynese mode of communication, the potential is there for much richer communication than exists between humans. There are limitations, posed by the private nature of many concepts - but these limitations are much less onerous than for human language, and can be overcome to some extent by the learning of complex cognitive schemata for translation between the "private languages" of individual Atomspaces and the "public languages" of Psynese servers. But rich communication does not in itself imply the evolution of mindplexes. It is possible that a community of Psynese-communicating CogPrime s might spontaneously evolve a mind- plex structure - at this point, we don't know enough about CogPrime individual or collective dynamics to say. But it is not necessary to rely on spontaneous evolution. In fact it is passible, and even architecturally simple, to design a community of CogPrime s in such a way as to encourage and almost force the emergence of a mindplex structure. The solution is simple: simply beef up PsyneseVocabulary servers. Rather than relatively passive receptacles of knowledge from the CogPrime s they serve, allow them to be active, creative entities, with their own feelings, goals and motivations. The PsyneseVocabulary servers serving a community of CogPrime s are absolutely critical to these CogPrime s. Without them, high-level inter-CogPrime communication is effectively impossible. And without the concepts the PsyneseVocabularies supply, high-level individual CogPrime thought will be difficult, because CogPrime s will come to think in Psynese to at least the same extent to which humans think in language. EFTA00624522
376 43 Communication Between Artificial Minds Suppose each PsyneseVocabulary server has its own full CogPrime mind, its own "conscious theater". These minds are in a sense "emergent minds" of the CogPrime community they serve - because their contents are a kind of "nonlinear weighted average" of the mind-contents of the community. Furthermore, the actions these minds take will feed back and affect the community in direct and indirect ways - by affecting the language by which the minds communicate. Clearly, the definition of a mindplex is fulfilled. But what will the dynamics of such a CogPrime mindplex be like? What will be the properties of its cognitive and personality psychology? We could speculate on this here, but would have very little faith in the possible accuracy of our speculations. The psychology of mindplexes will reveal itself to us experimentally as our work on AGI engineering, education and socialization proceeds. One major issue that arises, however, is that of personality filtering. Put simply: each intelli- gent agent in a mindplex must somehow decide for itself which knowledge to grab from available PsyneseVocabulary servers and other minds, and which knowledge to avoid grabbing from oth- ers in the name of individuality. Different minds may make different choices in this regard. For instance, one choice could be to, as a matter of routine, take only extremely confident knowledge from the PsyneseVocabulary server. This corresponds roughly to ingesting "facts" from the col- lective knowledge pool, but not opinions or speculations. Less confident knowledge would then be ingested from the collective knowledge pool on a carefully calculated and as-needed basis. Another choice could be to accept only small networks of Atoms from the collective knowledge pool, on the principle that these can be reflectively understood as they are ingested, whereas large networks of Atoms are difficult to deliberate and reflect about. But any policies like this are merely heuristic ones. 43.5 Psynese and Natural Language Processing Next we review a more near-term, practical application of the Psynese mechanism: the fusion of two different approaches to natural language processing in CogPrime, the experiential learning approach and the "engineered NLP subsystem" approach. In the former approach, language is not given any extremely special role, and CogPrime is expected to learn language much as it would learn any other complex sort of knowledge. There may of course be learning biases programmed into the system, to enable it to learn language based on its experience more rapidly. But there is no concrete linguistic knowledge programmed in. In the latter approach, one may use knowledge from statistical corpus analysis, one may use electronic resources like WordNet and FrameNet, and one may use sophisticated, specialized tools like natural language parsers with hand-coded grammars. Rather than trying to emulate the way a human child learns language, one is trying to emulate the way a human adult comprehends and generates language. Of course, there is not really a rigid dichotomy between these two approaches. Many linguistic theorists who focus on experiential learning also believe in some form of universal grammar, and would advocate for an approach where learning is foundational but is biased by in-built abstract structures representing universal grammar. There is of course very little knowledge (and few detailed hypotheses) about how universal grammar might be encoded in the human brain, though there is reason to think it may be at a very abstract level, due to the significant EFTA00624523
43.5 Psynese and Natural Language Processing 377 overlaps between grammatical structure, social role structure ICE3001, and physical reasoning [Cam; The engineered approach to NLP provides better functionality right "out of the box," and enables the exploitation of the vast knowledge accumulated by computational linguists in the past decades. However, we suspect that computational linguistics may have hit a ceiling in some regards, in terms of the quality of the language comprehension and generation that it can deliver. It runs up against problems related to the disambiguation of complex syntactic constructs, which don't seem to be resolvable using either a tractable number of hand-coded rules, or supervised or unsupervised learning based on a tractably large set of examples. This conclusion may be disputed, and some researchers believe that statistical computational linguistics can eventually provide human-level functionality, once the Web becomes a bit larger and the computers used to analyze it become a bit more powerful. But in our view it is interesting to explore hybridization between the engineered and experiential approaches, with the motivation that the experiential approach may provide a level of flexibility and insight at dealing with ambiguity that the engineered approach apparently lacks. After all, the way a human child deals with the tricky disambiguation problems that stump current computational linguistics systems is not via analysis of trillion-word corpuses, bur rather via correlating language with non-linguistic experience. One may argue that the genome implic- itly contains a massive corpus of speech, but there it's also to be noted that this is experientially contextualized speech. And it seems clear front the psycholinguistic evidence Romo:31 that for young human children, language is part and parcel of social and physical experience, learned in a manner that's intricately tied up with the learning of many other sorts of skills. One interesting approach to this sort of hybridization, using Psynese, is to create multiple CogPrime instances taking different approaches to language learning, and let them communi- cate. Most simply one may create • A CogPrime instance that learns language mainly based on experience, with perhaps some basic in-built structure and some judicious biasing to its learning (let's call this CPap) • A CogPrime instance using an engineered NLP system (let's call this CPeng) In this case, CPap can use CPc,k9 as a cheap way to test its ideas. For instance suppose, CPap thinks it has correctly interpreted a certain sentence S into Atom-set A. Then it can send its interpretation A to CPap and see whether CPens, thinks A is a good interpretation of S, by consulting CPapthe trait value of ReferenceLink ExOut PsyneseMatch CP a9 S ExOut PsyneseMatch GT." A Similarly, if CPap believes it has found a good way (S) to linguistically express a collection $ of Atoms A, it can check whether these two match reasonably well in CP,,,g. Of course, this approach could be abused in an inefficient and foolish way, for instance if CP„,.p did nothing but randomly generate sentences and then test them against Ca w In this case we would have a much less efficient approach than simply using CPeng directly. However, effectively making use of CPap as a resource requires a different strategy: throwing CPap only a relatively small selection of things that seem to make sense, and using CPap as a filter to avoid trying out rough-draft guesses in actual human conversation. EFTA00624524
378 43 Communication Between Artificial Minds This hybrid approach, we suggest, may provide a way of getting the best of both worlds: the flexibility of a experiential-learning-based language approach, together with the exploitation of existing linguistic tools and resources. With this in mind, in the following chapters we will describe both engineering and experiential-learning based approaches to NLP. 43.5.1 Collective Language Learning Finally we bring the language-learning and mindplex themes together, in the notion of collective language learning. One of the most interesting uses for a mindplex architecture is to allow multiple CogPrime agents to share the linguistic knowledge they gain. One may envision a PsyneseVocabulary server into which a population of CogPrime agents input their linguistic knowledge specifically, and which these agents then consult when they wish to comprehend or express something in language. and their individual NLP systems are not up to the task. This could be a very powerful approach to language learning, because it would allow a potentially very large number of AI systems to effectively act as a single language learning system. It is an especially appealing approach in the context of CogPrime systems used to control animated agents in online virtual worlds or multiplayer games. The amount of linguistic experience undergone by, say, 100,000 virtually embodied CogPrime agents communicating with human virtual world avatars and game players, would be far more than any single human child or any single agent could undergo. Thus, to the extent that language learning can be accelerated by additional experience, this approach could enable language to be learned quite rapidly. EFTA00624525
Chapter 44 Natural Language Comprehension Co-authored with Michael Ross and Linas Vepstas and Ruiting Lian 44.1 Introduction Two key approaches to endowing AGI systems with linguistic facility exist, as noted above: • "Experiential" - shorthand here for "gaining most of its linguistic knowledge from interactive experience, in such a way that language learning is not easily separable from generic learning how to survive and flourish" • "Engineered" - shorthand here for "gaining most of its linguistic knowledge front sources other than the system's own experience in the world" (including learning language front resources like corpora) This dichotomy is somewhat fuzzy, since getting experiential language learning to work well may involve some specialized engineering, and engineered NLP systems may also involve some learning from experience. However, in spite of the fuzziness, the dichotomy is still real and important; there are concrete choices to be made in designing an NLP system and this dichotomy compactly symbolizes some of them. Much of this chapter and the next few will be focused on the engineering approach, but we will also devote some space to discussing the experience-based approach. Our overall perspective on the dichotomy is that • the engineering-based approach, on its own, is unlikely to take us to human-level NLP ... but this isn't wholly impossible, if the engineering is done in a manner that integrates linguistic functionality richly with other kinds of experiential learning • using a combination of experience-based and engineering-based approaches may be the most practical option • the engineering approach is useful for guiding the experiential approach, because it tells us a lot about what kinds of general structures and dynamics may be adequate for intelligent language processing. To simplify a bit, one can prepare an AGI system for experiential learning by supplying it with structures and dynamics capable of supporting the key com- ponents of an engineered NIP system - and biased toward learning things similar to known engineered NLP systems - but requiring all, or the bulk of, the actual linguistic content to be learned via experience. This approach may be preferable to requiring a system to learn language based on more abstract structures and dynamics, and may indeed be more comparable to what human brains do, given the large amount of linguistic biasing that is probably built into the human genome. 379 EFTA00624526
380 44 Natural Language Comprehension Further distinctions, overlapping with this one, may also be useful. One may distinguish (at least) five modes of instructing NLP systems, the first three of which are valid only for engineered NLP systems, but the latter two of which are valid both for engineered and experiential NLP systems: • hand-coded rules • supervised learning on hand-tagged corpuses, or via other mechanisms of explicit human training • unsupervised learning from static bodies of data • unsupervised learning via interactive experience • supervised learning via interactive experience Note that, in principle, any of these modes may be used in a pure-language or a socially/phys- ically embodied language context. Of course, there is also semi-supervised learning which may be used in place of supervised learning in the above list IICSZO8j. Another key dichotomy related to linguistic facility is language comprehension versos lan- guage generation (each of which is typically divided into a number of different subprocesses). In language comprehension, we have processes like stemming, part-of-speech tagging, grammar- based parsing, semantic analysis, reference resolution and discourse analysis. In language gener- ation, we have semantic analysis, syntactic sentence generation, pragmatic discourse generation, reference-insertion, and so forth. In this chapter and the next two we will briefly review all these different topics and explain how they may be embodied in CogPrime. Then, in Chapter ?? we present a complementary approach to linguistic interaction with AGI systems based on the invented language Lojban: and in Chapter 48 we discuss the use of CogPrime cognition to regulate the dialogue process. A typical, engineered computational NLP system involves hand-coded algorithms carrying out each of the specific tasks mentioned in the previous paragraph, sometimes with parameters, rules or number tables that are tuned or learned statistically based on corpuses of data. In fact, most NLP systems handle only understanding or only generation; systems that cover both aspects in a unified way are quite rare. The human mind, on the other hand, carries out these tasks in a much more interconnected way - using separate procedures for the separate tasks, to some extent, but allowing each of these procedures to be deeply informed by the information generated by the other procedures. This interconnectedness is what allows the human mind to really understand language - specifically because human language syntax is complex and ambiguous enough that the only way to master it is to infuse one's syntactic analyses with semantic (and to a lesser extent pragmatic) knowledge. In our treatment of NLP we will pay attention to connections between linguistic ftmctionalities, as well as to linguistic functionalities in isolation. It's worth emphasizing that what we mean by a "experience based" language system is quite different from corpus-based language systems as are commonplace in computational linguistics today [MS99J (and from the corpus-based learning algorithm to be discussed in Chapter ?? below). In fact, we feel the distinction between corpus-based and rule-based language processing systems is often overblown. Whether one hand-codes a set of rules, or carefully marks up a corpus so that rules can be induced from it, doesn't ultimately make that much difference. For instance, OpenCogPrime's RelEx system (to be described below) uses hand-coded rules to do much the same thing that the Stanford parser does using rules induced from a tagged corpus. But both systems do roughly the same thing. RelEx is currently faster due to using fewer rules, and it handles some complex cases like comparatives better (presumably because they were not EFTA00624527













