29.4 Multisensory Integration, and Perception-Action Integration 181 29.4.2 Thought-Experiment: Eye-Hand Coordination For example, how would DeSTIN-OpenCog integration as described here carry out a simple task of eye-hand coordination? Of course the details of such a feat, as actually achieved, would be too intricate to describe in a brief space, but it still is meaningful to describe the basic ideas. Consider the case of a robot picking up a block, in plain sight immediately in front of the robot, via pinching it between two fingers and then lifting it. In this case, • The visual scene, including the block, is perceived by DeSTIN; and appropriate patterns in various DeSTIN nodes are formed • Predicates corresponding to the distribution of patterns among DeSTIN nodes are activated and exported to the OpenCog Atomspace • Recognition that a block is present is carried out, either by - PLN inference within OpenCog, drawing the conclusion that a block is present from the exported predicates, using ImplicationLinks comprising a working definition of a "block" - A predicate comprising the definition of "block", previously imported into DeSTIN from OpenCog and utilized within DeSTIN nodes as a basic pattern to be scanned for. This option would obtain only if the system had perceived many blocks in the past, justifying the automation of block recognition within the perceptual hierarchy. • OpenCog, motivated by one of its higher-level goals, chooses "picking up the block" as subgoal. So it allocates effort to finding a procedure whose execution, in the current context, has a reasonable likelihood of achieving the goal of picking up the block. For instance, the goal could be curiosity (which might make the robot want to see what lies under the block), or the desire to please the agent's human teacher (in case the human teacher likes presents, and will reward the robot for giving it a block as a present), etc. • OpenCog, based on its experience, uses PLN to reason that "grabbing the block" is a subgoal of "picking up the block" • OpenCog utilizes a set of predicates corresponding to the desired state of "grabbing the the block" as a target for an optimization algorithm, designed to figure out a series of servomotor actions that will move the robot's body from the current state to the target state. This is a relatively straightforward control theory problem. • Once the chosen series of servomotor actions has been executed, the robot has its fingers poised around the block, ready to pick it up. At this point, the action-perception hierarchy perceives what is happening in the fingers. If the block is really being grabbed properly, then the fingers are reporting some force, due to the feeling of grabbing the block (haptic input is another possibility and would be treated similarly, but we will leave that aside for now). Importance spreads from these action-perception patterns into the Atomspace, and back down into the visual perception hierarchy, stimulating concepts and percepts related to "something is being grabbed by the fingers." • If the fingers aren't receiving enough force, because the agent is actually only poking the block with one finger and grabbing the air with another finder, then the "something is being grabbed by the fingers" stimulation doesn't happen, and the agent is less sure it's actually grabbing anything. In that case it may withdraw its hand a bit, so that it can more easily assess its hand's state visually, and try the optimization-based movement planning again. EFTA00624328
182 29 Bridging the Symbolic/Subsymbolic Gap • Once the robot estimates the goal of gabbing the block has been successfully achieved, it proceeds to the next sub-subgoal, and asks the action-sequence optimizer to find a sequence of movements that will likely cause the predicates corresponding to "hold the block up" to obtain. It then executes this movement series and picks the block up in the air. This simple example is a far cry from the perceptual-motor coordination involved in doing embroidery, juggling or serving a tennis ball. But we believe it illustrates, in a simple way, the same basic cognitive structures and dynamics used in these more complex instances. 29.5 A Practical Example: Using Subtree Mining to Bridge the Gap Between DeSTIN and PLN In this section we describe seine relatively simple practical experiments we have run, exploring the general ideas described above. The core idea of these experiments is to apply Yun Chi's Frequent Subtree Mining software ICXYNION n to mine frequent patterns from a data-store of trees representing DeSTIN states. In this application, each frequent subtree represents a "common visual pattern." These patterns may then be reasoned about using PLN. This approach may also be extended to include additional quality metrics besides frequency, e.g. interaction information Ina/ which lets one measure how "surprising" a subtree is. Figure ?? illustrates the overall architecture into which the use of frequent subtree mining to bridge DeSTIN and PLN Ls intended to fit. This architecture is not yet fully implemented, but is a straightforward extension of the current OpenCog architecture for processing data from game worlds IGEA08], and is scheduled for implementation later in 2013 in the course of a funded project involving the use of DeSTIN and OpenCog for humanoid robot control. The components intervening between DeSTIN and OpenCog, in this architecture, are: • DeSTIN State DB: Stores all DeSTIN states the system has experienced, indexed by time of occurrence • Frequent Subtree Miner: Recognizes frequent subtrees in the database of DeSTIN states, and can also filter the frequent subtrees by other criteria such as information-theoretic surprisingnes.s. These subtrees may sometimes span multiple time points. • Fltquent Subtree Recognizer: Scans DeSTIN output, and recognizes frequent subtrees therein. These subtrees are the high level "visual patterns" that make their way from DeSTIN to OpenCog. • Perception Collector: Linearly normalizes the spatial coordinates associated with its input subtrees, to compensate for movement of the camera. Filters out perceptions that didn't change recently (e.g. a static white wall), so that only new visual information is passed along to OpenCog. Translates the subtrees into Scheme files representing OpenCog logical Atoms. • Experience DB: Stores all the normalized subtrees that have actually made their way into OpenCog available for download at http://www. nec- labs . comRychi/publ icat ion/so ftware html EFTA00624329
29.5 A Practical Example: Using Subtree Mining to Bridge the Cap Between DeSTIN and PLN semantic feedback. providing additional input to DeSTIN node> video from robot cameras DeSTIN —> forms hierarchical sate embodying pattern> in visual input Frequent Subtree Miner DeSTIN State DB Frequent Subtree Recognizer Cognitive Processes Experience DB 4 OpenCog Atom Space important new subtrees. tagged with normalized locations important subtrees identified in DeSTIN state. tagged with their location in visual field Perception Collector normalizes input to correct for robot head and body movement fiten out perceptions that did not change much recently: converts suberees to OpenCog "Atomese" logical form Fig. 29.1: Graphical depiction of the architecture for DeSTIN/OpenCog integration using fre- quent subtree mining as a bridge. Semantic feedback is not yet implemented; and sophisticated processing of DeSTIN or other visual input is not yet handled by OpenCog's Perception Collec- tor. The experiments presented here utilized a simplified, preliminary version of the architecture. • Semantic Feedback: Allows the semantic associations OpenCog makes to a subtree, to be fed back into DeSTIN as additional inputs to the nodes involved in the subtree. This allows perception to make use of cognitive information. EFTA00624330
184 29 Bridging the Symbolic/Subsymbolic Gap 29.5.1 The Importance of Semantic Feedback One aspect of the above architecture not yet implemented, but worthy of note, is semantic feedback. Without the semantic feedback, we expect to be able to emulate human object and event recognition insofar as they are done by the human brain in a span of less than 500ms or so. In this time frame, the brain cannot do much sophisticated cognitive feedback, and processes perceptual data in an essentially feedforward manner. On the other hand, properly tuned semantic feedback along with appropriate symbolic reasoning in OpenCog, may allow us to emulate human object and event recognition as the human brain does it when it has more time available, and can use its cognitive understanding to guide vision processing. A simple example of this sort of symbolic reasoning is analogical inference. Given a visual scene, OpenCog can reason about what the robot has seen in similar situations before, where its notion of "similarity" draws not only on visual cues but on other contextual information: what time it is, what else is in the room (even if not currently visible), who has been seen in the room recently, etc. For instance, recognizing familiar objects that are largely occluded and in dim light, may be something requiring semantic feedback, and not achievable via the feedforward dynamics alone. This can be tested in a robot vision context via showing the robot various objects and events in various conditions of lighting and occlusion, and observing its capability at recognizing the objects and events with and without semantic feedback, in each of the conditions. If the robot sees an occluded object in a half-dark area on a desk, and it knows that a woman was recently sitting at that desk and then got up and left the room, its symbolic analogical inference may make it more likely to conclude that the object is a purse. Without this symbolic inference, it might not be able to recognize the object as a purse based on bottom-up visual clues alone. 29.6 Some Simple Experiments with Letters To illustrate the above ideas in an elementary context, we now present results of an experiment using DeSTIN, subt.ree mining and PLN together to recognize patterns among a handful of black and white images comprising simple letter-forms. This is a "toy" example, but exemplifies the key processes reviewed above. During the next year we will be working on deploying these same processes in the context of robot vision. 29.6.1 Mining Subtrees from DeSTIN States Induced via Observing Letterforms Figure 29.2 shows the 7 input images utilized; Figure 29.3 shows the centroids found on each of the layers of DeSTIN (note that translation invariance was enabled for these experiments); and Figure 29.4 shows the most frequent subtrees recognized among the DeSTIN states induced by observing the 7 input images. EFTA00624331
29.6 Some Simple Experiments with Letters 185 The centroid images shown in Figure 29.3 were generated as follows. For the bottom layer, centroids were directly represented as 4x4 grayscale images (ignoring the previous and parent belief sections of the centroid). For higher-level centroids, we proceeded as follows: • Divide the centroid into 4 sub-arrays. An image is generated for each sub-array by treating the elements of the sub-array as weights in a weighted sum of the child centroid images. This weighted sum is used superpose / blend the child images into t image. • Then these 4 sub-array images are combined in a square to create the whole centroid image. • Repeat the process recursively, till one reaches the top level. The weighted averaging used a p-power approach, i.e. replacing each weight uPi with wri(wc -F + .41 for a given exponent p > 0. The parameter p toggles how much attention is paid to nearby versus distant centroids. In generating Figure 29.3 we used p = 4. The relation between subtrees and input images, in this example, was directly given via the subtree miner as: tree #0 matches input image: 4 6 tree #1 matches input image: 1 2 tree #2 matches input image: 3 5 tree #3 matches input image: 0 1 2 4 tree #4 matches input image: 3 5 tree #5 matches input image: 4 5 tree #6 matches input image: 1 2 tree #7 matches input image: 0 3 4 29.6.2 Mining Subtrees from DeSTIN States Induced via Observing Letterforms The subtree-image relationships listed above may be most directly expressed in PLN syntax/se- mantics via Evaluation contained_in (Tree0 Image4) Evaluation contained_in (Tree0 Image6) Evaluation contained_in (Treel Imagel) Evaluation contained_in (Treel Image2) Evaluation contained_in (Tree2 Image3) Evaluation contained_in (Tree2 Image5) Evaluation contained_in (Tree3 Image0) Evaluation contained_in (Tree3 Imagel) Evaluation contained_in (Tree3 Image2) Evaluation contained_in (Tree3 Image4) Evaluation contained_in (Tree4 Image3) Evaluation contained_in (Tree4 Image5) Evaluation contained_in (Tree5 Image4) Evaluation contained_in (Tree5 Image5) Evaluation contained_in (Tree6 Imagel) Evaluation contained_in (Tree6 Image2) Evaluation contained_in (Tree? Image0) Evaluation contained_in (Tree? Image3) Evaluation contained_in (Tree? Image4) EFTA00624332
186 29 Bridging the Symbolic Subsymbolic Gap C (a Image 0 (b) Image (c) Image 2 T (d) Image 3 (e) Image (f) Image 5 (g) Image 6 Fig. 29.2: Simple input images fed to DeSTIN for the experiment reported here. But while this is a perfectly natural way to import such relationships into OpenCog, it is not necessarily the most convenient form for PLN to use to manipulate them. For sonic useful inference chains, it is most convenient for PLN to translate these into the more concise form Inheritance Image4 hasTree0 Inheritance Image6 hasTree0 Inheritance Image3 hasTree7 Inheritance Image4 hasTree7 PLN performs the translation from Evaluation into Inheritance form via the inference steps Evaluation contains (Tree° Image4) --> \\ definition of SatisfyingSet Member Image4 (SatisfyingSet (Evaluation contains(Tree0 *)) ) -- \\ definition of hasTreeO Member Image4 hasTree0 --> \\ M2I, Member to Inheritance inference Inheritance Image4 hasTree0 EFTA00624333
29.6 Some Simple Experiments with Letters 187 (a) Layer 0 (b) Layer 1 (c) Layer 2 (d) Layer 3 (e) Layer 4 (f) Layer 5 (g) Laye 6 (h) Layer 7 Fig. 29.3: Example visualization of the centroids on the 7 layers of the DeSTIN network. Each picture shows multiple centroids at the corresponding level. Higher level centroids are visualized as p'th power averages of lower level centroids, with p=4. Finally, given the Inheritance relations listed above, PLN can draw some simple conclusions fairly directly, such as: Similarity Imagel Image2 <1, .375> Similarity Image3 Image5 c.5, .444> The PLN truth values above are given in the form " <strength, confidence> ", where strength is in this case effectively a probability, and confidence represents a scaling into the interval [0,1) of the amount of evidence on which that strength value is based. The confidence is calculated using a "personality parameter" of k = 5 (k may vary between 1 and oo, with higher numbers indicating less value attached to each individual piece of evidence. For example the truth value strength of 1 attached to "Similarity Imagel Image2" indicates that according to the evidence provided by these subtrees (and ignoring all other evidence), Imagel and Image2 are the same. Of course they are not the same — one is a C and another is an 0 — and once more evidence is given to PLN, it will decrease the strength value of this SimilarityLink. The confidence value of .375 indicates that PLN is not very certain of the sameness of these two letters. What conclusion can we draw from this toy example, practically speaking? The conclusions drawn by the PLN system are not useful in this case - PLN thinks C and 0 are the same, as a provisional hypothesis based on this data. But this is not because DeSTIN and PLN are stupid. Rather, it's because they have not been fed enough data. The hypothesis that a C is an EFTA00624334
188 29 Bridging the Symbolic/Subsymbolic Gap occluded 0 is actually reasonably intelligent. If we fed these same systems many more pictures, then the subtree miner would recognize many more frequent subtrees in the larger corpus of DeSTIN states, and PLN would have a lot more information to go on, and would draw more conmmonsensically clever conclusions. We will explore this in our future work. We present this toy example not as a useful practical achievement, but rather as a very simple illustration of the process via which subsymbolic knowledge (as in the states of the DeSTIN deep learning architecture) can be mapped into abstract logical knowledge, which can then be reasoned on via a probabilistic logical reasoning engine (such as PLN). We believe that the same process illustrated so simplistically in this example, will also generalize to more realistic and interesting examples, involving more complex images and inferences. The integration of DeSTIN and OpenCog described here is being pursued in the context of a project aimed at the creation of a humanoid robot capable of perceiving interpreting and acting in its environment with a high level of general intelligence. 29.7 Conclusion We have described, at a high level, a novel approach to bridging the symbolic / subsymbolic gap, via very tightly integrating DeSTIN with OpenCog. We don't claim that this is the only way to bridge the gap, but we do believe it is a viable way. Given the existing DeSTIN and OpenCog designs and codebases, the execution of the ideas outlined here seems to be relatively straightforward. falling closer to the category of "advanced development" than that of blue-sky research. However, fine-tuning all the details of the approach will surely require substantial effort. While we have focused on robotics applications here, the basic ideas described could be im- plemented and evaluated in a variety of other contexts as well, for example the identification of objects and events in videos, or intelligent video summarization. Our interests are broad, how- ever. we feel that robotics is the best place to start - partly due to a general intuition regarding the deep coupling between human-like intelligence and human-like embodiment; and partly due to a more specific intuition regarding the value of action for perception, as reflected in Heinz von Foerster's dictum "if you want to see, learn how to act". We suspect there are important cognitive reasons why perception in the human brain centrally involves premotor regions. The coupling of a perceptual deep learning hierarchy and a symbolic AI system doesn't intrinsically solve the combinatorial explosion problem intrinsic in looking for potential conceptual patterns in masses of perceptual data. However, a system with particular goals and the desire to act in such a way as to achieve them, possesses a very natural heuristic for pruning the space of possible perceptual/conceptual patterns. It allows the mind to focus in on those percepts and concepts that are useful for action. Of course, there are other ways besides integrating action to enforce effective pruning, but the integration of perception and action has a variety of desirable properties that might be difficult to emulate via other methods, such as the natural alignment of the hierarchical structures of action and reward with that of perception. The outcome of any complex research project is difficult to foresee in detail. However, our intuition - based on our experience with OpenCog and DeSTIN, and our work with the math- ematical and conceptual theories underlying these two systems - is that the hybridization of OpenCog and DeSTIN as described here will constitute a major step along the path to human- level AGI. It will enable the creation of an OpenCog instance endowed with the capability of EFTA00624335
29.7 Conclusion 189 flexibly interacting with a rich stream of data from the everyday human world. This data will not only help OpenCog to guide a robot in carrying out everyday tasks, but will also provide raw material for OpenCogPrime's cognitive processes to generalize from in various ways - e.g. to use as the basis for the formation of new concepts or analogical inferences. EFTA00624336
190 29 Bridging the Symbolic/Subsymbolic Gap (a) Subtree 0: (1.,6,C2,P0) (b) Subtree 1: (L6,C12,P0) liwit I' V I" es Ili (c) Subtree 2: (L6,C19.20) (d) Subtree 3: (L6,C12,P1) (e) Subtree 4: (L6,C13,P1) (f) Subtree 5: (L6,C1,P2) (g) Subtree 6: (L6,C5,P2) (1) Subtree 7: (L6,C20,P3) Fig. 29.4: Example subtrees extracted from the set of DeSTIN states corresponding to the input images given above. Each subtree is associated with a triple (level, centroid, position). The position is one of the four level n squares making up a level n — 1 centroid. In this simple exmaple, all these frequent subtrees happen to be from Level 6, but this is not generally the case for more complex images. some of the centroids look like whitespace, but this is because a common region of whitespace was recognized among multiple input images. EFTA00624337
Section IV Procedure Learning EFTA00624338
EFTA00624339
Chapter 30 Procedure Learning as Program Learning 30.1 Introduction Broadly speaking, the learning of predicates and schemata (executable procedures) is done in CogPrime via a number of different methods, including for example PLN inference and concept predicatization (to be discussed in later chapters). Most of these methods, however, merely extrapolate procedures directly from other procedures or concepts in the AtomSpace, in a local way - a new procedure is derived from a small number of other procedures or concepts. General intelligence also requires a method for deriving new procedures that are more "fundamentally new." This is where CogPrime makes recourse to explicit procedure learning algorithms such as hillclimbing and MOSES, discussed in Chapters 32 and 33 below. In this brief chapter we fornmlate the procedure learning problem as a program learning problem in a general way, and make some high-level observations about it. Conceptually, this chapter is a follow-up to Chapter 21, which discussed the choice to represent procedures as programs; here we make some simple observations regarding the implications of this choice for procedure learning, and the formal representation of procedure learning with CogPrime. 30.1.1 Program Learning An optimization problem may be defined as follows: a solution space S is specified, together with some fitness function on solutions, where "solving the problem" corresponds to discovering a solution in S with a sufficiently high fitness. In this context, we may define program learning as follows: given a program space P, a behavior space B, an execution function exec : PI—O3, and a fitness function on behav- iors, "solving the problem" corresponds to discovering a program p in P whose corresponding behavior, exec(p), has a sufficiently high fitness. In evolutionary, learning terms, the program space is the space of genotypes, and the be- havior space is the space of phenotypes. This formalism of procedure learning serves well for explicit procedure learning CogPrime, not counting cases like procedure learning within other systems (like DeSTIN) that may be hybridized with CogPrime. 193 EFTA00624340
194 30 Procedure Learning as Program Learning Of course, this extended formalism can of be entirely vacuous - the behavior space could be identical to the program space. and the execution function simply identity, allowing any opti- mization problem to be cast as a problem of program learning. The utility of this specification arises when we make interesting assumptions regarding the program and behavior spaces, and the execution and fitness functions (thus incorporating additional inductive bias): 1. Open-endedness - P has a natural "program size" measure - programs may be enumer- ated from smallest to largest, and there is no obvious problem-independent upper bound on program size. 2. Over-representation - exec often maps many programs to the same behavior. 3. Compositional hierarchy - programs themselves have an intrinsic hierarchical organiza- tion, and may contain subprograms which are themselves members of P or some related program space. This provides a natural family of distance measures on programs, in tenns of the the number and type of compositions / decompositions needed to transform one program into another (i.e., edit distance). 4. Chaotic Execution - very similar programs (as conceptualized in the previous item) may have very different behaviors. Precise mathematical definitions could be given for all of these properties but would provide little insight - it is more instructive to simply note their ubiquity in symbolic representations; human programming languages (LISP, C, etc.), Boolean and real-valued formulae, pattern- matching systems, automata, and many more. The crux of this line of thought is that the combination of these four factors conspires to scramble fitness functions - even if the mapping from behaviors to fitness is separable or nearly decomposable, the complex' program space and chaotic execution function will often quickly lead to intractability as problem size grows. These properties are not superficial inconveniences that can be circumvented by some particularly clever encoding. On the contrary, they are the essential characteristics that give programs the power to compress knowledge and generalize correctly, in contrast to fiat, inert representations such as lookup tables (see Baum [13aMMI for a full treatment of this line of argument). The consequences of this particular kind of complexity, together with the fact that most program spaces of interest are combinatorially very large, might lead one to believe that com- petent program learning is impossible. Not so: real-world program learning tasks of interest have a compact structuret - they are not "needle in haystack" problems or uncorrelated fitness landscapes, although they can certainly be encoded as such. The most one can definitively state is that algorithm foo, methodology bar, or representation baz is unsuitable for express- ing and exploiting the regularities that occur across interesting program spaces. Some of these regularities are as follows: 1. Simplicity prior - our prior assigns greater probability mass to smaller programs. 2. Simplicity preference - given two programs mapping to the same behavior, we prefer the smaller program (this can be seen as a secondary fitness function). 3. Behavioral decomposability - the mapping between behaviors and fitness is separable or nearly decomposable. Belatedly, fitness are more than scalars - there is a partial ordering corresponding to behavioral dominance, where one behavior dominates another if it exhibits • Here "complex" means open-ended, over-representing, and hierarchical. t Otherwise, humans could not write programs significantly more compact than lookup tables. EFTA00624341
30.2 Representation-Building 195 a strict superset of the latter's desideratum, according to the fitness ftuiction.t This partial order will never contradict the total ordering of scalar fitnesss. 4. White box execution - the mechanism of program execution is known a priori, and remains constant across many problems. How these regularities may be exploited will be discussed in later sections and chapters. Another fundamental regularity of great interest for artificial general intelligence is patterns across related problems that may be solvable with similar programs (e.g., involving common modules). 30.2 Representation-Building One important issue in achieving competent program learning is representation building. In an ideally encoded optimization problem. all prespecified variables would exhibit complete separa- bility, and could be optimized independently. Problems with hierarchical dependency structure cannot be encoded this way, but are still tractable by dynamically learning the problem decom- position (as is done by the BOA and hBOA, described in Chapter 33). For complex problems with interacting subcomponents, finding an accurate problem decomposition is often tanta- mount to finding a solution. In an idealized run of a competent optimization algorithm, the problem decomposition evolves along with the set of solutions being considered, with parallel convergence to the correct decomposition and the global solution optima. However, this is cer- tainly contingent on the existence of some compacts and reasonably correct decomposition in the space (of decompositions, not solutions) being searched. Difficulty arises when no such decomposition exists, or when a more effective decomposition exists that cannot be formulated as a probabilistic model over representational parameters. Accordingly, one may extend current approaches via either: a more general modeling language for expressing problem decompositions; or additional mechanisms that modify the representa- tions on which modeling operates (introducing additional inductive bias). In CogPrime we have focused on the latter - the former would appear to require qualitatively more computational capacity than will be available in the near future. If one ignores this constraint, such a "univer- sal" approach to general problem-solving is indeed possible, e.g. AIXP as discussed in Chapter 7.3. We refer to these additional mechanisms as "representation-building" because they serve the same purpose as the pre-representational mechanisms employed (typically by humans) in setting up an optimization problem - to present an optimization algorithm with the salient pa- rameters needed to build effective problem decompositions and vary solutions along meaningful dimensions. We return to this issue in detail in Chapter 33 in the context of MOSES, the most powerful procedure-learning algorithm provided in CogPrime. * For example, in supervised classification one rule dominates another if it correctly classifies all of the items that second rule classifies correctly, as well as some which the second rule gets wrong. § The decomposition must be compact because in practice only a fairly small sampling of solutions may be evaluated (relative to the size of the total space) at a time, and the search mechanism for exploring decomposition- space is greedy and local. This is in also accordance with the general notion of learning corresponding to compression. EFTA00624342
196 30 Procedure Learning as Program Learning 30.3 Specification Based Procedure Learning Now we explain how procedure learning fits in with the declarative and intentional knowledge representation in the AtomSpace. The basic method that CogPrime uses to learn procedures that appear fundamentally new from the point of view of the AtomSpace at a given point in time is "specification-based pro- cedure learning". This involves taking a PredicateNode with a ProcedureNode input type as a specification, and searching for ProcedureNodes that fulfill this specification (in the sense of making the specification PredicateNode as true as possible). In evolutionary computing lingo, the specification predicate is a fitness function. Searching for PredicateNodes that embody patterns in the AtomSpace as a whole is a spe- cial case of this kind of learning, where the specification PredicateNode embodies a notion of what constitutes an "interesting pattern". The quantification of interestingness is of course an interesting and nontrivial topic in itself. Finding schemata that are likely to achieve goals important to the system is also a special case of this kind of learning. In this case, the specification predicate is of the form: F(S) PredictivelmplicationLink (ExOutLink S) G This measures the extent to which executing schema S is positively correlated with goal- predicate G being achieved shortly later. Given a PredicateNode interpretable as a specification, how do we find a ProcedureNode sat- isfying the specification? Lacking prior knowledge sufficient to enable an incremental approach like inference, we must search the space of possible ProcedureNodes, using an appropriate search heuristic, hopefully one that makes use of the system's existing knowledge as hilly as passible. EFTA00624343
Chapter 31 Learning Procedures via Imitation, Reinforcement and Correction Co-authored with Moshe Looks, Samir Araujo and Welter Silva 31.1 Introduction In procedure learning as elsewhere in cognition, it's not enough to use the right algorithm, one has to use it in the right way based on the data and context and affordances available. While Chapters ?? and 33 focus on procedure learning algorithms, this one focuses on procedure learning methodology. We will delve into the important special case of procedure learning in which the fitness function involves reinforcement and imitation supplied by a teacher and/or an environment, and look at examples of this in the context of teaching behaviors to virtual pets controlled by OpenCogPrime. While this may seem a very narrow context, many of the lessons learned are applicable more broadly; and the discussion has the advantage of being grounded in actual experiments done with OpenCogPrime's predecessor system, the Novamente Cognition Engine, and with an early OpenCog version as well, during the period 2007-2008. We will focus mainly on learning from a teacher, and then common on the very similar case where the environment, rather than some specific agent, is the teacher. 31.2 IRC Learning Suppose one intelligent agent (the "teacher") has knowledge of how to carry out a certain behavior, and wants to transfer this knowledge to another intelligent agent (the "student"). But, suppose the student agent lacks the power of language (which might be, for example, because language is the thing being taught!). How may the knowledge be transferred? At least three methodologies are possible: 1. Imitative learning: The teacher acts out the behavior, showing the student by example 2. Reinforcement learning: The student tries to do the behavior himself, and the teacher gives him feedback on how well he did 3. Corrective learning: As the student attempts the behavior, the teacher actively corrects (i.e. changes) the student's actions, guiding him toward correct performance Obviously, these three forms of instmction are not exclusive. What we describe here, and call IRC learning, is a pragmatic methodology for instructing AGI systems that combines these 197 EFTA00624344
198 31 Learning Procedures via Imitation, Reinforcement and Correction three forms of instruction. We believe this combination is a potent one, and is certainly implicit in the way human beings typically teach young children and animals. For sake of concreteness, we present IRC learning here primarily in the context of virtually embodied AGI systems — i.e., AGI systems that control virtual agents living in virtual worlds. There is an obvious extension to physical robots living in the real world and capable of flexible interaction with humans. In principle, IRC learning is applicable more broadly as well, and could be explored in various non-embodied context such as (for instance) automated theorem-proving. In general, the term "IRC learning" may be used to describe any teacher/student interaction that involves a combination of reinforcement, imitation and correction. While we have focused in our practical work so far on the use of IRC to teach simple "animal-like" behaviors, the application that interests us more in the medium term is language instruction, to which we will return in later chapters. Harking back to Chapter 9, it is clear that an orientation toward effective IRC learning will be valuable for any system attempting to achieve complex goals in an environment heavily populated by other intelligences possessing significant goal-relevant knowledge. Everyday human environments possess this characteristic, and we suggest the best way to create human-level AGIs will be to allow them to develop in environments possessing this characteristic as well. 31.2.1 A Simple Example of Imitation/Reinforcement Learning Perhaps the best way to introduce the essential nature of the IRC teaching protocol is to give a brief snippet from a script that was created to guide the actual training of the virtual animals controlled by the PetBrain. This snippet involves only I and It; the C will be discussed afterwards. This snippet demonstrates a teaching methodology that involves two avatars: Bob who is being the teacher, and Jill who is being an "imitation animal," showing the animal what to do by example. 1. Bob wants to teach the dog Fido a trick. He calls his friend Jill over. "Jill, can you help me teach Fido a trick?" 2. Jill comes over. glow much will you pay me for it? 3. Bob gives her a Liss. 4. "All right," says Jill, "what do you want to teach him?" 5. "Let's start with fetching stuff," replies Bob. 6. So Bob and Jill start teaching Fido to fetch using the Pet language.... 7. Bob says: "Fido, I'm going to teach you to play fetch with Jilt" 8. Fido sits attentively, looking at Bob. 9. Bob says: "OK, I'm playing fetch now." 10. Bob picks up a stick from the ground and throws it. Jill runs to get the stick and brings it back to Bob. 11. Bob says: "I'm done fetching. 12. Bob says, "You try it." 13. Bob throws a stick Fido runs to the stick, gets it, and brings it back. 14. Bob says "Good dog!" 15. Fido looks happy. 16. Bob says: "Ok, we're done with that game of fetch. EFTA00624345
31.2 IRC Learning 199 17. Bob says, "Now, let's try playing fetch again.° 18. This time, Bob throws a stick in a different direction, where there's already a stick lying on the ground (call the other stick Stick 2). 19. Fido runs and retrieves Stick 2. As soon as he picks it up, Bob says "No." But Fido keeps on running and brings the stick back to Bob. 20. Bob says "No, that was wrong. That was the wrong stick. Stop trying!" 21. Jill says, "Furry little moron!" 22. Bob says to Jill, "Have some patience, will you? Let's try again." 23. Fido is slowly wandering around, sniffing the ground. 24. Bob says "Fido, stay." Fido returns near Bob and sits. 25. Bob throws Stick 2. Fido starts to get up and Bob repeats "Fido, stay." 26. Bob goes and picks up Stick 1, and walks back to his original position. 27. Bob says "Fido, I'm playing fetch with Jill again." 28. Bob throws the first stick in the direction of stick 2. 29. Jill goes and gets stick 1 and brings it back to Bob. 30. Bob says "I'm done playing fetch with Jill." 31. Bob says "Thy playing fetch with me now." He throws stick I in another direction, where stick 3 and stick 4 are lying on the ground, along with some other junk. 32. Fido runs and gets stick 1 and brings it back. 33. Bob and Jill both jump up and down smiling and say "Good dog! Good dog, Fido!! Good doe!" 34. Fido smiles and jumps up and licks Jill on the face. 35. Bob says, "Fido, we're done practicing fetch." In the above transcript, Line 7 initiates a formal training session, and Line 33 terminates this session. The training session is broken into "exemplar" intervals during which exemplars are being given, and "trial" intervals during which the animal is trying to imitate the exemplars, following which is receives reinforcement on its success or otherwise. For instance line 9 initiates the presentation of an exemplar interval, and line 11 indicates the termination of this interval. Line 12 indicates the beginning of a trial interval, and line 16 indicates the termination of this interval. The above example of combined imitative/reinforcement learning involves two teachers, but, this is of course not the only way things can be done. Jill could be eliminated from the above teaching example. The result of this would be that, in figuring out how to imitate the exemplars, Fido would have to figure out which of Bob's actions were "teacher" actions and which were "simulated student" actions. This is not a particularly hard problem, but it's harder than the case where Jill carries out all the simulated-student actions. So in the case of teaching fetch with only one teacher avatar, on average, more reinforcement trials will be required. 31.2.2 A Simple Example of Corrective Learning Another interesting twist on the imitative/reinforcement teaching methodology described above is the use of explicit correctional instructions from the teacher to the animal. This is not shown in the above example but represents an important addition to the methodology show there. One good example of the use of corrections would be the problem of teaching would be teaching an animal to sit and wait until the teacher says "Get Up," using only a single teacher. Obviously, EFTA00624346
200 31 Learning Procedures via Imitation, Reinforcement and Correction using two teachers, this is a much easier problem. Using only one teacher, it's still easy, but involves a little more subtlety, and becomes much more tractable when corrections are allowed. One way that human dog owners teach their dogs this sort of behavior is as follows: 1. Tell the dog "sit" 2. tell the dog "stay" 3. Whenever the dog tries to get up. tell him "no" or "sit", and then he sits down again 4. eventually, tell the dog to "get up" The real dog understands, in its own way, that the "no" and "sit" commands said after the "stay" command are meta-commands rather than part of the "stay" behavior. In our virtual-pet case, this would be more like 1. tell the dog "I'm teaching you to stay" 2. Tell the dog "sit" 3. Whenever the dog tries to get up, tell him "no" or "sit", and then he sits down again 4. eventually, tell the dog to "get up" 5. tell the dog "I'm done teaching you to stay" One easy way to do this, which deviates from the pattern of humanlike interaction, would be to give the agent knowledge about how to interpret an explicit META flag in communications directed toward it. In this case, the teaching would look like 1. tell the dog "I'm teaching you to stay" 2. Tell the dog "META: sit" 3. Whenever the dog tries to get up, tell him "META: no" or "META:sit", and then he sits down again 4. eventually, tell the dog to "get up" 5. tell the dog "I'm done teaching you to stay" Even without the META tag, this behavior (and other comparable ones) is learnable via CogPrime's learning algorithms within a modest number of reinforcement trials. So we have not actually implemented the META approach. But it well illustrates the give-and-take relationship between the sophistication of the teaching methodology and the number of reinforcement trials required. In many cases, the best way to reduce the number of reinforcement trials required to learn a behavior is not to increase the sophistication of the learning algorithm, but rather to increase the information provided during the instruction process. No matter how advanced the learning algorithm, if the teaching methodology only gives a small amount of information, it's going to take a bunch of reinforcement trials to go through the search space and find one of the right procedures satisfying the teacher's desires. One of the differences between the real-world learning that an animal or human child (or adult) experiences, and the learning "experienced" by standard machine-learning algorithms, is the richness and diversity of information that the real world teaching environment provides, beyond simple reinforcement signals. Virtual worlds provide a natural venue in which to experiment with providing this sort of richer feedback to AI learning systems, which is one among the many reasons why we feel that virtual worlds are an excellent venue for experimentation with and education of early-stage AGI systems. EFTA00624347
31.3 IRC Learning in the PetBrain 201 31.3 IRC Learning in the PetBrain Continuing the theme of the previous section, we now discuss "trick learning" in the PetBrain, as tested using OpenCog and the Mtdtiverse virtual world during 2007-2008. The PetBrain constitutes a specific cognitive infrastructure implementing the IRC learning methodology in the virtual-animal context, with some extensibility beyond this context as well. In the PetBrain, learning itself is carried out by a variety of hillclimbing as described in Chapter 32, which is a fast learning algorithm but may fail on harder behaviors (in the sense of requiring an unacceptably large number of reinforcement trials). For more complex behaviors, MOSES (Chapter 33) would need to be integrated as an alternative. Compared to hillclimbing, MOSES is much smarter but slower, and may take a few minutes to solve a problem. The two algorithms (as implemented for the PetBrain) share the same Combo knowledge representation and some other software components (e.g. normalization rules for placing procedures in an appropriate hierarchical normal form, as described in Chapter 21). The big challenge involved in designing the PetBrain system, Al-wise, was that these learning algorithms, used in a straightforward way with feedback from a human-controlled avatar as the fitness function, would have needed an excessive number of reinforcement trials to learn relatively simple behaviors. This would bore the human beings involved with teaching the animals. This is not a flaw of the particular learning algorithms being proposed, but Ls a generic problem that would exist with any Al algorithms. To choose an appropriate behavior out of the space of all possible behaviors satisfying reasonable constraints, requires more bits of information that is contained in a handful of reinforcement trials. Most "animal training" games (e.g. Nintendogs may be considered as a reference case) work around this "hard problem" by not allowing teaching of novel behaviors. Instead, a behavior list is made up front by the game designers. The animals have preprogrammed procedures for carrying out the behaviors on the list. As training proceeds they make fewer errors, till after enough training they converge "miraculously" on the pre-programmed plan This approach only works, however, if all the behaviors the animals will ever learn have been planned and scripted in advance. The first key to making learning of non-pre-programmed behaviors work, without an excessive number of reinforcement trials, is in "fitness estimation" - code that guesses the fitness of a candidate procedure at fulfilling the teacher's definition of a certain behavior, without actually having to try out the procedure and see how it works. This is where the I part of IRC learning comes in. At an early stage in designing the PetBrain application, we realized it would be best if the animals were instructed via a methodology where the same behaviors are defined by the teacher both by demonstration and by reinforcement signals. Learning based on reinforcement signals only can also be handled, but learning will be much slower. In evolutionary programming lingo, we have 1. Procedures — genotypes 2. Demonstrated exemplars, and behaviors generated via procedures = phenotypes 3. Reinforcement signals from pet owner = fitness EFTA00624348
202 31 Learning Procedures via Imitation, Reinforcement and Correction One method of imitation-based fitness estimation used in the PetBrain involves an internal simulation world which we'll call CogSim, as discussed in Chapter 40 ". CogSim can be visualized using a simple testing UI, but in the normal course of operations it doesn't require a user interface; it is an internal simulation world, which allows the PetBrain to experiment and see what a certain procedure would be likely to do if enacted in the SL virtual world. Of course, the accuracy of this kind of simulation depends on the nature of the procedure. For procedures that solely involve moving around and interacting with inanimate objects, it can be very effective. For procedures involving interaction with human-controlled avatars, other animals, or other complex objects, it may be unreliable - and making it even moderately reliable would require significant work that has not yet been done, in terms of endowing CogSim with realistic simulations of other agents and their internal motivational structures and so forth. But short of this, CogSim has nonetheless proved useful for estimating the fitness of simple behavioral procedures. When a procedure is enacted in CogSim, this produces an object called a "behavior descrip- tion" (BD), which is represented in the AtomSpace knowledge representation format. The BD generated by the procedure is then compared with the BD's corresponding to the "exemplar" behaviors that the teacher has generated, and that the student is trying to emulate. Similari- ties are calculated, which is a fairly subtle matter that involves some heuristic inferences. An estimate of the likelihood that the procedure, if executed in the world, will generate a behavior adequately similar to the exemplar behaviors. Furthermore, this process of estimation may be extended to make use of the animal's long- term episodic memory. Suppose a procedure P is being evaluated in the context of exemplar-set E. Then 1. The episodic memory is mined for pairs (P', E') that are similar to (P,E) 2. The fitness of these pairs (P', E') is gathered from the experience base 3. An estimate of the fitness of (P,E) is then formed Of course, if a behavior description corresponding to P has been generated via CogSim, this may also be used in the similarity matching against long-term memory. The tricky part here, of course, is the similarity measurement itself, which can be handled via simple heuristics, but if taken sufficiently seriously becomes a complex problem of uncertain inference. One thing to note here is that in the PetBrain context, although learning is done by each animal individually, this learning is subtly guided by collective knowledge within the fitness estimation process. Internally, we have a "borg mind" with multiple animal bodies, and an architecture designed to ensure the maintenance of unique personalities on the part of the individual animals in spite of the collective knowledge and learning underneath. At time of writing, we have just begun to experiment with the learning system as described above, and are using it to learn simple behaviors such as playing fetch, basic soccer skills, doing specific dances as demonstrated by the teacher, and so forth. We have not yet done enough experimentation to get a solid feel for the limitations of the methodology as currently implemented. Note also the possibility of using CogPrime's PLN inference component to allow generaliza- tion of learned behaviors. For instance, with inference deployed appropriately, a pet that had learned how to play tag would afterwards have a relatively easy time learning to play "freeze tag." A pet that had learned how to hunt for Easter eggs would have a relatively easy time • The few readers familiar with obscure OpenCog documentation may remember that CogSim was previously called 'Third Life, in reference to the Second Life virtual world that was being used to embody the OpenCog virtual pets at the time EFTA00624349
31.4 Applying A Similar IRC Methodology to Spontaneous Learning 203 learning to play hide-and-seek. Episodic memory can be very useful for fitness estimation here, but explicit use of inference may allow much more rapid and far-reaching inference capabilities. 31.3.1 Introducing Corrective Learning Next, how may corrections be utilized in the learning process we have described? Obviously, the corrected behavior description gets added into the knowledge base as an additional exemplar. And, the fact of the correction acts as a partial reinforcement (up until the time of the correction, what the animal was doing was correct). But beyond this, what's necessary is to propagate the correction backward from the BD level to the procedure level. For instance, if the animal is supposed to be staying in one place, and it starts to get up but is corrected by the teacher (who says "sit" or physically pushes the animal back down), then the part of the behavior-generating procedure that directly generated the "sit" command needs to be "punished." How difficult this is to do, depends on how complex the procedure is. It may be as simple as providing a negative reinforcement to a specific "program tree node" within the procedure, thus disincentivizing future procedures generated by the procedure learning algorithm from containing this node. Or it may be more complex, requiring the solution of an inference problem of the form "Find a procedure P" that is as similar as possible to procedure P, but that does not generate the corrected behavior, but rather generates the behavior that the teacher wanted in.stead." This sort of "working backwards from the behavior description to the procedure" is never going to be perfect except in extremely simple cases, but it is an important part of learning. We have not yet experimented with this extensively in our virtual animals, but plan to do so as the project proceeds. There is also an interesting variant of correction in which the agent's own memory serves implicitly as the teacher. That is, if a procedure generates a behavior that seems wrong based on the history of successful behavior descriptions for similar exemplars, then the system may suppress that particular behavior or replace it with another one that seems more appropriate - inference based on history thus serving the role of a correcting teacher. 31.4 Applying A Similar IRC Methodology to Spontaneous Learning We have described the IRC teaching/learning methodology in the context of learning from a teacher - but in fact a similar approach can be utilized for purely unsupervised learning. In that case, the animal's intrinsic goal system acts implicitly as a teacher. For instance, suppose the animal wants to learn how to better get itself fed. In this case, 1. Exemplars are provided by instances in the animal's history when it has successfully gotten itself fed 2. Reinforcement is provided by, when it is executing a certain procedure, whether or not it actually gets itself fed or not 3. Correction as such doesn't apply, but implicit correction may be used via deploying history- based inference. If a procedure generates a behavior that seems wrong based on the history of successful behavior descriptions for the goal of getting fed, then the system may suppress that particular behavior. EFTA00624350
204 31 Learning Procedures via Imitation, Reinforcement and Correction The only real added complexity here lies in identifying the exemplars. In surveying its own history, the animal must look at each previous instance in which it got fed (or same sample thereof), and for each one recollect the series of N actions that it carried out prior to getting fed. It then must figure out how to set N - i.e. which of the actions prior to getting fed were part of the behavior that led up to getting fed, and which were just other things the animal happened to be doing a while before getting fed. To the extent that this exemplar mining problem can be solved adequately, innate-goal-directed spontaneous learning becomes closely analogous to teacher-driven learning as we've described it. Or in other words: Experience, as is well known, can serve as a very effective teacher. EFTA00624351
Chapter 32 Procedure Learning via Adaptively Biased Hillclimbing Primary author: Nil Geisweiller 32.1 Introduction Having chosen to represent procedures as programs, explicit procedure learning then becomes a matter of automated program learning. In its mast general incarnation, automated program learning is obviously an intractable problem; so the procedure learning design problem then boils down to finding procedure learning algorithms that are effective on the class of problems relevant to CogPrime systems in practice. This is a subtle matter because there is no straightforward way to map from the vaguely-defined category of real-world "everyday human world like" goals and environments to any formally-defined class of relevant objective functions for a program learning algorithm. However, this difficulty is not a particular artifact of the choice of programs to represent procedures; similar issues would arise with any known representational mechanism of suitable power. For instance, if procedures were represented as recurrent neural nets, there would arise similar questions of how many layers to give the networks, how to determine the connection statistics, what sorts of neurons to use, which learning algorithm, etc. One can always posh such problems to the meta level and use automated learning to determine which variety of learning algorithm to use - but then one has to make some decisions on the metalearning level, based on one's understanding of the specific structure of the space of relevant program learning algorithms. In the fictitious work of unbounded computational resources no such judicious choices are necessary, but that's not the world we live in, and it's not relevant to the design of human-like AGI systems. At the moment, in CogPrime, we utilize two different procedure learning systems, which operate on the same knowledge representation and rely on much of the same internal code. One, which we roughly label "hill climbing", is used for problems that are sufficiently "easy" in the sense that it's possible for the system to solve them using feasible resources without (implicitly or explicitly) building any kind of sophisticated model of the space of solutions to the problem. The other, MOSES, is used for problems that are sufficiently difficult that the right way to solve them is to progressively build a model of the program space as one tries out various solutions, and then use this model to guide ongoing search for better and better solutions. Hillclimbing is treated in this chapter; MOSES in the next. 205 EFTA00624352
206 32 Procedure Learning via Adaptively Biased Hillclimbing 32.2 Hillclimbing "Hillclimbing," broadly speaking, is not a specific algorithm but a category of algorithms. It applies in general to any search problem where there is a large space of possible solutions, which can be compared as to their solution quality. Here we are interested in applying it specifically to problems of search through spaces of programs. In hillclimbing, one starts with a candidate solution to a problem (often a random one, which may be very low-quality), and iteratively makes small changes to the candidate to generate new possibilities, hoping one of them will be a better solution. If a new possibility is better than the current candidate, then the algorithm adopts the new possibility as its new candidate solution. When the current candidate solution can no longer be improved via small changes. t he algorithm terminates. Ideally, at that point the current candidate solution is close to optimal - but this is not guaranteed! Various tweaks to hillclimbing exist, including "restart" which means that one starts hill- climbing over and over again, taking the best solution from multiple trials; and "backtracking", which means that if the algorithm terminates at a solution that seems inadequate, then the search can "backtrack" to a previously considered candidate solution, and try to make different small changes to that candidate solution, trying previously unexplored possibilities in search of a new candidate. The value of these and other tweaks depends on the specific problem under consideration. In the specific approach to hillclimbing described here, we use a hillclimber with backtrack- ing, applied to programs that are represented in the same hierarchical normal form used with MOSES (based on the program normalization ideas presented in Chapter 21). The basic pseu- docode for the hillamber may be given as: Let L be the list (initially empty) of programs explored so far in decreasing order with respect to their fitness. Let Arp be the neighbors of program p. 1. Take the best program p E L 2. Evaluate all programs of Np 3. Merge Np in L 4. Move p from L to the set of best programs found so far and repeat from step 1 until time runs out In the following sections of this chapter, we show how to speed up the hillclimbing search for learning procedures via four optimizations, which have been tested fairly extensively. For concreteness we will refer often to the specific case of using the hill climbing algorithm to control a virtual agent in a virtual world - and especially the case of teaching a virtual pet tricks via imitation learning (as in Chapter 31) but the ideas have more general importance. The four optimizations are: • reduce candidates to normal form to minimize over-representation and increase the syntactic semantic correlation (Chapter 21), • filter perceptions using an entropy criterion to avoid building candidates that involve nodes unlikely to be contained in the solution (Section 32.3), • use sequences of agent actions, observed during the execution of the program, as building blocks (Section 32.4), • choose and calibrate a simplicity measure to focus on simpler solutions (the "Ocean bias") first (Section 32.5). EFTA00624353
32.3 Entity and Perception Filters 207 32.3 Entity and Perception Filters The number of program candidates of a given size increases exponentially with the alphabet of the language; therefore it is important to narrow that alphabet as much as possible. This is the role of the two filters explained below, the Entity and the Entropy filter. 32.3.1 Entity filter This filter is in charge of selecting the entities in the scene the pet should take into account during an imitation learning session. These entities can be any objects, avatars or other pets. In general this is a very hard problem, for instance if a bird is flying near the owner while teaching a trick, should the pet ignore it? Perhaps the owner wants to teach the pet to bark at them; if so they should not be ignored. In our current and prior work with OpenCog controlling virtual world agents, we have used some fairly crude heuristics for entity filtering, which must be hand-tuned depending on the properties of the virtual world. However, our intention is to replace these heuristics with entity filtering based on Economic Attention Networks (ECAN) as described in Chapter 23. 32.3.2 Entropy perception filter The perception filter is in charge of selecting all perceptions in the scene that are reasonably likely to be part of the solution to the program learning problem posed. A "perception" in the virtual world context means the evaluation of one of a set of pre-specified perception predicates, with an argument consisting of one of the entities in the observed environment. Given N entities (provided by the Entity filter), there are usually O(N2) potential perceptions in the Atomspace due to binary perceptions like near(owner bird) inside(toy box> The perception filter proceeds by computing the entropy of any potential perceptions hap- pening during a learning session. Indeed if the entropy of a given perception P is high that means that a conditional if (P 81 E2) has a rather balanced probability of taking Branch El or 82. On the other hand if the entropy is low then the probability of taking these branches is unbalanced, for instance the probability of taking El may be significantly higher than the probability of taking 82, and therefore if (P 81 E2) could reasonably be substituted by 81. For example, assume that during the teaching sessions, the predicate near (owner bird) is false 99% percent of the time; then near (owner bird) will have a low entropy and will possibly be discarded by the filter (depending on the threshold). If the bird is always far from the owner then it will have entropy 0 and will surely be discarded, but if the bird comes and goes it will have a high entropy and will pass the filter. Let P be such a perception and Pi returns 1 when the perception is true at time t or 0 otherwise, where t ranges over the set of instants, of size N, recorded between the beginning and the end of the demonstrated trick. The calculation goes as follows EFTA00624354
208 32 Procedure Learning via Adaptively Biased Hillclimbing Entropy(P) = H I E 8 where H(p) = —p log(p) — (1 — Alog(1 — p). There are additional subtleties when the perception involves random operators, like near (owner random object) that is the entropy is calculated by taking into account a certain distribution over entities grouped under the term random_object. The calculation is optimized to ignore instants when the perception relates to object that have not moved which makes the calculation efficient enough, but there is room to improve it in various ways; for instance it could be made to choose perceptions based not only on entropy but also inferred relevancy with respect to the context using PLN. 32.4 Using Action Sequences as Building Blocks A heuristic that has been shown to work, in the "virtual pet trick" context, is to consider sequences of actions that are compatible with the behavior demonstrated by the avatar showing the trick as building blocks when defining the neighborhood of a candidate. For instance if the trick is to fetch a ball, compatible sequences would be got o (ball ), grab(ball), got o (owner), drop goto(random_object), grab(nearest_ obj act), got o(ovner), drop Sub-sequences can be considered as well - though too many building blocks also increase the neighborhood exponentially, so one has to be careful when doing that. In practice using the set of whole compatible sequences worked well. This for instance can speed up many fold the learning of the trick triple_kick as shown in Section 32.6. 32.5 Automatically Parametrizing the Program Size Penalty A common heuristic for program learning is an "Occam penalty" that penalizes large programs, hence biasing search toward compact programs. The function we use to penalize program size is inspired by Ray Solomonoff's theory of optimal inductive inference ISo164a, So16411; simply said, a program is penalized exponentially with respect to its size. Also, one may say that since the number of program candidates grows exponentially with their size, exploring solutions with higher size must be exponentially worth the cost. In the next subsections we describe the particular penalty function we have used and how to tune its parameters. 32.5.1 Definition of the complexity penalty Let p be a program candidate and penalty(p) a function with domain Al] measuring the complexity of p. If we consider the complexity penalty function penalty(p) as if it denotes the prior probability of p, and score(p) (the quality of p as utilized within the hill climbing algorithm) as denoting the conditional probability of the desired behavior knowing p, then EFTA00624355
32.5 Automatically Parametrizing the Program Size Penalty 209 Bayer rules tells us that fitness(p) = score(p) x penalty(p) denotes the conditional probability of p knowing the right behavior to imitate, the fitness function that we want to maximize. It happens that in the pet trick learning context which is our main example in this chapter, score(p) does not denote such a probability; instead it measures how similar the behavior generated by p and the behavior to imitate are. However, we utilize the above formula anyway, with a heuristic interpretation. One may construct assumptions under which score(p) does represent a probability but this would take us too far afield. The penalty function we use is then given by: penalty(p) = exp(—a x log(b x x where I I is the program size, lAl its alphabet size and e = exp(1). The reason I AI enters into the equation is because the alphabet size varies from one problem to another due to the perception and action filters. Without that constraint the term log(b x IAI + e) could simply be included in a. The higher a the more intense the penalty is. The parameter b controls how that intensity varies with the alphabet size. It is important to remark the difference between such a penalty function and lexicographic parsimony pressure (literally said everything being equal, choose the shortest program). Because of the use of sequences as building blocks, without such a penalty function the algorithm may rapidly reach an optimal program (a mere long sequence of actions) and remain stuck in an apparent optimum while missing the very, logic of the action sequence that the human wants to convey. 32.5.2 Parameterizing the complexity penalty Due to the nature of the search algorithm (hill climbing with restart), the choice of the candidate used to restart the search is crucial. In our case we restart with the candidate with the best fitness so far which has not been yet used to restart. The danger of such an approach is that when the algorithm enters a region with local optima (like a plateau), it may basically stay there as long as there exist better candidates in that region than outside of it non used yet for restart. Longer programs tend to generate larger regions of local optima (because they have exponentially more syntactic variations that lead to close behaviors), so if the search enters such region via an overly complex program it is likely to take a very long time to get out of it. Introducing probability in the choice of the restart may help avoiding that sort of trap but having experimented with that it turned out not to be significantly better on average for learning relatively simple things (indeed although the restart choice Ls more diverse it still tends to occur in large region of local optima). However, a significant improvement we have found is to carefully choose the size penalty function so that the search will tend to restart on simpler programs even if they do not exhibit • Bayes rule as used here is P(MID) = P(M)P(DIA" where Al denotes the Model (the program) and D P(D) denotes the data (the behavior to imitate), here P(D) is ignored, that is the data is assumed to be distributed uniformly EFTA00624356
210 32 Procedure Learning via Adaptively Biased HilIclimbing the best behaviors, but will still be able to reach the optimal solution even if it is a complex one. The solution we suggest is to choose a and b such that penalty(p) is: 1. as penalizing as possible, to focus on simpler programs first (although that constraint may possibly be lightened as the experimentation shows), 2. but still correct in the sense that the optimal solution p maximizes fitness(p). And we want that to work for all problems we are interested in. That restriction is an important point because it is likely that in general the second constraint will be too strict to produce a good penalty function. We will now formalize the above problem. Let i be an index that ranges over the set of problems of interest (in our case pet tricks to learn), score; and fitness; denotes the score and fitness functions of the dub problem. Let 6);(s) denote the set of programs of score s &As) = {plscore(p) = s} Define a family of partial functions : [0,1) N so that Rs) = argminIPI pe9.(p) What this says is that for any given score s we want the size of the shortest program p with that score. And fi is partial because there may not be any program returning a given score. Let be the family of partial functions g:: [0,11 H [0,1) parametrized by a and b such that gds) = s x exp(—a x (log(b x IAI -F e) x f i(s)) That is: given a score s, ye(s) returns the fitness fitness(p) of the shortest program p that marks that score. 32.5.3 Definition of the Optimization Problem Let si be the highest score obtained for fitness function i (that is the score of the program chosen as the current best solution of i). Now the optimization problem consists of finding some a and b such that Vi argmax gi(s) = that is the highest score has also the highest fitness. We started by choosing a and b as high as possible, it is a good heuristic but not the best, the best one would be to choose a and b so EFTA00624357
32.6 Some Simple Experimental Results 211 that they minimize the number of iterations (number of restarts) to reach a global optimum, which is a harder problem. Also, regarding the resolution of the above equation, it is worth noting we do not need the analytical expression of scom(p). Using past learning experiences we can get a partial description of the fitness landscape of each problem just by looking at the traces of the search. Overall we have found this optimization works rather well; that is, tricks that would otherwise take several hours or days of computation can be learned in seconds or minutes. And the method also enables fast learning for new tricks, in fact all tricks we have experimented with so far could be learned reasonably fast (seconds or minutes) without the need to retune the penalty function. In the current CogPrime codebase, the algorithm in charge of calibrating the parameters of the penalty function has been written in Python. It takes in input the log of the imitation learning engine that contains the score, the size, the penalty and the fitness of all candidates explored for all tricks taken in consideration for the parameterizing. The algorithm proceeds in 2 steps: 1. Reconstitute the partial functions f1 for all fitness functions i already attempted, based on the traces of these previously optimized fitness functions. 2. Try to find the highest a and b so that Vi argmax gi(s) = For step 2, since there are only two parameters to tune, we have used a 2D grid, enumerating all points (a, b) and zooming when necessary. So the speed of the process depends largely on the resolution of the grid but (on an ordinary 2009 PC processor) usually it does not require more than 20 minutes to both extract f; and find a and b with a satisfactory resolution. 32.6 Some Simple Experimental Results To test the above ideas in a simple context, we initially used them to enable an OpenCog powered virtual world agent to learn a variety of simple "dog tricks" based on imitation and reinforcement learning in the Multiverse virtual world. We have since deployed them on a variety of other applications in various domains. We began these experiments by running learning on two tricks, fetch_ball and triple_kick to be described below. in order to calibrate the size penalty function: 1. fetch_ball, which corresponds to the Combo program and_seg (goto (ball) grab (ball) goto (owner) drop) 2. triple_kick, if the stick is near the ball then kick 3 times with the left leg and otherwise 3 times with the right leg. So for that trick the owner had to provide 2 exemplars, one for kickL (with the stick near the ball) and one for kickR, and move away the ball from the stick before showing the second exemplar. Below is the Combo program of triple_kick if (near (stick ball) and_seg (kickL kickL kickL) and_seg(kickR kickR kickR) ) EFTA00624358
212 32 Procedure Learning via Adaptively Biased HilIclimbing Before choosing an exponential size penalty function and calibrating it fetch_ball would be learned rather rapidly in a few seconds, but triple_kick would take more than an hour. After calibration both fetch_ball and triple_kick would be learned rapidly, the later in less than a minute. Then we experimented with a new few tricks, some simpler, like sit_under_tree and_seg(goto(tree) sit) and others more complex like double_dance, where the trick consists of dancing until the owner emits the message "stop dancing", and changing the dance upon owner's actions while(not(says(owner "stop dancing")) if(last_action(owner "kickL") tap_dance lean_rock_dance)) That is the pet performs a tap_dance when the last action of the owner is kickL, and otherwise performs a lean_rock_dance. We tested learning for 3 tricks, fetch_ball, triple_kick and double_dance. Each trick was tested in 7 settings denoted confi to confio summarized in Table 32.1. • confi is the default configuration of the system, the parameters of the size penalty function are a = 0.03 and b = 0.34, which is actually not what is returned by the calibration technique but close to. That is because in practice we have found that on average learning is working slightly faster with these values. • conf2 is the configuration with the exact values returned by the calibration, that is a = 0.05, = 0.94 . • conk has the reduction engine disabled. • corala has the entropy filter disabled (threshold is null so all perceptions pass the filter). • conk has the intensity of the penalty function set to 0. • conk has the penalty function set with low intensity. • conf7 and conk have the penalty function set with high intensity. • conk has the action sequence building block enabled • conho has the action sequence building block enabled but with a slightly lower intensity of the size penalty function than normal. lieduct ActSeq Entropy a b Setting On Off 0.1 0.03 0.34 confi On Off 0.1 0.05 0.94 conf2 Off Off 0.1 0.03 0.34 conh On Off 0 0.03 0.34 conk On Off 0.1 0 0.34 conh On Off 0.1 0.0003 0.34 confo On Off 0.1 0.3 0.34 conk On Off 0.1 3 0.34 conk On On 0.1 0.03 0.34 conh On On 0.1 0.025 0.34 confto Table 32.1: Settings for each learning experiment EFTA00624359
32.6 Some Simple Experimental Results 213 Setting Percep Restart Eval Time confr 3 .4 653 5218 conk 3 3 245 2s conk 3 3 1073 8s42 conk 136 3 28287 4mn7s coots 3 >700 >500000 >lh conk 3 3 653 5218 confr 3 8 3121 23s42 conk 3 147 65948 8mnlOs confg 3 0 89 410ms confio 3 0 33 l6lms Table 32 2: Learning time for fetch_ball Setting Percep Restart Eval Time ccmh 1 18 2783 21s47 conk 1 110 11426 Inin532 conk 1 49 15069 2nin152 canf4 124 oO co 00 conk 1 >800 >200K >lh conk 1 7 1191 9s67 confr 1 >2500 >200K >lh conk 1 >2500 >200K >lh conk 1 0 107 146ms confi a 1 0 101 164ms Table 32.3: Learning time for triple_kick Setting Percep Restart Eval Time conk C173 FM FM FM FM FM FM 10 CM CM CM A 113 ds conk 113 4s conk 150 6s20ms conk >60K >1h conk 113 ds conk 113 ds conk 113 ds ccmfa >300K >1h ccmfo 138 dsl9lms conf10 219K 56mn3s Table 32.4: Learning time for double_dance Tables 32.2, 32.3 and 32.4 contain the results of the learning experiment for the three tricks, fetch_ball, triple_kick and double_dance. In each table the column Percept gives the number perceptions which is taken into account for the learning. Restart gives the number of restarts hill climbing had to do before reaching the solution. Eval gives the number of evaluations and Time the search time. In Table 32.2 and 32.4 we can see that fetch_ball or double_dance are learned in a few seconds both in conh and conf2. In 32.3 however learning is about five times faster with conh than with c1242, which was the motivation to go with conf2 as default configuration, but oonf2 still performs well. EFTA00624360
214 32 Procedure Learning via Adaptively Biased HilIclimbing As Tables 32.2, 32.3 and 32.4 demonstrate for setting confs, the reduction engine speeds the search up by less than twice for fetch_ball and double_dance, and many times for triple_kick. The results for conf4 shows the importance of the filtering function, learning is dramatically slowed down without it. A simple trick like fetch_ball took few minutes instead of seconds, double_dance could not be learned after an hour, and triple_kick might never be learned because it did not focus on the right perception from the start. The results for conf5 shows that without any kind of complexity penalty learning can be dramatically slowed down, for the reasons explained in Section 32.5 that is the search loses itself in large regions of sub-optima. Only double_dance was not affected by that, which is probably explained by the fact that only one restart occurred in double_dance and it happened to be the right one. The results for cogs show that when action sequence building-block is disabled the intensity of the penalty function could be set even lower. For instance triple_ kick is learned faster (9s67 instead of 21s47 for co:Q. Conversely the results for lamb show that when action sequence building-block is enabled, if the Occam's razor is too weak it can dramatically slow down the search. That is because in this circumstance the search is misled by longer candidates that fit and takes a very cut before it can reach the optimal more compact solution. 32.7 Conclusion In our experimentation with hillclimbing for learning pet tricks in a virtual world, we have shown that the combination of 1. candidate reduction into normal form, 2. filtering operators to narrow the alphabet, 3. using action sequences that are compatible with the shown behavior as building blocks, 4. adequately choosing and calibrating the complexity penalty function, can speed up imitation learning so that moderately complex tricks can be learned within seconds to minutes instead of hours, using a simple "hill climbing with restarts" learning algorithm. While we have discussed these ideas in the context of pet tricks, they have of course been developed with more general applications in mind, and have been applied in many additional contexts. Combo can be used to represent any sort of procedure, and both the hillclimbing algorithm and the optimization heuristics described here appear broad in their relevance. Natural extensions of the approach described here include the following directions: 1. improving the Entity and Entropy filter using ECAN and PLN so that filtering is not only based on entropy but also relevancy with respect to the context and background knowledge, 2. using transfer learning (see Section 33.5 of Chapter 33) to tune the parameters of algorithm using contextual and background knowledge. Indeed these improvements are under active investigation at time of writing, and some may well have been implemented and tested by the time you read this. EFTA00624361
Chapter 33 Probabilistic Evolutionary Procedure Learning Co-authored with Moshe Looks" 33.1 Introduction The CogPrime architecture fundamentally requires, as one of its components, some powerful algorithm for automated program learning. This algorithm must be able to solve procedure learning problems relevant to achieving human-like goals in everyday human environments, re- lying on the support of other cognitive processes, and providing them with support in turn. The requirement is not that complex human behaviors need to be learnable via program induction alone, but rather that when the best way for the system to achieve a certain goal seems to be the acquisition of a chunk of procedural knowledge, the program learning component should be able to carry out the requisite procedural knowledge. As CogPrime is a fairly broadly-defined architecture overall, there are no extremely precise requirements for its procedure learning component. There could be variants of CogPrime in which procedure learning carried more or less weight, relative to other components. Some guidance here may be provided by looking at which tasks are generally handled by humans primarily using procedural learning, a topic on which cognitive psychology has a fair amount to say, and which is also relatively amenable to commonsense understanding based on our introspective and social experience of being human. When we know how to do something, but can't explain very clearly to ourselves or others how we do it, the chances are high that we have acquired this knowledge using some form of "procedure learning" as opposed to declarative learning. This is especially the case if we can do this same sort of thing in many different contexts, each time displaying a conceptually similar series of actions, but adapted to the specific situation. We would like CogPrime to be able to carry out procedural learning in roughly the same situations ordinary humans can (and potentially other situations as well: maybe even at the start, and definitely as development proceeds), largely via action of its program learning component. In practical terms, our intuition (based on considerable experience with automated program learning, in OpenCog and other contexts) is that one requires a program learning component capable of learning programs with between dozens and hundreds of program tree nodes, in Combo or some similar representation - not able to learn arbitrary programs of this size, but rather able to solve problems arising in everyday human situations in which the simplest ac- ceptable solutions involve programs of this size. We also suggest that the majority of procedure " First author 215 EFTA00624362
216 33 Probabilistic Evolutionary Procedure Learning learning problems arising in everyday human situation can be solved via program with hierar- chical structure, so that it likely suffices to be able to learn programs with between dozens and hundreds of program tree nodes, where the programs have a modular structure, consisting of modules each possessing no more than dozens of program tree nodes. Roughly speaking, with only a few dozen Combo tree nodes, complex behaviors seem only achievable via using very subtle algorithmic tricks that aren't the sort of thing a human-like mind in the early stages of development could be expected to figure out; whereas, getting beyond a few hundred Combo tree nodes, one seems to get into the domain where an automated program learning approach is likely infeasible without rather strong restrictions on the program structure, so that a more ap- propriate approach within CogPrime would be to use PLN, concept creation or other methods to fuse together the results of multiple smaller procedure learning runs. While simple program learning techniques like hillclimbing (as discussed in Chapter 32 above) can be surprisingly powerful, they do have fundamental limitations, and our experience and intuition both indicate that they are not adequate for serving as CogPrime's primary program learning component. This chapter describes an algorithm that we do believe is thus capable - CogPrime's most powerful and general procedure learning algorithm, MOSES, an integrative probabilistic evolutionary program learning algorithm that was briefly overviewed in Chapter 6 of Part I. While MOSES as currently designed and implemented embodies a number of specific algo- rithmic and structural choices, at bottom it embodies two fundamental insights that are critical to generally intelligent procedure learning: • Evolution is the right approach to the learning of difficult procedures • Enhancing evolution with probabilistic methods is necessary. Pure evolution, in the vein of the evolution of organisms and species, is too slow for broad use within cognition; so what is required is a hybridization of evolutionary and probabilistic methods, where probabilistic methods provide a more directed approach to generating candidate solutions than is possible with typical evolutionary heuristics like crossover and mutation We summarize these insights in the phrase Probabilistic Evolutionary Program Learning (PEPL); MOSES is then one particular PEPL algorithm, and in our view a very good one. We have also considered other related algorithms such as the PLEASURE algorithm roe0Sal (which may also be hybridized with MOSES), but for the time being it appears to us that MOSES satisfies CogPrime's needs. Our views on the fundamental role of evolutionary dynamics in intelligence were briefly pre- sented in Chapter 3 of Part 1. Terrence Deacon said it even more emphatically: "At every step the design logic of brains is a Darwinian logic: overproduction, variation, competition, selec- tion ... it should not come as a surprise that this same logic is also the basis for the normal millisecond-by-millisecond information processing that continues to adapt neural software to the world." pea981 He has articulated ways in which, during neurodevelopment, different com- putations compete with each other (e.g., to determine which brain regions are responsible for motor control). More generally, he posits a kind of continuous flux as control shifts between competing brain regions, again, based on high-level "cognitive demand." Deacon's intuition is similar to the one that led Edelman to propose Neural Darwinism tEcle931, and Calvin and Bickerton IC13001 to pose the notion of mind as a "Darwin Machine". The latter have given plausible neural mechanisms ("Darwin Machines") for synthesizing short "programs". These programs are for tasks such as rock throwing and sentence generation, which are represented as coherent firing patterns in the cerebral cortex. A population of such patterns, EFTA00624363
33.1 Introduction 217 competing for neurocomputational territory, replicates with variations, under selection pressure to conform to background knowledge and constraints. To incorporate these insights, a system is needed that can recombine existing solutions in a non-local synthetic fashion, learning nested and sequential structures, and incorporate back- ground knowledge (e.g. previously learned routines). MOSES is a particular kind of program evolution intended to satisfy these goals, using a combination of probability theory with ideas drawn from genetic programming, and also incorporating some ideas we have seen in previous chapters such as program normalization. The main conceptual assumption about CogPrime's world, implicit in the suggestion of MOSES as the primary, program learning component, is that the goal-relevant knowledge that cannot effectively be acquired by the other methods at CogPrime's disposal (PLN, ECAN, etc.), forms a body of knowledge that can effectively be induced across via probabilistic modeling on the space of programs for controlling a CogPrime agent. If this is not true, then MOSES will provide no advantage over simple met hods like well-timed hillclimbing as described in Chapter 32. If it is true, then the effort of deploying a complicated algorithm like MOSES is worthwhile. In essence, the assumption is that there are relatively simple regularities among the programs implementing those procedures that are most critical for a human-like intelligence to acquire via procedure learning rather than other methods. 33.1.1 Explicit versus Implicit Evolution in CogPrime Of course, the general importance of evolutionary dynamics for intelligence does not imply the need to use explicit evolutionary algorithms in one's AGI system. Evolution can occur in an intelligent system whether or not the low-level implementation layer of the system involves any explicitly evolutionary processes. For instance it's clear that the human mind/brain involves evolution in this sense on the emergent level - we create new ideas and procedures by varying and combining ones that we've found useful in the past, and this occurs on a variety of levels of abstraction in the mind. In CogPrime, however, we have chosen to implement evolutionary dynamics explicitly, as well as encouraging them to occur implicitly. CogPrime is intended to display evolutionary dynamics on the derived-hypergraph level, and this is intended to be a consequence of both explicitly-evolutionary and not-explicitly- evolutionary dynamics. Cognitive processes such as PLN inference may lead to emergent evo- lutionary dynamics (as useful logical relationships are reasoned on and combined, leading to new logical relationships in an evolutionary manner); even though PLN in itself is not explicitly evolutionary in character, it becomes emergently evolutionary via its coupling with CogPrime's attention allocation subsystem, which gives more cognitive attention to Atoms with more impor- tance, and hence creates an evolutionary, dynamic with importance as the fitness criterion and the whole constellation of MindAgents as the novelty-generation mechanism. However, MOSES explicitly embodies evolutionary dynamics for the learning of new patterns and procedures that are too complex for hillclimbing or other simple heuristics to handle. And this evolutionary, learning subsystem naturally also contributes to the creation of evolutionary patterns on the emergent, derived-hypergraph level. EFTA00624364
218 33 Probabilistic Evolutionary Procedure Learning 33.2 Estimation of Distribution Algorithms There is a long history in AI of applying evolution-derived methods to practical problem-solving; John Holland's genetic algorithm No174 initially a theoretical model, has been adapted suc- cessfully to a wide variety of applications (see e.g. the proceedings of the GECCO conferences). Briefly, the methodology applied is as follows: 1. generate a random population of solutions to a problem 2. evaluate the solutions in the population using a predefined fitness function 3. select solutions from the population proportionate to their fitnesss 4. recombine/mutate them to generate a new population 5. go to step 2 Holland's paradigm has been adapted from the case of fixed-length strings to the evolution of variable-sized and shaped trees (typically Lisp S-expressions), which in principle can represent arbitrary computer programs Il<oz92, Koz9-11. Recently, replacements-for/extensions-of the genetic algorithm have been developed (for fixed-length strings) which may be described as estimation-of-distribution algorithms (see IPe101 for an overview). These methods, which outperform genetic algorithms and related techniques across a range of problems, maintain centralized probabilistic models of the pop- ulation learned with sophisticated datamining techniques. One of the most powerful of these methods is the Bayesian optimization algorithm (BOA) IPe1051. The basic steps of the BOA are: 1. generate a random population of solutions to a problem 2. evaluate the solutions in the population using a predefined fitness function 3. from the promising solutions in the population, learn a generative model 4. create new solutions using the model, and merge them into the existing population 5. go to step 2. The neurological implausibility of this sort of algorithm is readily apparent - yet recall that in CogPrime we are attempting to roughly emulate human cognition on the level of behavior not structure or dynamics. Fundamentally, the BOA and its ilk (the competent adaptive optimization algorithms) dif- fer from classic selecto-recombinative search by attempting to dynamically learn a problem decomposition, in terms of the variables that have been pre-specified. The BOA represents this decomposition as a Bayesian network (directed acyclic graph with the variables as nodes, and an edge from x to y indicating that y is probabilistically dependent on x). An extension, the hierarchical Bayesian optimization algorithm (hBOA) uses a Bayesian network with lo- cal structure to more accurately represent hierarchical dependency relationships. The BOA and hBOA are scalable and robust to noise across the range of nearly decomposable functions. They are also effective, empirically, on real-world problems with unknown decompositions, which may or may not be effectively representable by the algorithms; robust, high-quality results have been obtained for Ising spin glasses and NfaxSAT, as well as a variety of real-world problems. EFTA00624365
33.3 Competent Program Evolution via MOSES 219 33.3 Competent Program Evolution via MOSES In this section we summarize meta-optimizing semantic evolutionary search (MOSES), a system for competent program evolution, described more thoroughly in ILoo061. Based on the viewpoint developed in the previous section, MOSES is designed around the central and unprecedented capability of competent optimization algorithms such as the hBOA, to generate new solutions that simultaneously combine sets of promising assignments from previous solutions according to a dynamically learned problem decomposition. The novel aspects of MOSES described herein are built around this core to exploit the unique properties of program learning problems. This facilitates effective problem decomposition (and thus competent optimization). 33.3.1 Statics The basic goal of MOSES is to exploit the regularities in program spaces outlined in the previ- ous section, most critically behavioral decomposability and white box execution, to dynamically construct representations that limit and transform the program space being searched into a relevant subspace with a compact problem decomposition. These representations will evolve as the search progresses. 33.3.1.1 An Example Let's start with an easy example. What knobs (meaningful parameters to vary) exist for the family of programs depicted in Figure ?? on the left? We can assume, in accordance with the principle of white box execution, that all symbols have their standard mathematical interpre- tations, and that x, y, and z are real-valued variables. In this case, all three programs correspond to variations on the behavior represented graph- ically on the right in the figure. Based on the principle of behavioral decomposability, good knobs should express plausible evolutionary variation and recombination of features in behav- ior space, regardless of the nature of the corresponding changes in program space. It's worth repeating once more that this goal cannot be meaningfully addressed on a syntactic level - it requires us to leverage background knowledge of what the symbols in our vocabulary (cos, +, 0.35, etc.) actually mean. A good set of knobs will also be orthogonal Since we are searching through the space of combinations of knob settings (not a single change at a time, but a set of changes), any knob whose effects are equivalent to another knob or combination of knobs is tuidesirable.t Corre- spondingly, our set of knobs should span all of the given programs (i.e., be able to represent them as various knob settings). A small basis for these programs could be the 3-dimensional parameter space, x1 E {x, z,0} (left argument of the root node), x2 E {y, x} (argument of cos), and x3 E [0.3, 0.4] (multiplier for the cos-expression). However, this is a very, limiting view, and overly tied to the particulars t First because this will increase the number of samples needed to effectively model the structure of knob-space, and second because this modeling will typically be quadratic with the number of knobs, at least for the BOA or hBOA. EFTA00624366
220 33 Probabilistic Evolutionary Procedure Learning of how these three programs happen to be encoded. Considering the space behaviorally (right of Figure ??), a number of additional knobs can be imagined which might be turned in meaningful ways, such as: 1. numerical constants modifying the phase and frequency of the cosine expression, 2. considering some weighted average of x and y instead of one or the other, 3. multiplying the entire expression by a constant, 4. adjusting the relative weightings of the two arguments to +. 33.3.1.2 Syntax and Semantics This kind of representation-building calls for a correspondence between syntactic and semantic variation. The properties of program spaces that make this difficult are over-representation and chaotic execution, which lead to non-orthogonality, oversampling of distant behaviors, and undersampling of nearby behaviors, all of which can directly impede effective program evolution. Non-orthogonality is caused by over-representation. For example. based on the properties of commutativity and associativity, an -F a2 -F + may be expressed in exponentially many different ways, if + is treated as a non-commutative and non-associative binary operator. Similarly, operations such as addition of zero and multiplication by one have no effect, the successive addition of two constants is equivalent to the addition of their sum, etc. These effects are not quirks of real-valued expressions; similar redundancies appear in Boolean formulae (x AND x e x), list manipulation (cdr(cons(x, L)) a L), and conditionals (if x then y else z e if NOT x then z else y). Without the ability to exploit these identities, we are forced to work in a greatly expanded space which represents equivalent expression in many different ways, and will therefore be very far from orthogonality. Completely eliminating redundancy is infeasible, and typically NP-hard (in the domain of Boolean formulae it is reducible to the satisfiability problem, for instance), but one can go quite far with a heuristic approach. Oversampling of distant behaviors is caused directly by chaotic execution, as well as a somewhat subtle effect of over-representation, which can lead to simpler programs being heavily oversampled. Simplicity is defined relative to a given program space in terms of minimal length, the number of symbols in the shortest program that produces the same behavior. Undersampling of nearby behaviors is the flip side of the oversampling of distant behaviors. As we have seen, syntactically diverse programs can have the same behavior; this can be attributed to redundancy, as well as non-redundant programs that simply compute the same result by different means. For example, 3 *x can also be computed as r+r+x; the first version uses less symbols, but neither contains any obvious "bloat" such as addition of zero or multiplication by one. Note however that the nearby behavior of 3.1 *x, is syntactically close to the former, and relatively far from the latter. The converse is the case for the behavior of ex+y. In a sense, these two expressions can be said to exemplify differing organizational principles, or points of view, on the underlying function. Differing organizational principles lead to different biases in sampling nearby behaviors. A superior organizational principle (one leading to higher-fitness syntactically nearby programs for a particular problem) might be considered a metaptation (adaptation at the second tier). Since equivalent programs organized according to different principles will have identical fitnesss, some methodology beyond selection for high fitnesss must be employed to search for good organizational principles. Thus, the resolution of undersampling of nearby behaviors revolves EFTA00624367
33.3 Competent Program Evolution via MOSES 221 around the management of neutrality in search, a complex topic beyond the scope of this chapter. These three properties of program spaces greatly affect the performance of evolutionary, methods based solely on syntactic variation and recombination operators, such as local search or genetic programming. In fact, when quantified in terms of various fitness-distance correlation measures, they can be effective predictors of algorithm performance, although they are of course not the whole story. A semantic search procedure will address these concerns in terms of the underlying behavioral effects of and interactions between a language's basic operators; the general scheme for doing so in MOSES is the topic of the next subsection. 33.3.1.3 Neighborhoods and Normal Forms The procedure MOSES uses to construct a set of knobs for a given program (or family of structurally related programs) is based on three conceptual steps: ?eduction to normal form, neighborhood enumeration, and neighborhood reduction. Reduction to normal form - Redundancy is heuristically eliminated by reducing programs to a normal form. Typically, this will be via the iterative application of a series of local rewrite rules (e.g., Vx, x t 0 x ), until the target program no longer changes. Note that the well-known conjunctive and disjunctive normal torus for Boolean formulae are generally unsuitable for this purpose; they destroy the hierarchical structure of formulae, and dramatically limit the range of behaviors (in this case Boolean functions) that can be expressed compactly. Rather, hierarchical normal forms for programs are required. Neighborhood enumeration - A set of possible atomic perturbations is generated for all programs under consideration (the overall perturbation set will be the union of these). The goal is to heuristically generate new programs that correspond to behaviorally nearby variations on the source program, in such a way that arbitrary sets of perturbations may be composed combinatorially to generate novel valid programs. Neighborhood reduction - Redundant perturbations are heuristically culled to reach a more orthogonal set. A straightfor- ward way to do this is to exploit the reduction to normal form outlined above; if multiple knobs lead to the same normal forms program, only one of them is actually needed. Additionally, note that the number of symbols in the normal form of a program can be used as a heuristic approximation for its minimal length - if the reduction to normal form of the program resulting from twiddling some knob significantly decreases its size, it can be assumed to be a source of oversampling, and hence eliminated from consideration. A slightly smaller program is typically EFTA00624368
222 33 Probabilistic Evolutionary Procedure Learning a meaningful change to make, but a large reduction in complexity will rarely be useful (and if so, can be accomplished through a combination of knobs that individually produce small changes). At the end of this process, we will be left with a set of knobs defining a subspace of programs centered around a particular region in program space and heuristically centered around the cor- responding region in behavior space as well. This is part of the meta aspect of MOSES, which seeks not to evaluate variations on existing programs itself, but to construct parameterized program subspaces (representations) containing meaningful variations, guided by background knowledge. These representations are used as search spaces within which an optimization algo- rithm can be applied. 33.3.2 Dynamics As described above, the representation-building component of MOSES constructs a parameter- ized representation of a particular region of program space, centered around a single program family of closely related programs. This is consistent with the line of thought developed above, that a representation constructed across an arbitrary region of program space (e.g., all programs containing less than n symbols), or spanning an arbitrary collection of unrelated programs, is unlikely to produce a meaningful parameterization (i.e., one leading to a compact problem decomposition). A sample of programs within a region derived from representation-building together with the corresponding set of knobs will be referred to herein as a deme;t a set of denies (together spanning an arbitrary, area within program space in a patchwork fashion) will be referred to as a metapopulationfi MOSES operates on a metapopulation. adaptively creating, removing, and allocating optimization effort to various denies. Demo management is the second fundamental nets aspect of MOSES, after (and above) representation-building; it essentially corresponds to the problem of effectively allocating computational resources to competing regions, and hence to competing programmatic organizational- representational schemes. 33.3.2.1 Algorithmic Sketch The salient aspects of programs and program learning lead to requirements for competent program evolution that can be addressed via a representation-building process such as the one shown above, combined with effective deme management. The following sketch of MOSES, elaborating Figure 33.1 repeated here from Chapter 8 of Part 1, presents a simple control flow that dynamically integrates these processes into an overall program evolution procedure: 1. Construct an initial set of knobs based on some prior (e.g., based on an empty program) and use it to generate an initial random sampling of programs. Add this deme to the metapopulation. 2. Select a deme from the metapopulation and update its sample, as follows: A term borrowed from biology, referring to a somewhat isolated local population of a species. Another term borrowed from biology, referring to a coup of somewhat separate populations (the demes) that nonetheless interact. EFTA00624369
33.3 Competent Program Evolution via MOSES 223 Repreatuationaudding Realm Sampling Samoa Opomareioa Fig. 33.1: The top-level architectural components of MOSES, with directed edges indicating the flow of information and program control. a. Select some promising programs from the deme's existing sample to use for modeling, according to the fitness function. b. Considering the promising programs as collections of knob settings, generate new col- lections of knob settings by applying some (competent) optimization algorithm. c. Convert the new collections of knob settings into their corresponding programs, reduce the programs to normal form, evaluate their fltnesss, and integrate them into the deme's sample, replacing less promising programs. 3. For each new program that meets the criteria for creating a new deme, if any: a. Construct a new set of knobs (via representation-building) to define a region centered around the program (the deme's exemplar). and use it to generate a new random sam- pling of programs, producing a new dente. b. Integrate the new deme into the metapopulation, possibly displacing less promising domes. 4. Repeat from step 2. The criterion for creating a new deme is behavioral non-dominance (programs which are not dominated by the exemplars of any existing denies are used as exemplars to create new denies), which can be defined in a domain-specific fashion. As a default, the fitness function may be used to induce dominance, in which case the set of exemplar programs for demos corresponds to the set of top-fitness programs. 33.3.3 Architecture The preceding algorithmic sketch of MOSES leads to the top-level architecture depicted in Figure ??. Of the four top-level components, only the fitness function is problem-specific. The EFTA00624370
224 33 Probabilistic Evolutionary Procedure Learning representation-building process is domain-specific, while the random sampling methodology and optimization algorithm are domain-general. There is of course the possibility of improving performance by incorporating domain and/or problem-specific bias into random sampling and optimization as well. 33.3.4 Example: Artificial Ant Problem Let's go through all of the steps that are needed to apply MOSES to a small problem, the artificial ant on the Santa Fe trail lkoz92], and describe the search process. The artificial ant domain is a two-dimensional grid landscape where each cell may or may not contain a piece of food. The artificial ant has a location (a cell) and orientation (facing up, down, left, or right), and navigates the landscape via a primitive sensor, which detects whether or not there is food in the cell that the ant is facing, and primitive actuators move (take a single step forward), right (rotate 90 degrees clockwise), and left (rotate 90 degrees counter-clockwise). The Santa Fe trail problem is a particular 32 x 32 toroidal grid with food scattered on it (Figure 4), and a fitness function counting the number of unique pieces of food the ant eats (by entering the cell containing the food) within 600 steps (movement and 90 degree rotations are considered single steps). Programs are composed of the primitive actions taking no arguments, a conditional (if-food- ahead),' which takes two arguments and evaluates one or the other based on whether or not there is food ahead, and progn, which takes a variable number of arguments and sequentially evaluates all of them from left to right. To fitness a program, it is evaluated continuously until 600 time steps have passed, or all of the food is eaten (whichever comes first). Thus for example, the program if-food-ahead(m, r) moves forward as long as there is food ahead of it, at which point it rotates clockwise until food is again spotted. It can successfully navigate the first two turns of the placeSanta Fe trail, but cannot cross "gaps" in the trail, giving it a final fitness of 11. The first step in applying MOSES is to decide what our reduction rules should look like. This program space has several clear sources of redundancy leading to over-representation that we can eliminate, leading to the following reduction rules: 1. Any sequence of rotations may be reduced to either a left rotation, a right rotation, or a reversal, for example: progn(left, left, left) reduces to right 1. Any if-food-ahead statement which is the child of an if-food-ahead statement may be elim- inated, as one of its branches is clearly irrelevant, for example: if-food-ahead(m, if-food-ahead (l, r)) reduces to if-food-ahead(m, r) I This formulation is equivalent to using a general three-argument if-then-else statement with a predicate as the first argument, as there is only a single predicate (food-ahead) for the ant problem. EFTA00624371
33.3 Competent Program Evolution via MOSES 225 1. Any progn statement which is the child of a progn statement may be eliminated and replaced by its children, for example: pnogn(pnogn(left, move), move) reduces to progn(left, move, move) The representation language for the ant problem is simple enough that these are the only three rules needed - in principle there could be many more. The first rule may be seen as a consequence of general domain-knowledge pertaining to rotation. The second and third rules are fully general simplification rules based on the semantics of if-then-else statements and associative functions (such as progn). respectively. These rules allow us to naturally parameterize a knob space corresponding to a given program (note that the arguments to the progn and if-food-ahead functions will be recursively reduced and parameterized according to the same procedure). Rotations will correspond to knobs with four possibilities (left, right, reversal, no rotation). Movement commands will correspond to knobs with two possibilities (move, no movement). There is also the possibility of introducing a new command in between, before, or after, existing commands. Some convention (a "canonical form") for our space is needed to determine how the knobs for new commands will be introduced. A representation consists of a rotation knob, followed by a conditional knob, followed by a movement knob, followed by a rotation knob, etc." The structure of the space (how large and what shape) and default knob values will be determined by the "exemplar" program used to construct it. The default values are used to bias the initial sampling to focus around the prototype associated to the exemplar: all of the n direct neighbors of the prototype are first added to the sample, followed by a random selection of n programs at a distance of two from the prototype, n programs at a distance of three, etc., until the entire sample is filled. Note that the hBOA can of course effectively recombine this sample to generate novel programs at any distance from the initial prototype. The empty program progn (which can be used as the initial exemplar for MOSES), for example, leads to the following prototype: progn( rotate? [default no rotation], if-food-ahead( progn( rotate? (default no rotation], move? (default no movement/), rotate? (default no rotation/, move? (default no movement/)), move? (default no movement() progn( There are six parameters here, three which are quaternary (rotate), and three which are binary (move). So the program I That there is some fixed ordering on the knobs is important, so that two rotation knobs are not placed next to each other (as this would introduce redundancy). In this case, the precise ordering chosen (rotation, conditional, movement) does not appear to be critical. EFTA00624372
226 33 Probabilistic Evolutionary Procedure Learning progn(left, if-food-ohead(move, left)) would be encoded in the space as [left, no rotation, move, left, no movement, no movement] with knobs ordered according to a pre-order left-to-right traversal of the program's parse tree (this is merely for exposition; the ordering of the parameters has no effect on MOSES). For a prototype program already containing an if-food-ahead statement, nested conditionals would be considered. 60 SO 40 0 30 O tiro O 2° vl 10 goo &A00 12.1103 16= 20.C00 # program evaluations Technique Computational Effort Genetic Programming 450,000 evaluations Evolutionary Programming 136,000 evaluations MOSES 23,000 evaluations Fig. 33.2: On the top, histogram of the number of global optima found after a given number of program evaluations for 100 rims of MOSES on the artificial ant problem (each run is counted once for the first global optimum reached). On the bottom, computational effort required to find an optimal solution for various techniques with probability p=.99 (for MOSES p=1, since an optimal solution was found in all runs). A space with six parameters in it is small enough that MOSES can reliably find the optimum (the program progn(right, if-food-ahead(pmgnaleft), move)), with a very small population. Af- ter no further improvements have been made in the search for a specified number of generations (calculated based on the size of the space based on a model derived from 1231 that is general to the hBOA, and not at all tuned for the artificial ant problem), a new representation is constructed centered around this program." Additional knobs are introduced 9n between" all " MOSES reduces the exemplar program to normal form before constructing the representation; in this particular case however, no transformations are needed. Similarly, in general neighborhood reduction would be EFTA00624373
33.3 Competent Program Evolution via MOSES 227 existing ones (e.g., an optional move in between the first rotation and the first conditional), and possible nested conditionals are considered (a nested conditional occurring in a sequence after some other action has been taken is not redundant). The resulting space has 39 knobs, still quite tractable for hBOA, which typically finds a global optimum within a few genera- tions. If the optimum were not to be found, MOSES would construct a new (possibly larger or smaller) representation, centered around the best program that was found, and the process would repeat. The artificial ant problem is well-studied, with published benchmark results available for genetic programming as well as evolutionary programming based solely on mutation (i.e., a form of population-based stochastic hill climbing). Furthermore, an extensive analysis of the search space has been carried out by Langdon and Poli 111)04 with the authors concluding: 1. The problem is "deceptive at all levels", meaning that the partial solutions that must be recombined to solve the problem to optimality have lower average fitness than the partial solutions that lead to inferior local optima. 2. The search space contains many symmetries (e.g., between left and right rotations), 3. There is an unusually high density of global optima in the space (relative to other common test problems); 4. even though current evolutionary methods can solve the problem, they are not significantly more effective (in terms of the number of program evaluations require) than random sample. 5. "If real program spaces have the above characteristics (we expect them to do so but be still worse) then it is important to be able to demonstrate scalable techniques on such problem spaces". 33.3.4.1 Test Results Koza [Koz92J reports on a set of 148 runs of genetic programming with a population size of 500 which had a 16% success rate after 51 generations when the runs were terminated (a total of 25,500 program evaluations per run). The minimal "computational effort" needed to achieve success with 99% probability was attained by processing through generation 14 was 450,000 (based on parallel independent runs). Chellapilla the971 reports 47 out of 50 successful rums with a minimal computational effort (again, for success with 99% probability) of 136,000 for his stochastic hill climbing method. In our experiment with the artificial ant problem, one hundred runs of MOSES were executed. Beyond the domain knowledge embodied in the reduction and knob construction procedure, the only parameter that needed to be set was the population scaling factor, which was set to 30 (MOSES automatically adjusts to generate a larger population as the size of the representa- tion grows, with the base case determined by this factor). Based on these "factory" settings, MOSES found optimal solutions on every run out of 100 trials, within a maximum of 23,000 program evaluations (the computational effort figure corresponding to 100% success). The av- erage number of program evaluations required was 6952, with 95% confidence intervals of ±856 evaluations. Why does MOSES outperform other techniques? One factor to consider first is that the language programs are evolved in is slightly more expressive than that used for the other used to eliminate any extraneous knobs (based on domain-specific heuristics). For the ant domain however no such reductions are necessary. EFTA00624374
228 33 Probabilistic Evolutionary Procedure Learning techniques; specifically, a progn is allowed to have no children (if all of its possible children are "turned off"), leading to the possibility of if-food-ahead statements which do nothing if food is present (or not present). Indeed, many of the smallest solutions found by MOSES exploit this feature. This can be tested by inserting a "do nothing" operation into the terminal set for genetic programming (for example). Indeed, this reduces the computational effort to 272,000; an interesting effect, but still over an order of magnitude short of the results obtained with MOSES (the success rate after 50 generations is still only 20%). Another possibility is that the reductions in the search space via simplification of programs alone are responsible. However, the results past attempts at introducing program simplification into genetic programming systems [27, 28J have been mixed; although the system may be sped up (because programs are smaller), there have been no dramatic improvement in results noted. To be fair, these results have been primarily focused on the symbolic regression domain; I am not aware of any results for the artificial ant problem. The final contributor to consider is the sampling mechanism (knowledge-driven knob-creation followed by probabilistic model-building). We can test to what extent model-building con- tributes to the bottom line by simply disabling it and assuming probabilistic independence between all knobs. The result here is of interest because model-building can be quite expensive (O(n2N) per generation, where n is the problem size and N is the population sizett). In 50 independent runs of MOSES without model-building, a global optimum was still discovered in all runs. However, the variance in the number of evaluations required was much higher (in two cases over 100,000 evaluations were needed). The new average was 26,355 evaluations to reach an optimum (about 3.5 times more than required with model-building). The contribution of model-building to the performance of MOSES is expected to be even greater for more difficult problems. Applying MOSES without model-building (i.e., a model assuming no interactions between variables) is a way to test the combination of representation-building with an approach re- sembling the probabilistic incremental program learning (PIPE) algorithm ISS03I, which learns programs based on a probabilistic model without any interactions. PIPE has now been shown to provide results competitive with genetic programming on a number of problems (regression, agent control, etc.). It Ls additionally possible to look inside the models that the hBOA constructs (based on the empirical statistics of successful programs) to see what sorts of linkages between knobs are being learned.tt For the 6-knob model given above for instance an analysis of the linkages learned shows that the three most common pairwise dependencies uncovered, occurring in over 90% of the models across 100 runs, are between the rotation knobs. No other individual dependencies occurred in more than 32% of the models. This preliminary finding is quite significant given Landgon and Poli's findings on symmetry. and their observation that "It[hese symmetries lead to essentially the same solutions appearing to be the opposite of each other. E.g. either a pair of Right or pair of Left terminals at a particular location may be important." In this relatively simple case, all of the components of MOSES appear to mesh together to provide superior performance - which is promising, though it of course does not prove that these same advantages will apply across the range of problems relevant to human-level AGI. ti The fact that reduction to normal tends to reduce the problem size is another synergy between it and the application of probabilistic model-building. *1 There is in fact even more information available in the hBOA models concerning hierarchy and direction of dependence, but this is difficult to analyze. EFTA00624375
33.3 Competent Program Evolution via MOSES 229 33.3.5 Discussion The overall MOSES design is unique. However, it Ls instructive at this point to compare its two primary unique facets (representation-building and deme management) to related work in evolutionary computation. Rosca's adaptive representation architecture Illos991 is an approach to program evolution which also alternates between separate representation-building and optimization stages. It is based on Koza's genetic programming, and modifies the representation based on a syntactic analysis driven by the fitness function, as well as a modularity bias. The representation-building that takes place consists of introducing new compound operators, and hence modifying the implicit distance function in tree-space. This modification is uniform, in the sense that the new operators can be placed in any context, without regard for semantics. In contrast to Rosca's work and other approaches to representation-building such as Koza's automatically defined functions IKA95], MOSES explicitly addresses the underlying (semantic) structure of program space independently of the search for any kind of modularity or problem decomposition. This preliminary stage critically changes neighborhood structures (syntactic similarity) and other aggregate properties of programs. Regarding deme management, the embedding of an evolutionary algorithm within a super- ordinate procedure maintaining a metapopulation is most commonly associated with "island model" architectures ISWM901. One of the motivations articulated for using island models has been to allow distinct islands to (asually implicitly) explore different regions of the search space, as MOSES does explicitly. MOSES can thus be seen as a very particular kind of island model architecture, where programs never migrate between islands (demos), and islands are created and destroyed dynamically as the search progresses. In MOSES, optimization does not operate directly on program space, but rather on a sub- space defined by the representation-building process. This subspace may be considered as being defined by a sort of template assigning values to some of the underlying dimensions (e.g., it restricts the size and shape of any resulting trees). The messy genetic algorithm [(NUM% an early competent optimization algorithm, uses a similar mechanism - a common "competi- tive template" is used to evaluate candidate solutions to the optimization problem which are themselves underspecified. Search consequently centers on the template(s), much as search in MOSES centers on the programs used to create new domes (and thereby new representations). The issue of deme management can thus be seen as analogous to the issue of template selection in the messy genetic algorithm. 33.3.6 Conclusion Competent evolutionary optimization algorithms are a pivotal development, allowing encoded problems with compact decompositions to be tractably solved according to normative princi- ples. We are still faced with the problem of representation-building - casting a problem in terms of knobs that can be twiddled to solve it. Hopefully, the chosen encoding will allow for a com- pact problem decomposition. Program learning problems in particular rarely possess compact decompositions, due to particular features generally present in program spaces (and in the map- ping between programs and behaviors). This often leads to intractable problem formulations, even if the mapping between behaviors and fitness has an intrinsic separable or nearly decom- EFTA00624376
230 33 Probabilistic Evolutionary Procedure Learning posable structure. As a consequence, practitioners must often resort to manually carrying out the analogue of representation-building, on a problem-specific basis. Working under the thesis that the properties of programs and program spaces can be leveraged as inductive bias to remove the burden of manual representation-building, leading to competent program evolution, we have developed the MOSES system. and explored its properties. While the discussion above has highlighted many of the features that make MOSES uniquely powerful, in a sense it has told only half the story. Part of what makes MOSES valuable for CogPrime is that it's good on its own; and the other part is that it cooperates well with the other cognitive processes within CogPrime. We have discussed aspects of this already in Chapter 8 of Part 1, especially in regard to the MOSES/PLN relationship. In the following section we proceed further to explore the interaction of MOSES with other aspects of the CogPrime system — a topic that will arise repeatedly in later chapters as well. 33.4 Integrating Feature Selection Into the Learning Process In the typical workflow of applied machine learning, one begins with a large number of features, each applicable to some or all of the entities one wishes to learn about; then one applies some feature selection heuristics to whittle down the large set of features into a smaller one; then one applies a learning algorithm to the reduced set of features. The reason for this approach is that the more powerful among the existing machine learning algorithms tend to get confused when supplied with too many features. The problem with this approach is that sometimes one winds up throwing out potentially very useful information during the feature selection phase. This same sort of problem exists with MOSES in its simplest form, as described above. The human mind, as best we understand it, does things a bit differently than this standard "feature selection followed by learning" process. It does seem to perform operations analogous to feature selection, and operations analogous to the application of a machine learning algorithm to a reduced feature set - but then it also involves feedback from these "machine learning like" operations to the "feature selection like" operations, so that the intermediate results of learning can cause the introduction into the learning process of features additional to those initially selected, thus allowing the development of better learning results. Compositional spatiotemporal deep learning (CSDLN) architectures like HTM 1/1131161 or DeSTIN IA RCO9al, as discussed in 27 incorporate this same sort of feedback. The lower levels of such an architecture, in effect, carry out "feature selection" for the upper levels - but then feedback from the upper to the lower levels also occurs, thus in effect modulating the "feature selection like" activity at the lower levels based on the more abstract learning activity on the upper levels. However, such CSDLN architectures are specifically biased toward recognition of certain sorts of patterns - an aspect that may be considered a bug or a feature of this class of learning architecture, depending on the context. For visual pattern recognition, it appears to be a feature, since the hierarchical structure of such algorithms roughly mimics the architecture of visual cortex. For automated learning of computer programs carrying out symbolic tasks, on the other hand, CSDLN architectures are awkward at best and probably generally inappropriate. For cases like language learning or abstract conceptual inference, the jury, is out. In this section we explore the question of: how to introduce an appropriate feedback between feature selection and learning in the case of machine learning algorithms with general scope and without explicit hierarchical structure - such as MOSES. We introduce a specific technique EFTA00624377
33.4 Integrating Feature Selection Into the Learning Process 231 enabling this, which we call LIFES, short for Learning-Incorporated Feature Selection. We argue that LIFES is particularly applicable to learning problems that possess the conjunction of two properties that we call data focusability and feature focusability. We illustrate LIFES in a MOSES context, via describing a specific incarnation of the LIFES technique that does feature selection repeatedly during the MOSES learning process, rather than just doing it initially prior to MOSES learning. 33.4.1 Machine Learning, Feature Selection and AGI The relation between feature selection and machine learning appears an excellent example of the way that, even when the same basic technique is useful in both narrow AI and AGI, the method of utilization is often quite different. In most applied machine learning tasks, the need to customize feature selection heuristics for each application domain (and in some cases, each particular problem) is not a major difficulty. This need does limit the practical utilization of machine learning algorithms, because it means that many ML applications require an expert user who understands something about machine learning, both to deal with feature selection issues and to interpret the results. But it doesn't stand in the way of ML's fundamental usability. On the other hand, in an AGI context, the situation is different, and the need for human-crafted, context-appropriate feature selection does stand in the way of the straightforward insertion of most ML algorithms into an integrative AGI systems. For instance, in the OpenCog integrative AGI architecture that we have co-architected rea131, the MOSES automated program learning algorithm plays a key role. It is OpenCog's main algorithm for acquiring procedural knowledge, and is used for generating some sorts of declarative knowledge as well. However, when MOSES tasks are launched automatically via the OpenCog scheduler based on an OpenCog agent's goals, there is no opportunity for the clever choice of feature selection heuristics based on the particular data involved. And crude feature selection heuristics based on elementary statistics, are often insufficiently effective, as they rule out too many valuable features (and sometimes rule out the most critical features). In this context, having a variant of MOSES that can sift through the scope of possible features in the course of its learning is very important. An example from the virtual dog domain pursued in [GEMS' would be as follows. Each procedure learned by the virtual dog combines a number of different actions, such as "step forward", "bark", "turn around", "look right","lift left front leg," etc. In the virtual dog a- periments done previously, the number of different actions permitted to the dog was less than 100, so that feature selection was not a major issue. However, this was an artifact of the rela- tively simplistic nature of the experiments conducted. For a real organism, or for a robot that learns its own behavioral procedures (say, via a deep learning algorithm) rather than using a pre-configured set of "animated" behaviors, the number of passible behavioral procedures to potentially be combined using a MOSES-learned program may be very large. In this case, one must either use some crude feature selection heuristic, have a human select the features, or use something like the LIFES approach described here. LIFES addresses a key problem in moving from the relatively simple virtual dog work done before, to related work with virtual agents displaying greater general intelligence. As an other example, suppose an OpenCog-controlled agent is using MOSES to learn pro- cedures for navigating in a dynamic environment. The features that candidate navigation pro- EFTA00624378
232 33 Probabilistic Evolutionary Procedure Learning cedures will want to pay attention to, may be different in a well-lit environment than in a dark environment. However, if the MOSES learning process is being launched internally via OpenCog's goal system, there is no opportunity for a human to adjust the feature selection heuristics based on the amount of light in the environment. Instead, MOSES has got to figure out what features to pay attention to all by itself. LIFES is designed to allow MOSES (or other comparable learning algorithms) to do this. So far we have tested LIFES in genomics and other narrow-AI application areas, as a way of initially exploring and validating the technique. As our OpenCog work proceeds, we will explore more AGI-oriented applications of MOSES-LIFES. This will be relatively straightforward on a software level as MOSES is fully integrated with OpenCog. 33.4.2 Data- and Feature- Focusable Learning Problems Learning-integrated feature selection as described here is applicable across multiple domain areas and types of learning problem - but it is not completely broadly applicable. Rather it is most appropriate for learning problems possessing two properties we call data focusability and feature focusability. While these properties can be defined with mathematical rigor, here we will not be proving any theorems about them, so we will content ourselves with semi-formal definitions, sufficient to guide practical work. We consider a fitness function 0, defined on a space of programs f whose inputs are features defined on elements of a reference dataset S, and whose outputs lie in the inter- val [0,1]. The features are construed as functions mapping elements of S into [0,1]. Where F(x) = F"(x)) is the set of features evaluated on x E .9, we use f (x) as a short- hand for f(F(x)). We are specifically interested in 0 which are "data focusable", in the sense that, for a large number of highly fit programs f, there is some subset Si on which f is highly concentrated (note that St will be different for different f). By "concentrated" it is meant that ExEs, f (x) & Es f (x) is large. A simple case is where f is Boolean and f(x) = I <=> x E One important case is where is "property-based", in the sense that each element x E $ has some Boolean or numeric property p(x), and the fitness function 0(f) rewards f for predicting p(x) given x for x E Si, where $f is some non-trivial subset of S. For example, each element of $ might belong to some category, and the fitness function might represent the problem of placing elements of S into the proper category - but with the twist that f gets rewarded if it accurately places some subset St of elements in S into the proper category, even if it has nothing to say about all the elements in S but not St. For instance, consider the case where $ is a set of images. Suppose the function p(x) indicates whether the image x contains a picture of a cat or not. Then, a suitable fitness function 0 would be one measuring whether there is some non-trivially large set of images St so that if x E Si, then f can accurately predict whether x contains a picture of a cat or not. A key point is that the fitness function doesn't care whether f can accurately predict whether x contains a picture of a cat or not, for x outside St. EFTA00624379
33.4 Integrating Feature Selection Into the Learning Process 233 Or, consider the case where S is a discrete series of time points, and p(x) indicates the value of some quantity (say, a person's EEG) at a certain point in time. Then a suitable fitness function 0 might measure whether there is sonic non-trivially large set of time-points St so that if x E Sp, then f can accurately predict whether x will be above a certain level L or not. Finally, in addition to the property of data-focusability introduced above, we will concern ourselves with the complementary property of "feature-focusability." This means that, while the elements of S are each characterized by a potentially large set of features, there are many highly fit programs f that utilize only a small subset of this large set of features. The case of most interest here is where there are various highly fit programs f, each utilizing a different small subset of the overall large set of features. In this case one has (loosely speaking) a pattern recognition problem, with approximate solutions comprising various patterns that combine various different features in various different ways. For example, this would be the case if there were many different programs for recognizing pictures containing cats, each one utilizing different features of cats and hence applying to different subsets of the overall database of images. There may, of course, be many important learning problems that are neither data nor feature focusable. However, the LIFE$ technique presented here for integrating feature selection into learning is specifically applicable to objective functions that are both data and feature focusable. In this sense, the conjunction of data and feature focusability appears to be a kind of "tractabil- ity" that allows one to bypass the troublesome separation of feature selection and learning, and straightforwardly combine the two into a single integrated process. Being property-based in the sense described above does not seem to be necessary for the application of LIFES, though most practical problems do seem to be property-based. 33.4.3 Integrating Feature Selection Into Learning The essential idea proposed here is a simple one. Suppose one has a learning problem involving a fitness function that is both data and feature focusable. And suppose that, in the course of learning according to some learning algorithm, one has a candidate program f, which is reasonably fit but merits improvement. Suppose that f uses a subset F1 of the total set F of possible input features. Then, one may do a special feature selection step, customized just for f. Namely, one may look at the total set F of possible features, and ask which features or small feature-sets display desirable properties on the set Sp. This will lead to a new set of features potentially worthy of exploration; let's call it F1. We can then attempt to improve f by creating variants of f introducing some of the features in Pf - either replacing features in F1 or augmenting them. The process of creating and refining these variants will then lead to new candidate programs g, potentially concentrated on sets $ g different from St, in which case the process may be repeated. This is what we call LIFES - Learning-Integrated Feature Selection. As described above the LIFES process is quite general, and applies to a variety of learning algorithms - basically any learning algorithm that includes the capability to refine a candidate solution via the introduction of novel features. The nature of the "desirable properties" used to evaluate candidate features or feature-sets on St needs to be specified, but a variety of standard techniques may be used here (along with more advanced ideas) - for instance, in the case where the fitness function is defined in terms of some property mapping p as describe, EFTA00624380
234 33 Probabilistic Evolutionary Procedure Learning above, then given a feature F. one can calculate the mutual information of P with p over Sp Other measures than mutual information may be used here as well. The LIFES process doesn't necessarily obviate the need for np-front feature selection. What it does, is prevent up-front feature selection from limiting the ultimate feature usage of the learning algorithm. It allows the initially selected features to be used as a rough initial guide to learning - and for the candidates learned using these initial features, to then be refined and improved using additional features chosen opportunistically along the learning path. In some cases, the best programs ultimately learned via this approach might not end up involving any of the initially selected features. 33.4.4 Integrating Feature Selection into MOSES Learning The application of the general LIFES process in the MOSES context is relatively straightfor- ward. Quite simply, given a reasonably fit program f produced within a deme, one then isolates the set Si on which f is concentrated, and identifies a set F't of features within F that displays desirable properties relative to $f. One then creates a new deme 7, with exemplar f, and with a set of potential input features consisting of Ff U F;. What does it mean to create a deme f'with a certain set of "potential input features" Ff U Fr Abstractly, it means that Ff. = Ff U F. Concretely, it means that the knobs in the deme's exemplar 7 must be supplied with settings corresponding to the elements of Ff U F. The right way to do this will depend on the semantics of the features. For instance, it may be that the overall feature space F is naturally divided into groups of features. In that case. each new feature F1 in would be added, as a potential knob setting, to any knob in f corresponding to a feature in the same group as P. On the other hand, if there is no knob in f corresponding to features in FIN knob group, then one has a different situation, and it is necessary to "mutate" f by adding a new node with a new kind of knob corresponding to or replacing an existing node with a new one corresponding to P. 33.4.5 Application to Genomic Data Classification To illustrate the effectiveness of LIFES in a MOSES context, we now briefly describe an exam- ple application, in the genomics domain. The application of MOSES to gene expression data is described in more detail in ilmo0711, and is only very briefly summarized here. To obtain the results summarized here, we have used MOSES, with and without LIFES, to analyze two differ- ent genomics datasets: an Alzheimers SNP (single nucleotide polymorphism) dataset [Meal previously analyzed using ensemble genetic programming [CC P1-109] . The dataset is of the form "Case vs. Control", where the Case category, consists of data from individuals with Alzheimers and Control consists of matched controls. MOSES was used to learn Boolean program trees embodying predictive models that take in a subset of the genes in an individual, and output a Boolean combination of their discretized expression values, that is interpreted as a prediction of whether the individual is in the Case or Control category. Prior to feeding them into MOSES, expression values were first Q-normalized, and then discretized via comparison to the median EFTA00624381
33.4 Integrating Feature Selection Into the Learning Process 235 expression measured across all genes on a per-individual basis (1 for greater than the median, 0 for less than). Fitness was taken as precision, with a penalty factor restriction attention to program trees with recall above a specified minimum level. These study was carried out, not merely for testing MOSES and LIFES, but as part of a practical investigation into which genes and gene combinations may be the best drug targets for Alzheimers Disease. The overall methodology for the biological investigation, as described in IGCPNI061, is to find a (hopefully diverse) ensemble of accurate classification models, and then statistically observe which genes tend to occur most often in this ensemble, and which combinations of genes tend to co-occur most often in the models in the ensemble. These most frequent genes and combinations are taken as potential therapeutic targets for the Case cate- gory of the underlying classification problem (which in this case denotes inflammation). This methodology has been biologically validated by follow-up lab work in a number of cases; see e.g. IGea05] where this approach resulted in the first evidence of a genetic basis for Chronic Fatigue Syndrome. A significant body of unpublished commercial work along these lines has been done by Biomind LLC thttp: / /biomind. comj for its various customers. Comparing MOSES-LIFES to MOSES with conventional feature selection, we find that the former finds model ensembles combining greater diversity with greater precision, and equivalent recall. This is because conventional feature selection eliminates numerous genes that actually have predictive value for the phenotype of inflammation, so that MOSES never gets to see them. LIFES exposes MOSES to a much greater number of genes, some of which MOSES finds useful. And LIFES enables MOSES to explore this larger space of genes without getting bollixed by the potential combinatorial explosion of possibilities. Algorithm Train. Precision Train. Recall Test Precision Test Recall MOSES .81 .51 .65 .42 best training precision MOSES .80 .52 .69 .43 best test precision MOSES-LIFES .84 .51 .68 .38 best training precision MOSES-LIFES .82 .51 .72 .48 best test precision Table 33.1: Impact of LIFES on MOSES classification of Alzheimers Disease SNP data. Fitness function sought to maximize precision consistent with a constraint of precision being at least 0.5. Precision and recall figures are average figures over 10 folds, using 10-fold cress-validation. The results shown here are drawn from a larger set of runs, and are selected according to two criteria: best training precision (the fair way to do it) and best test precision (just for comparison). We see that use of LIFES increases precision by around 3% in these tests, which is highly statistically significant according to permutation analysis. The genomics example shows that LIFES makes sense and works in the context of MOSES, broadly speaking. It seems very plausible that LIFES will also work effectively with MOSES in an integrative AGI context, for instance in OpenCog deployments where MOSES is used to drive procedure learning, with fitness functions supplied by other OpenCog components. However, the empirical validation of this plausible conjecture remains for future work. EFTA00624382
236 33 Probabilistic Evolutionary Procedure Learning 33.5 Supplying Evolutionary Learning with Long-Term Memory This section introduces an important enhancement to evolutionary learning, which extends the basic PEPL framework, by forming an adaptive hybridization of PEPL optimization with PLN inference (rather than merely using PLN inference within evolutionary learning to aid with modeling). The first idea here is the use of PLN to supply evolutionary learning with a long-term memory. Evolutionary, learning approaches each problem as an isolated entity, but in reality, a CogPrime system will be confronting a long series of optimization problems, with subtle interrelationships. When trying to optimize the function 1, CogPrime may make use of its experience in optimizing other functions g. Inference allows optimizers of g to be analogically transformed into optimizers of f, for instance it allows one to conclude: Inheritance f g EvaluationLink f x EvaluationLink g x However, less obviously, inference also allows patterns in populations of optimizers of g to be analogically transformed into patterns in populations of optimizers off For example, if pat is a pattern in good optimizers of f, then we have: InheritanceLink f g ImplicationLink EvaluationLink f x EvaluationLink pat x ImplicationLink EvaluationLink g x EvaluationLink pat x (with appropriate probabilistic truth values), an inference which says that patterns in the population of f-optimizers should also be patterns in the population of g-optimizers). Note that we can write the previous example more briefly as: InheritanceLink f g ImplicationLink (EvaluationLink f) (EvaluationLink pat) ImplicationLink (EvaluationLink g) (EvaluationLink pat) A similar formula holds for $imilarityLinIcs. We may also infer: ImplicationLink (EvaluationLink g) (EvaluationLink pat_g) ImplicationLink (EvaluationLink f) (EvaluationLink pat_f) ImplicationLink (EvaluationLink (g AND f)) (EvaluationLink (pat_g AND pat_f)) and: ImplicationLink (EvaluationLink f) (EvaluationLink pat) EFTA00624383
33.6 Hierarchical Program Learning 237 ImplicationLink (EvaluationLink -f) (EvaluationLink -pat) Through these sorts of inferences, PLN inference can be used to give evolutionary learning a long-term memory. allowing knowledge about population models to be transferred from one optimization problem to another. This complements the more obvious use of inference to transfer knowledge about specific solutions from one optimization problem to another. For instance in the problem of finding a compact program generating some given sequences of bits the system might have noticed that when the number of 0 roughly balances the number of 1 (let us call this property STR_BALANCE) successful optimizers tend to give greater biases toward conditionals involving comparisons of the number of 0 and 1 inside the condition, let us call this property over optimizers COMP_CARD_DIGIT_BIAS. This can be expressed in PLN as follows AverageQuantifierLink (tv) ListLink $X $Y ImplicationLink ANDLink InheritanceLink STR_BALANCE $X EvaluationLink SUCCESSFUL_OPTIMIZER_OF ListLink $Y $X InheritanceLink COMP_CARD_DIGIT_BIAS $Y which translates by, if the problem SX inherits from STR_BALANCE and $V is a successful optimizer of $X then, with probability p calculated according to tv, SY tends to be biased according to the property described by COMP_CARD_DIGIT_BIAS. 33.6 Hierarchical Program Learning Next we discuss hierarchical program structure, and its reflection in probabilistic modeling, in more depth. This is a surprisingly subtle and critical topic, which may be approached from several different complementary, angles. To an extent, hierarchical structure is automatically accounted for in MOSES, but it may also be valuable to pay more explicit mind to it. In human-created software projects, one common approach for dealing with the existence of complex interdependencies between parts of a program is to give the program a hierarchical structure. The program is then a hierarchical arrangement of programs within programs within programs, each one of which has relatively simple dependencies between its parts (however its EFTA00624384
238 33 Probabilistic Evolutionary Procedure Learning parts may themselves be hierarchical composites). This notion of hierarchy is essential to such programming methodologies as modular programming and object-oriented design. Pelikan and Goldberg discuss the hierarchal nature of human problem-solving, in the context of the hBOA (hierarchical BOA) version of BOA. However, the hBOA algorithm does not incorporate hierarchical program structure nearly as deeply and thoroughly as the hierarchical procedure learning approach proposed here. In hBOA the hierarchy is implicit in the models of the evolving population, but the population instances themselves are not necessarily explicitly hierarchical in structure. In hierarchical PEPL as we describe it here, the population consists of hierarchically structured Combo trees, and the hierarchy of the probabilistic models corresponds directly to this hierarchical program structure. The ideas presented here have some commonalities to John Koza's ADFs and related tricks for putting reusable subroutines in GP trees, but there are also some very substantial differences, which we believe will make the current approach far more effective (though also involving considerably more computational overhead). We believe that this sort of hierarchically-savvy modeling is what will be needed to get probabilistic evolutionary learning to scale to large and complex programs, just as hierarchy- based methodologies like modular and object-oriented programming are needed to get human software engineering to scale to large and complex programs. 33.6.1 Hierarchical Modeling of Composite Procedures in the AtomSpace The passibility of hierarchically structured programs is (intentionally) present in the CogPrime design, even without any special effort to build hierarchy into the PEPL framework. Combo trees may contain Nodes that point to PredicateNodes, which may in turn contain Combo trees, etc. However, our current framework for learning Combo trees does not take advantage of this hierarchy. What is needed, in order to do so, is for the models used for instance generation to include events of the form: Combo tree Node at position x has type PredicateNode; and the PredicateNode at position x contains a Combo tree that possesses property P. where x is a position in a Combo tree and P is a property that may or may not be true of any given Combo tree. Using events like this, a relatively small program explicitly incorporat- ing only short-range dependencies; may implicitly encapsulate long-range dependencies via the properties R But where do these properties P come from? These properties should be patterns learned as part of the probabilistic modeling of the Combo tree inside the PredicateNode at position x. For example, if one is using a decision tree modeling framework, then the properties might be of the form decision tree D evaluates to Trim. Note that not all of these properties have to be statistically correlated with the fitness of the PredicateNode at position x (although some of them surely will be). Thus we have a multi-level probabilistic modeling strategy. The top-level Combo tree has a probabilistic model whose events may refer to patterns that are parts of the probabilistic models of Combo trees that occur within it and so on down. In instance generation, when a newly generated Combo tree is given a PredicateNode at position x, two possibilities exist: EFTA00624385
33.6 Hierarchical Program Learning 239 • There is already a model for PredicateNodes at position x in Combo trees in the given population, in which case a population of PredicateNodes potentially living at that position is drawn from the known model, and evaluated. • There is no such model (because it has never been tried to create a PredicateNode at position x in this population before), in which case a new population of Combo trees is created corresponding to the position, and evaluated. Note that the fitness of a Combo tree that is not at the top level of the overall process, is assessed indirectly in terms of the fitness of the higher-level Combo tree in which it is embedded, due to the requirement of having certain properties, etc. Suppose each Combo tree in the hierarchy has on average R adaptable sub-programs (rep- resented as Nodes pointing to PredicateNodes containing Combo trees to be learned). Suppose the hierarchy is K levels deep. Then we will have about R x K program tree populations in the tree. This suggests that hierarchies shouldn't get too big, and indeed, they shouldn't need to, for the same essential reason that human-created software programs, if well-designed, tend not to require extremely deep and complex hierarchical structures. One may also introduce a notion of reusable components across various program learning runs, or across several portions of the same hierarchical program. Here is one learning patterns of the form: If property P1(C,x) applies to a Combo tree C and a node x within it, then it is often good for node x to refer to a PredicateNode containing a Combo tree with property P2. These patterns may be assigned probabilities and may be used in instance generation. They are general or specialized programming guidelines, which may be learned over time. 33.6.2 Identifying Hierarchical Structure In Combo trees via Metallodes and Dimensional Embedding One may also apply the concepts of the previous section to model a population of CTs that doesn't explicitly have a hierarchical structure, via introducing the hierarchical structure during the evolutionary process, through the introduction of special extra Combo tree nodes called Metallodes. For instance Metallodes may represent subtrees of Combo trees, which have proved useful enough that it seems justifiable to extract them as "macros." This concept may be implemented in a couple different ways, here we will introduce a simple way of doing this based on dimensional embedding, and then in the next section we will allude to a more sophisticated approach that uses inference instead. The basic idea is to couple decision tree modeling with dimensional embedding of subtrees, a trick that enables small decision tree models to cover large regions of a CT in an approximate way, and which leads naturally to a form of probabilistically-guided crossover. The approach as described here works most simply for C11 that have many subtrees that can be viewed as mapping numerical inputs into numerical outputs. There are clear generalizations to other sorts of CTs, but it seems advisable to test the approach on this relatively simple case first. The first part of the idea is to represent subtrees of a CT as numerical vectors in a relatively low-dimensional space (say N=50 dimensions). This can be done using our existing dimensional embedding algorithm, which maps any metric space of entities into a dimensional space. All EFTA00624386
240 33 Probabilistic Evolutionary Procedure Learning that's required is that we define a way of measuring distance between subtrees. If we look at subtrees with numerical inputs and outputs, this is easy. Such a subtree can be viewed as a function mapping Rn into Km; and there are many standard ways to calculate the distance between two functions of this sort (for instance one can make a Monte Carlo estimate of the if metric which is defined as: [Sum(x) (f(x) - g(x))";() (1/p) Of course, the same idea works for subtrees with non-numerical inputs and outputs, the tuning and implementation are just a little trickier. Next, one can augment a CT with meta-nodes that correspond to subtrees. Each meta-node is of a special CT node type Metallode, and comes tagged with an N-dimensional vector. Exactly which subtrees to replace with Metallodes is an interesting question that must be solved via some heuristics. Then, in the course of executing the PEPL algorithm, one does decision tree modeling as usual, but making use of Metallodes as well as ordinary CT nodes. The modeling of Metallodes is quite similar to the modeling of Nodes representing ConceptNodes and PredicateNodes using embedding vectors. In this way, one can use standard, small decision tree models to model fairly large portions of CTs (because portions of CTs are approximately represented by Metallodes). But how does one do instance generation, in this scheme? What happens when one tries to do instance generation using a model that predicts a Metallode existing in a certain location in a CT? Then, the instance generation process has got to find sonic CT subtree to put in the place where the Metallode is predicted. It needs to find a subtree whose corresponding embedding vector is close to the embedding vector stored in the Metallode. But how can it find such a subtree? There seem to be two ways: 1. A reasonable solution is to look at the database of subtrees that have been seen before in the evolving population, and choose one from this database, with the probability of choosing subtree X being proportional to the distance between X's embedding vector and the embedding vector stored in the Metallode. 2. One can simply choose good subtrees, where the goodness of a subtree is judged by the average fitness of the instances containing the target subtree. One can use a combination of both of these processes (luring instance generation. But of course, what this means is that we're in a sense doing a form of crossover, because we're generating new instances that combine subtrees from previous instances. But we're combining subtrces in a judicious way guided by probabilistic modeling, rather than in a random way as in GP-style crossover. 33.6.2.1 Inferential Metallodes Metallodes are an interesting and potentially powerful technique, but we don't believe that they, or any other algorithmic trick, is going to be the solution to the problem of learning hierarchical procedures. We believe that this is a cognitive science problem that probably isn't amenable to a purely computer science oriented solution. In other words, we suspect that the correct way to break a Combo tree down into hierarchical components depends on context, algorithms are of course required, but they're algorithms for relating a CT to its context rather EFTA00624387
33.6 Hierarchical Program Learning 241 than pure CT-manipulation algorithms. Dimensional embedding is arguably a tool for capturing contextual relationships, but it's a very crude one. Generally speaking, what we need to be learning are patterns of the form "A subtree meeting requirements X Ls often fit when linked to a subtree meeting requirements Y, when solving a problem of type Z". Here the context requirements Y will not pertain to absolute tree position but rather to abstract properties of a subtree. The Metallode approach as outlined above is a kind of halfway measure toward this goal, good because of its relative computational efficiency, but ultimately too limited in its power to deal with really hard hierarchical learning problems. The reason the Metallode approach is crude Ls simply because it involves describing subtrees via points in an embedding space. We believe that the correct (but computationally expensive) approach is indeed to use Metallodes - but with each Metallode tagged, not with coordinates in an embedding space, but with a set of logical relationships describing the subtree that the Metallode stands for. A candidate subtree's similarity to the Metallode may then be determined by inference rather than by the simple computation of a distance between points in the embedding space. (And, note that we may have a hierarchy of Metallodes, with small subtrees corresponding to Metallodes, larger subtrees comprising networks of small subtrees also corresponding to Metallodes, etc.) The question then becomes which logical relationships one tries to look for, when character- izing a Metallode. This may be partially domain-specific, in the sense that different properties will be more interesting when studying motor-control procedures than when studying cognitive procedures. To intuitively understand the nature of this idea, let's consider some abstract but common- sense examples. Firstly, suppose one is learning procedures for serving a ball in tennis. Suppose all the successful procedures work by first throwing the ball up really high, then doing other stuff. The internal details of the different procedures for throwing the ball up really high may be wildly different. What we need is to learn the pattern Implication Inheritance X "throwing the ball up really high" "X then Y" is fit Here X and Y are Metallodes. But the question is how do we learn to break trees down into Metallodes according to the formula "tree='X then Y' where X inherits from 'throwing the ball up really high."'? Similarly, suppose one is learning procedures to do first-order inference. What we need is to learn a pattern such as: Implication AND F involves grabbing pairs from the AtomTable G involves applying an inference rule to those each pair H involves putting the results back in the AtomTable "F I G (H)))" is fit Here we need Metallodes for F, G and H, but we need to characterize e.g. the Metallode F by a relationship such as "involves grabbing pairs from the AtomTable." Until we can characterize !Actallodes using abstract descriptors like this, one might argue we're just doing "statistical learning" rather than "general intelligence style" procedure learning. But to do this kind of abstraction intelligently seems to require some background knowledge about the domain. EFTA00624388
242 33 Probabilistic Evolutionary Procedure Learning In the "throwing the ball up really high" case the assignment of a descriptive relationship to a subtree involves looking, not at the internals of the subtree itself, but at the state of the world after the subtree has been executed. In the "grabbing pairs from the AtomTable" case it's a bit simpler but still requires some kind of abstract model of what the subtree is doing, i.e. a model involving a logic expression such as 'The output of F is a set S so that if P belongs to S then P is a set of two Atoms Al and A2. and both Al and A2 were produced via the getAtom operator." How can this kind of abstraction be learned? It seems unlikely that abstractions like this will be found via evolutionary search over the space of all possible predicates describing program subtrees. Rather, they need to be found via probabilistic reasoning based on the terms combined in subtrees, put together with background knowledge about the domain in which the fitness function exists. In short, integrative cognition is required to learn hierarchically structured pro- grams in a truly effective way, because the appropriate hierarchical breakdowns are contextual in nature, and to search for appropriate hierarchical breakdowns without using inference to take context into account, involves intractably large search spaces. 33.7 Fitness Function Estimation via Integrative Intelligence If instance generation is very cheap and fitness evaluation is very expensive (as is the case in many applications of evolutionary learning hi CogPrime ), one can accelerate evolutionary learning via a "fitness function estimation" approach. Given a fitness function embodied in a predicate P, the goal is to learn a predicate Q so that: 1. Q is much cheaper than P to evaluate, and 2. There is a high-strength relationship: Similarity Q P or else ContextLink C (Similarity Q P) where C is a relevant context. Given such a predicate Q, one could proceed to optimize P by ignoring evolutionary learning altogether and just repeatedly following the algorithm: • Randomly generate N candidate solutions. • Evaluate each of the N candidate solutions according to Q. • Take the kcN solutions that satisfy Q best, and evaluate them according to P. improved based on the new evaluations of P that are done. Of course, this would not be as good as incorporating fitness function estimation into an overall evolutionary, learning framework. Heavy utilization of fitness function estimation may be appropriate, for example, if the entities being evolved are schemata intended to control an agent's actions in a real or simulated environment. In this case the specification predicate F., in order to evaluate P(S), has to actually use the schema S to control the agent in the environment. So one may search for Q that do not involve any simulated environment, but are constrained to be relatively small predicates involving only cheap-to-evaluate terms (e.g. one may allow standard combinatory, numbers, EFTA00624389
33.7 Fitness Function Estimation via Integrative Intelligence 243 strings, ConceptNodes, and predicates built up recursively from these). Then Q will be an abstract predictor of concrete environment success. We have left open the all-important question of how to find the "specification approximating predicate" Q. One approach is to use evolutionary learning. In this case, one has a population of predi- cates, which are candidates for Q. The fitness of each candidate Q is judged by how well it approximates P over the set of candidate solutions for P that have already been evaluated. If one uses evolutionary learning to evolve Qs, then one is learning a probabilistic model of the set of Qs, which tries to predict what sort of Q,s will better solve the optimization problem of approximating P's behavior. Of course, using evolutionary, learning for this purpose poten- tially initiates an infinite regress, but the regress can be stopped by, at some level, finding Qs using a non-evolutionary learning based technique such as genetic programming, or a simple evolutionary learning based technique like standard BOA programming. Another approach to finding Q is to use inference based on background knowledge. Of course, this is complementary rather than contradictory to using evolutionary learning for finding Q. There may be information in the knowledge base that can be used to "analogize" regarding which Qs may match P. Indeed, this will generally be the case in the example given above, where P involves controlling actions in a simulated environment but Q does not. An important point is that, if one uses a certain Q1 within fitness estimation, the evidence one gains by trying Q1 on numerous fitness cases may be utilized in future inferences regarding other Q2 that may serve the role of Q. So, once inference gets into the picture, the quality of fitness estimators may progressively improve via ongoing analogical inference based on the internal structures of the previously attempted fitness estimators. EFTA00624390
EFTA00624391
Section V Declarative Learning EFTA00624392
EFTA00624393
Chapter 34 Probabilistic Logic Networks Co-authored with Matthew lkle 34.1 Introduction Now we turn to CogPrime's methods for handling declarative knowledge - beginning with a series of chapters discussing the Probabilistic Logic Networks (PLN) IG?,•11H08J approach to uncertain logical reasoning, and then turning to chapters on pattern mining and concept creation. In this first of the chapters on PLN, we give a high-level overview, summarizing material given in the book Probabilistic Logic Networks IGNIIII081 more compactly and in a somewhat differently-organized way. For a more thorough treatment of the concepts and motivations underlying PLN, the reader is encouraged to read EGNIIHOSI. PLN is a mathematical and software framework for uncertain inference, operative within the CogPrime software framework and intended to enable the combination of probabilistic truth values with general logical reasoning rules. Some of the key requirements underlying the development of PLN were the following: • To enable uncertainty-savvy versions of all known varieties of logical reasoning, including for instance higher-order reasoning involving quantifiers, higher-order functions, and so forth • To reduce to crisp "theorem prover" style behavior in the limiting case where uncertainty tends to zero • To encompass inductive and abductive as well as deductive reasoning • To agree with probability theory in those reasoning cases where probability theory, in its current state of development, provides solutions within reasonable calculational effort based on assumptions that are plausible in the context of real-world embodied software systems • To gracefully incorporate heuristics not explicitly based on probability theory, in cases where probability theory, at its current state of development, does not provide adequate pragmatic solutions • To provide "scalable" reasoning, in the sense of being able to carry out inferences involving billions of premises. • To easily accept input from, and send input to, natural language processing software systems In practice, PLN consists of • a set of inference rules (e.g. deduction, Bayes rule, variable unification, modus pollens, etc.), each of which takes one or more logical relationships or terms (represented as CogPrime Atoms) as inputs, and produces others as outputs 247 EFTA00624394
248 34 Probabilistic Logic Networks • specific mathematical formulas for calculating the probability value of the conclusion of an inference rule based on the probability values of the premises plus (in some cases) appropriate background assumptions. PLN also involves a particular approach to estimating the confidence values with which these probability values are held (weight of evidence, or second-order uncertainty). Finally, the implementation of PLN in software requires important choices regarding the structural representation of inference rules, and also regarding "inference control" - the strategies required to decide what inferences to do in what order, in each particular practical situation. Currently PLN is being utilized to enable an animated agent to achieve goals via combining actions in a game world. For example, it can figure out that to obtain an object located on top of a wall, it may want to build stairs leading from the floor to the top of the wall. Earlier PLN applications have involved simpler animated agent control problems, and also other domains, such as reasoning based on information extracted from biomedical text using a language parser. For all its sophistication, however, PLN falls prey to the same key weakness as other logical inference systems: combinatorial explosion. In trying to find a logical chain of reasoning leading to a desired conclusion, or to evaluate the consequences of a given set of premises, PLN may need to explore an unwieldy number of possible combinations of the Atoms in CogPrime's memory. For PLN to be practical beyond relatively simple and constrained problems (and most definitely, for it to be useful for AGI at the human level or beyond), it must be coupled with a powerful method for "inference tree pruning" - for paring down the space of passible inferences that the PLN engine must evaluate as it goes about its business in pursuing a given goal in a certain context. Inference control will be addressed in Chapter 36. 34.2 A Simple Overview of PLN The key elements of PLN are its rules and Formulas. In general,a PLN rule has • Input: A tuple of Atoms (which must satisfy certain criteria, specific to the Rule) • Output: A tuple of Atoms Actually, in nearly all cases, the output is a single Atom; and the input is a single Atom or a pair of Atoms. The prototypical example is the DeductionRule. Its input must look like X_Link A B X_Link B C And its output then looks like X_Link A C Here, X_Link may be either InheritanceLink, SubsetLink, ImplicationLink or Extension- allmplicationLink. A PLN formula goes along with a PLN rule, and tells the uncertain truth value of the output, based on the uncertain truth value of the input. For example, if we have X_Link A B <sAB> X_Link B C <BBC> then the standard PLN deduction formula tells us EFTA00624395
34.2 A Simple Overview of PLN 249 X_Link A C <sAC> with — SAO (SC — SESBC) SAC = SABS BC ÷ 1 — .48 where e.g. s A denote the strength of the truth value of node A. In this example, the uncertain truth value of each Atom is given as a single "strength" number. In general, uncertain truth values in PLN may take multiple forms, such as • Single strength values like .8, which may indicate probability or fuzzy truth value, depending on the Atom type • (strength, confidence) pairs like (.8, .4) • (strength, count) pairs like (.8, 15) • indefinite probabilities like (.G,. .9, .95) which indicate credible intervals of probabilities 34.2.1 Forward and Backward Chaining Typical patterns of usage of PLN are forward-chaining and backward-chaining inference. Forward chaining basically means: 1. Given a pool (a list) of Atoms of interest 2. One applies PLN rules to these Atoms, to generate new Atoms, hopefully also of interest 3. Adding these new Atoms to the pool, one returns to Step 1 EXAMPLE: "People are animals" and "animals breathe" are in the pool of Atoms. These are combined by the Deduction rule to form the conclusion "people breathe". Backward chaining falls into two cases. First: • "'Truth value query."' Given a target Atom whose truth value is not known (or is too uncertainly known), plus a pool of Atoms, find a way to estimate the truth value of the target Atom, via combining the Atoms in the pool using the inference Rules EXAMPLE: The target is "do people breathe?" (InheritanceLink people breathe). The truth value of the target is estimated via doing the inference "People are animals, animals breathe, therefore people breathe." Second: • "'Variable fulfillment query"'. Given a target Link (Atoms may be Nodes or Links) with one or more VariableAtoms among its targets, figure out what Atoms may be put in place of these VariableAtoms, so as to give the target Link a high strength* confidence (i.e. a "high truth value"). EXAMPLE: The target is "what breathes?", i.e. "InheritanceLink $X breathe"... Direct lookup into the Atomspace reveals the Atom "InheritanceLink animal breathe", indicating that the slot $X may be filled by "animal". Inference reveals that "Inheritance people breathe" so that the slot $X may also be filled by "people". EFTA00624396
250 34 Probabilistic Logic Networks EXAMPLE: the target is "what breathes and adds", ie "(InheritanceLink $X breathe) AND (InheritanceLink $X add)". Inference reveals that the slot $X may be filled by "people" but not "cats" or "computers." Common-sense inference may involve a combination of backward chaining and forward chain- ing. The hardest part of inference is "inference control" - that is, knowing which among the many possible inference steps to take, in order to obtain the desired information (in backward chaining) or to obtain interesting new information (in forward chaining). In an Atomspace with a large number of (often quite uncertain) Atoms, there are many, many possibilities and powerful heuristics are needed to choose between them. The best guide to inference control is some sort of induction based on the system's past history of which inferences have been useful. But of course, a young system doesn't have much history to go on. And relying on indirectly relevant history is, itself, an inference problem - which can be solved best by a system with some history to draw on! 34.3 First Order Probabilistic Logic Networks We now review the essentials of PLN in a more formal way. PLN is divided into first-order and higher-order sub-theories (FOPLN and HOPLN). These terms are used in a nonstandard way drawn conceptually from NARS [Wan061. We develop FOPLN first, and then derive HOPLN therefrom. FOPLN is a term logic, involving terms and relationships (links) between tents. It is an uncertain logic, in the sense that both terms and relationships are associated with truth value objects, which may come in multiple varieties ranging from single numbers to complex structures like indefinite probabilities. Terms may be either elementary, observations, or abstract tokens drawn from a token-set T. 34.3.1 Core FOPLN Relationships "Core FOPLN" involves relationships drawn from the set: negation; Inheritance and probabilis- tic conjunction and disjunction; Member and fuzzy conjunction and disjunction. Elementary observations can have only Member links, while token terms can have any kinds of links. PLN makes clear distinctions, via link type semantics, between probabilistic relationships and fuzzy set relationships. Member semantics are usually fuzzy relationships (though they can also be crisp), whereas Inheritance relationships are probabilistic, and there are rules governing the interoperation of the two types. Suppose a virtual agent makes an elementary VisualObservation o of a creature named Fluffy. The agent might classify o as belonging, with degree 0.9, to the fuzzy set of furry objects. The agent might also classify o as belonging with degree 0.8 to the fuzzy set of animals. The agent could then build the following links in its memory: Member o furry < 0.9 > Member o animals < 0.8 > EFTA00624397
34.3 First Order Probabilistic Logic Networks 251 The agent may later wish to refine its knowledge, by combining these MemberLinks. Using the minimum fuzzy conjunction operator, the agent would conclude: fuzzyAND < 0.8 > Member o furry Member o animals meaning that the observation o is a visual observation of a fairly furry, animal object. The semantics of (extensional) Inheritance are quite different from, though related to, those of the MemberLink. ExtensionalInheritance represents a purely conditional probabilistic subset relationship and is represented through the Subset relationship. If A is Fluffy and B is the set of cat, then the statement Subset < 0.9 > A B means that Pfr is in the set BIx is in the set A) = 0.9. 34.3.2 PLN Truth Values PLN is equipped with a variety of different types of truth-value types. In order of increasing information about the full probability distribution, they are: • strength truth-values, which consist of single numbers; e.g., < s > or < .8 >. Usually strength values denote probabilities but this is not always the case. • SimpleTruthValues. consisting of pairs of numbers. These pairs come in two forms: < s, w >, where s is a strength and w is a "weight of evidence" and < s, N >, where N is a "count." "Weight of evidence" is a qualitative measure of belief, while "count" is a quantitative mea- sure of accumulated evidence. • IndefiniteTruthValues, which quantify truth-values in terms of an interval [L, UI, a credibil- ity level b, and an integer k (called the lookahead). IndefiniteTruthValues quantify the idea that after k more observations there is a probability b that the conclusion of the inference will appear to lie in IL, • DistributionalrruthValues, which are discretized approximations to entire probability dis- tributions. 34.3.3 Auxiliary FOPLN Relationships Beyond the core FOPLN relationships, FOPLN involves additional relationship types of two varieties. There are simple ones like Similarity, defined by Similarity A B EFTA00624398
252 34 Probabilistic Logic Networks We say a relationship R is simple if the truth value of R A B can be calculated solely in terms of the truth values of core FOPLN relationships between A and B. There are also complex "aux- iliary" relationships like IntensionalInheritance. which as discussed in depth in the Appendix ??, measures the extensional inheritance between the set of properties or patterns associated with one term and the corresponding set associated with another. Returning to our example, the agent may observe that two properties of cats are that they are furry, and purr. Since the Fluffy is also a furry animal, the agent might then obtain, for example IntensionalInheritance < 0.5 > Fluffy cat meaning that the Fluffy shares about 50% of the properties of cat. Building upon this relation- ship even further, PLN also has a mixed intensional/extensional Inheritance relationship which is defined simply as the disjunction of the Subset and IntensionalInheritance relationships. As this example illustrates, for a complex auxiliary relationship R, the truth value of R A B is defined in terms of the truth values of a number of different FOPLN relationships among different terms (others than A and B), specified by a certain mathematical formula. 34.3.4 PLN Rules and Formulas A distinction is made in PLN between rules and formulas. PLN logical inferences take the form of "syllogistic rules," which give patterns for combining statements with matching terms. Examples of PLN rules include, but are not limited to, • deduction ((A —> B) A (B 4. (A -, C)), • induction ((A B) A (A -, C)), • abduction ((A —> C) A (B 0* (A —> C)), • revision, which merges two versions of the same logical relationship that have different truth values • inversion ((A —) B) a (B —) A)). The basic schematic of the first four of these rules is shown in Figure 34.1. We can see that the first three rules represent the natural ways of doing inference on three interrelated terms. We can also see that induction and abduction can be obtained from the combination of deduction and inversion, a fact utilized in PLN's truth value formulas. Related to each rule is a formula which calculates the truth value resulting from application of the rule. As an example, suppose sA, SD, 3C, SAS, and sBc represent the truth values for the terms A, B, C, as well the truth values of the relationships A —> B and B C, respectively. Then, under suitable conditions imposed upon these input truth values, the formula for the deduction rule is given by: - .AD) ( 4C SDSDC) SAC = SABSDC - SB EFTA00624399
34.3 First Order Probabilistic Logic Networks 233 deduction A Ste- 40 P abduction M / • 4/—i• p induction revision • S b. I ) 4 p Ns. .. Fig. 34.1: The four most basic first-order PLN inference rules where sAc represents the truth value of the relationship A —> C. This formula is directly derived from probability theory given the assumption that A -, B and B C are independent. For inferences involving solely fuzzy operators, the default version of PLN uses standard fuzzy logic with min/max truth value formulas (though alternatives may also be employed consistently with the overall PLN framework). Finally, the semantics of combining fuzzy and probabilistic operators is hinted at in IGNIIII081 but addressed more rigorously in 'GUN, which gives a precise semantics for constructs of the form Inheritance A B where A and B are characterized by relationships of the form Member C A, Member D B, etc. It is easy to see that, in the crisp case, where all MemberLinks and InheritanceLinks have strength 0 or 1, FOPLN reduces to standard propositional logic. Where inheritance is crisp but membership isn't, FOPLN reduces to higher-order fuzzy logic (including fuzzy statements about terms or fuzzy statements, etc.). 34.3.5 Inference Trails Inference trails are a mechanism used in some implementations of PLN, borrowed from the NABS inference engine lWan061. In this approach, each Atom contains a nail structure. which keeps a record of which Atoms were used in deriving the given Atom's TruthValue. In its simplest form, the Mail can just be a list of Atoms. The total set of Atoms involved in a given nail, in EFTA00624400
254 34 Probabilistic Logic Networks principle, could be very large; but one can in practice cap Trail size at 50 or some other similar number. In a more sophisticated version, one can record the rules used with the Atoms in the Trail as well, allowing recapitulation of the whole inference history producing an Atom's truth value. If the PLN MindAgents store all the inferences they do in some global inference history structure, then nails are obviated, as the information in the Mail can be found via consulting this history structure. The purpose of keeping inference trails is to avoid errors due to double-counting of evidence. If links L1 and L2 are both derived largely based on link Lo, and L1 and L2 both lead to ity as a consequence - do we want to count this as two separate, independent pieces of evidence about L4? Not really, because most of the information involved comes from the single Atom Lo anyway. If all the Atoms maintain nails then this sort of overlapping evidence can be identified easily; otherwise it will be opaque to the reasoning system. While Trails can be a useful tool, there is reason to believe they're not strictly necessary. If one just keeps doing probabilistic inference iteratively without using nails, eventually the dependencies and overlapping evidence bases will tend to be accounted for, much as in a loopy Bayes net. The key question then comes down to: how long is "eventually" and can the reasoning system afford to wait? A reasonable strategy seems to be • Use Trails for high-STI Atoms that are being reasoned about intensively, to minimize the amount of error • For lower-STI Atoms that are being reasoned on more casually in the background, allow the double-counting to exist in the short term, figuring it will eventually "come out in the wash" so it's not worth spending precious compute resources to more rigorously avoid it in the short term 34.4 Higher-Order PLN Higher-order PLN (HOPLN) is defined as the subset of PLN that applies to predicates (con- sidered as functions mapping arguments into truth values). It includes mechanisms for dealing with variable-bearing expressions and higher-order functions. A predicate, in PLN. is a special kind of term that embodies a function mapping terms or relationships into truth-values. HOPLN contains several relationships that act upon predicates including Evaluation, Implication, and several types of quantifiers. The relationships can involve constant terms, variables, or a mixture. The Evaluation relationship, for example, evaluates a predicate on an input term. An agent can thus create a relationship of the form Evaluation near (Bob's house, Fluffy) or, as an example involving variables, Evaluation near (X, Fluffy) EFTA00624401
34.4 Higher-Order PLN 255 The Implication relationship is a particularly simple kind of HOPLN relationship in that it behaves very much like FOPLN relationships, via substitution of predicates in place of simple terms. Since our agent knows, for example, Implication is_Fluffy AND is_furry purrs and Implication AND is_furry puns is_cat the agent could then use the deduction rule to conclude Implication is_Fluffy is_cat PLN supports a variety of quantifiers, including traditional crisp and fuzzy quantifiers, plus the AverageQuantifier defined so that the truth value of AverageQuantifter X F(X) is a weighted average of F(X) over all relevant inputs X. AverageQuantifier is used implicitly in PLN to handle logical relationships between predicates, so that e.g. the conclusion of the above deduction is implicitly interpreted as AverageQuantifier X Implication Evaluation is_Fluffy X Evaluation is_cat X We can now connect PLN with the SRAM model (defined in Chapter 7 of Part 1). Suppose for instance that the agent observes Fluffy from across the room, and that it has previously learned a Fetch procedure that tells it how to obtain an entity once it sees that entity. Then, if the agent has the goal of finding a cat, and it has concluded based on the above deduction that Fluffy is indeed a cat (since it is observed to be furry and purr), the cognitive schematic (knowledge of the form Context & Procedure —> Coat as explained in Chapter 8 of Part 1) may suggest it to execute the Fetch procedure. 34.4.1 Reducing HOPLN to FOPLN In ICMIH081 it is shown that in principle, over any finite observation set, HOPLN reduces to FOPLN. The key ideas of this reduction are the elimination of variables via use of higher-order functions, and the use of the set-theoretic definition of function embodied in the SatisfyingSet operator to map function-argument relationships into set-member relationships. As an example, consider the Implication link. In HOPLN, where X is a variable EFTA00624402
256 34 Probabilistic Logic Networks Implication RI AX R2 B X may be reduced to Inheritance SatisfyingSet(Ri A X) SatisfyingSet(R2 B X) where e.g. SatisfyingSet(R1 A X) is the fuzzy set of all X satisfying the relationship RI(A, X). Furthermore in Appendix ??, we show how experience-based possible world semantics can be used to reduce PLN's existential and universal quantifiers to standard higher order PLN relationships using AverageQuantifier relationships. This completes the reduction of HOPLN to FOPLN in the SRAM context. One may then wonder why it makes sense to think about HOPLN at all. The answer is that it provides compact expression of a specific subset of FOPLN expressions, which is useful in cases where agents have limited memory and these particular expressions provide agents practical value (it biases the agent's reasoning ability to perform just as well as in first or higher orders). 34.5 Predictive Implication and Attraction This section briefly reviews the notions of predictive implication and predictive attraction, which are critical to many aspects of CogPrime dynamics including goal-oriented behavior. Define Attraction A B <s> as P(BIA) - P(BHA) = s, or in node and link terms s (Inheritance A B).s - (Inheritance -.A B).s For instance (Attraction fat pig).s (Inheritance fat pig).s - (Inheritance -.fat pig).s Belatedly, in the temporal domain, we have the link type PredictiveImplication, where Predictivelmplication A B <s> roughly means that s is the probability that Implication A B <s> holds and also A occurs before B. More sophisticated versions of Predictivelmplication come along with more specific information regarding the time lag between A and B: for instance a time interval T in which the lag mast lie, or a probability distribution governing the lag between the two events. We may then introduce PredictiveAttraction A B <s> to mean s (PredictiveImplication A B).s - (Predictivelmplication B).s For instance EFTA00624403
34.6 Confidence Decay 257 (PredictiveAttraction kiss_Ben be_happy).s - 1Predictivelmplication kiss_Ben be_happyl.s - (Predictivelmplication -.kiss_Ben be_happy).s This is what really matters in terms of determining whether kissing Ben is worth doing in pursuit of the goal of being happy, not just how likely it is to be happy if you kiss Ben, but how differentially likely it is to be happy if you kiss Ben. Along with predictive implication and attraction, sequential logical operations are important, represented by operators such as SequentialAND, SimultaneousAND and SimultaneousOR. For instance: PredictiveAttraction SeguentialAND Teacher says 'fetch' I get the ball I bring the ball to the teacher I get a reward combines SequentialAND and PredictiveAttraction. In this manner, an arbitrarily complex system of serial and parallel temporal events can be constructed. 34.6 Confidence Decay PLN is all about uncertain truth values, yet there is an important kind of uncertainty it doesn't handle explicitly and completely in its standard truth value representations: the decay of infor- mation with time. PLN does have an elegant mechanism for handling this: in the <s,d> formalism for truth values, strength s may remain untouched by time (except as new evidence specifically corrects it), but d may decay over time. So, our confidence in our old observations decreases with time. In the indefinite probability formalism, what this means is that old truth value intervals get wider, but retain the same mean as they had back in the good old days. But the tricky question is: How fast does this decay happen? This can be highly context-dependent. For instance, 20 years ago we learned that the electric guitar is the most popular instrument in the world, and also that there are more bacteria than humans on Earth. The former fact is no longer true (keyboard synthesizers have outpaced electric guitars). but the latter is. And, if you'd asked us 20 years ago which fact would be more likely to become obsolete, we would have answered the former - because we knew particulars of technology would likely change far faster than basic facts of earthly ecology. On a smaller scale, it seems that estimating confidence decay rates for different sorts of knowledge in different contexts is a tractable data mining problem, that can be solved via the system keeping a record of the observed truth values of a random sampling of Atoms as they change over time. (Operationally, this record may be maintained in parallel with the SystemActivityTable and other tables maintained for purposes of effort estimation, attention allocation and credit assignment.) If the truth values of a certain sort of Atom in a certain context change a lot, then the confidence decay rate for Atoms of that sort should be increased. This can be quantified nicely using the indefinite probabilities framework. For instance, we can calculate, for a given sort of Atom in a given context, separate b-level credible intervals for the L and U components of the Atom's truth value at time t-r, centered EFTA00624404
258 34 Probabilistic Logic Networks about the corresponding values at time t. (This would be computed by averaging over all t values in the relevant past. where the relevant past is defined as some particular multiple of r; and over a number of Atoms of the same sort in the same context.) Since historically-estimated credible-intervals won't be available for every exact value of r, interpolation will have to be used between the values calculated for specific values of r. Also, while separate intervals for L and U would be kept for maximum accuracy, for reasons of pragmatic memory efficiency one might want to maintain only a single number x, considered as the radius of the confidence interval about both L and U. This could be obtained by averaging together the empirically obtained intervals for L and U. Then, when updating an Atom's truth value based on a new observation, one performs a revision of the old TV with the new, but before doing so, one first widens the interval for the old one by the amounts indicated by the above-mentioned credible intervals. For instance, if one gets a new observation about A with TV (L„c,„, Une„,), and the prior TV of A, namely (Lad, Udd), is 2 weeks old, then one may calculate that Ldd should really be considered as: SCLJolcll - x, L_rold}+x)3 and U_old should really be considered as: SW_fold} - x, QJ old} + x)$ so that (L_new, U_new) should actually be revised with: Sct.Joicil - x, U_(old} + x)$ to get the total: (L,U) for the Atom after the new observation. Note that we have referred fuzzily to "sort of Atom" rather than "type of Atom" in the above. This is because Atom type is not really the right level of specificity to be looking at. Rather - as in the guitar vs. bacteria example above - confidence decay rates may depend on semantic categories, not just syntactic (Atom type) categories. To give another example, confidence in the location of a person should decay more quickly than confidence in the location of a building. So ultimately confidence decay needs to be managed by a pool of learned predicates, which are applied periodically. These predicates are mainly to be learned by data mining, but inference may also play a role in some cases. The ConfidenceDecay MindAgent must take care of applying the confidence-decaying pred- icates to the Atoms in the AtomTahle. periodically. The ConfidenceDecayUpdater MindAgent must take care of: • forming new confidence-decaying predicates via data mining, and then revising them with the existing relevant confidence-decaying predicates. • flagging confidence-decaying predicates which pertain to important Atoms but are uncon- fident, by giving them STICurrency, so as to make it likely that they will be visited by inference. 34.6.1 An Example As an example of the above issues, consider that the confidence decay of: EFTA00624405
34.6 Confidence Decay 259 Inh Ari male should be low whereas that of: Inh Ari tired should be higher, because we know that for humans, being male tends to be a more permanent condition than being tired. This suggests that concepts should have context-dependent decay rates, e.g. in the context of humans, the default decay rate of maleness is low whereas the default decay rate of tired-ness is high. However, these defaults can be overridden. For instance, one can say "As he passed through his 80's, Grandpa just got tired, and eventually he died." This kind of tiredness, even in the context of humans, does not have a rapid decay rate. This example indicates why the confidence decay rate of a particular Atom needs to be able to override the default. In terms of implementation, one mechanism to achieve the above example would be as follows. One could incorporate an interval confidence decay rate as an optional component of a truth value. As noted above one can keep two separate intervals for the L and U bounds; or to simplify things one can keep a single interval and apply it to both bounds separately. Then, e.g., to define the decay rate for tiredness among humans, we could say: ImplicationLink_HOJ InheritanceLink $X human InheritanceLink $X tired <confidenceDecay - (0,.1J> or else (preferably): ContextLink human InheritanceLink $X tired <confidenceDecay - [0,.1(> Similarly, regarding maleness we could say: ContextLink human Inh SX male <confidenceDecay a 10,.00001)> Then one way to express the violation of the default in the case of grandpa's tiredness would be: InheritanceLink grandpa tired <confidenceDecay - (0,.001l> (Another way to handle the violation from default, of course, would be to create a separate Atom: tired_from_old_age and consider this as a separate sense of "tired" from the normal one, with its own confidence decay setting.) In this example we see that, when a new Atom is created (e.g. InheritanceLink Ari tired), it needs to be assigned a confidence decay rate via inference based on relations such as the ones given above (this might be done e.g. by placing it on the queue for immediate attention by the ConfidenceDecayUpdater MindAgent). And periodically its confidence decay rate could be updated based on ongoing inferences (in case relevant abstract knowledge about confidence decay rates changes). Making this sort of inference reasonably efficient might require creating a special index containing abstract relationships that tell you something about confidence decay adjustment. such as the examples given above. EFTA00624406
260 34 Probabilistic Logic Networks 34.7 Why is PLN a Good Idea? We have explored the intersection of the family of conceptual and formal structures that is PLN, with a specific formal model of intelligent agents (SRAM) and its extension using the cognitive schematic. The result is a simple and explicit formulation of PLN as a system by which an agent can manipulate tokens in its memory, thus represent observed and conjectured relationships (between its observations and between other relationships), in a way that assists it in choosing actions according to the cognitive schematic. We have not, however, rigorously answered the question: What is the contribution of PLN to intelligence, within the formal agents framework introduced above? This is a quite subtle question, to which we can currently offer only an intuitive answer, not a rigorous one. Firstly, there is the question of whether probability theory is really the best way to manage uncertainty, in a practical context. Theoretical results like those of Ca t ox611 and de Finetti FIF371 demonstrate that probability theory is the optimal way to handle uncertainty, if one makes certain reasonable assumptions. However, these reasonable assumptions don't actually apply to real-world intelligent systems, which must operate with relatively severe computa- tional resource constraints. For example, one of Cox's axioms dictates that a reasoning system must assign the same truth value to a statement, regardless of the route it uses to derive the statement. This is a nice idealization, but it can't be expected of any real-world, finite-resources reasoning system dealing with a complex environment. So an open question exists, as to whether probability theory is actually the best way for practical AGI systems to manage uncertainty. Most contemporary Al researchers assume the answer is yes, and probabilistic AI has achieved increasing popularity in recent years. However, there are also significant voices of dissent, such as Pei Wang tWanthil in the AGI community, and many within the fuzzy logic community. PLN is not strictly probabilistic, in the sense that it combines formulas derived rigorously from probability theory with others that are frankly heuristic in nature. PLN was created in a spirit of open-mindedness regarding whether probability theory is actually the optimal approach to reasoning under uncertainty using limited resources, versus merely an approximation to the optimal approach in this case. Future versions of PLN might become either more or less strictly probabilistic, depending on theoretical and practical advances. Next, aside from the question of the practical value of probability theory, there is the question of whether PLN in particular is a good approach to carrying out significant parts of what an AGI system needs to do, to achieve human-like goals in environments similar to everyday human environments. Within a cognitive architecture where explicit utilization the cognitive schematic (Context & Procedure —> Goal) is useful, clearly PLN is useful if it works reasonably well - so this question partially reduces to: what are the environments in which agents relying on the cognitive schematic are intelligent according to formal intelligent measures like those defined in Chapter 7 of Part 1. And then there is the possibility that some uncertain reasoning formalism besides PLN could be even more useful in the context of the cognitive schematic. In particular, the question arises: What are the unique, peculiar aspects of PLN that makes it more useful in the context of the cognitive schematic, than some other, more straightforward approach to probabilistic inference? Actually there are multiple such aspects that we believe make it particularly useful. One is the indefinite probability approach to truth values, which we believe is more robust for AGI than known alternatives. Another is the clean reduction of higher order logic (as defined in PLN) to first-order logic (as defined in PLN), and the utilization EFTA00624407
34.7 Why is PLN a Good Idea? 261 of term logic instead of predicate logic wherever possible — these aspects make PLN inferences relatively simple in most cases where, according to human common sense, they should be simple. A relatively subtle issue in this regard has to do with PLN intension. The cognitive schematic is formulated in terms of PredictiveExtensionallmplication (or any other equivalent way like PredictiveExtensionalAttraction), which means that intensional PLN links are not required for handling it. The hypothesis of the usefulness of intensional PLN links embodies a subtle assumption about the nature of the environments that intelligent agents are operating in. As discussed in roe06] it requires an assumption related to Peirce's philosophical axiom of the "tendency to take habits," which posits that in the real world, entities possessing some similar patterns have a probabilistically surprising tendency to have more similar patterns. Reflecting on these various theoretical subtleties and uncertainties, one may get the feeling that the justification for applying PLN in practice is quite insecure! However, it must be noted that no other formalism in AI has significantly better foundation, at present. Every AI method involves certain heuristic assumptions, and the applicability of these assumptions in real life is nearly always a matter of informal judgment and copious debate. Even a very rigorous technique like a crisp logic formalism or support vector machines for classification, requires non-rigorous heuristic assumptions to be applied to the real world (how does sensation and actuation get translated into logic formulas, or SW feature vectors)? It would be great if it were possible to use rigorous mathematical theory to derive an AGI design, but that's not the case right now, and the development of this sort of mathematical theory seems quite a long way off. So for now, we must proceed via a combination of mathematics, practice and intuition. In terms of demonstrated practical utility, PLN has not yet confronted any really ambitious AGI-type problems, but it has shown itself capable of simple practical problem-solving in areas such as virtual agent control and natural language based scientific reasoning r HMOS" The current PLN implementation within CogPrime can be used to learn to play fetch or tag, draw analogies based on observed objects, or figure out how to carry out tasks like finding a cat. We expect that further practical applications, as well as very ambitious AGI development, can be successfully undertaken with PLN without a theoretical understanding of exactly what are the properties of the environments and goals involved that allow PLN to be effective. However, we expect that a deeper theoretical understanding may enable various aspects of PLN to be adjusted in a more effective manner. EFTA00624408
EFTA00624409
Chapter 35 Spatiotemporal Inference 35.1 Introduction Most of the problems and situations humans confront every day involve space and time explicitly and centrally. Thus, any AGI system aspiring to humanlike general intelligence must have some reasonably efficient and general capability to solve spatiotemporal problems. Regarding how this capability might get into the system, there is a spectrum of possibilities, ranging from rigid hard-coding to tabula rasa experiential learning. Our bias in this regard is that it's probably sensible to somehow "wire into" CogPrime some knowledge regarding space and time - these being, after all, very basic categories for any embodied mind confronting the world. It's arguable whether the explicit insertion of prior knowledge about spacetime is necessary for achieving humanlike AGI using feasible resources. As an argument against the necessity of this sort of prior knowledge, Ben Kuipers and his colleagues ISNIK 21 have shown that an AI system can learn via experience that its perceptual stream comes from a world with three, rather than two or four dimensions. There is a long way from learning the number of dimensions in the world to learning the full scope of practical knowledge needed for effectively reasoning about the world - but it does seem plausible, from their work, that a broad variety of spatiotemporal knowledge could be inferred from raw experiential data. On the other hand, it also seems clear that the human brain does not do it this way, and that a rich fund of spatiotemporal knowledge is "hard-coded" into the brain by evolution - often in ways so low- level that we take them for granted, e.g. the way some motion detection neurons fire in the physical direction of motion, and the way somatosensory cortex presents a distorted map of the body's surface. On a psychological level, it is known that some fundamental intuition for space and time is hard-coded into the human infant's brain 1.1011051. So while we consider the learning of basic spatiotemporal knowledge from raw experience a worthy research direction, and fully compatible with the CogPrime vision; yet for our main current research, we have chosen to hard-wire some basic spatiotemporal knowledge. If one does wish to hard-wire some basic spatiotemporal knowledge into one's AI system, multiple alternate or complementary methodologies may be used to achieve this, including spa- tiotemporal logical inference, internal simulation, or techniques like recurrent neural nets whose dynamics defy simple analytic explanation. Though our focus in this chapter is on inference, we must emphasize that inference, even very broadly conceived, is not the only way for an intelli- gent agent to solve spatiotemporal problems occurring in its life. For instance, if the agent has a detailed map of its environment, it may be able to answer some spatiotemporal questions by 263 EFTA00624410
264 35 Spatiotemporal Inference directly retrieving information from the map. Or, logical inference may be substituted or aug- mented by (implicitly or explicitly) building a model that satisfies the initial knowledge - either abstractly or via incorporating "visualization" connected to sensory memory - and then inter- pret new knowledge over that model instead of inferring it. The latter is one way to interpret what DeSTIN and other CSDLNs do; indeed, DeSTIN's perceptual hierarchy is often referred to as a "state inference hierarchy." Any CSDLN contains biasing toward the commonsense struc- ture of space and time, in its spatiotemporal hierarchical structure. It seems plausible that the human mind uses a combination of multiple methods for spatiotemporal understanding, just as we intend CogPrime to do. In this chapter we focus on spatiotemporal logical inference, addressing the problem of creat- ing a spatiotemporal logic adequate for use within an AGI system that confronts the same sort of real-world problems that humans typically do. The idea is not to fully specify the system's un- derstanding of space and time in advance. but rather to provide some basic spatiotemporal logic rules, with parameters to be adjusted based on experience, and the opportunity for augmenting the logic over time with experientially-acquired rules. Most of the ideas in this chapter are reviewed in more detail, with more explanation. in the book Real World Reasoning [MC+ Ill; this chapter represent a concise summary, compiled with the AGI context specifically in mind. A great deal of excellent work has already been done in the areas of spatial, temporal and spatiotemporal reasoning; however, this work does not quite provide an adequate foundation for a logic-incorporating AGI system to do spatiotemporal reasoning, because it does not ade- quately incorporate uncertainty. Our focus here is to extend existing spatiotemporal calculi to appropriately encompass uncertainty, which we argue is sufficient to transform them into an AGI-ready spatiotemporal reasoning framework. We also find that a simple extension of the standard PLN uncertainty representations, inspired by P(Z)-logic Tian lOj, allows more elegant expression of probabilistic fuzzy predicates such as arise naturally in spatiotemporal logic. In the final section of the chapter, we discuss the problem of planning, which has been con- sidered extensively in the Al literature. We describe an approach to planning that incorporates PLN inference using spatiotemporal logic, along with MOSES as a search method, and some record-keeping methods inspired by traditional Al planning algorithms. 35.2 Related Work on Spatio-temporal Calculi We now review several calculi that have previously been introduced for representing and rea- soning about space, time and space-t line combined. Spatial Calculi Calculi dealing with space usually model three types of relationships between spatial regions: topological, directional and metric. The most popular calculus dealing with topology is the Region Connection Calculus (RCC) IRC(793], relying on a base relationship C (for Connected) and building up other relationships from it, like P (for PartOf), or 0 (for Overlap). For instance P(X, Y), meaning X is a part of Y, can be defined using C as follows EFTA00624411
35.2 Related Work on Spatio-temporal Calculi 265 P(X, Y) iff VZ E Li, C(Z, X) C(Z, Y) (35.1) Where U is the universe of regions. RCC-8 models eight base relationships. see Figure 35.1. And DC(X. Y) EC(X. Y) TPP(X. Y) NTPP(X. Y) PO(X. Y) EQ(X. Y) TPPi(X. Y) NTPPi(X. Y) Fig. 35.1: The eight base relationships of RCC-8 it is possible, using the notion of convexity, to model more relationships such as inside, partially inside and outside, see Figure 35.2. For instance RCC-23 is an extension of RCC-8 using relationships based on the notion of convexity. The 9-intersection calculus IWM95, Kur09I is Insidc(%L Y) P—Insidc(X Y) Outsidc(X. Y) Fig. 35.2: Additional relationships using convexity another calculus for reasoning on topological relationships, but handling relationships between heterogeneous objects, points, lines, surfaces. Regarding reasoning about direction, the Cardinal Direction Calculus IC Eot , ZI41)(08] con- siders directional relationships between regions, to express propositions such as "region A is to the north of region B". And finally regarding metric reasoning, spatial reasoning involving qualitative distance (such as close, medium, far) and direction combined is considered in ICFH97I. Some work has also been done to extend and combine these various calculi, such as combining RCC-8 and the Cardinal Direction Calculus IMAM], or using size [GR00) or shape ICoh95] information in RCC. Temporal Calculi The best known temporal calculus is Allen's Interval Algebra which considers 13 rela- tionships over time intervals, such as Before, During, Overlap, Meet, etc. For instance one EFTA00624412
266 35 Spatiotemporal Inference can express that digestion occurs after or right after eating by Before(Eet, Digest) V Meet(Eat, Digest) equivalently denoted Eet{Bef ore ,Meet}Digest. There also exists a generalization of Allen's Interval Algebra that works on semi-intervals IFF92j, that axe intervals with possibly undefined start or end. There are modal temporal logics such as LTL and CT L, mostly used to check temporal constraints on concurrent systems such as deadlock or fairness using Model Checking IN lai001. Calculi with Space and Time Combined There exist calculi combining space and time, first of all those obtained by "temporizing" spatial calculus, that is tagging spatial predicates with timestamps or time intervals. For instance STCC (for Spatio-temporal Constraint Calculus) IC NO2j is basically RCC-8 combined with Allen's Algebra. With STCC one can express spatiotemporal propositions such as Meet(DC( Finger, Key), EC( Finger, Key)) which means that the interval during which the finger is away from the key meets the interval during which the finger is against the key. Another way to combine space and time is by modeling motion; e.g. the Qualitative Tra- jectory Calculus (QTC) IWKB05] can be used to express whether 2 objects are going for- ward/backward or left/right relative to each other. Uncertainty in Spatio-temporal Calculi In many situations it is worthwhile or even nectsary to consider non-crisp extensions of these calculi. For example it is not obvious how one should consider in practice whether two regions are connected or disconnected. A desk against the wall would probably be considered connected to it even if there is a small gap between the wall and the desk. Or if A is not entirely part of B it may still be valuable to consider to which extent it is, rather than formally rejecting PartOf (A, B). There are several ways to deal with such phenomena; one way is to consider probabilistic or fuzzy extensions of spatiotemporal For instance in ISIXICK08b, SIX'CKOSal the RCC relationship C (for Connected) is replaced by a fuzzy predicate representing closeness between regions and all other relationships based on it are extended accordingly. So e.g. DC (for Disconnected) is defined as follows DC(X, Y) =1 — C(X, Y) (35.2) P (for Part0f) is defined as P(X, Y) = g gc(z, x),C(Z, Y)) (35.3) where / is a fuzzy implication with some natural properties (usually /(xl, x2) = max(1 —xi , x2)). Or, FA (for Equal) is defined as EFTA00624413
35.3 Uncertainty with Distributional Fuzzy Values 267 EQ(X, Y) = min(P(X, Y), P(Y, X)) (35.4) and so on. However the inference rules cannot determine the exact fuzzy values of the resulting rela- tionships but only a lower bound, for instance T(P(X, Y), P(Y, Z)) ≤ P(X, Z) (35.5) where T(xl , 22) = max(0, x1+22-1). This is to be expected since in order to know the resulting fuzzy value one would need to know the exact spatial configuration. For instance Figure 35.3 depicts 2 possible configurations that would result in 2 different values of P(X, Z). (A) (b) Fig. 35.3: Depending on where Z is, in dashline, P(X, Z) gets a different value. One way to address this difficulty is to reason with interval-value fuzzy logic [DP001, with the downside of ending up with wide intervals. For example applying the same inference rule from Equation 35.5 in the case depicted in Figure 35.4 would result in the interval [0, 1], corresponding to a state of total ignorance. This is the main reason why, as explained in the next section, we have decided to use distributional fuzzy values for our AGloriented spatiotemporal reasoning. There also exist attempts to use probability with RCC. For instance in [Win0(1, RCC re- lationships are extracted from computer images and weighted based on their likelihood as estimated by a shape recognition algorithm. However, to the best of our knowledge, no one has used distributional fuzzy values [Yan WI in the context of spatiotemporal reasoning; and we believe this Ls important for the adaptation of spatiotemporal calculi to the AGI context. 35.3 Uncertainty with Distributional Fuzzy Values P(Z) Ivan is an extension of fuzzy logic that considers distributions of fuzzy values rather than mere fuzzy values. That is, fuzzy connectors are extended to apply over probability density functions of fuzzy truth value. For instance the connector (often defined as —x = 1 — x) is extended such that the resulting distribution p, : [0, R+ is It~(x) = p(1 - x) (35.6) where A is the probability density function of the unique argument. Similarly, one can define Inn : [0,1] R+ as the resulting density function of the connector xi A X2 = min(x , x2) over the 2 arguments µi : [0,1] and µz : [01 R+ EFTA00624414
268 35 Spatiotemporal Inference 11/4(2) = (x)1 s2(x2)dx2 +$12(x).1 111(xl)dx1 See [Ilan WI for the justification of Equations 35.6 and 35.7. (35.7) Besides extending the traditional fuzzy operators, one can also define a wider class of con- nectors that can fully modulate the output distribution. Let F : [0,1)" H ([0,1J H R+) be a n-ary connector that takes n fuzzy values and returns a probability density function. In that case the probability density function resulting from the extension of F over distributional fuzzy values is: AP" = 1: F(xl, x„)fizi (xi) µ„(x,,)dxi dx„ (35.8) where sti, p„ are the n input arguments. That is, it is the average of all density functions output by F applied over all fuzzy input values. Let us call that type of connectors fuzzy- probabilistic. In the following we give an example of such a fuzzy-probabilistic connector. Example with PartOf Let us consider the RCC relationship PartOf (P for short as defined in Equation 35.1). A typical inference rule in the crisp case would be: P(X,Y) P(Y,Z) P(X, Z) expressing the transitivity of P. But using a distribution of fuzzy values we would have the following rule P (X, Y) P(Y,Z) (µ2) P(X, Z) (nor) POT stands for PartOf Transitivity. The definition of szpor for that particular inference rule may depend on many assumptions like the shapes and sizes of regions X, Y and Z. In the following we will give an example of a definition of ppm, with respect to some oversimplified assumptions chosen to keep the example short. (35.9) (35.10) Let us define the fuzzy variant of PartOf (X, Y) as the proportion of X which is part of Y (as suggested in 'Pang). Let us also assume that every region is a unitary circle. In this case, the required proportion depends solely on the distance dxy between the centers of X and Y, so we may define a function f that takes that distance and returns the according fuzzy value; that is, f(dxy) = P(X,Y) EFTA00624415
35.3 Uncertainty with Distributional Fuzzy Values 269 4a — dxy sin (a ) if 0 < dxy ≤ 2 f(dxy) = S 27r 0 if dxr > 2 where a = cos' (dxy/2). (35.11) For 0 ≤ dxy ≤ 2, f(dxy) is monotone decreasing, so the inverse of f(dxy), that takes a fuzzy value and returns a distance, is a function declared as f (x) : [0,1] a-). [0,2]. Let be xxy = P(X,Y), xyz = P(Y,Z), x = P(X, Z), dxY = f-1(xxv), drz = f-1(xyz), = Idxy —dyzi and Is = dxy -I-dyz. For dxy and dy2 fixed, let g : [0, ni H [I, u] be a function that takes as input the angle Q of the two lines from the center of Y to X and Y to Z (as depicted in Figure 35.4) and returns the distance dxz. g is defined as follows 9(0) = V(dxy - dvz sin (0))2 + (dYz cos (0 2 So 1 ≤ dxz ≤ u. It is easy to see that g is monotone increasing and surjective, therefore there exists a function inverse C I : [l, u] H [0,71. Let h = fog, so h takes an angle as input and (a) (b) (e) Fig. 35.4: dxz, in dashline, for 3 different angles returns a fuzzy value, h : [0,71 H [0,1]. Since f is monotone decreasing and g is monotone increasing, h is monotone decreasing. Note that the codomain of h is [0, f -1(0] if 1 < 2 or {0} otherwise. Assuming that 1 < 2, then the inverse of h is a function with the following signature : [0, f -1.(1)] H [0,71. Using and assuming that the probability of picking ti E [0,7] is uniform, we can define the binary connector POT. Let us define v = POT(xxy, xyz), recalling that POT returns a density function and assuming x < f (I) h-1(=) 1 v(x) = 2 hm A-1(x+ 86) 2 1 -1(x -F lim — ar6 2h—it (x) (35.12) where ICI' is the derivative of IC'. If x ≥ ri(l) then v(x) = 0. For sake of simplicity the exact expressions of h-1 and v(x) have been left out, and the case where one of the fuzzy arguments EFTA00624416
270 35 Spatiotemporal Inference xxv, xyz or both are null has not been considered but would be treated similarly assuming some probability distribution over the distances day and dxz. It is now possible to define ppm, in rule 35.10 (following Equation 35.8) .10 0 ILPOT = POT(2 11X2)iilfri)112(X2)dXidX2 (35.13) Obviously, assuming that regions are unitary circles is crude; in practice, regions might be of very different shapes and sizes. In fact it might be so difficult to chose the right assumptions (and once chosen to define POT correctly), that in a complex practical context it may be best to start with overly simplistic assumptions and then learn POT based on the experience of the agent. So the agent would initially perform spatial reasoning not too accurately, but would improve over time by adjusting POT, as well as the other connectors corresponding to other inference rules. It may also be useful to have more premises containing information about the sizes (e.g Big(X)) and shapes (e.g Long(Y)) of the regions, like B(X) L(Y) (A2) P(X, Y) (P) P(Y,Z) (A4) P(X, Z) (a) where B and L stand respectively for Big and Long. Simplifying Numerical Calculation Using probability density as described above is computationally expensive, and in many prac- tical cases it's overkill. To decrease computational cost, several cruder approaches are passible, such as discretizing the probability density functions with a coarse resolution, or restricting attention to beta distributions and treating only their means and variances (as in [Yan The right way to simplify depends on the fuzzy-probabilistic connector involved and on how much inaccuracy can be tolerated in practice. 35.4 Spatio-temporal Inference in PLN We have discussed the representation of spatiotemporal knowledge, including associated micer- tainty. But ultimately what matters is what an intelligent agent can do with this knowledge. We now turn to uncertain reasoning based on uncertain spatiotemporal knowledge, using the integration of the above-discussed calculi into the Probabilistic Logic Networks reasoning sys- tem, an uncertain inference framework designed specifically for AGI and integrated into the OpenCog AGI framework. We give here a few examples of spatiotemporal inference rules coded in PLN. Although the current implementation of PLN incorporates both fuzziness and probability it does not have a built-in truth value to represent distributional fuzzy values, or rather a distribution of distribution of fuzzy value, as this is how, in essence, confidence is represented in PLN. At that point, depending on design choice and experimentation, it is not clear whether we want to use EFTA00624417
35.4 Spatio-temporal Inference in PLN 271 the existing truth values and treat them as distributional truth values or implement a new type of truth value dedicated for that, so for our present theoretical purposes we will jast call it DF Truth Value. Due to the highly flexible Hal formalism (Higher Order Judgment, explained in the PLN book in detail) we can express the inference rule for the relationship PartOf directly as Nodes and Links as follows ForAllLink $X $Y $Z ImplicationLink_HOJ ANDLink PartOf ($X, SY) (tvi) PartOf ($Y,SZ) (tv2) ANDLink tv3 = µpo7(tvl,tv2) PartOf($X,$Z) (tv3) (35.14) where µpoi is defined in Equation 35.13 but extended over the domain of PLN DF Truth Value instead of P(Z) distributional fuzzy value. Note that Part0f(SX,SY) (tv) is a shorthand for EvaluationLink (tv) PartOf ListLink (35.15) $71 $Y and ForAllLink $X $Y $Z is a shorthand for ForAllLink ListLink $X $Y $Z (35.16) Of course one advantage of expressing the inference rule directly in Nodes and Links rather than a built-in PLN inference rule is that we can use OpenCog itself to improve and refine it, or even create new spatiotemporal rules based on its experience. In the next 2 examples the fuzzy-probabilistic connectors are ignored, (so no DF Truth Value is indicated) but one could define them similarly to ;tpor• First consider a temporal rule from Allen's Interval Algebra. For instance "if $h meets $I2 and $I3 is during $I2 then $I3 is after SIC would be expressed as ForAllLink $I1 $12 $I3 ImplicationLink ANDLink Meet(SI1,$I2) During($I3,$I2) After(SI3. $It) (35.17) EFTA00624418
272 35 Spatiotentporal Inference And a last example with a metric predicate could be "if $X is near $Y and VC is far from $Z then $Y is far from Sr ForAllLink $X $Y $Z ImplicationLink_HOJ ANDLink Near(SX,SY) Far(SX, SZ) Far($Y, $Z) (35.18) That is only a small and partial illustrative example - for instance other rules may be used to specify that Near and Far and reflexive and symmetric. 35.5 Examples The ideas presented here have extremely broad applicability; but for sake of concreteness, we now give a handful of examples illustrating applications to commonsense reasoning problems. 35.5.1 Spatiotemporal Rules The rules provided here are reduced to the strict minimum needed for the examples: 1. At $T, if $X is inside $Y and $Y is inside $Z then Si( is inside $Z ForAllLink $T $X $Y $Z ImplicationLink_HOJ ANDLink atTime(ST, Inside($X, $Y)) atTime(ST, Inside($Y, $Z)) atTime(ST, Inside($X,SZ)) 2. If a small object $X is over $Y and $Y is far from $Z then $X is far from $Z ForAllLink ImplicationLink_HOJ ANDLink Small(SX) Over($X,$Y) Far(SY) Far(SX) That rule is expressed in a crisp way but again is to be understood in an uncertain way, although we haven't worked out the exact formulae. EFTA00624419
35.5 Examples 273 35.5.2 The Laptop is Safe from the Rain A laptop is over the desk in the hotel room, the desk is far from the window and we want assess to what extent the laptop is far from the window, therefore same from the rain. Note that the truth values are ignored but each concept is to be understood as fuzzy, that is having a PLN Fuzzy Truth Value but the numerical calculation are left out. We want to assess how far the Laptop is from the window Far( Window, Laptop) Assuming the following 1. The laptop is small Small(Laptop) 2. The laptop is over the desk Over(Laptop, Desk) 3. The desk is far from the window Far(Desk, Window) Now we can show an inference trail that lead to the conclusion, the numeric calculation are let for later. 1. using axioms 1, 2, 3 and PLN AND rule ANDLink Small(Laptop) Over(Laptop, Desk) Far(Desk, Window) 2. using spatiotemporal rule 2, instantiated with SX = Laptop, Si = Desk and $Z = Window ImplicationLink_HOJ ANDLink Small(Laptop) Over(Laptop, Desk) Far(Desk, Window) Far(Laptop, Window) 3. using the result of previous step as premise with PLN implication rule Far(Laptop, Window) 35.5.3 Fetching the Toy Inside the Upper Cupboard Suppose we know that there is a toy in an upper cupboard and near a bag, and want to assess to which extent climbing on the pillow is going to bring us near the toy. Here are the following assumptions 1. The toy is near the bag and inside the cupboard. The pillow is near and below the cupboard Near (toy, bag) (tvi) Inside(toy, cupboard) (tv2) Below(pillow, cupboard) (tv3) Near(pillow, cupboard) (tv4) EFTA00624420
274 35 Spatiotemporal Inference 2. The toy is near the bag inside the cupboard, how near is the toy to the edge of the cupboard? ImplicationLink_HOJ ANDLink Near(toy, bag) (tvi) Inside(toy, cupboard) (tv2) ANDLink tv3 = tv2) Near(toy, cupboard_edge) (tv3) 3. If I climb on the pillow, then shortly after I'll be on the pillow PredictivelmplicationLink Climb_on(pillow) Over(self, pillow) 4. If I am on the pillow near the edge of the cupboard how near am I to the toy? ImplicationLink_HOJ ANDLink Below(pillow, cupboard) (tvi) Near(pillow, cupboard) (tv2) Over (self , pillow) (tv3) Near(toy, cupboard_edge) (tv4) ANDLink tv5 = OF2(tvi, tv2, tv3, tv4) Near (self , toy) (tits) The target theorem is "How near I am to the toy if I climb on the pillow." PredictivelmplicationLink Climb_on(pillow) Near(self, toy) (?) And the inference chain as follows 1. Axiom 2 with axiom 1 Near(toy, cupboard_edge) (tv6) 2. Step 1 with axiom 1 and 3 PredictivelmplicationLink Climb_on(pillow) ANDLink Below(pillow, cupboard) (tv3) Near(pillow, cupboard) (tv4) Over (self , pillow) (tv7) Near(toy, cupboard_edge) (tv6) 3. Step 2 with axiom 4, target theorem: How near am I to the toy if I climb on the pillow PredictivelmplicationLink Climb_on(pillow) Near(self, toy) (tv3) EFTA00624421
35.6 An Integrative Approach to Planning 275 35.6 An Integrative Approach to Planning Planning is a major research area in the mainstream AI community, and planning algorithms have advanced dramatically in the last decade. However, the best of breed planning algorithms are still not able to deal with planning in complex environments in the face of a high level of uncertainty, which is the sort of situation routinely faced by humans in everyday life. Really powerful planning, we suggest, requires an approach different than any of the dedicated planning algorithms, involving spatiotemporal logic combined with a sophisticated search mechanism (such as MOSES). It may be valuable (or even necessary) for an intelligent system involved in planning-intensive goals to maintain a specialized planning-focused data structure to guide general learning mech- anisms toward more efficient learning in a planning context. But even if so, we believe planning must ultimately be done as a case of more general learning, rather than via a specialized algo- rithm. The basic approach we suggest here is to • use MOSES for the core plan learning algorithm. That is, MOSES would maintain a popu- lation of "candidate partial plans", and evolve this population in an effort to find effective complete plans. • use PLN to help in the fitness evaluation of candidate partial plans. That is, PLN would be used to estimate the probability that a partial plan can be extended into a high-quality complete plan. This requires PLN to make heavy use of spatiotemporal logic, as described in the previous sections of this chapter. • use a GraphPlan-style 1B1:971 planning graph, to record information about candidate plans, and to propagate information about mutual exclusion between actions. The planning graph maybe be used to help guide both MOSES and PLN. In essence, the planning graph simply records different states of the world that may be achiev- able, with a high-strength PredictivelmplicationLink pointing between state X and Y if X can sensibly serve as a predecessor to X; and a low-strength (but potentially high-confidence) PredictivelmplicationLink between X and Y if the former excludes the latter. This may be a subgraph of the Atomspace or it may be separately cached; but in each case it must be frequently accessed via PLN in order for the latter to avoid making a massive number of unproductive inferences in the course of assisting with planning. One can think of this as being a bit like PGraphPlan IBI.9!II. except that • MOSES is being used in place of forward or backward chaining search, enabling a more global search of the plan space (mixing forward and backward learning freely) • PLN is being used to estimate the value of partial plans, replacing heuristic methods of value propagation Regarding PLN, one possibility would be to (explicitly, or in effect) create a special API function looking something like EstimateSuccessProbability(PartialPlan PP, Goal G) (assuming the goal statement contains information about the time allotted to achieve the goal). The PartialPlan is simply a predicate composed of predicates linked together via temporal EFTA00624422
276 35 Spatiotemporal Inference links such as Predictivelmplication and SimultaneousAND. Of course, such a function could be used within many non-MOSES approaches to planning also. Put simply, the estimation of the success probability is "just" a matter of asking the PLN backward-chainer to figure out the truth value of a certain ImplicationLink, i.e. PredictivelmplicationLink [time-lag T] EvaluationLink do PP G But of course, this may be a very difficult inference without some special guidance to help the backward chainer. The GraphPlan-style planning graph could be used by PLN to guide it in doing the inference, via telling it what variables to look at, in doing its inferences. This sort of reasoning also requires PLN to have a fairly robust capability to reason about time intervals and events occurring therein (i.e., basic temporal inference). Regarding MOSES, given a candidate plan, it could look into the planning graph to aid with program tree expansion. That is, given a population of partial plans, MOSES would progressively add new nodes to each plan, representing predecessors or successors to the actions already described in the plans. In choosing which nodes to add, it could be probabilistically biased toward adding nodes suggested by the planning graph. So, overall what we have is an approach to doing planning via MOSES, with PLN for fitness estimation - but using a GraphPlan-style planning graph to guide MOSES's exploration of the neighborhood of partial plans, and to guide PLN's inferences regarding the success likelihood of partial plans. EFTA00624423
Chapter 36 Adaptive, Integrative Inference Control 36.1 Introduction The subtlest and most difficult aspect of logical inference is not the logical rule-set nor the management of uncertainty, but the coning! of inference: the choice of which inference steps to take, in what order, in which contexts. Without effective inference control methods, logical inference is an unscalable and infeasible approach to learning declarative knowledge. One of the key ideas underlying the CogPrime design is that inference control cannot effectively be handled by looking at logic alone. Instead, effective inference control must arise from the intersection between logical methods and other cognitive processes. In this chapter we describe some of the general principles used for inference control in the CogPrime design. Logic itself is quite abstract and relatively (though not entirely) independent of the specific environment and goals with respect to which a system's intelligence is oriented. Inference con- trol, however, is (among other things) a way of adapting a logic system to operate effectively with respect to a specific environment and goal-set. So, the reliance of CogPrime's inference control methods on the integration between multiple cognitive processes, is a reflection of the foundation of CogPrime on the assumption (articulated in Chapter 9) that the relevant en- vironment and goals embody interactions between world-structures and interaction-structures best addressed by these various processes. 36.2 High-Level Control Mechanisms The PLN implementation in CogPrime is complex and lends itself to utilization via many different methods. However, a convenient way to think about it is in terms of three basic backward-focused query operations: • findtv, which takes in an expression and tries to find its truth value. • findExamples, which takes an expression containing variables and tries to find concrete terms to fill in for the variables. • createExamples, which takes an expression containing variables and tries to create new Atoms to fill in for the variables, using concept creation heuristics as discussed in a Chapter 38, coupled with inference for evaluating the products of concept creation. 277 EFTA00624424
278 36 Adaptive, Integrative Inference Control and one forward-chaining operation: • findConclusions, which takes a set of Atoms and seeks to draw the most interesting possible set of conclusions via combining them with each other and with other knowledge in the AtomTable. These inference operations may of course call themselves and each other recursively, thus cre- ating lengthy chains of diverse inference. Findtv is quite straightforward, at the high level of discussion adopted here. Various inference rules may match the Atom; in our current PLN implementation, loosely described below, these inference rules are executed by objects called Rules. In the course of executing findtv, a decision must be made regarding how much attention to allocate to each one of these Rule objects, and some choices must be made by the objects themselves - issues that involve processes beyond pure inference, and will be discussed later in this chapter. Depending on the inference rules chosen, findtv may lead to the construction of inferences involving variable expressions, which may then be evaluated via findExamples or createExamples queries. The findExamples operation sometimes reduces to a simple search through the AtomSpace. On the other hand, it can also be done in a subtler way. If the findExamples Rule wants to find examples of $X so that F(SX), but can't find any, then it can perform some sort of heuristic search, or else it can run another findExamples query, looking for $G so that Implication SG F and then running findExamples on G rather than F. But what if this findExamples query doesn't come up with anything? Then it needs to run a createExamples query on the same implication, trying to build a $G satisfying the implication. Finally, forward-chaining inference (findConclusions) may be conceived of as a special heuris- tic for handling special kinds of findExample problems. Suppose we have K Atoms and want to find out what consequences logically ensue from these K Atoms, taken together. We can form the conjunction of the K Atoms (let's call it C), and then look for SD so that Implication C SD Conceptually, this can be approached via findExamples, which defaults to createExamples in cases where nothing is found. However, this sort of findExamples problem is special, involving appropriate heuristics for combining the conjuncts contained in the expression C, which embody the basic logic of forward-chaining rather than backward-chaining inference. 36.2.1 The Need for Adaptive Inference Control It is clear that in humans, inference control is all about context. We use different inference strategies in different contexts, and learning these strategies is most of what learning to think is all about. One might think to approach this aspect of cognition, in the CogPrime design, by introducing a variety of different inference control heuristics, each one giving a certain algorithm for choosing which inferences to carry out in which order in a certain context. (This is similar to what has been done within Cyc, for example http: cyc . corn.) However, in keeping with the integrated intelligence theme that pervades CogPrime, we have chosen an alternate strategy for PLN. We have one inference control scheme, which is quite simple, but which relies partially on EFTA00624425
36.3 Inference Control in PLN 279 structures coming from outside PLN proper. The requisite variety of inference control strategies is provided by variety in the non-PLN structures such as • HebbianLinks existing in the AtomTable. • Patterns recognized via pattern-mining in the corpus of prior inference trails 36.3 Inference Control in PLN We will now describe the basic "inference control" loop of PLN in CogPrime. Pre-2013 OpenCog versions used a somewhat different scheme, more similar to a traditional logic engine. The ap- proach presented here is more cognitive synergy oriented, achieving PLN control via a combi- nation of logic engine style methods and integration with attention allocation. 36.3.1 Representing PLN Rules as GroundedSchemallodes PLN inference rules may be represented as GroundedSchemallodes. So for instance the PLN Deduction Rule, becomes a GroundedSchemallode with the properties: • Input: a pair of links (LI, L2), where LI and L2 are the same type, and must be one of InheritanceLink, ImplicationLink, SubsetLink or ExtensionalimplicationLink • Output: a single link, of the same type as the input The actual PLN Rules and Formulas are then packed into the internal execution methods of GroundedSchemallodes. In the current PLN code, each inference rule has a Rule class and a separate Formula class. So then, e.g. the PLNDeductionRule GroundedSchemallode, invok.es a function of the general form Link PLNDeductionRule(Link LI, Link L2) which calculates the deductive consequence of two links. This function then invokes a function of the form TruthValue PLNDeductionFormula(TruthValue tAB, TruthValue tBC, TruthValue tA, TruthValue tB, Truth which in turn invokes functions such as SimpleTruthValue SimplePLNDeductionFormula(SimpleTruthValue tAB, SimpleTruthValue tBC, SimpleTruth IndefiniteTruthValue IndefinitePLNDeductionFormula(IndefiniteTruthValue tAB, IndefiniteTruthValue 36.3.2 Recording Executed PLN Inferences in the Atomspace Once an inference has been carried out, it can be represented in the Atomspace, e.g. as EFTA00624426
280 36 Adaptive. Integrative Inference Control ExecutionLink GroundedSchemallode: PLNDeductionRule ListLink HypotheticalLink InheritanceLink people animal <tvl> HypotheticalLink InheritanceLink animal breathe <tv2> HypotheticalLink InheritanceLink people breathe <tv3> Note that a link such as InheritanceLink people breathe <.8,.2> will have its truth value stored as a truth value version within a CompositeTruthValue object.) In the above, e.g. InheritanceLink people animal is used as shorthand for InheritanceLink Cl C2 where CI and C2 are ConceptNodes representing "people" and "animal" respectively. We can also have records of inferences involving variables, such as ExecutionLink GroundedSchemallode: PLNDeductionRule ListLink HypotheticalLink InheritanceLink $V1 animal <tvl> HypotheticalLink InheritanceLink animal breathe <tv2> HypotheticalLink InheritanceLink $V1 breathe <tv3> where SW is a specific VariableNode. 36.3.3 Anatomy of a Single Inference Step A single inference step, then, may be viewed as follows: 1. Choose an inference rule R. and a tuple of Atoms that collectively match the input conditions of the rule 2. Apply the chosen rule R to the chosen input Atoms 3. Create an ExecutionLink recording the output found 4. In addition to retaining this ExecutionLink in the Atomspace. also save a copy of it to the InferenceRepository 'this is not needed for the very first implementation, but will be very useful once PLN is in regular use. The InferenceRepository, referred to here, is a special Atomspace that exists just to save a record of PLN inferences. It can be mined, after the fact, to learn inference patterns, which can be used to guide future inferences. EFTA00624427















