22.12 Coals and Time 81 22.12 Goals and Time The CogPrime system maintains an explicit list of "Ubergoals", which as will be explained in Chapter 24, receive attentional currency which they may then allocate to their subgoals according to a particular mechanism. However, there is one subtle factor involved in the definition of the Ubergoals: time. The truth value of a Ubergoal is typically defined as the average level of satisfaction of some De- mand over some period of time - but the time scale of this averaging can be very important. In many cases, it may be worthwhile to have separate Ubergoals corresponding to the same Demand but doing their truth-value time-averaging over different time scales. For instance, corresponding to Demands such as Novelty or Health, we may posit both long-term and short- term versions, leading to Ubergoals such as CurrentNovelty, LongTermNovelty, CurrentHealth, LongTermHealth, etc. Of course, one could also wrap multiple Ubergoals corresponding to a single Demand into a single Ubergoal combining estimates over multiple time scales; this is not a critical issue and the only point of splitting Demands into multiple UbergoaLs is that it can make things slightly simpler for other cognitive processes. For instance, if the agent has a goal of pleasing Bob, and it knows Bob likes to be presented with surprising structures and ideas, then the agent has some tricky choices to make. Among other choices it must balance between focusing on • creating things and then showing them to Bob • studying basic knowledge and improving its skills. Perhaps studying basic knowledge and skills will give it a foundation to surprise Bob much more dramatically in the mid-term future ... but in the short run will not allow it to surprise Bob much at all, because Bob already knows all the basic material. This is essentially a variant of the general "exploration versus exploitation" dichotomy, which lacks any easy solution. Young children are typically poor at carrying out this kind of balancing act, and tend to focus overly much on near-term satisfaction. There are also significant cultural differences in the heuristics with which adult humans face these issues; e.g. in some contexts Oriental cultures tend to focus more on mid to long term satisfaction whereas Western cultures are more short term oriented. EFTA00624228
EFTA00624229
Chapter 23 Attention Allocation Co-authored with Joel Pitt and Matt Ikle' and Rui Liu 23.1 Introduction The critical factor shaping real-world general intelligence is resource constraint. Without this issue, we could just have simplistic program-space-search algorithms like AIXI't instead of com- plicated systems like the human brain or CogPrime. Resource constraint is managed implicitly within various components of CogPrime, for instance in the population size used in evolu- tionary learning algorithms, and the depth of forward or backward chaining inference trees in PLN. But there is also a component of CogPrime that manages resources on a global and cognitive-process-independent manner: the dictation allocation component. The general principles the attention allocation process should follow are easy enough to see: History should be used as a guide, and an intelligence should make probabilistic judgments based on its experience, guessing which resource-allocation decisions are likely to maximize its goal-achievement. The problem is that this is a difficult learning and inference problem, and to carry it out with excellent accuracy would require a limited-resources intelligent system to spend nearly all its resources deciding what to pay attention to and nearly none of them actually paying attention to anything else. Clearly this would be a very poor allocation of an AI system's attention! So simple heuristics are called for, to be supplemented by more advanced and expensive procedures on those occasions where time is available and correct decisions are particularly crucial. Attention allocation plays, to a large extent, a "meta" role in enabling mind-world corre- spondence. Without effective attention allocation, the other cognitive processes can't do their jobs of helping an intelligent agent to achieve its goals in an environment, because they won't be able to pay attention to the most important parts of the environment, and won't get com- putational resources at the times when they need it. Of course this need could be addressed in multiple different ways. For example, in a system with multiple complex cognitive processes, one could have attention allocation handled separately within each cognitive process, and then a simple "top layer" of attention allocation managing the resources allocated to each cognitive process. On the other hand, one could also do attention allocation via a single dynamic, perva- sive both within and between individual cognitive processes. The CogPrime design gravitates more toward the latter approach. though also with some specific mechanisms within various MindAgents; and efforts have been made to have these specific mechanisms modulated by the generic attention allocation structures and dynamics wherever possible. 83 EFTA00624230
84 23 Attention Allocation In this chapter we will dig into the specifics of how these attention allocation issues are addressed in the CogPrime design. In short, they are addressed via a set of mechanisms and equations for dynamically adjusting importance values attached to Atoms and MindAgents. Dif- ferent importance values exist pertinent to different time scales, most critically the short-term (STI) and long-term (LTI) importances. The use of two separate time-scales here reflects fun- damental aspects of human-like general intelligence and real-world computational constraints. The dynamics of STI is oriented partly toward the need for real-time responsiveness, and the more thoroughgoing need for cognitive processes at speeds vaguely resembling the speed of "real time" social interaction. The dynamics of LTI is based on the fact that some data tends to be useful over long periods of time, years or decades in the case of human life, but the practical capability to store large amounts of data in a rapidly accessible way is limited. One could imagine environments in which very-long-term multiple-type memory was less critical than it is in typical human-friendly environments; and one could envision AGI systems carrying out tasks in which real-time responsiveness was unnecessary (though even then some attention focusing would certainly be necessary). For AGI systems like these, an attention allocation system based on STI and LTI with CogPrime-like equations would likely be inappropriate. But for an AGI system intended to control a vaguely human-like agent in an environment vaguely resembling everyday human environments, the focus on STI and LTI values, and the dynamics proposed for these values in CogPrime, appear to make sense. Two basic innovations are involved in the mechanisms attached to these STI and LTI impor- tance values: • treating attention allocation as a data mining problem: the system records information about what it's done in the past and what goals it's achieved in the past, and then recog- nizes patterns in this history and uses them to guide its future actions via probabilistically adjusting the (often context-specific) importance values associated with internal terms, ac- tors and relationships, and adjusting the "effort estimates" associated with Tasks • using an artificial-economics approach to update the importance values (attached to Atoms, MindAgents, and other actors in the CogPrime system) that regulate system attention. (And, more speculatively, using an information geometry based approach to execute the optimization involved in the artificial economics approach efficiently and accurately.) The integration of these two aspects is crucial. The former aspect provides fundamental data about what's of value to the system, and the latter aspect allows this fundamental data to be leveraged to make sophisticated and integrative judgments rapidly. The need for the latter, rapid-updating aspect exists partly because of the need for real-time responsiveness, imposed by the need to control a body in a rapidly dynamic world, and the prominence in the architecture of an animal-like cognitive cycle. The need for the former, data-mining aspect (or something functionally equivalent) exists because, in the context of the tasks involved in human-level general intelligence, the assignment of credit problem is hard - the relations between various entities in the mind and the mind's goals are complex, and identifying and deploying these relationships is a difficult learning problem requiring application of sophisticated intelligence. Both of these aspects of attention allocation dynamics may be used in computationally lightweight or computationally sophisticated manners: • For routine use in real-time activity EFTA00624231
23.2 Semantics of Short and Long Term Importance 85 - "data mining" consists of forming HebbianLinks (involved in the associative memory and inference control, see Section 23.5), where the weight of the link from Atom A to Atom B is based on the probability of shared utility of A and B - economic attention allocation consists of spreading ShortTermlmportance and LongTer- mImportance "artificial currency" values (both grounded in the universal underlying "juju" currency value defined further below) between Atoms according to specific equa- tions that somewhat resemble neural net activation equations but respect the conser- vation of currency • For use in cases where large amounts of computational resources are at stake based on lo- calized decisions, hence allocation of substantial resources to specific instances of attention- allocation is warranted - "data mining" may be more sophisticated, including use of PLN, MOSES and pattern mining to recognize patterns regarding what probably deserves more attention in what contexts - economic attention allocation may involve more sophisticated economic calculations involving the expected future values of various "expenditures" of resources The particular sort of "data mining" going on here is definitely not exactly what the human brain does, but we believe this is a case where slavish adherence to neuroscience would be badly suboptimal (even if the relevant neuroscience were well known, which is not the case). Doing attention allocation entirely in a distributed, formal-neural-net-like way is, we believe, extremely and unnecessarily inefficient, and given realistic resource constraints it leads to the rather poor attention allocation that we experience every, day in our ordinary waking state of consciousness. Several aspects of attention allocation can be fruitfully done in a distributed, neural-net-like way, but not having a logically centralized repository of system-history information (regardless of whether it's physically distributed or not) seems intrinsically problematic in terms of effective attention allocation. And we argue that, even for those aspects of attention allocation that are best addressed in terms of distributed, vaguely neural-net-like dynamics, an artificial-economics approach has significant advantages over a more strictly neural-net-like approach, due to the greater ease of integration with other cognitive mechanisms such as forgetting and data mining. 23.2 Semantics of Short and Long Term Importance We now specify the two types of importance value (short and long term) that play a key role in CogPrime dynamics. Conceptually, ShortTermImportance (STI) is defined as STI(A) = P(A will be useful in the near future) whereas LongTermlmportance (LTI) is defined as LTI(A) = P(A will be useful eventually, in the foreseeable future) Given a time-scale T, in general we can define an importance value relative to T as Ir(A) = P(A will be useful during the next T seconds) EFTA00624232
86 23 Attention Allocation In the ECAN module in CogPrime, we deal only with STI and LTI rather than any other importance values, and the dynamics of STI and LTI are dealt with by treating them as two separate "artificial currency" values, which however are interconvertible via being mutually grounded in a common currency called "juju." For instance, if the agent is intensively concerned with trying to build interesting blocks structures, then knowledge about interpreting biology research paper abstracts is likely to be of very little current importance. So its biological knowledge will get low STI, but - assuming the agent expects to use biology again - it should maintain reasonably high LTI so it can remain in memory for future use. And if in its brainstorming about what blocks structures to build, the system decides to use some biological diagrams as inspiration, STI can always spread to some of the biology-related Atoms, increasing their relevance and getting them more attention. While the attention allocation system contains mechanisms to convert STI to LTI, it also has parameter settings biasing it to spend its juju on both kinds of importance - i.e. it contains an innate bias to both focus its attention judiciously, and manage its long-term memory conscientiously. Because in CogPrime most computations involving STI and LTI are required to be very rapid (as they're done for many Atoms in the memory very frequently), in most cases when dealing with these quantities, it will be appropriate to sacrifice accuracy for efficiency. On the other hand, it's useful to occasionally be able to carry out expensive, highly accurate computations involving importance. An example where doing expensive computations about attention allocation might pay off, would be the decision whether to use biology-related or engineering-related metaphors in cre- ating blocks structures to please a certain person. In this case it could be worth doing a few steps of inference to figure out whether there's a greater intensional similarity between that person's interests and biology or engineering; and then using the results to adjust the STI levels of whichever of the two comes out most similar. This would not be a particularly expensive inference to carry out, but it's still much more effort than what can be expended on Atoms in the memory most of the time. Most attention allocation in CogPrime involves simple neural-net type spreading dynamics rather than explicit reasoning. Figure 23.1 illustrates the key role of LTI in the forgetting process. Figure 23.2 illustrates the key role of STI in maintaining a "moving bubble of attention", which we call the system's AttentionalFocus. 23.2.1 The Precise Semantics of STI and LTI Now we precisiate the above definitions of STI and LTI. First, we introduce the notion of reward. Reward is something that Goals give to Atoms. In principle a Goal might give an Atom reward in various different forms, though in the design given here, reward will be given in units of a currency called juju. The process by which Goals assign reward to Atoms is part of the "assignment of credit" process (and we will later discuss the various time-scales on which assignment of credit may occur and their relationship to the time-scale parameter within LTI). Next, we define J(A,tht2,r) = expected amount of reward A will receive between tt and t2 time-steps in the future, if its STI has percentile rank r among all Atoms in the AtomTable EFTA00624233
23.2 Semantics of Short and Long Term Importance 87 0 ATOM STAY IN ATOMSPACE REMOVE FROM ATOMSPACE SAVE TO BACKUP STORE PERMANENTLY DELETE Fig. 23.1: LongTermlmportance and Forgetting. The percentile rank r of an Atom is the rank of that Atom in a list of Atoms ordered by decreasing STI, divided by the total number of Atoms. The reason for using a percentile rank instead of the STI itself is because at any given time only a limited number of atoms can be given attention, so all atoms below a certain perceptible rank, depending on the amount of available resource, will simply be ignored. This is a fine-grained measure of how worthwhile it is expected to be to increase A's STI, in terms of getting A rewarded by Goals. For practical purposes it is useful to collapse J(A, t1, t2, r) to a single number: J (A, tl, t2) Er J(A,tht2,r)wr Er tor where tor weights the different percentile ranks (and should be chosen to be monotone increasing in r). This is a single-number measure of the responsiveness of an Atom's utility to its STI level. So for instance if A has a lot of STI and it turns out to be rewarded then J(A, tl, t2) will be high. On the other hand if A has little STI then whether it gets rewarded or not will not influence J(A, t1, t2) much. To simplify notation, it's also useful to define a single-time-point version J(A, = J(A, t, 23.2.1.1 Formalizing STI Using these definitions, one simple way to make the STI definition precise is: Mih nsh(A, t) = P(J(A, t, t + t shord 9 Sthreshoid) where Sth„shou demarcates the "attentional focus boundary." Which is a way of saying that we don't want to give STI to atoms that would not get rewarded if they were given attention. EFTA00624234
ss 23 Attention Allocation ATTENTIONAL FOCUS AT TILNTIONAL I OCUS BOUNDARY Fig. 23.2: Formation of the AttentionalFocus. The dynamics of STI is configured to en- courage the emergence of richly crass-connected networks of Atoms with high STI (above a threshold called the AttentionalFocusBoundary), passing STI among each other as long as this is useful and forming new HebbianLinks among each other. The collection of these Atoms is called the AttentionalFocus. Or one could make the STI definition precise in a fuzzier way, and define CO smunv(A,0 t + s)c— s=o for some appropriate parameter c (or something similar with a decay function less severe than exponential) EFTA00624235
23.2 Semantics of Short and Long Term Importance 89 In either case, the goal of the ECAN subsystem, regarding STI, is to assign each Atom A an STI value that corresponds as closely as possible to the theoretical STI values defined by whichever one of the above equations is selected (or some other similar equation). 23.2.2 STI, STIFund, and Juju But how can one estimate these probabilities in practice? In some cases they may be estimated via explicit inference. But often they must be estimated by heuristics. The estimative approach taken in current CogPrime design is an artificial economy, in which each Atom maintains a certain fund of artificial currency. In the current proposal this currency is called juju and is the same currency used to value LTI. Let us call the amount of juju owned by Atom A the STIFund of A. Then, one way to formalize the goal of the artificial economy is to state that: if one ranks all Atoms by the wealth of their STIFtmd, and separately ranks all Atoms by their theoretical STI value, the rankings should be as close as passible to the same. One may also formalize the goal in terms of value correlation instead of rank correlation, of course. Proving conditions under which the STIFtmd values will actually correlate well with the theoretical STI values, is an open math problem. Heuristically, one may map STIFund values into theoretical STI values by a mapping such as A.STIT'und— STIFund. min A.STI = a + i3 STIFund. 1111OC -STIF'und, min where STIFund. min = min X .STIF'und, However, we don't currently have rigorous grounding for any particular functional form for such a mapping; the above is just a heuristic approxima- tion. The artificial economy approach leads to a variety of supporting heuristics. For instance, one such heuristic is: if A has been used at time t, then it will probably be useful at time t + s for small .s. Based on this heuristic, whenever a MindAgent uses an Atom A, it may wish to increase A's STIFund (so as to hopefully increase correlation of A's STIFund with its theoretical STI). It does so by transferring some of its juju to A's STIFund. 23.2.3 Formalizing LTI Similarly to STI, with LTI we will define theoretical LTI values, and posit an LTIFund associated with each Atom, which seeks to create values correlated with the theoretical LTI For LTI, the theoretical issues are subtler. There is a variety of different ways to precisiate the above loose conceptual definition of LTI. For instance, one can (and we will below) create formalizations of both: 1. L77.1(1) = (some time-weighting or normalization of) the expected value of A's total usefulness over the long-term future 2. LTh,„„i(A) = the probability that A ever becomes highly useful at some point in the long- term future EFTA00624236
90 23 Attention Allocation (here "cont" stands for "continuous"). Each of these may be formalized, in similar but noniden- tical ways. These two forms of LTI may be viewed as extremes along a continuum; one could posit a host of intermediary LTI values between them. For instance, one could define LT/p(A) = the p'th power average' of expectation of the utility of A over brief time intervals, measured over the long-term future Then we would have LTIbargi = LTIami = LTI1 and could vary p to vary the sharpness of the LTI computation. This might be useful in some contexts, but our guess is that it's overkill in practice and that looking at LT/bunt and LTIona is enough (or more than enough; the current OCP code uses only one LTI value and that has not been problematic so far). 23.2.4 Applications of LTIfrurat versus LTIcont It seems that the two forms of LTI discussed above might be of interest in different contexts, depending on the different ways that Atoms may be used so as to achieve reward. If an Atom is expected to get rewarded for the results of its being selected by MindAgents that carry out diffuse, background thinking (and hence often select low-STI Atoms from the AtomTable), then it may be best associated with LTIcont. On the other hand, if an Atom is expected to get rewarded for the results of its being selected by MindAgents that are focused on intensive foreground thinking (and hence generally only select Atoms with very high STI), it may be best associated with LT/s„„i. In principle, Atoms could he associated with particular LTIp based on the particulars of the selection mechanisms of the MindAgents expected to lead to their reward. But the issue with this is, it would result in Atoms carrying around an excessive abundance of different LTIp values for various p, resulting in memory bloat; and it would also require complicated analyses of MindAgent dynamics. If we do need more than one LTI value, one would hope that two will be enough, for memory conservation reasons. And of course, if an Atom has only one LTI value associated with it, this can reasonably be taken to stand in for the other one: either of LT/bunt or LTIcont may, in the absence of information to the contrary, be taken as an estimate of the other. 23.2.4.1 LTI with Various Time Lags The issue of the p value in the average in the definition of LTI is somewhat similar to (though orthogonal to) the point that there are many different interpretations of LTI, achieved via I the p'th power average is defined as tJ XP EFTA00624237
23.3 Defining Burst LTI in Terms of STI 91 considering various time-lags. Our guess is that a small set of time-lags will be sufficient. Perhaps one wants an exponentially increasing series of time-lags: i.e. to calculate LTI over k cycles where k is drawn from {r, 2r, 4r, 8r,...Tvr). The time-lag in LTI seems related to the time-lag in the system's goals. If a Goal object is disseminating juju, and the Goal has an intrinsic time scale of t, then it may be interested in LTI on time-scale t. So when a MA (MindAgent) is acting in pursuit of that goal, it should spend a bunch of its juju on LTI on time-scale t. Complex goals may be interested in multiple time-scales (for instance, a goal might place greater value on things that occur in the next hour, but still have nonzero interest in things that occur in a week), and hence may have different levels of interest in LTI on multiple time-scales. 23.2.4.2 Formalizing Burst LTI Regarding burst LTI, two approaches to formalization seem to be the threshold version LT/6„nt.ihmh(A) = P(A will receive a total of at least Sthreshou amount of normalized stimulus during some time interval of length t„hon in the next tk,„5, time steps) and the fuzzy version, CO Lum,„,.,„zzvo, E + si t + S tshort)f(S, tlimg) a=0 where f(t,tio,,g) : R+ x R+ R+ is a nonincreasing function that remains roughly constant in t up till a point te„,„9 steps in the future, and then begins slowly decaying. 23.2.4.3 Formalizing Continuous LTI The threshold version of continuous LTI is quite simply: LT/c,,„144,„„h(A, t, ,,9) = 377thrah(A, te,mg) That is, smooth threshold LTI is just like smooth threshold STI, but the time-scale involved is longer. On the other hand, the fuzzy version of smooth LTI is: 03 L J(A.,t -F s)f(s,tic,„9) s.o using the same decay function f that was introduced above in the context of burst LTI. 23.3 Defining Burst LTI in Terms of STI It is straightforward to define burst LTI in terms of STI, rather than directly in terms of juju. We have EFTA00624238
92 23 Attention Allocation LTIoung,th,,,h(A, t) = P(i s:i t; " STImnsh (A, t s)) Or, using the fuzzy definitions, we obtain instead the approximate equation CO LT16,,„Litm,y(A, Ea(s)STIpazy(A,t s)f(s,tiong) 8=o where 1 — c cc(s) 1 — ce+1 or the more complex exact equation: LT/6„,„E.,,,..,(A, t) . E snii.y(A, t + s) ( f (s, turn") — E(c' f (s — r, tt87,9))) OD r= 1 23.4 Valuing LTI and STI in terms of a Single Currency We now further discuss the approach of defining LTIFtmd and STIFund in terms of a single currency: juju (which as noted, corresponds in the current ECAN design to normalized stimu- lus). In essence, we can think of STIFund and LTIFund as different forms of financial instrument, which are both grounded in juju. Each Atom has two financial instruments attached to it: "STIRind of Atom A" and "LTIFund of Atom A" (or more if multiple versions of LTI are used). These financial instruments have the peculiarity that, although many agents can put juju into any one of them, no record is kept of who put juju in which one. Rather, the MA's are acting so as to satisfy the system's Goals, and are adjusting the STIFund and LTIFund values in a heuristic manner that is expected to approximately maximize the total utility propagated from Goals to Atoms. Finally, each of these financial instruments has a value that gets updated by a specific update equation. To understand the logic of this situation better, consider the point of view of a Goal with a certain amount of resources (juju, to be used as reward), and a certain time-scale on which its satisfaction is to be measured. Suppose that the goal has a certain amount of juju to expend on getting itself satisfied. This Goal clearly should allocate some of its juju toward getting processor time allocated toward the right Atoms to serve its ends in the near future; and some of its juju toward ensuring that, in future, the memory, will contain the Atoms it will want to see processor time allocated to. Thus, it should allocate some of its juju toward boosting the STIFund of Atoms that it thinks will (if chosen by appropriate MindAgents) serve its needs in the near future, and some of its juju toward boosting the LTIFund of Atoms that it thinks will serve its need in the future (if they remain in RAM). Thus, when a Goal invokes a MindAgent (giving the MindAgent the juju it needs to access Atoms and carry out its work), it should tell this MindAgent to put some of its juju into LTIFunds and some into STIFunds. EFTA00624239
23.4 Valuing LTI and STI in terms of a Single Currency 93 If a MindAgent receives a certain amount of juju each cycle, independently of what the system Goals are explicitly telling it, then this should be viewed as reflecting an implicit goal of "ambient cognition", and the balance of STI and LTI associated with this implicit goal must be a system parameter. In general, the trade-off between STI and LTI boils down to the weighting between near and far future that is intrinsic to a particular Goal. Simplistically: if a Goal values getting processor allocated to the right stuff immediately 25 times more than getting processor allocated to the right stuff 20K cycles in the future, then it should he willing spend 25x more of its juju on STI than on LT/20K cydee• (This simplistic picture is complicated a little by the relationship between different time-scales. For instance, boosting LThok, q,de,(A) will have an indirect effect of increasing the odds that A will still be in memory 20K cycles in the future.) However, this isn't the whole story, because multiple Goals are setting the importance values of the same set of Atoms. If M1 pumps all its juju into STI for certain Atoms, then M2 may decide it's not worthwhile for it to bother competing with MI in the STI domain, and to spend its juju on LTI instead. Note that the current system doesn't allow a MA to change its mind about LTI allocations. One can envision a system where a MindAgent could in January, pay juju to have Atom A kept around for a year, but then change its mind in June 6 months later, and ask for some of the money back. But this would require an expensive accounting procedure, keeping track of how much of each Atom's LTI had been purchased by which MindAgent; so it seems a poor approach. A more interesting alternative would be to allow MA's to retain adjustable `reserve funds" of juju. This would mean that a MindAgent would never see a purpose to setting LTI,,„. 5,02,4A) instead of repeatedly setting LT10„c unless a substantial transaction cost were incurred with each transaction of adjusting an Atom's LTI. Introducing a transaction cost plus an ad- justable per-MindAgent juju reserve fund, and LTI's on multiple time scales, would give the LTI framework considerable flexibility. (To prevent MA's from hoarding their juju, one could place a tax rate on reserve juju.) The conversion rate between STI and LTI becomes an interesting matter; though it seems not a critical one, since in the practical dynamics of the system it's juju that is used to produce STI and LTI. In the current design there is no apparent reason to spread STI of one Atom to LTI of another Atom, or convert the STI of an Atom into LTI of that same Atom, etc. - but such an application might come up. (Fbr the rest of this paragraph, let's just consider LTI with one time scale, for simplicity.) Each Goal will have its own preferred conversion rate between STI and LTI, based on its own balancing of different time scales. But, each Goal will also have a limited amount of juju, hence one can only trade a certain amount of STI for LT!, if one is trading with a specific goal G. One could envision a centralized STI-for-LTI market where different MA's would trade with each other, but this seems overcomplicated, at least at the present stage. As a simpler software design point, this all suggests a value for associating each Goal with a parameter telling how much of its juju it wants to spend on STI versus Lit Or, more subtly, how much of its juju it wants to spend on LTI on various time-scales. On the other hand, in a simple ECAN implementation this balance may be assumed constant across all Goals. EFTA00624240
94 23 Attention Allocation 23.5 Economic Attention Networks Economic Attention Networks (ECANs) are dynamical systems based on the propagation of STI and LTI values. They are similar in many respects to Hopfield nets, but are based on a different conceptual foundation involving the propagation of amounts of (conserved) currency rather than neural-net activation. Further, ECANs are specifically designed for integration with a diverse body of cognitive processes as embodied in integrative AI designs such as CogPrime. A key aspect of the CogPrime design is the imposition of ECAN structure on the CogPrime AtomSpace. Specifically, ECANs have been designed to serve two main purposes within CogPrime: to serve as an associative memory for the network, and to facilitate effective allocation of the attention of other cognitive processes to appropriate knowledge items. An ECAN is simply a graph, consisting of un-typed nodes and links, and also "Hebbian" links that may have types such as HebbianLink. InverseHebbianLink, or SymmetricHebbianLink. Each node and link in an ECAN is weighted with two currency values, called STI (short- term importance) and LTI (long-term importance); and each Hebbian link is weighted with a probabilistic truth value. The equations of an ECAN explain how the STI, LTI and Hebbian link weights values get updated over time. As alluded to above, the metaphor underlying these equations is the interpretation of STI and LTI values as (separate) artificial currencies. The fact that STI and LTI are currencies means that, except in unusual instances where the ECAN controller decides to introduce inflation or deflation and explicitly manipulate the amount of currency in circulation, the total amounts of STI and LTI in the system are conserved. This fact makes the dynamics of an ECAN dramatically different than that of an attractor neural network. In addition to STI and LTI as defined above, the ECAN equations also contain the notion of an Attentional Focus (AF), consisting of those Atoms in the ECAN with the highest STI values (and represented by the sgh„ shom value in the above equations). These Atoms play a privileged role in the system and, as such, are treated using an alternate set of equations. 23.5.1 Semantics of Hebbian Links Conceptually, the probability value of a HebbianLink from A to B is the odds that if A is in the AF, so is B; and correspondingly, the InverseHebbianLink from A to B is weighted with the odds that if A is in the AF, then B is not. An ECAN will often be coupled with a "Forgetting" process that removes low-LTI Atoms from memory according to certain heuristics. A critical aspect of the ECAN equations is that Atoms periodically spread their STI and LTI to other Atoms that connect to them via Hebbian and InverseHebbianLinks; this is the ECAN analogue of activation spreading in neural networks. Multiple varieties of HebbianLink may be constructed, for instance • Asymmetric HebbianLinks, whose semantics are as mentioned above: the truth value of Hebbiatthink A B denotes the probability that if A is in the AF, so is B • Symmetric HebbianLinks, whose semantics are that: the truth value of SymmetricHebbian- Link A B denotes the probability that if one of A or B is in the AF, both are EFTA00624241
23.6 Dynamics of STI and LTI Propagation 95 It is also worth noting that one can combine ContextLinks with HebbianLinks and express contextual association such that in context C, there is a strong HebbianLink between A and B. 23.5.2 Explicit and Implicit Hebbian Relations In addition to explicit HebbianLinks, it can be useful to treat other links implicitly as Heb- bianLinks. For instance, if ConceptNodes A and B are found to connote similar concepts, and a SimilarityLink is formed between them, then this gives reason to believe that maybe a Sym- metricHebbianLink between A and B should exist as well. One could incorporate this insight in CogPrime in at least three ways: • creating HebbianLinks paralleling other links (such as SimilarityLinks) • adding "Hebbian weights" to other links (such as SimilarityLinks) • implicitly interpreting other links (such as SimilarityLinks) as HebbianLinks Further, these strategies may potentially be used together. There are some obvious semantic relationships to be used in interpreting other link types im- plicitly as HebbianLinks: for instance, Similarity maps into SynunetricHebbian. and Inheritance A B maps into Hebbian A B. One may express these as inference rules, e.g. SimilarityLink A B <tv_1> I - SymmetricHebbianLink A B <tv_2> where tv2.8 = tvi.s. Clearly, tv2.c c tvi.c; but the precise magnitude of tv2.c must be deter- mined by a heuristic formula. One option is to set tv2.c = crtui.c where the constant cc is set empirically via data mining the System Activity Tables to be described below. 23.6 Dynamics of STI and LTI Propagation We now get more specific about how some of these ideas are implemented in the currently implemented ECAN subsystem of CogPrime. We'll discuss mostly sTi here because in the current design LTI works basically the same way. MindAgents send out stimulus to Atoms whenever they use them (or else, sometimes, just for the purpose of increasing the Atom's STI); and before these stimulus values are used to update the STI levels of the receiving Atom, they are normalized by: the total amount of stimulus sent out by the MindAgent in that cycle, multiplied by the total amount of STI currency that the MindAgent decided to spend in that cycle. The normalized stimulus is what has above been called juju. This normalization preserves fairness among MA's, and conservation of currency. (The reason "stimuli" exist, separately from STI, is that stimulus-sending needs to be very computationally cheap, as in general it's done frequently by each MA each cycle, and we don't want each action a MA takes to invoke some costly importance-updating computation.) Then, Atoms exchange STI according to certain equations (related to HebbianLinks and other links), and have their STI values updated according to certain equations (which involve, among other operations, transferring STI to the "central bank"). EFTA00624242
96 23 Attention Allocation 23.6.1 ECAN Update Equations The CogServer is understood to maintain a kind of central bank of STI and LTI funds. When a non-ECAN MindAgent finds an Atom valuable, it sends that Atom a certain amount of Stimulus, which results in that Atom's STI and LTI values being increased (via equations to be presented below, that transfer STI and LTI funds from the CogServer to the Atoms in question). Then, the ECAN ImportanceUpdating MindAgent carries out multiple operations, including some that transfer STI and LTI funds from some Atoms bath to the CogServer. There are multiple ways to embody this process equationally; here we briefly describe two variants. 23.6.1.1 Definition and Analysis of Variant 1 We now define a specific set of equations in accordance with the ECAN conceptual framework described abate. We define HsTi• = [81, • • • i sn] to be the vector of STI values, and C = 1 1CIL, • • • >elm • • • : • : cia, • • • 'cm,J t at the existence of a HebbianLink or InverseHebbianLink between A and B are mutually [ 911," • 19 exclusive possibilities. We also define CLTI = :. • .: 1 to be the matrix of LTI values 9nt, " >Sinn for each of the corresponding We assume an updating scheme in which, periodically, a number of Atoms are allocated Stimulus amounts, which causes the corresponding STI values to change according to the equa- tions to be the connection matrix of Hebbian probability values, where it is assumed Vi : si = s e — rent + wages, where rent and wages are given by 20.. 1 rent = (Rent) • max ( 0, icl"n r "b"') , if si > 0 { f and 0, if si > e > rent = 0, ifs; ≤ e { (Wage)(StiMUIUS) , if p i 1 wages = (Veag talitiliiS) .r a ' Ti_Er..1p, ,Il pi = u where P = [pi, • • • ,p.), with pa E {0,1) is the cue pattern for the pattern that is to be retrieved. All quantities enclosed in angled brackets are system parameters, and LTI updating is accom- plished using a completely analogous set of equations. EFTA00624243
23.6 Dynamics of STI and LTI Propagation 97 The changing STI values then cause updating of the connection matrix, according to the "conjunction" equations. First define = resents nxSTri if Si recentkiinSTi ' if < Next define conj = Conjunction (si,sj) = norms x norm.' and c:j = (ConjDecay) conj + (1 — conj)cii. Finally update the matrix elements by setting = tip 4.; > 0 cif = ow if < 0 We are currently also experimenting with updating the connection matrix in accordance with the equations suggested by Storkey (1997, 1998, 1999.) A key property of these equations is that both wages paid to, and rent paid by, each node are positively correlated to their STI values. That is, the more important nodes are paid more for their services, but they also pay more in rent. A fixed percentage of the links with the lowest LTI values is then forgotten (which corresponds equationally to setting the LTI to 0). Separately from the above, the process of Hebbian probability updating is carried out via a diffusion process in which some nodes "trade" STI utilizing a diffusion matrix D, a version of the connection matrix C normalized so that D is a left stochastic matrix. D acts on a similarly scaled vector v, normalized so that v is equivalent to a probability vector of STI values. The decision about which nodes diffuse in each diffusion cycle is carried out via a decision function. We currently are working with two types of decision functions: a standard threshold function, by which nodes diffuse if and only if the nodes are in the AF; and a stochastic decision function in which nodes diffuse with probability th"h(shar'('' -PoEkm""17))+1, where shape and FocusBoundary are parameters. The details of the diffusion process are as follows. First, construct the diffusion matrix from the entries in the connection matrix as follows: If cif ≥ 0, then d i = else, set do = Next, we normalize the columns of D to make D a left stochastic matrix. In so doing, we ensure that each node spreads no more than a (MaxSpread) proportion of its STI, by setting n if Edi > (MaxSpread) : 1=1 { dy = d„. x (It 87 1), for i.0 j ' du = 1 — (MaxSpread) EFTA00624244
98 23 Attention Allocation else: djj = 1— E i = 1 i Now we obtain a scaled STI vector v by setting minSTI = min si and maxSTI = max si — min STI v, max STI — min STI The diffusion matrix is then used to update the node STIs v'= Dv and the STI values are resealed to the interval [minSTI,mfocSTII. In both the rent and wage stage and in the diffusion stage, the total STI and LTI funds of the system each separately form a conserved quantity: in the case of diffusion, the vector v is simply the total STI times a probability vector. To maintain overall system funds within homeostatic bounds, a mid-cycle tax and rent-adjustment can be triggered if nectsbary; the equations currently used for this are 1. (Rent} recent stimulus awarded. bettoreArupdatex (Wage) recent 2. tax =f,, where x is the distance from the current AtomSpace bounds to the center of the homeostatic range for AtomSpace funds; 3. Vi:si = — tax 23.6.1.2 Investigation of Convergence Properties of Variant 1 Now we investigate some of the properties that the above ECAN equations display when we use an ECAN defined by them as an associative memory network in the manner of a Hopfield network. We consider a situation where the ECAN is supplied with memories via a "training" phase in which one imprints it with a series of binary patterns of the form P = [pi, • • • ,p„], with E {0, 1). Noisy versions of these patterns are then used as cue patterns during the retrieval process. We obviously desire that the ECAN retrieve the stored pattern corresponding to a given cue pattern. In order to achieve this goal, the ECAN must converge to the correct fixed point. Theorem 23.1. For a given value of e in the STI rent calculation, there is a subset of hyperbolic decision functions for which the ECAN dynamics converge to an attracting faxed point. Proof Rent is zero whenever e ≤ s, ≤ recent2o m' s11 , so we consider this case first. The updating process for the rent and wage stage can then be written as f (s) = s + constant. The next stage is governed by the hyperbolic decision function EFTA00624245
23.6 Dynamics of STI and LTI Propagation 99 tanh (shape (s — FocusBotmdary)) -I- 1 9 (s) 2 The entire updating sequence is obtained by the composition (go f) (s), whose derivative is then (s. of) sech2 (f (s)) • shape (1), 2 which has magnitude less than 1 whenever -2 < shape < 2. We next consider the case si reeentigaxSTI > a The function f now takes the form 20 and we have f (s) s log (20s/recentMaxSTI) 2 -F constant, (so sech2 (f (s)) • shape ( 1 1 \ 2 2s) which has magnitude less than 1 whenever Ishapei < I 2; l• Choosing the shape parameter to satisfy 0 < shape < min (2, I 2* I) then guarantees that I(91 0 f )'I < 1. Finally, g o f maps the closed interval [recentAlinStl,recentMazSTI) into itself, so applying the Contraction Mapping Theorem completes the proof. 23.6.1.3 Definition and Analysis of Variant 2 The ECAN variant described above has performed completely acceptably in our experiments so far; however we have also experimented with an alternate variant, with different convergence properties. In Variant 2, the dynamics of the ECAN are specifically designed so that a certain conceptually intuitive function serves as a Liaptmov function of the dynamics. At a given time t, for a given Atom indexed i, we define two quantities: OUTi(t) = the total amount that Atom i pays in rent and tax and diffusion during the time-t iteration of ECAN ; iNi(t) = the total amount that Atom i receives in diffusion, stimulus and welfare during the time-t iteration of ECAN. Note that welfare is a new concept to be introduced below. We then define DifFi(t) = MAO— OUTi(t) ; and define AVDIFF(t) as the average of DIFF;(t) over all i in the ECAN. The design goal of Variant 2 of the ECAN equations is to ensure that, if the parameters are tweaked appropriately, AVDIFF can serve as a (deterministic or stochastic, depending on the details) Lyapunov function for ECAN dynamics. This implies that with appropriate parameters the ECAN dynamics will converge toward a state where AVDIFF=0, meaning that no Atom is making any profit or incurring any loss. It must be noted that this kind of convergence is not always desirable, and sometimes one might want the parameters set otherwise. But if one wants the STI components of an ECAN to converge to some specific values, as for instance in a classic associative memory application, Variant 2 can guarantee this easily. In Variant 2, each ECAN cycle begins with rent collection and welfare distribution, which occurs via collecting rent via the Variant 1 equation, and then performing the following two steps: EFTA00624246
100 23 Attention Allocation • Step A: calculate X, defined as the positive part of the total amount by which AVDIFF has been increased via the overall rent collection process. • Step B: redistribute X to needy Atoms a S follows: For each Atom z, calculate the positive part of OUT —IN, defined as deficit(z). Distribute X +e wealth among all Atoms z, giving each Atom a percentage of X that is proportional to deficit(z), but never so much as to cause OUT<IN for any Atom (the welfare being given counts toward IN). Here e > 0 ensures AVDIFF decrease; e = 0 may be appropriate if convergence is not required in a certain situation. Step B is the welfare step, which guarantees that rent collection will decrease AVDIFF. Step A calculates the amount by which the rich have been made poorer, and uses this to make the poor richer. In the case that the sum of deficit(z) over all nodes z is less than X, a mid-cycle rent adjustment may be triggered, calculated so that step B will decrease AVDIFF. (Le. we cut rent on the rich, if the poor don't need their money to stay out of deficit.) Similarly. in each Variant 2 ECAN cycle, there is a wage-paying process, which involves the wage-paying equation from Variant 1 followed by two steps. Step A: calculate Y, defined as the positive part of the total amount by which AVDIFF has been increased via the overall wage payment process. Step B: exert taxation based on the surplus Y as follows: For each Atom z, calculate the positive part of IN— OUT, defined as surplus(z). Collect Y + el wealth from all Atom z, collecting from each node a percentage of Y that is proportional to surplus(z), but never so much as to cause IN < OUT for any node (the new STI being collected counts toward OUT). In case the total of surplus(z) over all nodes z is less than Y, one may trigger a mid-cycle wage adjustment, calculated so that step B will decrease AVDIFF. I.e. we cut wages since there is not enough surplus to support it. Finally, in the Variant 2 ECAN cycle, diffusion is done a little differently, via iterating the following process: If AVDIFF has increased during the diffusion round so far, then choose a random node whose diffusion would decrease AVDIFF, and let it diffuse; if AVDIFF has decreased during the diffusion round so far, then choose a random node whose diffusion would increase AVDIFF, and let it diffuse. In carrying out these steps, we avoid letting the same node diffuse twice in the same round. This algorithm does not let all Atoms diffuse in each cycle, but it stochastically lets a lot of diffusion happen in a way that maintains AVDIFF constant. The iteration may be modified to bias toward an average decrease in AVDIFF. The random element in the diffusion step, together with the logic of the rent/welfare and wage/tax steps, combines to yield the result that for Variant 2 of ECAN dynamics, AVDIFF is a stochastic Lyapunov function. The details of the proof of this will be omitted but the outline of the argument should be clear from the construction of Variant 2. And note that by setting the e and el parameter to 0, the convergence requirement can be eliminated, allowing the network to evolve more spontaneously as may be appropriate in some contexts; these parameters allow one to explicitly adjust the convergence rate. One may also derive results pertaining to the meaningfulness of the attractors, in various special cases. For instance, if we have a memory consisting of a set M of m nodes, and we imprint the memory on the ECAN by stimulating in nodes during an interval of time, then we want to be able to show that the condition where precisely those m nodes are in the AF is a fixed-point attractor. However, this is not difficult, because one must only show that if these m nodes and none others are in the AF, this condition will persist. EFTA00624247
23.7 Glocal Economic Attention Networks 101 23.6.2 ECAN as Associative Memory We have carried out experiments gauging the performance of Variant 1 of ECAN as an as- sociative memory, using the implementation of ECAN within CogPrime, and using both the conventional and Storkey Hebbian updating fornmlas. As with a Hopfield net memory, the memory capacity (defined as the number of memories that can be retrieved from the network with high accuracy) depends on the sparsity of the network, with denser networks leading to greater capacity. In the ECAN case the capacity also depends on a variety of parameters of the ECAN equations, and the precise unraveling of these dependencies is a subject of current research. However, one interesting dependency has already been uncovered in our preliminary experimentation, which has to do with the size of the AF versus the size of the memories being stored. Define the size of a memory (a pattern being imprinted) as the number of nodes that are stimulated during imprinting of that memory. In a classical Hopfield net experiment, the mean size of a memory is usually around, say, .2-.5 of the number of neurons. In typical CogPrime associative memory situations, we believe the mean size of a memory, will be one or two orders of magnitude smaller than that, so that each memory occupies only a relatively small portion of the overall network. What we have found is that the memory capacity of an ECAN is generally comparable to that of a Hopfield net with the same number of nodes and links, if and only if the ECAN parameters are tuned so that the memories being imprinted can fit into the AF. That is, the AF threshold or (in the hyperbolic case) shape parameter must be tuned so that the size of the memories is not so large that the active nodes in a memory cannot stably fit into the AF. This tuning may be done adaptively by testing the impact of different threshold/shape values on various memories of the appropriate size; or potentially a theoretical relationship between these quantities could be derived, but this has not been done yet. This is a reasonably satisfying result given the cognitive foundation of ECAN: in loose terms what it means is that ECAN works best for remembering things that fit into its focus of attention during the imprinting process. 23.7 Glocal Economic Attention Networks In order to transform ordinary ECANs into glocal ECANs, one may proceed in essentially the same manner as with glocal Hopfield nets as discussed in Chapter 13 of Part 1. In the language normally used to describe CogPrime, this would be termed a "map encapsulation" heuristic. As with glocal Hopfield nets, one may proceed most simply via creating a fixed pool of nodes intended to provide locally-representative keys for the maps formed as attractors of the network. Links may then be formed to these key nodes, with weights and STI and LTI values adapted by the usual ECAN algorithms. EFTA00624248
102 23 Attention Allocation 23.7.1 Experimental Explorations To compare the performance of glocal ECANs with glocal Hopfield networks in a simple context, we ran experiments using ECAN in the manner of a Hopfield network. That is, a number of nodes take on the equivalent role of the neurons that are presented patterns to be stored. These patterns are imprinted by setting the corresponding nodes of active bits to have their STI within the AF, whereas nodes corresponding to inactive bits of the pattern are below the AF threshold. Link weight updating then occurs, using one of several update rules, but in this case the update rule of ISV99I was used. Attention was spread using a diffusion approach by representing the weights of Hebbian links between pattern nodes within a left stochastic Markov matrix, and multiplying it by the vector of normalised STI values to give a vector representing the new distribution of STI. To explore the effects of key nodes on ECAN Hopfield networks, in roe08b1 we used the palimpsest testing scenario of ISV99], where all the local neighbours of the imprinted pattern, within a single bit change, are tested. Each neighbouring pattern is used as input to try and retrieve the original pattern. If all the retrieved patterns are the same as the original (within a given tolerance) then the pattern is deemed successfully retrieved and recall of the previous pattern is attempted via its neighbours. The number of patterns this can repeat for successfully is called the palimpsest storage of the network. As an example, consider one simple experiment that was nm with recollection of 10 x 10 pixel pat tents (so, 100 nodes. each corresponding to a pixel in the grid), a Hebbian link density of 30%, and with 1% of links being forgotten before each pattern is imprinted. The results demonstrated that, when the mean palimpsest storage is calculated for each of 0, 1, 5 and 10 key nodes we find that the storage is 22.6, 22.4, 24.9, and 26.0 patterns respectively, indicating that key nodes do improve memory recall on average. 23.8 Long-Term Importance and Forgetting Now we turn to the forgetting process (carried out by the Forgetting MindAgent), which is driven by LTI dynamics, but has its own properties as well. Overall, the goal of the "forgetting" process is to maximize the total utility of the Atoms in the AtomSpace throughout the future. The most basic heuristic toward this end is to remove the Atoms with the lowest LTI, but this isn't the whole story. Clearly, the decision to remove an Atom from RAM should depend on factors beyond just the LTI of the Atom. For example, one should also take into account the expected difficulty in reconstituting the given Atom from other Atoms. Suppose the system has the relations: "dogs are animals" "animals are cute" "dogs are cute" and the strength of the third relation is not dissimilar from what would be obtained by deduction and revision from the first two relations and others in the system. Then, even if the system judges it will be very useful to know dogs are cute in the future, it may reasonably choose to remove dogs are cute from memory, anyway, because it knows it can be so easily reconstituted, EFTA00624249
23.9 Attention Allocation via Data Mining on the System Activity Table 103 by a few inference steps for instance. Thus, as well as removing the lowest-LTI Atoms, the Forgetting MindAgent should also remove Atoms meeting certain other criteria such as the combination of: • low STI • easy reconstitutability in terms of other Atoms that have LTI not lass than its own 23.9 Attention Allocation via Data Mining on the System Activity Table In this section we'll discuss an object called the System Activity Table, which contains a number of subtables recording various activities carried out by the various objects in the CogPrime system. These tables may be used for sophisticated attention allocation processes, according to an approach in which importance values and HebbianLink weight values are calculated via direct data mining of a centralized knowledge store (the System Activity Table). This approach provides highly accurate attention allocation but at the cost of significant computational effort. The System Activity Table is actually a set of tables, with multiple components. The precise definition of the tables will surely be adapted based on experience as the work with CogPrime progresses; what is described here is a reasonable first approximation. First, there is a MindAgent Activity Table, which includes, for each MindAgent in the system, a table such as Table 23.1 (in which the time-points recorded are the last T system cycles, and the Atom-lists recorded are lists of Handles for Atoms). System Cycle Effort Spent ?utopia° Used Atom Combo 1 Utilized Atom Combo 2 Utilized .. . Now 3.3 4000 Atom21, Atom44 Atom 44, Atom 47, Atom 345 .. . Now -1 0.4 6079 Atom123, Atom33 Atom 345 .. . ... ... ... ... ... .. . Table 23.1: Example MindAgent Table The MindAgent's activity table records, for that MindAgent and for each system cycle, which Atom-sets were acted on by that MindAgent at that point in time. Similarly, a table of this nature must be maintained for each Task-type, e.g. InferenceTask, MOSESCategorizationTask, etc. The Task tables are used to estimate Effort values for various Tasks, which are used in the procedure execution process. If it can be estimated how much spatial and temporal resources a Task is likely to use, via comparison to a record of previous similar tasks (in the Task table), then a MindAgent can decide whether it is appropriate to carry out this Task (versus some other one, or versus some simpler process not requiring a Task) at a given point in time, a process to be discussed in a later chapter. In addition to the MindAgent and Task-type tables, it is convenient if tables are maintained corresponding to various goals in the system (as shown in Table ??), including the Ubergoals but also potentially derived goals of high importance. For each goal, at minimum, the degree of achievement of the goal at a given time must be recorded. Optionally, at each point in time, the degree of achievement of a goal relative to some EFTA00624250
104 23 Attention Allocation System Cycle Total Achievement Achievement for Atom44 Achievement for set {Atom44, Atom 233} ... Now .8 .4 .5 ... Now-1 .9 .5 .55 ... ... ... •-• •-• ... Table 23.2: Example Goal Table particular Atoms may be recorded. Typically the list of Atom-specific goal-achievements will be short and will be different for different goals and different time points. Some goals may be applied to specific Atoms or Atom sets, others may only be applied more generally. The basic idea is that attention allocation and credit assignment may be effectively carried out via datamining on these tables. 23.10 Schema Credit Assignment And, how do we apply a similar approach to clarifying the semantics of schema credit assign- ment? From the above-described System Activity Tables, one can derive information of the form Achieve(G,E,T) "Goal G was achieved to extent E at time T" which may be grounded as, for example: Similarity E ExOut GetTruthValue Evaluation atTime HypLink G and more refined versions such as Achieve(G,E,T,A,P) "Goal G was achieved to extent E using Atoms A (with parameters P) at time T" Enact(S,I,ST_13,0,3T_2$1 "Schema S was enacted on inputs I at time $T_13, producing outputs 0 at time $T 2$" The problem of schema credit assignment is then, in essence: Given a goal G and a distribution of times V. figure out what schema to enact in order to cause G's achievement at some time in the future, where the desirability of times is weighted by V. The basic method used is the learning of predicates of the form ImplicationLink F(C, Pn) where EFTA00624251
23.10 Schema Credit Assignment 105 • the P1 are Enact() statements in which the T1 and T2 are variable, and the S, I and 0 may be concrete or variable • C is a predicate representing a context • g is an Achieve() statement, whose arguments may be concrete or abstract • F is a Boolean function Typically, the variable expressions in the T1 and T2 positions will be of the form T -F offset, where offset is a constant value and T is a time value representing the time of inception of the whole compound schema. T may then be defined as To — offsety where offset, is a constant value and 7.0 is a variable denoting the time of achievement of the goal. In CogPrime, these predicates may be learned by a combination of statistical pattern mining, PLN inference and MOSES or hill-climbing procedure learning. The choice of what action to take at a given point in time is then a probabilistic decision. Based on the time-distribution 73 given, the system will know a certain number of expressions C = F(C, Ph ..., P4 of the type described above. Each of these will be involved in an Implica- tionLink with a certain estimated strength. It may select the "compound schema" C with the highest strength. One might think to introduce other criteria here, e.g. to choose the schema with the highest strength but the lowest cost of execution. However, it seems better to include all pertinent criteria in the goal, so that if one wants to consider cost of execution, one assumes the existence of a goal that incorporates cost of execution (which may be measured in multiple ways, of course) as part of its internal evaluation function. Another issue that arises is whether to execute multiple C simultaneously. In many cases this won't be possible because two different C's will contradict each other. It seems simplest to assume that C's that can be fused together into a single plan of action, are presented to the schema execution process as a single fused C. In other words, the fusion is done during the schema learning process rather than the execution process. A question emerges regarding how this process deals with false causality, e.g. with a schema that, due to the existence of a common cause, often happens to occur immediately prior to the occurrence of a given goal. For instance, roosters crowing often occurs prior to the sun rising. This matter is discussed in more depth in the PLN book and The Hidden Pattern; but in brief, the answer is: In the current approach, if roosters crowing often causes the sun to rise, then if the system wants to cause the sun to rise, it may well cause a rooster to crow. Once this fails, then the system will no longer hold the false belief, and afterwards will choose a different course of action. Furthermore, if it holds background knowledge indicating that roosters crowing is not likely to cause the sun to rise, then this background knowledge will be invoked by inference to discount the strength of the ImplicationLink pointing from rooster-crowing to sun-rising, so that the link will never be strong enough to guide schema execution in the first place. The problem of credit assignment thus becomes a problem of creating appropriate heuristics to guide inference of ImplicationLinks of the form described above. Assignment of credit is then implicit in the calculation of truth values for these links. The difficulty is that the predicates F involved may be large and complex. EFTA00624252
106 23 Attention Allocation 23.11 Interaction between ECANs and other CogPrime Components We have described above a number of interactions between attention allocation and other aspects of CogPrime; in this section we gather a few comments on these interactions, and some additional ones. 23.11.1 Use of PLN and Procedure Learning to Help ECAN MOSES or hillclimbing may be used to help mine the SystemActivityTable for patterns of usefulness, and create HebbianLinIcs reflecting these patterns. PLN inference may be carried out on HebbianLinks by treating (HebbianLink A B) as a virtual predicate evaluation relationship, i.e. as EvaluationLink Rebbianpredicate (A, B) PLN inference on HebbianLinks may then be used to update node importance values, because node importance values are essentially node probabilities corresponding to HebbianLinks. And similarly, MindAgent-relative node importance values are node probabilities corresponding to MindAgent-relative HebbianLinks. Note that conceptually. the nature of this application of PLN is different from most other uses of PLN in CogPrime. Here, the purpose of PLN is not to draw conclusions about the outside world, but rather about what the system should focus its resources on in what context. PLN, used in this context, effectively constitutes a nonlinear-dynamical iteration governing the flow of attention through the CogPrime system. Finally, inference on HebbianLinks leads to the emergence of maps, via the recognition of clusters in the graph of HebbianLinks. 23.11.2 Use of ECAN to Help Other Cognitive Processes First of all, associative-memory functionality is directly important in CogPrime because it is used to drive concept creation. The CogPrime heuristic called "map formation" creates new Nodes corresponding to prominent attractors in the ECAN, a step that (according to our preliminary results) not only increases the memory capacity of the network beyond what can be achieved with a pure ECAN but also enables attractors to be explicitly manipulated by PLN inference. Equally important to associative memory is the capability of ECANs to facilitate effective allocation of the attention of other cognitive processes to appropriate knowledge items (Atoms). For example, one key role of ECANs in CogPrime is to guide the forward and backward chaining processes of PLN (Probabilistic Logic Network) inference. At each step, the PLN inference chainer is faced with a great number of inference steps (branches) from which to choose; and a choice is made using a statistical "bandit problem" mechanism that selects each possible inference step with a probability proportional to its expected "desirability." In this context, there is considerable appeal in the heuristic of weighting inference steps using probabilities EFTA00624253
23.12 MindAgent Importance and Scheduling 107 proportional to the STI values of the Atoms they contain. One thus arrives at a combined PLN/ECAN dynamic as follows: 1. An inference step is carried out, involving a choice among multiple possible inference steps, which is made using STI-based weightings (and made among Atoms that LTI weightings have deemed valuable enough to remain in RAM) 2. The Atoms involved in the inference step are rewarded with STI and LTI proportionally to the utility of the inference step (how much it increases the confidence of Atoms in the system's memory) 3. The ECAN operates, and multiple Atom's importance values are updated 4. Return to Step 1 if the inference isn't finished An analogous interplay may occur between ECANs and MOSES. It seems intuitively clear that the same attractor-convergence properties highlighted in the above analysis of associative-memory behavior, will also be highly valuable for the application of ECANs to attention allocation. If a collection of Atoms is often collectively useful for some cognitive process (such as PLN), then the ascnriative-memory-type behavior of ECANs means that once a handful of the Atoms in the collection are found useful in a certain inference process, the other Atoms in the collection will get their STI significantly boosted, and will be likely to get chosen in subsequent portions of that same inference process. This is exactly the sort of dynamics one would like to see occur. Systematic experimentation with these interactions between ECAN and other CogPrime processes is one of our research priorities going forwards. 23.12 MindAgent Importance and Scheduling So far we have discussed economic transactions between Atoms and Atoms, and between Atoms and Units. MindAgents have played an indirect role, via spreading stimulation to Atoms which causes them to get paid wages by the Unit. Now it is time to discuss the explicit role of MindAgents in economic transactions. This has to do with the integration of economic attention allocation with the Scheduler that schedules the core MindAgents involved in the basic cognitive cycle. This integration may be done in many ways, but one simple approach is: 1. When a MindAgent utilizes an Atom, this results in sending stimulus to that Atom. (Note that we don't want to make MindAgents pay for using Atoms individually; that would penalize MA's that use more Atoms, which doesn't really make much sense.) 2. MindAgents then get currency from the Lobe (as defined in Chapter 19) periodically, and get extra currency based on usefulness for goal achievement as determined by the credit assigmnent process. The Scheduler then gives more processor time to MindAgents with more STI. 3. However, any MindAgent with LTI above a certain minimum threshold will get some min- imum amount of processor time (i.e. get scheduled at least once each N cycles). As a final note: In a multi-Lobe Unit, the Unit may use the different LTI values of MA's in different Lobes to control the distribution of MA's among Lobes: e.g. a very important (LTI) MA might get cloned across multiple Lobes. EFTA00624254
108 23 Attention Allocation 23.13 Information Geometry for Attention Allocation Appendix ?? outlines some very broad ideas regarding the potential utilization of information geometry and related ideas for modeling cognition. In this section, we present sonic more con- crete and detailed experiments inspired by the same line of thinking. We model CogPrime's Economic Attention Networks (ECAN) component using information geometric language, and then use this model to propose a novel information geometric method of updating ECAN net- works (based on an extension of Amari's ANGL algorithm). Tests on small networks suggest that information-geometric methods have the potential to vastly improve ECAN's capability to shift attention from current preoccupations to desired preoccupations. However, there is a high computational cost associated with the simplest implementations of these methods, which has prevented us from carrying out large-scale experiments so far. We are exploring the possibility of circumventing these issues via using sparse matrix algorithms on GPUs. 23.13.1 Brief Review of Information Geometry "Information geometry" is a branch of applied mathematics concerned with the application of differential geometry to spaces of probability distributions. In IG1111 we have suggested some extensions to traditional information geometry aimed at allowing it to better model general intelligence. However for the concrete technical work in this Chapter, the traditional formulation of information geometry will suffice. One of the core mathematical constructs underlying information geometry, is the Fisher Information, a statistical quantity which has a a variety of applications ranging far beyond statistical data analysis, including physics [Fri98J, psychology and Al LAW)]. Put simply, Fl is a formal way of measuring the amount of information that an observable random variable X carries about an unknown parameter 8 upon which the probability of X depends. Fl forms the basis of the Fisher-Rao metric, which has been proved the only Riemannian metric on the space of probability distributions satisfying certain natural properties regarding invariance with respect to coordinate transformations. Typically 8 in the Fl is considered to be a real multidimensional vector; however, IDab00I has presented a Fl variant that imposes basically no restrictions on the form of O. Here the multidimensional Fl will suffice, but the more gen- eral version is needed if one wishes to apply Fl to AGI more broadly, e.g. to declarative and procedural as well as attentional knowledge. In the set-up underlying the definition of the ordinary finite-dimensional Fisher information, the probability function for X, which is also the likelihood function for 8 E Ft", is a function f(X;6); it is the probability mass (or probability density) of the random variable X conditional on the value of 0. The partial derivative with respect to 81 of the log of the likelihood function is called the score with respect to (h. Under certain regularity conditions, it can be shown that the first moment of the score is 0. The second moment is the Fisher information: = rx(e)i = E [(( a — In f(X;6))) 2 10] 88; where, for any given value of 0i, the expression El..10) denotes the conditional expectation over values for X with respect to the probability function f(X; 0) given 0. Note that 0 ≤ i(0); < EFTA00624255
23.13 Information Geometry for Attention Allocation 109 co. Also note that, in the usual case where the expectation of the score is zero, the Fisher information is also the variance of the score. One can also look at the whole Fisher information matrix 8Inf(X,0)31n1(X,0)\ 2(9)ic E[( ) aej which may be interpreted as a metric gij, that provably Ls the only "intrinsic" metric on prob- ability distribution space. In this notation we have z(6)i = 1(0);,i. Dabak IDal)991 has shown that the geodesic between two parameter vectors 0 and Or is given by the exponential weighted curve (y(t)) (x) - fig:E.:4414 y, under the weak condition that the log-likelihood ratios with respect to f(X, 9) and f(X, 9') are finite. Also, along this sort of curve, the sum of the Kullback-Leibler distances between B and 0', known as the J- divergence, equals the integral of the Fisher information along the geodesic connecting 0 and 6v. This suggests that if one is attempting to learn a certain parameter vector based on data, and one has a certain other parameter vector as an initial value, it may make sense to use algorithms that try to follow the Fisher-Rao geodesic between the initial condition and the desired conclusion. This is what Amari IA ma851 IA NO011 calls "natural gradient" based learning, a conceptually powerful approach which subtly accounts for dependencies between the components of 0. 23.13.2 Information-Geometric Learning for Recurrent Networks: Extending the ANGL Algorithm Now we move on to discuss the practicalities of information-geometric learning within Cog- Prime's ECAN component. As noted above, Amari lAma85, ANOI introduced the natural gradient as a generalization of the direction of steepest descent on the space of loss functions of the parameter space. Issues with the original implementation include the requirement of calcu- lating both the Fisher information matrix and its inverse. To resolve these and other practical considerations, Amari tAma981 proposed an adaptive version of the algorithm, the Adaptive Natural Gradient Learning (ANGL) algorithm. Park, Amari, and Fukumizu IPA FOOI extended ANGL to a variety of stochastic models including stochastic neural networks, multi-dimensional regression, and classification problems. In particular, they showed that, assuming a particular form of stochastic feedforward neu- ral network and under a specific set of assumptions concerning the form of the probability distributions involved, a version of the Fisher information matrix can be written as G(0). E4 [(q1E [OH (V Hi. T Although Park et al considered only feedforward neural networks, their result also holds for more general neural networks, inc tiding the ECAN network. What is important is the decomposition of the probability distribution as P (Ylx; 0) = 11 n (yi - Ei (x,1)) 1=1 EFTA00624256
110 23 Attention Allocation where Y = H(x; 0)+4, Y = (h, • • • ,YL)T, H = (HI, • • • ,HL)T, (41,- • • ,eL)T, where is added noise. If we assume further that each r, has the same form as a Gaussian distribution with zero mean and standard deviation a, then the Fisher information matrix simplifies further to GOO= (V1/)1 . The adaptive estimate for dt-:, is given by = (1+ ei)6T1 — ELOTIVHOTIVH)T. and the loss function for our model takes the form t(x, y; 0) = — E log r(yi — Iii(x, O)). i.1 The learning algorithm for our connection matrix weights B is then given by 83+3 = B3 — MOT 1V1(00- 23.13.3 Information Geometry for Economic Attention Allocation: A Detailed Example Graph 1: Sum of Squares of Errors versus Number of Nodes SOW • 7002 4010 sour 4000 3000 4- KAM 2000 •••••• &M.N.& 1000 0 0 100 NO 300 400 NO Number of Nods Fig. 23.3: Results from Experiment 1 We now present the results of a series of small-scale, exploratory experiments comparing the original ECAN process running alone with the ECAN process coupled with ANGL. We are interested in determining which of these two lines of processing result in focusing attention more accurately. EFTA00624257
23.13 Information Geometry for Attention Allocation 111 The experiment started with base patterns of various sizes to be determined by the two algorithms. In the training stage, noise was added, generating a number of instances of noisy base patterns. The learning goal is to identify the underlying base patterns from the noisy patterns as this will identify how well the different algorithms can focus attention on relevant versus irrelevant nodes. Next, the ECAN process was run, resulting in the determination of the connection matrix C. In order to apply the ANGL algorithm, we need the gradient, VH, of the ECAN training process, with respect to the input x. While calculating the connection matrix C, we used Monte Carlo simulation to simultaneously calculate an approximation to VH. Graph 2: Sum of Squares of Errors versus Training Noise (16 nodes) 120 100 00 60 ea • r - 0 r 05 I 15 2 2.5 Imastandad Oittlailan 3 _r ECM e ECM...Mt Fig. 23.4: Results from Experiment 2 After ECAN training was completed, we bifurcated the experiment. In one branch, we ran fuzzed cue patterns through the retrieval process. In the other, we first applied the ANGL algorithm. optimizing the weights in the connection matrix, prior to running the retrieval process on the same fuzzed cue patterns. At a constant value of a = 0.8 we ran several samples through each branch with pattern sizes of 4 x 4, 7 x 7, 10 x 10, 15 x 15, and 20 x 20. The results are shown in Figure 23.3. We also ran several experiments comparing the sum of squares of the errors to the input training noise as measured by the value of a.; see Figures 23.4 and ??. These results suggest two major advantages of the ECAN+ANGL combination compared to ECAN alone. Not only was the performance of the combination better in every trial, save for one involving a small number of nodes and little noise, but the combination clearly scales significantly better both as the number of nodes increases, and as the training noise increases. EFTA00624258
EFTA00624259
Chapter 24 Economic Goal and Action Selection 24.1 Introduction A significant portion of CogPrime's dynamics is explicitly goal-driven - that is, based on trying (inasmuch as passible within the available resources) to figure out which actions will best help the system achieve its goals, given the current context. A key aspect of this explicit activity is guided by the process of "goal and action selection" - prioritizing goals, and then prioritizing actions based on these goals. We have already outlined the high-level process of action selection, in Chapter 22 above. Now we dig into the specifics of the process, showing how action selection is dynamically entwined with goal prioritization, and how both processes are guided by economic attention allocation as described in Chapter 23. While the basic structure of CogPrime's action selection aspect is fairly similar to MicroPsi (due to the common foundation in Dorner's Psi model), the dynamics are less similar. MicroPsi's dynamics are a little closer to being a formal neural net model, whereas ECAN's economic foundation tends to push it in different directions. The CogPrime goal and action selection design involves some simple simulated financial mechanisms, building on the economic metaphor of ELAN, that are different from, and more complex than, anything in MicroPsi. The main actors (apart from the usual ones like the AtomTable, economic attention alloca- tion, etc.) in the tale to be told here are as follows: • Structures: - UbergoalPool - ActiveSchemaPool • MindAgents: - GoalBasedSchemaSelection - GoalBasedSchemaLearning - GoalAttentionAllocation - FeasibilityUpdating - SchemaActivation The Ubergoal Pool contains the Atoms that the system considers as top-level goals. These goals must be treated specially by attention allocation: they must be given funding by the Unit so that they can use it to pay for getting themselves achieved. The weighting among different 113 EFTA00624260
114 24 Economic Coal and Action Selection top-level goals is achieved via giving them different amounts of currency. STICurrency is the key kind here, but of course ubergoals must also get some LTICurrency so they won't be forgotten. (Inadvertently deleting your top-level supergoals from memory is generally considered to be a bad thing ... it's in a sense a sort of suicide...) 24.2 Transfer of STI "Requests for Services" Between Goals Transfer of "attentional funds" from goals to subgoals, and schema modules to other schema modules in the same schema, take place via a mechanism of promises of funding (or 'requests for service,' to be called 'RFS's' from here on). This mechanism relies upon and interacts with ordinary economic attention allocation but also has special properties. Note that we will sometimes say that an Atom `tissues" an RFS or "transfers" currency while what we really mean is that some MindAgent working on that Atom issues an RFS or transfers currency. The logic of these RFS's is as follows. If agent A issues an RFS of value x to agent B, then 1. When B judges it appropriate, B may redeem the note and ask A to transfer currency of value x to B. 2. A may withdraw the note from B at any time. (There is also a little more complexity here, in that we will shortly introduce the notion of RFS's whose value is defined by a set of constraints. But this complexity does not contradict the two above points.) The total value of the of RFS's possessed by an Atom may be referred to as its "promise." A rough schematic depiction of this RFS process is given in Figure 24.1. Now we explain how RFS's may be passed between goals. Given two predicates A and B, if A is being considered as a goal, then B may be considered as a subgoal of A (and A the supergoal of B) if there exists a Link of the form Predictivelmplication B A I.e., achieving B may help to achieve A. Of course, the strength of this link and the temporal characteristics of this link are important in terms of quantifying how strongly and how usefully B is a subgoal of A. Supergoals (not only top-level ones, aka ubergoals) allocate RFS's to subgoals as follows. Supergoal A may issue a RFS to subgoal B if it is judged that achievement (i.e., predicate satisfaction) of B implies achievement of A. This may proceed recursively: subgoals may allocate RFS's to subsubgoaLs according to the same justification. Unlike actual currency, RFS's are not conserved. However, the actual payment of real cur- rency upon redemption of RFS's obeys the conservation of real currency. This means that agents need to be responsible in issuing and withdrawing RFS's. In practice this may be ensured by having agents follow a couple simple rules in this regard. 1. If B and C are two alternatives for achieving A, and A has x units of currency, then A may promise both B and C x units of currency. Whoever asks for a redemption of the promise first, will get the money, and then the promise will be rescinded from the other one. 2. On the other hand, if the achievement of A requires both B and C to be achieved, then B and C may be granted RFS's that are defined by constraints. If A has x units of currency, then B and C receive an RFS tagged with the constraint (B+C<=x). This means that in EFTA00624261
24.2 Ttansfer of STI "Requests for Services" Between Goals 115 CENTRAL BANK (ATOMSPACE) WO PIM APPROPRIATE SOCIAL INTERACTION MS $10 ASK SOMEONE RFS $60 ND SOMEONE TO ASK RFS $20 ( WU FOR ATTINTION $100 MAINTAIN APPROPRIATE ENERGY GET IFS BATTERY REMEMBER WHERE I SAW ONE no FIND ELECTRICAL OUTLET WWWRMWW0 AND !URDU RISS.° SEEK NOVELTY USER- GOALS SUB- GOALS Fig. 24.1: The RFS Propagation Process. An illustration of the process via which RFS's propagate from goals to abstract procedures, and finally mast get cashed out to pay for the execution of actual concrete procedures that are estimated relatively likely to lead to goal fulfillment. order to redeem the note, either one of B or C mast confer with the other one, so that they can simultaneously request constraint-consistent amounts of money from A. As an example of the role of constraints, consider the goal of playing fetch successfully (a subgoal of "get reward").... Then suppose it is learned that: ImplicationLink Sequent ialAND get ball deliver ball play_fetch where SequentielAND A B is the conjunction of A and B but with B occurring after A in time. Then, if play_fetch has $10 in STICurrency, it may know it has $10 to spend on a combination EFTA00624262
116 24 Economic Coal and Action Selection of get_ball and deliver_ball. In this case both get_ball and deliver_ball would be given RFS's labeled with the contraint: RES.get_ball + RFS.deliver_ball The issuance of RFS's embodying constraints is different from (and generally carried out prior to) the evaluation of whether the constraints can be fulfilled. An ubergoal may rescind offers of reward for service at any time. And, generally, if a subgoal gets achieved and has not spent all the money it needed, the supergoal will not offer any more funding to the subgoal (until/unless it needs that subgoal achieved again). As there are no ultimate sources of RFS in OCP besides ubergoals, promise may be considered as a measure of "goal-related importance." Transfer of RFS's among Atoms is carried out by the GoalAttentionAllocation MindAgent. 24.3 Feasibility Structures Next, there is a numerical data structure associated with goal Atoms, which is called the feasibility structure. The feasibility structure of an Atom G indicates the feasibility of achieving G as a goal using various amounts of effort. It contains triples of the form (t,p,E) indicating the truth value t of achieving goal G to degree p using effort E. Feasibility structures mist be updated periodically, via scanning the links coining into an Atom G; this may be done by a FeasibilityUpdating MindAgent. Feasibility may be calculated for any Atom G for which there are links of the form: Implication Execution S G for some S. Once a schema has actually been executed on various inputs, its cost of execution on other inputs may be empirically estimated. But this is not the only case in which feasibility may be estimated. For example, if goal G inherits from goal G1, and most children (e.g. subgoals) of G1 are achievable with a certain feasibility, then probably G is achievable with a similar feasibility as well. This allows feasibility estimation even in cases where no plan for achieving G yet exists, e.g. if the plan can be produced via predicate schematization, but such schematization has not yet been carried out. Feasibility then connects with importance as follows. Important goals will get more STICur- rency to spend, thus will be able to spawn more costly schemata. So, the GoalBasedSchemaSe- lection MindAgent, when choosing which schemata to push into the ActiveSchemaPool, will be able to choose more costly schemata corresponding to goals with more STICurrency to spend. 24.4 Goal Based Schema Selection Next, the GoalBa.sedSchemaSelection (GBSS) selects schemata to be placed into the Ac- tiveSchetnaPool. It does this by choosing goals G, and then choosing schemata that are alleged to be useful for achieving these goals. It chooses goals via a fitness function that combines promise and feasibility. This involves solving an optimization problem: figuring out how to EFTA00624263
24.4 Coal Based Schema Selection 117 maximize the odds of getting a lot of goal-important stuff done within the available amount of (memory and space) effort. Potentially this optimization problem can get quite subtle, but initially some simple heuristics are satisfactory. (One subtlety involves handling dependencies between goals, as represented by constraint-bearing RFS's.) Given a goal, the GBSS MindAgent chooses a schema to achieve that goal via the heuristic of selecting the one that maximizes a fitness function balancing the estimated effort required to achieve the goal via executing the schema, with the estimated probability that executing the schema will cause the goal to be achieved. When searching for schemata to achieve G. and estimating their effort, one factor to be taken into account is the set of schemata already in the ActiveSchemaPool. Some schemata S may simultaneously achieve two goals; or two schemata achieving different goals may have significant overlap of modules. In this case G may be able to get achieved using very little or no effort (no additional effort. if there is already a schema S in the ActiveSchemaPool that is going to cause G to be achieved). But if G "decide?' it can be achieved via a schema S already in the ActiveSchemaPool, then it should still notify the ActiveSchemaPool of this, so that G can be added to S's index (see below). If the other goal Cl that placed S in the ActiveSchemaPool decides to withdraw S, then S may need to hit up Cl for money, in order to keep itself in the ActiveSchemaPool with enough funds to actually execute. 24.4.1 A Game-Theoretic Approach to Action Selection Min Jiang has observed that, mathematically, the problem of action selection (represented in CogPrime as the problem of goal-based schema selection) can be modeled in terms of game theory, as follows: • the intelligent agent is one player, the world is another player • the agent's model of the world lets it make probabilistic predictions of how the world may respond to what the agent does (i.e. to estimate what mixed strategy the world is following, considering the world as a game player) • the agent itself chooses schema probabilistically, so it's also following a mixed strategy • so, in principle the agent can choose schema that it thinks will lead to a mixed Nash But the world's responses are very high-dimensional, which means that finding a mixed Nash equilibrium even approximately is a very hard computational problem. Thus, in a sense. the crux of the problem seems to come down to feature identification. If the world's response (real or predicted) can be represented as a low-dimensional set of features, then these features can be considered as the world's "move" in the game ... and the game theory problem becomes tractable via approximation schemes. But without the reduction of the world to a low-dimensional set of features, finding the mixed Nash equilbritun even approximately will not be computationally tractable... Some Al theorists would argue that this division into "feature identification" versus "action selection" is unnecessarily artificial; for instance, Hawkins 11113011 or Arel [ARC09bl might sug- gest to use a single hierarchical neural network to do both of them. But the brain after all t in game theory, a Nash equilibrium is when no player can do better by unilaterally changing its strategy EFTA00624264
118 24 Economic Coal and Action Selection contains many different regions, with different architectures and dynamics.... In the visual cor- tex, it seems that feature extraction and object classification are done separately. And it seems that in the brain, action selection has a lot to do with the basal ganglia, whereas feature extrac- tion is done in the cortex. So the neural analogy provides some inspiration for an architecture in which feature identification and action selection are separated. There is literature discussing numerical methods for calculating approximate Nash equilibria; however, this is an extremely tricky topic in the CogPrime context because action selection must generally be done in real-time. Like perception processing, this may be an area calling for the use of parallel processing hardware. For instance, a neural network algorithm for finding mixed Nash equilibria could be implemented on a GPU supercomputer, enabling rapid real-time action selection based on a reduced-dimensionality model of the world produced by intelligent feature identification. Consideration of the application of game theory in this context brings out an important point, which is that to do reasonably efficient and intelligent action selection, the agent needs some rapidly-evaluable model of the world, i.e. some way to rapidly evaluate the predicted response of the world to a hypothetical action by the agent. In the game theory approach (or any other sufficiently intelligent approach), for the agent to evaluate fitness of a schema-set S for achieving certain goals in a certain context, it has to (explicitly or implicitly) estimate • how the world will respond if the agent does S • how the agent could usefully respond to the world's response (call this action-set S1) • how the world will respond to the agent doing Si • etc. and so to rapidly evaluate the fitness of S, the agent needs to be able to quickly estimate how the world will respond. This may be done via simulation, or it may be done via inference (which however will rarely be fast enough, unless with a very accurate inference control mechanism), or it may be done by learning some compacted model of the world as represented for instance in a hierarchical neural network. 24.5 SchemaActivation And what happens with schemata that are actually in the ActiveSchemapool? Let as assume that each of these schema is a collection of modules (subprograms), connected via Activation- Links, which have semantics: (ActivationLink A B) means that if the schema that placed module A in the schema pool is to be completed, then after A is activated, B should he activated. (We will have more to say about schemata, and their modtdarization, in Chapter 25.) When a goal places a schema in the ActiveSchemaPool, it grants that schema an RFS equal in value to the total or some fraction of the promissory+real currency it has in its possession. The heuristics for determining how much currency to grant may become sophisticated; but initially we may just have a goal give a schema all its promissory currency; or in the case of a top-level supergoal, all its actual currency. When a module within a schema actually executes, then it must redeem some of its promis- sory currency to turn it into actual currency, because executing costs money (paid to the Lobe). EFTA00624265
24.6 ConlBasedScheranLearning 119 Once a schema is done executing, if it hasn't redeemed all its promissory currency, it gives the remainder back to the goal that placed it in the ActiveSchemaPool. When a module finishes executing, it passes promissory currency to the other modules to which it points with ActivationLinks. The network of modules in the ActiveSchemaPool is a digraph (whose links are Activation- Links), because sonic modules may be shared within different overall schemata. Each module must be indexed via which schemata contain it, and each schema must be indexed via which goal(s) want it in the ActiveSchemaPool. 24.6 GoalBasedSchemaLearning Finally, we have the process of trying to figure out how to achieve goals, i.e. trying to learn links between ExecutionLinks and goals G. This process should he focused on goals that have a high importance but for which feasible achievement-methodologies are not yet known. Predicate schematization is one way of achieving this; another is MOSES procedure evolution. EFTA00624266
EFTA00624267
Chapter 25 Integrative Procedure Evaluation 25.1 Introduction Procedural knowledge must be learned, an often subtle and difficult process - but it must also be enacted. Procedure enaction is not as tricky a topic as procedure learning, but still is far from trivial, and involves the real-time interaction of procedures, during the course of execution, with other knowledge. In this brief chapter we explain how this process may be most naturally and flexibly carried out, in the context of CogPrime's representation of procedures as programs ("Combo trees"). While this may seem somewhat of a "mechanical", implementation-level topic, it also involves some basic conceptual points, on which CogPrime as an AGI design does procedure evaluation fundamentally differently from narrow-AI systems or conventional programming language in- terpreters. Basically, what makes CogPrime Combo tree evaluation somewhat subtle is due to the interfacing between the Combo evaluator itself and the rest of the CogPrime system. In the CogPrime design, Procedure objects (which contain Combo trees, and are associated with ProcedureNodes) are evaluated by ProcedureEvaluator objects. Different ProcedureEvalu- ator objects may evaluate the same Combo tree in different ways. Here we explain these various sorts of evaluation - how they work and what they mean. 25.2 Procedure Evaluators In this section we will mention three different ProcedureEvaluators: • Simple procedure evaluation • Effort-based procedure evaluation, which is more complex but is required for integration of inference with procedure evaluation • Adaptive evaluation order based procedure evaluation In the following section we will delve more thoroughly into the interactions between inference and procedure evaluation. Another related issue is the modularization of procedures. This issue however is actually orthogonal to the distinction between the three ProcedureEvaluators mentioned above. Modu- larity simply requires that particular nodes within a Combo tree be marked as "module roots", 121 EFTA00624268
122 25 Integrative Procedure Evaluation so that they may be extracted from the Combo tree as a whole and treated as separate modules (called differently, sub-routines), if the ExecutionManager judges this appropriate. 25.2.1 Simple Procedure Evaluation The SimpleComboTreeEvaluator simply does Combo tree evaluation as described earlier. When an Atom is encountered, it looks into the AtomTable to evaluate the object. In the case that a Schema refers to an ungrounded Schemallode (that is not defined by a ComboTree as defined in Chapter 19), and an appropriate EvaluationLink value isn't in the AtomTable, there's an evaluation failure, and the whole procedure evaluation returns the truth value (.5,0): i.e., a zero-weight-of-evidence truth value, which is equivalent essentially to returning no value. In the case that a Predicate refers to an ungrounded PredicateNode, and an appropriate EvaluationLink isn't in the AtomTable, then some very simple "default thinking" is done, and it is assigned the truth value of the predicate on the given arguments to be the TruthValue of the corresponding PredicateNode (which is defined as the mean truth value of the predicate across all arguments known to CogPrime. ) 25.2.2 Effort Based Procedure Evaluation The next step is to introduce the notion of "effort" the amount of effort that the CogPrime system must undertake in order to carry out a procedure evaluation. The notion of effort is encapsulated in Effort objects, which may take various forms. The simplest Effort objects measure only elapsed processing time; more advanced Effort objects take into consideration other factors such as memory usage. An effort-based Combo tree evaluator keeps a running total of the effort used in evaluating the Combo tree. This is necessary if inference is to be used to evaluate Predicates, Schema, Arguments, etc. Without some control of effort expenditure, the system could do an arbitrarily large amount of inference to evaluate a single Atom. The matter of evaluation effort is nontrivial because in many cases a given node of a Combo tree may be evaluated in more than one way, with a significant effort differential between the different methodologies. If a Combo tree Node refers to a predicate or schema that is very costly to evaluate, then the ProcedureEvaluator managing the evaluation of the Combo tree must decide whether to evaluate it directly (expensive) or estimate the result using inference (cheaper but less accurate). This decision depends on how much effort the ProcedureEvaluator has to play with, and what percentage of this effort it finds judicious to apply to the particular Combo tree Node in question. In the relevant prototypes we built within OpenCog, this kind of decision was made based on some simple heuristics inside ProcedureEvaluator objects. However, it's clear that, in general, more powerful intelligence must be applied here, so that one needs to have ProcedureEvaluators that - in cases of sub-procedures that are both important and highly expensive - use PLN inference to figure out how much effort to assign to a given subproblem. EFTA00624269
25.3 The Procedure Evaluation Process 123 The simplest useful kind of effort-based Combo tree evaluator is the EffortIntervalCom- boTreeEvaluator, which utilizes an Effort object that contains three numbers (yes, no, max). The yes parameter tells it how much effort should be expended to evaluate an Atom if there is a ready answer in the AtomTable. The no parameter tells it how much effort should be expended in the case that there is not a ready answer in the AtomTable. The max parameter tells it how much effort should be expended, at maximum, to evaluate all the Atoms in the Combo tree, before giving up. Zero effort, in the simplest case, may be heuristically defined as simply looking into the AtomTable - though in reality this does of course take effort, and a more sophisticated treatment would incorporate this as a factor as well. Quantification of amounts of effort is nontrivial, but a simple heuristic guideline is to assign one unit of effort for each inference step. Thus, for instance, • (yes, no, max) = (0,5,1000) means that if an Atom can be evaluated by AtomTable lookup, this is done, but if AtomTable lookup fails, a minimum of 5 inference steps are done to try to do the evaluation. It also says that no more than 1000 evaluations will be done in the course of evaluating the Combo tree. • (yes, no, max) = (3,5,1000) says the same thing, but with the change that even if evaluation could be done by direct AtomTable lookup, 3 inference steps are tried anyway, to try to improve the quality of the evaluation. 25.2.3 Procedure Evaluation with Adaptive Evaluation Order While tracking effort enables the practical use of inference within Combo tree evaluation, if one has truly complex Combo trees, then a higher degree of intelligence is necessary to guide the evaluation process appropriately. The order of evaluation of a Combo tree may be determined adaptively, based on up to three things: • The history, of evaluation of the Combo tree • Past history of evaluation of other Combo trees, as stored in a special AtomTable consisting only of relationships about Combo tree-evaluation-order probabilities • New information entering into CogPrime's primary AtomTable during the course of evalu- ation ProcedureEvaluator objects may be selected at runtime by cognitive schemata, and they may also utilize schemata and MindAgents internally. The AdaptiveEvaluationOrderComboTreeE- valuator is more complex than the other ProcedureEvaluators discussed, and will involve var- ious calls to CogPrime MindAgents, particularly those concerned with PLN inference. WIK- ISOURCE:ProcedureacecutionDetails 25.3 The Procedure Evaluation Process Now we give a more thorough treatment of the procedure evaluation proems, as embodied in the effort-based or adaptive-evaluation-order evaluators discussed above. The process of procedure evaluation is somewhat complex, because it encompasses three interdependent processes: EFTA00624270
12d 25 Integrative Procedure Evaluation • The mechanics of procedure evaluation, which in the CogPrime design involves traversing Combo trees in an appropriate order. When a Combo tree node referring to a predicate or schema is encountered during Combo tree traversal, the process of predicate evaluation or schema execution must be invoked. • The evaluation of the truth values of predicates - which involves a combination of inference and (in the case of grounded predicates) procedure evaluation. • The computation of the truth values of schemata - which may involve inference as well as procedure evaluation. We now review each of these processes. 25.3.1 Truth Value Evaluation What happens when the procedure evaluation process encounters a Combo tree Node that represents a predicate or compound term? The same thing as when some other CogPrime process decides it wants to evaluate the truth value of a PredicateNode or CompoundTermNode: the generic process of predicate evaluation is initiated. This process is carried out by a TruthValueEvaluator object. There are several varieties of TruthValueEvaluator, which fall into the following hierarchy: TruthValueEvaluator DirectTruthValueEvaluator (abstract) SimpleDirectTruthValueEvaluator InferentialTruthValueEvaluator (abstract) SimplelnferentialTruthValueEvaluator MixedTruthValueEvaluator A DirectTruthValueEvaluator evaluates a grounded predicate by directly executing it on one or more inputs; an InferentialTruthValueEvaluator evaluates via inference based on the previously recorded, or specifically elicited, behaviors of other related predicates or compound terms. A MixedTruthValueEvaluator contains references to a DirectTruthValueEvaluator and an InferentialTruthValueEvaluator, and contains a weight that tells it how to balance the outputs from the two. Direct truth value evaluation has two cases. In one case, there is a given argument for the predicate; then, one simply plugs this argument in to the predicate's internal Combo tree, and passes the problem off to an appropriately selected ProcedureEvaluator. In the other case, there is no given argument, and one is looking for the truth value of the predicate in general. In this latter case, some estimation is required. It is not plausible to evaluate the truth value of a predicate on every possible argument, so one must sample a bunch of arguments and then record the resulting probability distribution. A greater or fewer number of samples may be taken, based on the amount of effort that's been allocated to the evaluation process. It's also possible to evaluate the truth value of a predicate in a given context (information that's recorded via embedding in a ContextLink); in this rase the random sampling is restricted to inputs that lie within the specified context. On the other hand, the job of an InferentialTruthValueEvaluator is to use inference rather than direct evaluation to guess the truth value of a predicate (sometimes on a particular argu- ment, sometimes in general). There are several different control strategies that may be applied here, depending on the amount of effort allocated. The simplest strategy is to rely on analogy, EFTA00624271
25.3 The Procedure Evaluation Process 125 simply searching for similar predicates and using their truth values as guidance. (In the case where a specific argument is given, one searches for similar predicates that have been evaluated on similar arguments.) If more effort is available, then a more sophisticated strategy may be taken. Generally, an InferentialTruthValueEvaluator may invoke a Schemallode that embodies an inference control strategy for guiding the truth value estimation process. These Schemallodes may then be learned like any others. Finally, a AlixedTruthValueEvaluator operates by consulting a DirectTruthValueEvaluator and/or an InferentialTruthValueEvaluator as necesbary, and merging the results. Specifically, in the case of an ungrounded PredicateNode, it simply returns the output of the Inferential- TruthValueEvaluator it has chosen. But in the case of a GroundedPredicateNode, it returns a weighted average of the directly evaluated and inferred values, where the weight is a parameter. In general, this weighting may be done by a Schemallode that is selected by the MixedTruth- ValueEvaluator; and these schemata may be adaptively learned. 25.3.2 Schema Execution Finally, schema execution is handled similarly to truth value evaluation, but it's a bit simpler in the details. Schemata have their outputs evaluated by SchemaExecutor objects, which in turn invoke ProcedureEvaluator objects. We have the hierarchy: SchemaExecutor DirectSchemaExecutor (abstract) SimpleDirectSchemaExecutor InferentialSchemaExecutor (abstract) SimplelnferentialSchemaExecutor MixedSchemaExecutor A DirectSchemaExecutor evaluates the output of a schema by directly executing it on some inputs; an InferentialSchemaExecutor evaluates via inference based on the previously recorded, or specifically elicited, behaviors of other related schemata. A MixedSchemaExecutor contains references to a DirectSchemaExecutor and an InferentialSchemaExecutor, and contains a weight that tells it how to balance the outputs from the two (not always obvious, depending on the output type in question). Contexts may be used in schema execution, but they're used only indirectly, via being passed to TruthValueEvaluators used for evaluating truth values of PredicateNodes or Com- poundTermNodes that occur internally in schemata being executed. EFTA00624272
EFTA00624273
Section III Perception and Action EFTA00624274
EFTA00624275
Chapter 26 Perceptual and Motor Hierarchies 26.1 Introduction Having discussed declarative, attentional, intentional and procedural knowledge, we are left only with sensorimotor and episodic knowledge to complete our treatment of the basic CogPrime "cognitive cycle" via which a CogPrime system can interact with an environment and seek to achieve its goals therein. The cognitive cycle in its most basic form leaves out the most subtle and unique aspects of CogPrime, which all relate to learning in various forms. But nevertheless it is the foundation on which CogPrime is built, and within which the various learning processes dealing with the various forms of memory all interact. The CogPrime cognitive cycle is more complex in many respects than it would need to be if not for the need to support diverse forms of learning. And this learning-driven complexity is present to sonic extent in the contents of the present chapter as well. If learning weren't an issue, perception and actuation could more likely be treated as wholly (or near-wholly) distinct modules, operating according to algorithms and structures independent of cognition. But our suspicion is that this sort of approach is unlikely to be adequate for achieving high levels of perception and action capability under real-world conditions. Instead, we suspect, it's necessary to create perception and action processes that operate fairly effectively on their own, but are capable of cooperating with cognition to achieve yet higher levels of functionality. And the benefit in such an approach goes both ways. Cognition helps perception and actu- ation deal with difficult cases, where the broad generalization that is cognition's specialty is useful for appropriately biasing perception and actuation based on subtle enviromnental regu- larities. And, the patterns involved in perception and actuation help cognition, via supplying a rich reservoir of structures and processes to use as analogies for reasoning and learning at various levels of abstraction. The prominence of visual and other sensory metaphors in abstract cognition is well known lArn69, Gar00]; and according to Lakoff and Nunez 1LN001 even pure mathematics is grounded in physical perception and action in very concrete ways. We begin by discussing the perception and action mechanisms required to interface CogPrime with an agent operating in a virtual world. We then turn to the more complex mechanisms needed to effectively interface CogPrime with a robot possessing vaguely humanoid sensors and actuators, focusing largely on vision processing. This discussion leads up to deeper discussions in Chapters 27, 28 and 29 where we describe in detail the strategy that would be used to integrate 129 EFTA00624276
130 26 Perceptual and Motor Hierarchies CogPrime with the DeSTIN framework for AGI perception/action (which was described in some detail in Chapter 4 of Part 1). In terms of the integrative cognitive architecture presented in Chapter 5, the material pre- sented in the chapters in this section has mostly to do with the perceptual and motor hierarchies, also touching on the pattern recognition and imprinting processes that play a role in the inter- action between these hierarchies and the conceptual memory. The commitment to a hierarchical architecture for perception and action Ls not critical for the CogPrime design as a whole - one could build a CogPrime with non-hierarchical perception and action modules, and the rest of the system would be about the same. The role of hierarchy here is a reflection of the obvious hierarchical structure of the everyday human environment, and of the human body. In a world marked by hierarchical structure, a hierarchically structured perceptual system is advantageous. To control a body marked by hierarchical structure, an hierarchically structured action system is advantageous. It would be possible to create a CogPrime system without this sort of in-built hierarchical structure, and have it gradually self-adapt in such a way as to grow its own internal hierarchical structure, based its experience in the world. However, this might be a case of push- ing the "experiential learning" perspective too far. The human brain definitely has hierarchical structure built into it; it doesn't need to learn to experience the world in hierarchical terms; and there seems to be no good reason to complicate an AGI's early development phase by forcing it to learn the basic facts of the world's and its body's hierarchality. 26.2 The Generic Perception Process We have already discussed the generic action process of CogPrime, in Chapter 25 on procedure evaluation. Action sequences are generated by Combo programs, which execute primitive ac- tions, including those corresponding to actuator control signals as well as those corresponding to, say, mathematical or cognitive operations. In some cases the actuator control signals may directly dictate movements; in other cases they may supply inputs and/or parameters to other software (such as DeSTIN, in the integrated CogBot architecture to be described below). What about the generic perception process? We distinguish sensation from perception, in a CogPrime context, by defining • perception as what occurs when some signal from the outside world registers itself in either: a CogPrime Atom, or some other sort of node (e.g. a DeSTIN node) that is capable of serving as the target of a CogPrime Link. • sensation as any "preprocessing" that occurs between the impact of some signal on some sensor, and the creation of a corresponding perception Once perceptual Atoms have been created, various perceptual MindAgents comes into play, taking perceptual schemata (schemata whose arguments are perceptual nodes or relations there- between) and applying them to Atoms recently created (creating appropriate ExecutionLinks to store the results). The need to have special, often modality-specific perception MindAgents to do this, instead of just leaving it to the generic SchemaExecution MindAgent, has to do with computational efficiency, scheduling and parameter settings. Perception MindAgents are doing schema execution urgently, and doing it with parameter settings tuned for perceptual processing. This means that, except in unusual circumstances, newly received stimuli will be processed immediately by the appropriate perceptual schemata. EFTA00624277
26.3 Interfacing CogPrime with a Virtual Agent 131 Sonic newly formed perceptual Atoms will have links to existing atoms, ready-made at their moment of creation. CharacterlnstanceNodes and NumberinstanceNodes are examples; they are born linked to the appropriate CharacterNodes and NumberNodes. Of course, atoms represent- ing perceived relationships, perceived groupings, etc., will not have ready-made links and will have to grow such links via various cognitive processes. Also, the ContextFormation MindAgent looks at perceptual atom creation events and creates Context Nodes accordingly; and this must be timed so that the Context Nodes are entered into the system rapidly, so that they can be used by the processes doing initial-stage link creation for new perceptual Atoms. In a full CogPrime configuration, newly created perceptual nodes and perceptual schemata may reside in a special perception-oriented Units, so as to ensure that perceptual processes occur rapidly, not delayed by slower cognitive processes. 26.2.1 The ExperienceDB Separate from the ordinary perception process, it may also valuable for there to be a direct route from the system's sensory sources to a special "ExperienceDB" database that records all of the system's experience. This does not involve perceptual schemata at all, nor is it left up to the sensory source; rather, it is carried out by the CogPrime server at the point where it receives input from the sensory source. This experience database is a record of what the system has seen in the past, and may be mined by the system in the future for various purposes. The creation of new perceptual atoms may also be stored in the experience database, but this must be handled with care as it can pose a large computational expense; it will often be best to store only a subset of these. Obviously, such an ExperienceDB is something that has no correlate in the human mind/brain. This is a case where CogPrime takes advantage of the non-brainlike properties of its digital com- puter substrate. The CogPrime perception process is intended to work perfectly well without access to the comprehensive database of experiences potentially stored in the ExperienceDB. However, a complete record of a mind's experience is a valuable thing, and there seems no reason for the system not to exploit it fully. Advantages like this allow the CogPrime system to partially compensate for its lath of some of the strengths of the human brain as an Al platform, such as massive parallelism. 26.3 Interfacing CogPrime with a Virtual Agent We now discuss some of the particularities of connecting CogPrime to a virtual world (such as Second Life, Multiverse, or Unity3D, to name sonic of the virtual world / gaming platforms to which OpenCog has already been connected in practice). EFTA00624278
132 26 Perceptual and Motor Hierarchies 26.3.1 Perceiving the Virtual World The most complex, high-bandwidth sensory data coming in from a typical virtual world is visual data, so that will be our focus here. We consider three modes in which a virtual world may present visual data to CogPrime (or any other system): • Object vision: CogPrime receives information about polygonal objects and their colors, textures and coordinates (each object is a set of contiguous polygons, and sometimes objects have "type" information, e.g. cube or sphere) • Polygon vision: CogPrime receives information about polygons and their colors, textures and coordinates • Pixel vision: CogPrime receives information about pixels and their colors and coordinates In each case, coordinates may be given either in "world coordinates" or in "relative coordinates" (relative to the gaze). This distinction is not a huge deal since within an architecture like CogPrime, supplying schemata for coordinate transformation is trivial; and, even if treated as a machine learning task, this sort of coordinate transformation is not very difficult to learn. Our current approach is to prefer relative coordinates, as this approach is more natural in terms of modern Western human psychology; but we note that in some other cultures world coordinates are preferred and considered more psychologically natural. Currently we have not yet done any work with pixel vision in virtual worlds. We have been using object vision for most of our experiments, and consider a combination of polygon vision and object vision as the "right" approach for early AGI experiments in a virtual worlds context. The problem with pure object vision is that it removes the possibility for CogPrime to understand object segmentation. If, for instance, CogPrime perceives a person as a single object, then how can it recognize a head as a distinct sub-object? Feeding the system a pre- figured hierarchy of objects, sub-objects and so forth seems inappropriate in the context of an experiential learning system. On the other hand, the use of polygon vision instead of pixel vision seems to meet no such objections. This may take different forms in different platforms. For instance, in our work with a Minecraft-like world in the Unity3D environment, we have relied heavily on virtual objects made of blocks, in which case the polygons of most interest are the faces of the blocks. Momentarily sticking with the object vision case for simplicity, examples of the perceptions emanating from the virtual world perceptual preprocessor into CogPrime are things like: 1. I am at world-coordinates $W 2. Object with metadata SM is at world-coordinates SW 3. Part of object with metadata $M is at world-coordinates $W 4. Avatar with metadata SM is at world-coordinates SW 5. Avatar with metadata SM is carrying out animation SA 6. Statements in natural language, from the pet owner The perceptual preprocessor takes these signals and translates them into Atoms, making use of the special Atomspace mechanisms for efficiently indexing spatial and temporal information (the and ) as appropriate. EFTA00624279
26.3 Interfacing CogPrime with a Virtual Agent 133 26.3.1.1 Transforming Real-World Vision into Virtual Vision One approach to enabling CogPrime to handle visual data coining from the real world is to transform this data into data of the type CogPrime sees in the virtual world. While this is not the approach we are taking in our current work, we do consider it a viable strategy, and we briefly describe it here. One approach along these lines would involve multiple phases: • Use a camera eye and a LiDAR (Light Detection And Ranging, used for high-resolution topographic mapping) sensor in tandem, so as to avoid having to deal with stereo vision • Using the above two inputs, create a continuous 3D contour map of the perceived visual world • Use standard mathematical transforms to polygon-ize the 3D contour map into a large set of small polygons • Use heuristics to merge together the small polygons, obtaining a smaller set of larger poly- gons (but retaining the large set of small polygons for the system to reference in cases where a high level of detail is necessary) • Feed the polygons into the perceptual pattern mining subsystem, analogously to the poly- gons that come in from virtual-world In this approach, preprocessing is used to make the system see the physical world in a manner analogous to how it sees the virtual-world world. This is quite different from the DeSTIN-based approach to CogPrime vision that we will discuss in Chapter 28, but may well also be feasible. 26.3.2 Acting in the Virtual World Complementing the perceptual preprocessor is the action postprocessor: code that translates the actions and action-sequences generated by CogPrime into instructions the virtual world can understand (such as "launch thus-and-thus animation"). Due to the particularities of current virtual world architectures, the current OpenCogPrime system carries out actions via executing pre-programmed high-level procedures, such as "move forward one step", "bend over forward" and so forth. Example action commands are: 1. Move ($D, $S) : $D is a distance, $S is a speed 2. Turn ($A, $S) : $A is an angle, $S is a speed 3. Pitch ($A, $S) : turn vertically up/down... [for birds only] 4. Jump ($D, $H, $S) : $H is a maximum height, at the center of the jump 5. Say ($T), ST is text : for agents with linguistic capability, which is not enabled in the current version 6. pick up($0) : $0 is an object 7. put down($0) This is admittedly a crude approach, and if a robot simulator rather than a typical virtual world were used, it would be possible for CogPrime to emanate detailed servomotor control commands rather than high-level instructions such as these. However, as noted in Chapter 16 of Part 1, at the moment there is no such thing as a "massive multiplayer robot simulator," and so the choice is between a multi-participant virtual environment (like the Multiverse environment EFTA00624280
134 26 Perceptual and Motor Hierarchies currently used with the PetBrain) or a small-scale robot simulator. Our experiments with virtual worlds so far have used the high-level approach described here; but we are also experimenting with using physical robots and corresponding simulators, as will be described below. 26.4 Perceptual Pattern Mining Next we describe how perceptual pattern mining may be carried out, to recognize meaningful structures in the stream of data produced via perceiving a virtual or physical world. In this subsection we discuss the representation of knowledge, and then in the following subsection we discuss the actual mining. We discuss the process in the context of virtual-world perception as outlined above, but the same processes apply to robotic perception, whether one takes the "physical world as virtual world" approach described above or a different sort of approach such as the DeSTIN hybridization approach described below. 26.4.1 Input Data First, we may assume that each perception is recorded as set of "transactions", each of which is of the form Time, 3D coordinates, object type or Time, 3D coordinates, action type Each transaction may also come with an additional list of (attribute, value) pairs, where the list of attributes is dependent upon the object or action type. Transactions are represented as Atoms, and don't need to be a specific Atom type - but are referred to here by the special name transactions simply to make the discussion clear. Next, define a transaction template as a transaction with location and time information set to wild cards - and potentially, some other attributes set to wild cards. (These are implemented in terms of Atoms involving VariableNodes.) For instance, some transaction templates in the current virtual-world might be informally represented as: • Reward • Red cube • kick • move_forward • Cube • Cube, size 5 • inc • Teacher EFTA00624281
26.4 Perceptual Pattern Mining 135 26.4.2 Transaction Graphs Next we may conceive a transaction graph, whose nodes are transactions and whose links are labeled with labels like after, SimAND, SSegAND (short for SimultaneousSequentialAND), near. in_front_of, and so forth (and whose links are weighted as well). We may also conceive a transaction template graph, whose nodes are transaction templates, and whose links are the same as in the transaction graph. Examples of transaction template graphs are near(Cube, Teacher) SSegAND(move_forward, Reward) Where Cube, Teacher, etc are transaction templates since Time and 3D coordinates are left unspecified. And finally, we may conceive a transaction template relationship graph (1TRG), whose nodes may be any of: transactions; transaction templates; basic spatiotemporal predicates evaluated at tuples of transactions or transaction templates. For instance SimAND(near(Cube, Teacher), above(Cube, Chair)) 26.4.3 Spatiotempcnut Conjunctions Define a temporal conjunction as a conjunction involving SimultaneousAND and Sequentia- IAND operators (including SSNAND as a special case of SeciAND: the special case that interests us in the short term). The conjunction is therefore ordered, e.g. A SSecIAND S SimAND C SSemAND D We may assume that the order of operations favors SimAND, so that no parenthesizing is necessary. Next, define a basic spatiotemporal conjunction as a temporal conjunction that conjoins terms that are either • transactions, or • transaction templates. or • basic spat iotemporal predicates applied to tuples of transactions or transaction templates I.e. a basic spatiotemporal conjunction is a temporal conjunction of nodes from the transaction template relationship graph. An example would be: (hold ball) SimAND ( near (ne, teacher) ) SSeciAND Reward This assumes that the hold action has an attribute that is the type of object held, so that hold ball in the above temporal conjunction is a shorthand for the transaction template specified by action type: hold object_held_type: ball This example says that if the agent is holding the ball and is near the teacher then shortly after that, the agent will get a reward. EFTA00624282
136 26 Perceptual and Motor Hierarchies 26.4.4 The Mining Task The perceptual mining task, then, is to find basic spatiotemporal conjunctions that are inter- esting. What constitutes interestingness is multifactorial, and includes. • involves important Atoms (e.g. Reward) • has a high temporal cohesion (i.e. the strength of the time relationships embodied in the SimAND and SeqAND links is high) • has a high spatial cohesion (i.e. the near() relationships have high strength) • has a high frequency • has a high surprise value (its frequency is far from what would be predicted by its component sub-conjunctions) Note that a conjunction can be interesting without satisfying all these criteria; e.g. if it involves something important and has a high temporal cohesion, we want to find it regardless of its spatial cohesion. In preliminary experiments we have worked with a provisional definition orinterestingness" as the combination of frequency and temporal cohesion, but this must be extended; and one may even wish to have the combination function optimized over time (slowly) where the fitness function is defined in terms of the STI and LTI of the concepts generated. 26.4.4.1 A Mining Approach One tractable approach to perceptual pattern mining is greedy and iterative, involving the following steps: 1. Build an initial transaction template graph G 2. Greedily mine some interesting basic spatiotemporal conjunctions from it, adding each interesting conjunction found as a new node in G (so that G becomes a transaction template relationship graph), repeating step 2 until boredom results or time runs out The same TTRG may be maintained over time, but of course will require a robust forgetting mechanism once the history gets long or the environment gets nontrivially complex. The greedy mining step may involve simply grabbing SeqAND or SimAND links with prob- ability determined by the (importance and/or interestingness) of their targets, and the prob- abilistic strength and temporal strength of the temporal AND relationship, and then creating conjunctions based on these links (which then become new nodes in the TTRG, so they can be built up into larger conjunctions). 26.5 The Perceptual-Motor Hierarchy The perceptual pattern mining approach described above is "flat," in the sense that it simply proposes to recognize patterns in a stream of perceptions, without imposing any kind of ex- plicitly hierarchical structure on the pattern recognition process or the memory of perceptual patterns. This is different from how the human visual system works, with its clear hierarchical EFTA00624283
26.6 Object Recognition from Polygonal Meshes 137 structure, and also different from many contemporary vision architectures, such as DeSTIN or Hawkins' Numenta system which also utilizes hierarchical neural networks. However, the approach described above may be easily made hierarchical within the CogPrime architecture, and this is likely the most effective way to deal with complex visual scenes. Most simply, in this approach, a hierarchy may be constructed corresponding to different spatial regions, within the visual field. The RegionNodes at the lowest level of the hierarchy correspond to small spatial regions, the ones at the next level up correspond to slightly larger spatial regions, and so forth. Each RegionNode also correspond to a certain interval of time, and there may be different RegionNodes corresponding to the same spatial region but with different time- durations attached to them. RegionNodes may correspond to overlapping rather than disjoint regions. Within each region mapped by a RegionNode, then, perceptual pattern mining as defined in the previous section may occur. The patterns recognized in a region are linked to the corre- sponding RegionNode - and are then fed as inputs to the RegionNodes corresponding to larger, encompassing regions; and as suggestions-to-guide-pattern-recognition to nearby RegionNodes on the same level. This architecture involves the fundamental hierarchical structure/dynamic observed in the human visual cortex. Thus, the hierarchy incurs a dynamic of patterns-within- patterns-within-patterns, and the heterarchy incurs a dynamic of patterns-spawning-similar- patterns. Also, patterns found in a RegionNode should be used to bias the pattern-search in the RegionNodes corresponding to smaller. contained regions: for instance, if many of the sub- regions corresponding to a certain region have revealed parts of a face, then the pattern-mining processes in the remaining sub-regions may be instructed to look for other face-parts. This architecture permits the hierarchical dynamics utilized in standard hierarchical vision models, such as Jeff Hawkins' and other neural net models, but within the context of CogPrime's pattern-mining approach to perception. It is a good example of the flexibility intrinsic to the CogPrime architecture. Finally, why have we called it a perceptual-motor hierarchy above? This is because, due to the embedding of the perceptual hierarchy in CogPrime's general Atom-network, the percepts in a certain region will automatically be linked to actions occurring in that region. So, there may be some perception-cognition-action interplay specific to a region, occurring in parallel with the dynamics in the hierarchy of multiple regions. Clearly this mirrors some of the complex dynamics occurring in the human brain, and is also reflected in the structure of sophisticated perceptual-motor approaches like DeSTIN, to be discussed below. 26.6 Object Recognition from Polygonal Meshes Next we describe a more specific perceptual pattern recognition algorithm - a strategy for iden- tifying objects in a visual scene that is perceived as a set of polygons. It is not a thoroughly detailed algorithmic approach, but rather a high-level description of how this may be done effec- tively within the CogPrime design. It is offered here largely as an illustration of how specialized perceptual data processing algorithms may he designed and implemented within the CogPrime framework. EFTA00624284
138 26 Perceptual and Motor Hierarchies We deal here with an agent whose perception of the world, at any point in time, is understood to consist of a set of polygons, each one described in terms of a list of corners. The corners may be assumed to be described in coordinates relative to the viewing eye of the agent. What we mean by "identifying objects" here is something very simple. We don't mean iden- tifying that a particular object is a chair, or is Ben's brown chair, or anything like that - we simply mean identifying that a given collection of polygons is meaningfully grouped into an ob- ject. That is the task considered here. The object could be a single block, it could be a person, or it could be a tower of blocks (which appears as a single object until it is taken apart). Of course, not all approaches to polygon-based vision processing would require this sort of phase: it would be possible, as an alternative, to simply compare the set of polygons in the visual field to a database of prior experience and then do object identification (in the present sense) based on this database-comparison. But in the approach described in this section, one begins instead with an automated segmentation of the set of perceived polygons into a set of objects. 26.6.1 Algorithm Overview The algorithm described here falls into three stages: 1. Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes. 2. Creating Adjacency Graphs from PPNodes. 3. Clustering in the Adjacency Graph. Each of these stages involves a bunch of details, not all of which have been fully resolved: this section just gives a conceptual overview. We will speak in terms of objects such as PolygonNode, PPNode and so forth, because inside the CogPrime AI engine, observed and conceived entities are represented as nodes in an graph. However, this terminology is not very important here, and what we call a PolygonNode here could just as well be represented in a host of other ways, within the overall CogPrime framework. 26.6.2 Recognizing PersistentPolygonNodes (PPNodes) from PolygonNodes A PolygonNode represents a polygon observed at a point in time. A PPNode represents a series of PolygonNodes that are heuristically guessed to represent the same PolygonNode at different moments in time. Before "object permanence" is learned, the heuristics for recognizing PPNodes will only work in the case of a persistent polygon that, over an interval of time, is experiencing relative motion within the visual field, but is never leaving the visual field. For example some reasonable heuristics are: If P1 occurs at time t, P2 occurs at time s where s is very close to t, and PI are similar in shape, size and color and position, then P1 and P2 should be grouped together into the same PPNode. EFTA00624285
26.6 Object Recognition from Polygonal Meshes 139 More advanced heuristics would deal carefully with the case where some of these similarities did not hold, which would allow us to deal e.g. with the case where an object was rapidly changing color. In the case where the polygons are coming from a simulation world like OpenSim, then from our positions as programmers and world-masters, we can see that what a PPNode is supposed to correspond to is a certain side of a certain OpenSim object; but it doesn't appear immediately that way to CogPrime when controlling an agent in OpenSim since CogPrime isn't perceiving OpenSim objects, it's perceiving polygons. On the other hand, in the case where polygons are coming from software that postprocesses the output of a LiDAR based vision system, then the piecing together of PPNodes from PolygonNodes is really necessary. 26.6.3 Creating Adjacency Graphs from. PPNodes Having identified PPNodes, we may then draw a graph between PPNodes, a PPGraph (also called an "Adjacency Graph"), wherein the links are AdjacencyLinks (with weights indicating the degree to which the two PPNodes tend to be adjacent, over time). A more refined graph might also involve SpatialCoordinationLinks (with weights indicating the degree to which the vector between the centroids of the two PPNodes tends to be consistent over time). We may then use this graph to do object identification: • First-level objects may be defined as clusters in the graph of PPNodes. • One may also make a graph between first-level objects, an ObjectGraph with the same kinds of links as in the PPGraph. Second-level objects may be defined as clusters in the ObjectGraph. The "strength" of an identified object may be assigned as the "quality" of the cluster (measured in terms of how tight the cluster is, and how well separated from other clusters.) As an example, consider a robot with two parts: a body and a head. The whole body may have a moderate strength as a first-level object, but the head and body individually will have significantly greater strengths as first-level objects. On the other hand, the whole body should have a pretty strong strength as a second-level object. It seems convenient (though not necessary) to have a PhysicalObjectNode type to represent the objects recognized via clustering; but the first versus second level object distinction should not need to be made on the Atom type level. Building the adjacency graph requires a mathematical formula defining what it means for two PPNodes to be adjacent. Creating this formula may require a little tuning. For instance, the adjacency between two PPNodes PP1 and PP2 may be defined as the average over time of the adjacency of the PolygonNodes PP1(t) and PP2(t) observed at each time t. (A p'th power average' may be used here, and different values of p may be tried.) Then, the adjacency between two (simultaneous) PolygonNodes P1 and P2 may be defined as the average over all x in PI of the minimum over all y in P2 of sim(x,y), where sim(,) is an appropriately scaled similarity function. This latter average could arguably be made a maximum; or perhaps even better a p'th power average with large p, which approximates a maximum. I the p'th power average is defined as J XP EFTA00624286
140 26 Perceptual and Motor Hierarchies 26.6.4 Clustering in the Adjacency Graph. As noted above, the idea is that objects correspond to clusters in the adjacency graph. This means we need to implement some hierarchical clustering algorithm that is tailored to find clusters in symmetric weighted graphs. Probably some decent algorithms of this character exist, if not it would be fairly easy to define one, e.g. by mapping some standard hierarchical clustering algorithm to deal with graphs rather than vectors. Clusters will then be mapped into PhysicalObjectNodes, interlinked appropriately via Phys- icalPartLinks and AdjacencyLinks. (E.g. there would be a PhysicalPartLink between the Phys- icalObjectNode representing a head and the PhysicalObjectNode representing a body [where the body is considered as including the head). 26.6.5 Discussion It seems probable that, for simple scenes consisting of a small number of simple objects, clus- tering for object recognition will be fairly unproblematic. However, there are two cases that are potentially tricky: • Sub-objects: e.g. the head and torso of a body, which may move separately; or the nose of the head, which may wiggle; or the legs of a walking dog; etc. • Coordinated objects: e.g. if a character's hat is on a table, and then later on his head, then when it's on his head we basically want to consider him and his hat as the same object, for some purposes. These examples show that partitioning a scene into objects is a borderline-cognitive rather than purely lower-level-perceptual task, which cannot be hard-wired in any very simple way. We also note that, for complex scen , clustering may not work perfectly for object recogni- tion and some reasoning may be needed to aid with the process. Intuitively, these may corre- spond to scenes that, in human perceptual psychology, require conscious attention and focus in order to be accurately and usefully perceived. 26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy We have discussed how one may do perception processing such as object recognition within the Atomspace, and this is indeed a viable strategy. But an alternate approach is also interesting, and likely more valuable in the case of robotic perception/action: build a separate perceptual- motor hierarchy, and link it in with the Atomspace. This approach is appealing in large part because a lot of valuable and successful work has already been done using neural networks and related architectures for perception and actuation. And it is not necessarily contradictory to doing perception processing in the Atomspace - obviously, one may have complementary, synergetic perception processing occurring in two different parts of the architecture. EFTA00624287
26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy 141 This section reviews some general ideas regarding the interfacing of CogPrime with deep learning hierarchies for perception and action; the following chapter then discusses one example of this in detail, involving the DeSTIN deep learning architecture. 26.7.1 Hierarchical Perception Action Networks CogPrime could be integrated with a variety of different hierarchical perception/action archi- tectures. For the purpose of this section, however, we will consider a class of architectures that is neither completely general nor extremely specific. Many of the ideas to be presented here are in fact more broadly applicable beyond the architecture described here. The following assumptions will be made about the HPANs (Hierarchical Perception/Action Network) to be hybridized with CogPrime. It may be best to use multiple HPANs, at least one for declarative/sensory/episodic knowledge (we'll call this the "primary HPAN") and one for procedural knowledge. A HPAN for intentional knowledge (a goal hierarchy; in DeSTIN called the "critic hierarchy") may be valuable as well. We assume that each HPAN has the properties: 1. It consists of a network of nodes, endowed with a learning algorithm, whose connectivity pattern is largely but not entirely hierarchical (and whose hierarchy contains both feedback, feedforward and lateral connections) 2. It contains a set of input nodes, receiving perceptual inputs, at the bottom of the hierarchy 3. It has a set of output nodes, which may span multiple levels of the hierarchy. The "output nodes" indicate informational signals to cognitive processes lying outside the HPAN, or else control signals to actuators, which may be internal or external. 4. Other nodes besides I/O nodes may potentially be observed or influenced by external pro- cesses; for instance they may receive stimulation 5. Link weights in the HPAN get updated via some learning algorithm that is roughly speaking "statistically Hebbian." in the sense that on the whole when a set of nodes get activated together for a period of time, they will tend to become attractors. By an attractor we mean: a set S of nodes such that the activation of a subset of S during a brief interval tends to lead to the activation of the whole set S during a reasonably brief interval to follow 6. As an approximate but not necessarily strict rule, nodes higher in the hierarchy tend to be involved in attractors corresponding to events or objects localized in larger spacetime regions Examples of specific hierarchical architectures broadly satisfying these requirements are the visual pattern recognition networks constructed by Hawkins 11113061 and IPCP011, and Arel's DeSTIN system discussed earlier (and in more depth in following chapters). The latter appears to fit the requirements particularly snugly due to having dynamics very well suited to the formation of a complex array of attractors, and a richer methodology for producing outputs. These are all not only HPANs but have a more particular structure that in Chapter 27 is called a Compositional Spatiotemporal Deep Learning Network or CSDLN. The particulars of the use of HPANs with OpenCog are perhaps best explained via enumer- ation of memory types and control operations. EFTA00624288
142 26 Perceptual and Motor Hierarchies 26.7.2 Declarative Memory The key idea here is linkage of primary HPAN attractors to CogPrime ConceptNodes via MemberLinks. This is in accordance with the notion of glocal memory, in the language of which the HPAN attractors are the maps and the corresponding ConceptNodes are the keys. Put simply, when a HPAN attractor is recognized, MemberLinks are created between the HPAN nodes comprising the main body of the attractor, and a ConceptNode in the AtomTable representing the attractor. MemberLink weights may be used to denote fuzzy attractor membership. Activation may spread from HPAN nodes to ConceptNodes, and STI may spread from ConceptNodes to HPAN nodes; a conversion rate between HPAN activation and STI currency must be maintained by the CogPrime central bank (see Chapter 23), for ECAN purposes. Both abstract and concrete knowledge may be represented in this way. For instance, the Eiffel Tower would correspond to one attractor, the general shape of the Eiffel Tower would correspond to another, and the general notion of a "tower" would correspond to yet another. As these three examples are increasingly abstract, the corresponding attractors would be weighted increasingly heavily on the upper levels of the hierarchy. 26.7.3 Sensory Memory CogPrime may also use its primary HPAN to store memories of sense-perceptions and low-level abstractions therefrom. MemberLinks may join concepts in the AtomTable to percept-attractors in the HPAN. If the HPAN is engineered to associate specific neural modules to specific spatial regions or specific temporal intervals, then this may be accounted for by automatically index- ing ConceptNodes corresponding to attractors. centered in those modules in the AtomTable's TimeServer and SpaceServer objects, which index Atoms according to time and space. An attractor representing something specific like the Eiffel Tower, or Bob's face, would be weighted largely in the lower levels of the hierarchy, and would correspond mainly to sensory rather than conceptual memory. 26.7.4 Procedural Memory The procedural HPAN may be used to learn procedures such as low-level motion primitives that are more easily learned using HPAN training than using more abstract procedure learning methods. For example, a Combo tree learned by MOSES in CogPrime might contain a primitive corresponding to the predicate-argument relationship pick_up(ball); but the actual procedure for controlling a robot hand to pick up a ball, might be expressed as an activity pattern within the low-level procedural HPAN. A procedure P stored in the low-level procedural HPAN would be represented in the AtomTable as a ConceptNode C linked to key nodes in the HPAN attractor corresponding to P. The invocation of P would be accomplished by transferring STI currency to C and then allowing ECAN to do its work. On the other hand, CogPrime's interfacing of the high-level procedural HPAN with the Cog- Prime ProcedureRepository is intimately dependent on the particulars of the MOSES proce- EFTA00624289
26.7 Interfacing the Atomspace with a Deep Learning Based Perception-Action Hierarchy 143 dure learning algorithm. As will be outlined in more depth in Chapter 33, MOSES is a complex, multi-stage process that tries to find a program maximizing some specified fitness function, and that involves doing the following within each "deme" (a deme being an island of roughly-similar programs) 1. casting program trees into a hierarchical normal form 2. evaluating the program trees on a fitness function 3. building a model distinguishing fit versus unfit program trees, which involves: 3a. figuring out what program tree features the model should include; 3b. building the model using a learning algorithm 4. generating new program trees that are inferred likely to give high fitness, based on the model 5. return to step 1 with these new program trees There is also a system for managing the creation and deletion of denies. The weakest point in CogPrime's current MOSES-based approach to procedure learning appears to be step 3. And the main weakness is conceptual rather than algorithmic; what is needed is to replace the current step 3 with something that uses long-term memory to do model-building and feature-selection, rather than (like the current code) doing these things in a manner that's restricted to the population of program trees being evolved to optimize a particular fitness function. One promising approach to resolving this issue is via replacing step 3b (and, to a limited extent, 3a) with an interconnection between MOSES and a procedural HPAN. A HPAN can do supervised categorization, and can be designed to handle feature selection in a manner integrated with categorization, and also to integrate long-term memory into its categorization decisions. 26.7.5 Episodic Memory In a hybrid CogPrime /HPAN architecture, episodic knowledge may be handled via a combi- nation of: 1. using a traditional approach to store a large ExperienceDB of actual experienced episodes (including sensory inputs and actions; and also the states of the most important items in memory during the experience) 2. using the Atomspace (with its TimeServer and Spar,Server components) to store declarative knowledge about experiences 3. using dimensional embedding to index the AtomSpace's episodic knowledge in a spatiotem- porally savvy way, as described in Chapter 40 4. training a large HPAN to summarize the scope of experienced episodes (this could be the primary HPAN used for declarative and sensory memory, or could potentially be a separate episodic HPAN) Such a network should be capable of generating imagined episodes based on cues, as well recalling real episodes. The HPAN would serve as a sort of index into the memory, of episodes, There would be HebbianLinks from the AtomTable into the episodic HPAN. EFTA00624290
144 26 Perceptual and Motor Hierarchies For instance, suppose that once the agent built an extremely tall tower of blocks, taller than any others in its memory. Perhaps it wants to build another very tall tower again, so it wants to summon up the memory, of that previous occasion, to see if there is possibly guidance therein. It then proceeds by thinking about tallness and towerness at the same time, which stimulates the relevant episode, because at the time of building the extremely tall tower, the agent was thinking a lot about tallness (so thoughts of tallness are part of the episodic memory). 26.7.6 Action Selection and Attention Allocation CogPrime's action selection mechanism chooses procedures based on which ones are estimated most likely to achieve current goals given current context, and places these in an "active proce- dure pool" where an ExecutionManager object mediates their execution. Attention allocation spans all components of CogPrime, including an HPAN if one is in- tegrated. Attention flows between the two components due to the conversion of STI to and from HPAN activation. And in this manner assignment of credit flows from GoalNodes into the HPAN, because this kind of simultaneous activation may be viewed as "rewarding" a HPAN link. So, the HPAN may reward signals from GoalNodes via ECAN, because when a ConceptN- ode gets rewarded, if the ConceptNode points to a set of nodes, these nodes get some of the reward. 26.8 Multiple Interaction Channels Now we discuss a broader issue regarding the interfacing between CogPrime and the external world. The only currently existing embodied OpenCog applications, PetBrain and C,ogBot, are based on a loosely human model of perception and action, in which a single CogPrime instance controls a single mobile body, but this of course is not the only way to do things. More generally, what we can say is that a variety of external-world events come into a CogPrime system from physical or virtual world sensors, plus from other sources such as database interfaces, Web spiders, and/or other sources. The external systems providing CogPrime with data may be generically referred to as sensory sources (and in the terminology we adopt here, once Atoms have been created to represent external data, then one is dealing with perceptions rather than sensations). The question arises how to architect a CogPrime system, in general, for dealing with a variety of sensory sources. We introduce the notion of an "interaction channel": a collection of sensory, sources that is intended to be considered as a whole as a synchronous stream, and that is also able to receive CogPrime actions - in the sense that when CogPrime carries out actions relative to the interaction channel, this directly affects the perceptions that CogPrime receives from the interaction channel. A CogPrime meant to have conversations with 10 separate users at once might have 10 interaction channels. A human mind has only one interaction channel in this sense (although humans may become moderately adept at processing information from multiple external-world sources, coming in through the same interaction channel). Multiple-interaction-channel digital psychology may become extremely complex - and hard for us, with our single interaction channels, to comprehend. This is one among many cases EFTA00624291
26.8 Multiple Interaction Channels 145 where a digital mind, with its more flexible architecture, will have a clear advantage over our human minds with their fixed and limited neural architectures. For simplicity, however, in the following chapters we will often focus on the single-interaction-channel case. Events coming in through an interaction channel are presented to the system as new per- ceptual Atoms, and relationships amongst these. In the multiple interaction channel case, the AttentionValues of these newly created Atoms require special treatment. Not only do they re- quire special rules, they require additional fields to be added to the AttentionValue object, beyond what has been discussed so far. We require newly created perceptual Atoms to be given a high initial STI. And we also require them to be given a high amount of a quantity called "interaction-channel STI." To support this, the AttentionValue objects of Atoms must be expanded to contain interaction- channel STI values; and the ImportanceUpdating MindAgent must compute interaction-channel importance separately from ordinary importance. And, just as we have channel-specific AttentionValues, we may also have channel-specific TruthValues. This allows the system to separately account for the frequency of a given percep- tual item in a given interaction channel. However, no specific mechanism is needed for these, they are merely contextual truth values, to be interpreted within a Context Node associated with the interaction channel. EFTA00624292
EFTA00624293
Chapter 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network 27.1 Introduction Many different approaches to "low-level" perception and action processing are possible within the overall CogPrime framework. We discussed several in the previous chapter, all elaborations of the general hierarchical pattern recognition approach. Here we describe one sophisticated ap- proach to hierarchical pattern recognition based perception in more detail: the tight integration of CogPrime with a sophisticated hierarchical perception/action oriented learning system such as the DeSTIN architecture reviewed in Chapter 4 of Part 1. We introduce here the term "Compositional Spatiotemporal Deep Learning Network" (CS- DLN), to refer to deep learning networks whose hierarchical structure directly mirrors the hierarchical structure of spacetime. In the language of Chapter 26, a CSDLN is a special kind of HPAN (hierarchical perception action network), which has the special property that each of its nodes refers to a certain spatiotemporal region and is concerned with predicting what happens inside that region. Current exemplifications of the CSDLN paradigm include the DeS- TIN architecture that we will focus on here. along with Jeff Hawkins' Numenta "HTM" system 111B061 I, Itatnar Arel's DeSTIN 1ARC09a1, Itamar Arel's HDRN 2 system (the proprietary, closed-source sibling of DeSTIN), Dileep George's spin-off from Numenta a, and work by Mo- hamad Tarifi 1TSH Ill, Bundzel and Hashimoto 11311101, and others. CSDLNs are reasonably well proven as an approach to intelligent sensory data processing, and have also been hypothe- sized as a broader foundation for artificial general intelligence at the human level and beyond I1113061 IABC094 While CSDLNs have been discussed largely in the context of perception, the specific form of CSDLN we will pursue here goes beyond perception processing, and involves the coupling of three separate hierarchies, for perception, action and goals/reinforcement 1G141G+101. The "action" CSDLNs discussed here correspond to the procedural HPAN discussed in Chapter 26. Abstract learning and self-understanding are then hypothesized as related to systems of attractors emerging from the close dynamic coupling of the upper levels of the three hierarchies. I While the Numenta system is the best-known CSDLN architecture, other CSDLNs appear more impressively functional in various respects; and many CSDLN-related ideas existed in the literature well before Numenta's advent. 2 http: \binatix.con 3 http : \ v icar ioussyst ems c om 147 EFTA00624294
27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network DeSTIN is our paradigm case of this sort of CSDLN, but most of the considerations given here would apply to any CSDLN of this general character. CSDLNs embody a certain conceptual model of the nature of intelligence, and to integrate them appropriately with a broader architecture, one must perform the integration not only on the level of software code but also on the level of conceptual models. Here we focus here on the problem of integrating an extended version of the DeSTIN CSDLN system with the CogPrime integrative AGI (artificial general intelligence) system. The crux of the issue here is how to map DeSTIN's attractors into CogPrime's more abstract, probabilistic "weighted, la- beled hypergraph" representation (called the Atomspace). The main conclusion reached is that in order to perform this mapping in a conceptually satisfactory way, one requires a system of hierarchies involving the structure of DeSTIN's network but the semantic structures of the Atomspace. The DeSTIN perceptual hierarchy is augmented by motor and goal hier- archies, leading to a tripartite "extended DeSTIN". In this spirit, three "semantic-perceptual" hierarchies are proposed, corresponding to the three extended-DeSTIN CSDLN hierarchies and explicitly constituting an intermediate level of representation between attractors in DeSTIN and the habitual cognitive usage of CogPrime Atoms and Atom-networks. For simple reference we refer to this as the "Semantic CSDLN" approach. A "tripartite semantic CSDLN" consisting of interlinked semantic perceptual, motoric and goal hierarchies could be coupled with DeSTIN or another CSDLN architecture to form a novel AGI approach; or (our main focus here) it may be used as a glue between an CSDLN and and a more abstract semantic network such as the cognitive Atoms in CogPrime's Atomspace. One of the core intuitions underlying this integration is that, in order to achieve the desired level of functionality for tasks like picture interpretation and assembly of complex block struc- tures, a convenient route is to perform a fairly tight integration of a highly capable CSDLN like DeSTIN with other CogPrime components. For instance, we believe it's necessary to go deeper than just using DeSTIN as an input/output layer for CogPrime, by building associative links between the nodes inside DeSTIN and those inside the Atomspace. This "tightly linked integration" approach is obviously an instantiation of the general cogni- tive synergy principle, which hypothesizes particular properties that the interactions between components in an integrated AGI system should display, in order for the overall system to dis- play significant general intelligence using limited computational resources. Simply piping output from an CSDLN to other components, and issuing control signals from these components to the CSDLN, is likely an inadequate mode of integration, incapable of leveraging the full potential of CSDLNs; what we are suggesting here is a much tighter and more synergetic integration. In terms of the general principle of mind-world correspondence, the conceptual justification for CSDLN/CogPrime integration would be that the everyday human world contains many com- positional spatiotemporal structures relevant to human goals, but also contains many relevant patterns that are not most conveniently cast into a compositional spatiotemporal hierarchy. Thus, in order to most effectively perceive, remember, represent, manipulate and enact the full variety of relevant patterns in the world, it is sensible to have a cognitive structure containing a CSDLN as a significant component, but not the only component. EFTA00624295
27.3 Semantic CSDLN for Perception Processing 149 27.2 Integrating CSDLNs with Other AI Frameworks CSDLNs represent knowledge as attractor patterns spanning multiple levels of hierarchical net- works, supported by nonlinear dynamics and (at least in the case of the overall DeSTIN design) involving cooperative activity of perceptual, motor and control networks. These attractors are learned and adapted via a combination of methods including localized pattern recognition al- gorithms and probabilistic inference. Other AGI paradigms represent and learn knowledge in a host of other ways. How then can CSDLNs be integrated with these other paradigms? A very simple form of integration, obviously, would be to use a CSDLN as a sensorimotor cortex for another AI system that's focused on more abstract cognition. In this approach, the CSDLN would stream state-vectors to the abstract cognitive system, and the abstract cognitive system would stream abstract cognitive inputs to the CSDLN (which would then consider them together with its other inputs). One thing missing in this approach is the possibility of the abstract cognitive system's insights biasing the judgments inside the CSDLN. Also, abstract cognition systems aren't usually well prepared to handle a stream of quantitative state vectors (even ones representing intelligent compressions of raw data). An alternate approach is to build a richer intermediate layer, which in effect translates between the internal language of the CSDLN and the internal language of the other AI system involved. The particulars, and the viability, of this will depend on the particulars of the other AI system. What we'll consider here is the case where the other AI system contains explicit symbolic representations of patterns (including patterns abstracted from observations that may have no relation to its prior knowledge or any linguistic terms). In this case, we suggest, a viable approach may be to construct a "semantic CSDLN" to serve as an intermediary. The semantic CSDLN has the same hierarchical structure as an CSDLN, but inside each node it contains abstract patterns rather than numerical vectors. This approach has several potential major advantages: the other Al system is not presented with a large volume of numerical vectors (which it may be unprepared to deal with effectively); the CSDLN can be guided by the other AI system, without needing to understand symbolic control signals; and the intermediary semantic CSDLN can serve as a sort of "blackboard" which the CSDLN and the other AI system can update in parallel, and be guided by in parallel, thus providing a platform encouraging "cognitive synergy". The following sections go into more detail on the concept of semantic CSDLNs. The discussion mainly concerns the specific context of DeSTIN/CogPrime integration, but the core ideas would apply to the integration of any CSDLN architecture with any other AI architecture involving uncertain symbolic representations susceptible to online learning. 27.3 Semantic CSDLN for Perception Processing In the standard perceptual CSDLN hierarchy, a node N on level k (considering level 1 as the bottom) corresponds to a spatiotemporal region S with size sk (sk increasing monotonically and usually exponentially with k); and, has children on level k -1 corresponding to spatiotemporal regions that collectively partition S. For example, a node on level 3 might correspond to a 16x16 pixel region $ of 2D space over a time period of 10 seconds, and might have 4 level 2 children corresponding to disjoint 4x4 regions of 2D space over 10 seconds, collectively composing S. EFTA00624296
150 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network This kind of hierarchy is very, effective for recognizing certain types of visual patterns. How- ever it is cumbersome for recognizing some other types of patterns, e.g. the pattern that a face typically contains two eyes beside each other, but at variable distance from each other. One way to remedy this deficiency is to extend the definition of the hierarchy, so that nodes do not refer to fixed spatial or temporal positions, but only to relative positions. In this approach, the internals of a node are basically the same as in an CSDLN, and the correspondence of the nodes on level k with regions of size s k is retained, but the relationships between the nodes are quite different. For instance, a variable-position node of this sort could contain several possible 2D pictures of an eye, but be nonspecific about where the eye is located in the 2D input image. Figure 27.1 depicts this "semantic-perceptual CSDLN" idea heuristically, showing part of a semantic-perceptual CSDLN indicating the parts of a face, and also the connections between the semantic-perceptual CSDLN, a standard perceptual CSDLN, and a higher-level cognitive semantic network like CogPrime's Atomspace. More formally, in the suggested "semantic-perceptual CSDLN" approach, a node N on level k, instead of pointing to a set of level k. — 1 children, points to a small (but not necessarily connected) semantic network , such that the nodes of the semantic network are (variable- position) level k — 1 nodes; and the edges of the semantic network possess labels repre- senting spatial or temporal relationships, for example horizontally_ aligned, vertically_ aligned, right_side, left_side, above, behind, immediately_ right, immediately_left, immediately_ above, immediately_ below, after, immediately_ after. The edges may also be weighted either with num- bers or probability distributions, indicating the quantitative weight of the relationship indicated by the label. So for example, a level 3 node could have a child network of the form horizontally_aligned(Ni, N2) where N1 and N2 are variable-position level 2 nodes. This would mean that N1 and N2 are along the same horizontal axis in the 2D input but don't need to be immediately next to each other. Or one could say, e.g. on_axis_perpendicular_to(N1, N2, N3, N1), meaning that N1 and N2 are on an axis perpendicular to the axis between N3 and N4. It may be that the latter sort of relationship is fundamentally better in some cases, because horizontally_aligned is still tied to a specific orientation in an absolute space, whereas on _axis _perpendicular _to is fully relative. But it may be that both sorts of relationship are useful. Next, development of learning algorithms for semantic CSDLNs seems a tractable research area. First of all, it would seem that, for instance, the DeSTIN learning algorithms could straightforwardly be utilized in the semantic CSDLN case, once the local semantic networks involved in the network are known. So at least for sonic CSDLN designs, the problem of learning the semantic networks may be decoupled somewhat from the learning occurring inside the nodes. DeSTIN nodes deal with clustering of their inputs, and calculation of probabilities based on these clusters (and based on the parent node states). The difference between the semantic CSDLN and the traditional DeSTIN CSDLN has to do with what the inputs are. Regarding learning the local semantic networks, one relatively straightforward approach would be to data mine them from a standard CSDLN. That is, if one runs a standard CS- The perceptual CSDLN shown is unrealistically small for complex vision processing (only 4 layers), and only a fragment of the semantic-perceptual CSDLN is shown (a node corresponding to the category face, and then a child network containing nodes corresponding to several components of a typical face). In a real semantic- perceptual CSDLN, there would be many other nodes on the same level as the face node. many other parts to the face subnetwork besides the eyes, nose and mouth depicted here; the eye. nose and mouth nodes would also have child subnetworks; there would be link from each semantic node to centrokla within a large number of perceptual nodes; and there would also be many nodes not corresponding clearly to any single English language concept like eye, nose, face, etc. EFTA00624297
27.3 Semantic CSDLN for Perception Processing 151 .-/ PERCEPTUAL HTM 4.. I...mi. Weil Wines. — 11..10000 MIN vomilecHYM invade peed•••••••••vd. inPROMPI COGNITNE SEMANTIC NETWORK SEMANTIC-PERCEPTUAL NTH Fig. 27.1: Simplified depiction of the relationship between a semantic-perceptual CSDLN, a tra- ditional perceptual CSDLN (like DeSTIN), and a cognitive semantic network (like CogPrime's AtomSpace). DLN on a stream of inputs, one can then run a frequent pattern mining algorithm to find semantic networks (using a given vocabulary of semantic relationships) that occur frequently in the CSDLN as it processes input. A subnetwork that is identified via this sort of mining, can then be grouped together in the semantic CSDLN, and a parent node can be created and pointed to it. Also, the standard CSDLN can be searched for frequent patterns involving the clusters (referring to DeSTIN here, where the nodes contain clusters of input sequences) inside the nodes in the semantic CSDLN. Thus, in the "semantic DeSTIN" case, we have a feedback interaction wherein: 1) the standard CSDLN is formed via processing input; 2) frequent pattern mining on the standard CSDLN is used to create subnetworks and cor- responding parent nodes in the semantic CSDLN; 3) the newly created nodes in the semantic CSDLN get their internal clusters updated via standard DeSTIN dynamics; 4) the clusters in the semantic nodes are used as seeds for frequent pattern mining on the standard CSDLN, returning us to Step 2 above. After the semantic CSDLN is formed via mining the perceptual CSDLN, it may be used to bias the further processing of the perceptual CSDLN. For instance, in DeSTIN each node carries out probabilistic calculations involving knowledge of the prior probability of the "observation" coming into that node over a given interval of time. In the current DeSTIN version, this prior probability is drawn from a uniform distribution, but it would be more effective to draw the prior probability from the semantic network - observations matching things represented in the semantic network would get a higher prior probability. One could also use subtler strategies EFTA00624298
152 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network such as using imprecise probabilities in DeSTIN IG °el 1 14, and assigning a greater confidence to probabilities involving observations contained in the semantic network. Finally, we note that the nodes and networks in the semantic CSDLN may either • be linked into the nodes and links in a semantic network such as CogPrime's AtomSpace • actually be implemented in terms of an abstract semantic network language like CogPrime's AtomSpace (the strategy to be suggested in Chapter 29). This allows us to think of the semantic CSDLN as a kind of bridge between the standard CSDLN and the cognitive layer of an AI system. In an advanced implementation, the cognitive network may be used to suggest new relationships between nodes in the semantic CSDLN, based on knowledge gained via inference or language. 27.4 Semantic CSDLN for Motor and Sensorimotor Processing Next we consider a semantic CSDLN that focuses on movement rather than sensation. In this case, rather than a 2D or 3D visual space, one is dealing with an n-dimensional configura- tion space (C-space). This space has one dimension for each degree of freedom of the agent in question. The more joints with more freedom of movement an agent has, the higher the dimensionality of its configuration space. Using the notion of configuration space, one can construct a semantic-motoric CSDLN hi- erarchy analogous to the semantic-perceptual CSDLN hierarchy. However, the curse of dimen- sionality demands a thoughtful approach here. A square of side 2 can be tiled with 4 squares of side 1, but a 50-dimensional cube of side 2 can be tiled with 25° 50-dimensional cubes of side 1. If one is to build a CSDLN hierarchy in configuration space analogous to that in perceptual space, some sort of sparse hierarchy is necessary. There are many ways to build a sparse hierarchy of this nature, but one simple approach is to build a hierarchy where the nodes on level k represent motions that combine the motions represented by nodes on level k — 1. In this case the most natural semantic label predicates would seem to be things like simultaneously, after, immediately_after, etc. So a level k node represents a sort of "motion plan" corresponded by chaining together (serially and/or in parallel) the motions encoded in level k-1 nodes. Overlapping regions of C-space correspond to different complex movements that share some of the same component movements, e.g. if one is trying to slap one pennon while elbowing another, or run while kicking a soccer ball forwards. Also note, the semantic CSDLN approach reveals perception and motor control to have essentially similar hierarchical structures, more so than with the traditional CSDLN approach and its fixed-position perceptual nodes. Just as the semantic-perceptual CSDLN is naturally aligned with a traditional perceptual CSDLN, similarly a semantic-motoric CSDLN may be naturally aligned with a "motor CS- DLN". A typical motoric hierarchy in robotics might contain a node corresponding to a robot arm, with children corresponding to the hand, upper arm and lower arm; the hand node might then contain child nodes corresponding to each finger, etc. This sort of hierarchy is intrinsically spatiotemporal because each individual action of each joint of an actuator like an arm is in- trinsically bounded in space and time. Perhaps the most ambitious attempt along these lines is IA\101j, which shows how perceptual and motoric hierarchies are constructed and aligned in an architecture for intelligent automated vehicle control. EFTA00624299
27.4 Semantic CSDLN for Motor and Sensorimotor Processing 153 Figure 27.2 gives a simplified illustration of the potential alignment between a semantic- motoric CSDLN and a purely motoric hierarchy (like the one posited above in the context of extended DeSTIN). 5 In the figure, the motoric hierarchy is assumed to operate somewhat like DeSTIN, with nodes corresponding to (at the lowest level) individual servomotors, and (on higher levels) natural groupings of servomotors. The node corresponding to a set of servos is assumed to contain centroids of clusters of trajectories through configuration space. The task of choosing an appropriate action is then executed by finding the appropriate centroids for the nodes. Note an asymmetry between perception and action here. In perception the basic flow is bottom-up, with top-down flow used for modulation and for "imaginative" generation of percepts. In action, the basic flow is top-down, with bottom-up flow used for modulation and for imaginative, "fiddling around" style generation of actions. The semantic-motoric hierarchy then contains abstractions of the C-space centroids from the motoric hierarchy - i.e., actions that bind together different C-space trajectories that correspond to the same fundamental action carried out in different contexts or under different constraints. Similarly to in the perceptual case, the semantic hierarchy here serves as a glue between lower-level function and higher-level cognitive semantics. iVsaffi l cir IC v. oan.tos. dl . & 4. met roam nolemotimw own, gar' , t-TLI L • • • % I MOTORIC HTM The mown[ hatarth/ neck teerespendolg 0), a parocubet to 0 wnretan. marM h.e Iona man chums Sash. through configuration spate that the ganconettes ham Natant* Mama& get object COGNITIVE SEMANTIC NETWORK or lower arm toward object SEMANTIC- MOTORIC HTM Fig. 27.2: Simplified depiction of the relationship between a semantic-motoric CSDLN, a motor control hierarchy (illustrated by the hierarchy of servos associated with a robot arm), and a cognitive semantic network (like CogPrime's AtomSpace). 5 In the figure, only a fragment of the semantic-motoric CSDLN is shown (a node corresponding to the "get object" action category, and then a child network containing nodes corresponding to several components of the action). In a real semantic-motoric CSDLN, there would be many other nodes on the same level as the get-object node, many other parts to the get-object subnetwork besides the ones depicted here; the subnetwork nodes would also have child subnetworks; there would be link from each semantic node to centroids within a large number of motoric nodes: and there might also be many nodes not corresponding clearly to any single English language concept like 'grasp object" etc. EFTA00624300
154 27 Integrating CogPrime with a Compositional Spatiotemporal Deep Learning Network 27.5 Connecting the Perceptual and Motoric Hierarchies with a Goal Hierarchy One way to connect perceptual and motoric CSDLN hierarchies is using a "semantic-goal CS- DLN" bridging the semantic-perceptual and semantic-motoric CSDLNs. The semantic-goal CS- DLN would be a "semantic CSDLN" loosely analogous to the perceptual and motor semantic CSDLNs - and could optionally be linked into the reinforcement hierarchy of a tripartite CS- DLN like extended DeSTIN. Each node in the semantic-goal CSDLN would contain implications of the form "Context & Procedure —> Goal", where Goal is one of the Al system's overall goals or a subgoal thereof, and Context and Procedure refer to nodes in the perceptual and motoric semantic CSDLNs respectively. For instance, a semantic-goal CSDLN node might contain an implication of the form "I perceive my hand is near object X & I grasp object X r I possess object X." This would be useful if "I possess object X" were a subgoal of some higher-level system goal, e.g. if X were a food object and the system had the higher-level goal of obtaining food. To the extent that the system's goals can be decomposed into hierarchies of progressively more and more spatiotemporally localized subgoals, this sort of hierarchy will make sense, leading to a tripartite hierarchy as loosely depicted in Figure 27.3. 6 One could attempt to construct an overall AGI approach based on a tripartite hierarchy of this nature, counting on the upper levels of the three hierarchies to conic together dynamically to form an integrated cognitive network, yielding abstract phenomena like language, self, reasoning and mathematics. On the other hand, one may view this sort of hierarchy as a portion of a larger integrative AGI architecture, containing a separate cognitive network, with a less rigidly hierarchical structure and less of a tie to the spatiotemporal structure of physical reality. The latter view is the one we are primarily taking within the CogPrime AGI approach, viewing perceptual, motoric and goal hierarchies as "lower level" subsystems connected to a "higher level" system based on the CogPrime AtomSpace and centered on its abstract cognitive processes. Learning of the subgoals and implications in the goal hierarchy is of course a complex matter, which may be addressed via a variety of algorithms, including online clustering (for subgoals or implications) or supervised learning (for implications, the "supervision" being purely internal and provided by goal or subgoal achievement). 6 The diagram is simplified in many ways, e.g. only a handful of nodes in each hierarchy is shown (rather than the whole hierarchy), and lines without arrows are used to indicate bidirectional arrows, and nearly all links are omitted. The purpose is just to show the general character of interaction between the components in a simplified context. EFTA00624301
27.5 Connecting the Perceptual and Motoric Hierarchies with a Coal Hierarchy 155 SEMANTIC-PERCEPTUAL HTM SB1ANTIC GOAL HTM SBMNfIC• MOTORIC HTM Fig. 27.3: Simplified illustration of the proposed interoperation of perceptual, motoric and goal semantic CSDLNs. EFTA00624302
EFTA00624303
Chapter 28 Making DeSTIN Representationally Transparent Co-authored with Itamar Arel 28.1 Introduction In this chapter and the next we describe one particular incarnation of the above ideas on semantic CSDLNs in more depth: the integration of CogPrime with the DeSTIN architecture reviewed in Chapter 4 of Part 1. One of the core intuitions underlying this integration is that, in order to achieve the de- sired level of functionality for tasks like picture interpretation and assembly of complex block structures, it will be necessary to integrate DeSTIN (or some similar system) and CogPrime components fairly tightly - going deeper than just using DeSTIN as an input/output layer for CogPrime, by building a number of explicit linkages between the nodes inside DeSTIN and CogPrime respectively. The general DeSTIN design has been described in talks as comprising three crosslinked hier- archies. handling perception, action and reinforcement; but so far only the perceptual hierarchy (also called the "spatiotemporal inference network") has been implemented or described in de- tail in publications. In this chapter we will focus on DeSTIN's perception hierarchy. We will explain DeSTIN's perceptual dynamics and representations as we understand them, more thor- oughly than was done in the brief review above; and we will describe a series of changes to the DeSTIN design, made in the spirit of easing DeSTIN/OpenCog integration. In the following chapter we will draw action and reinforcement into the picture, deviating somewhat in the de- tails from the manner in which these things would be incorporated into a standalone DeSTIN, but pursuing the same concepts in an OpenCog integration context. What we describe here is a way to make a "Uniform DeSTIN", in which the internal repre- sentation of perceived visual forms is independent of affine transformations (translation, scaling, rotation and shear). This "representational transparency" means that, when Uniform DeSTIN perceives a pattern: no matter how that pattern is shifted or linearly transformed, the way Uni- form DeSTIN represents that pattern internally is going to be basically the same. This makes it easy to look at a collection of DeSTIN states, obtained by exposing a DeSTIN perception network to the world at different points in time, and see the commonalities in what they are perceiving and how they are interpreting it. By contrast, in the original version of DeSTIN (here called "classic DeSTIN"), it may take significant effort to connect the internal repre- sentation of a visual pattern and the representation of its translated or linearly transformed versions. The uniformity of Uniform DeSTIN makes it easier for humans to inspect DeSTIN's state and understand what's going on, and also (more to the point) makes it easier for other 157 EFTA00624304
158 28 Making DeSTIN Representationally Transparent AI components to recognize patterns in sets of DeSTIN states. The latter fact is critical for the DeSTIN/OpenCog integration 28.2 Review of DeSTIN Architecture and Dynamics The hierarchical architecture of DeSTIN's spatiotemporal inference network comprises an ar- rangement into multiple layers of "nodes" comprising multiple instantiations of an identical processing unit. Each node corresponds to a particular spatiotemporal region, and uses a sta- tistical learning algorithm to characterize the sequences of patterns that are presented to it by nodes in the layer beneath it. More specifically, at the very lowest layer of the hierarchy nodes receive as input raw data (e.g. pixels of an image) and continuously construct a belief state that attempts to characterize the sequences of patterns viewed. The second layer, and all these above it, receive as input the belief states of nodes at their corresponding lower layers, and attempt to construct belief states that capture regularities in their inputs. Each node also receives as input the belief state of the node above it in the hierarchy (which constitutes "contextual" information, utilized in the node's prediction process). Inside each node, an online clustering algorithm is used to identify regularities in the se- quences received by that node. The centroids of the clusters learned are stored in the node and comprise the basic visual patterns recognized by that node. The node's "belief" regarding what it is seeing, is then understood as a probability density function defined over the centroids at that node. The equations underlying this centroid formation and belief updating process are identical for every node in the architecture, and were given in their original form in IARC09aJ, though the current open-source DeSTIN codebase reflects some significant improvements not yet reflected in the publication record. In short, the way DeSTIN represents an item of knowledge is as a probability distribution over "network activity patterns" in its hierarchical network. An activity pattern, at each point in time, comprises an indication of which centroids in each node are most active, meaning they have been identified as most closely resembling what that node has perceived, as judged in the context of the perceptions of the other nodes in the system. Based on this methodology, the DeSTIN perceptual network serves the critical role of building and maintaining a model of the state of the world as visually perceived. This methodology allows for powerful unsupervised classification. If shown a variety of real- world scenes, DeSTIN will automatically form internal structures corresponding to the various natural categories of objects shown in the scenes, such as trees, chairs, people, etc.; and also to the various natural categories of events it sees, such as reaching, pointing, falling. In order to demonstrate the informativeness of these internal structures, experiments have been done using DeSTIN's states as input feature vectors for supervised learning algorithms, enabling high-accuracy supervised learning of classification models from labeled image data [KARR)]. A closely related algorithm developed by the same principal researcher (Itamar Arel) has proven extremely successful at audition tasks such as phoneme recognition IA13S ± Ill. EFTA00624305
28.3 Uniform DeSTIN 159 28.2.1 Beyond Gray-Scale Vision The DeSTIN approach may easily be extended to other senses beyond gray-scale vision. For color vision, it suffices to replace the one-dimensional signals coming into DeSTIN's lower layer with 3D signals representing points in the color spectrum; the rest of the DeSTIN process may be carried over essentially without modification. Extension to further senses is also relatively straightforward on the mathematical and software structure level, though they may of course require significant additional tuning and refinement of details. For instance, olfaction does not lend itself well to hierarchical modeling, but audition and haptics (touch) do: • for auditory perception, one could use a DeSTIN architecture in which each layer is one- dimensional rather than two-dimensional, representing a certain pitch. Or one could use two dimensions for pitch and volume. This results in a system quite similar to the DeSTIN-like system shown to perform outstanding phoneme recognition in IA13S+ lib and is conceptually similar to Hierarchical Hidden Markov Models (HHMMs), which have proven quite success- ful in speech recognition and which Ray Kurzweil has argued are the central mechanism of human intelligence tKurl2l. Note also recent results published by Microsoft Research, showing dramatic improvements over prior speech recognition results based on use of a broadly HHMM-like deep learning system 2I. • for haptic perception, one could use a DeSTIN architecture in which the lower layer of the network possesses a 2D topology reflecting the topology of the surface of the body. Similar to the somatccensory cortex in the human brain, the map could be distorted so that more "pixels" are used for regions of the body from which more data is available (e.g. currently this might be the fingertips, if these were implemented using Syntouch technology [Fill, which has proved excellent at touch-based object identification). Input could potentially be multidimensional if multiple kinds of haptic sensors were available, e.g temperature, pressure and movement as in the Syntouch case. Augmentation of DeSTIN to handle action as well as perception is also possible, and will be discussed in Chapter 29 28.3 Uniform DeSTIN It would be possible to integrate DeSTIN in its original form with OpenCog or other Al sys- tems with symbolic aspects, via using an unsupervised machine learning algorithm to recognize patterns in sets of states of the DeSTIN network as originally defined. However, this pattern recognition task becomes much easier if one suitably modifies DeSTIN, so as to make the com- monalities between semantically similar states more obviously perceptible. This can be done by making the library of patterns recognized within each DeSTIN node invariant with respect to translation, scale, rotation and shear - a modification we call "Uniform DeSTIN." This "uni- formization" decreases DeSTIN's degree of biological mimicry, but eases integration of DeSTIN with symbolic AI methods. EFTA00624306
160 28 Making DeSTIN Representationally Transparent 28.3.1 Translation-Invariant DeSTIN The first revision to the "classic DeSTIN" to be suggested here is: All the nodes on the same level of the DeSTIN hierarchy should share the same library, of patterns. In the context of classic DeSTIN (i.e. in the absence of further changes to DeSTIN to be suggested below, which extend the type of patterns usable by DeSTIN), this means: the nodes on the same level should share the same list of centroids. This makes DeSTIN's pattern recognition capability translation- invariant. This translation invariance can be achieved without any change to the algorithms for updating centroids and matching inputs to centroids. In this approach, it's computationally feasible to have a much larger library of patterns utilized by each node, as compared to in classic DeSTIN. Suppose we have anxn pixel grid, where the lowest level has nodes corresponding to 4 x 4 squares. Then, there are a? nodes on the lowest level, and on the k'th level there are (fE)2 nodes. This means that, without increasing computational complexity (actually decreasing it, under reasonable assumptions), in translation-invariant Uniform DeSTIN we can have a factor of (402 more centroids on level k. One can achieve a much greater decrease in computational complexity (with the same amount of centroid increase) via use of a clever data structure like a cover tree IBK1,1181 to store the centroids at each level. Then the nearest-neighbor matching of input patterns to the library (centroid) patterns would be very rapid, much faster than linearly comparing the input to each pattern in the list. 28.3.1.1 Conceptual Justification for Uniform DeSTIN Generally speaking, one may say that: if the class of images that the system will see is invariant with respect to linear translations, then without loss of generality, we can assume that the library of patterns at each node on the same level is the same. In reality this assumption isn't quite going to hold. For instance, for an eye attached to a person or humanoid robot, the top of the pixel grid will probably look at a person's hair more often than the bottom ... because the person stands right-side-up more often than they stand upside-down, and because they will often fixate the center of their view on a person's face, etc. For this reason, we can recognize our friend's face better if we're looking at them directly, with their face centered in our vision. However, we suggest that this kind of peculiarity is not really essential to vision processing for general intelligence. There's no reason you can't have an intelligent vision system that recognizes a face just as well whether it's centered in the visual field or not. (In fact you could straightforwardly explicitly introduce this kind of bias within a translation-invariant DeSTIN, but it's not clear this is a useful direction.) By and large, in almost all cases, it seems to us that in a DeSTIN system exposed to a wide variety of real-world inputs in complex situations, the library of patterns in the different nodes at the same level would turn out to be substantially the same. Even if they weren't exactly the same, they would be close to the same, embodying essentially the same regularities. But of course, this sameness would be obscured, because centroid 7 in a certain node X on level 4 might actually be the same as centroid 18 in some other node Y on level 4 ... and there would be no way to tell that centroid 7 in node X and centroid 18 and node Y were actually referring to the same pattern, without doing a lot of work. EFTA00624307
28.3 Uniform DeSTIN 161 28.3.1.2 Comments on Biological Realism Translation-invariant DeSTIN deviates further from human brain structure than classic DeS- TIN, but this is for good reason. The brain has a lot of neurons, since adding new neurons was fairly easy and cheap for evolution; and tends to do things in a massively parallel manner, with great redundancy. For the brain, it's not so problematically expensive to have the functional equivalent of a lot of DeSTIN nodes on the same level, all simultaneously using and learning libraries of patterns that are essentially identical to each other. Using current computer technology, on the other hand, this sort of strategy is rather inefficient. In the brain, messaging between separated regions is expensive, whereas replicating function redundantly is cheap. In most current computers (with some partial exceptions such as CPUs), messaging between separated regions is fairly cheap (so long as those regions are stored on the same machine), whereas replicating function redundantly is expensive. Thus, even in cases where the same concept and abstract mathematical algorithm can be effectively applied in both the brain and a computer, the specifics needed for efficient implementation may be quite different. 28.3.2 Mapping States of Translation-Invariant DeSTIN into the Atomspace Mapping classic DeSTIN's states into a symbolic pattern-manipulation engine like OpenCog is possible, but relatively cumbersome. Doing the same thing with Uniform DeSTIN is much more straightforward. In Uniform DeSTIN, for example, Cluster 7 means the same thing in ANY node on level 4. So after a Uniform DeSTIN system has seen a fair number of images, you can be pretty sure its library of patterns is going to be relatively stable. Some clusters may come and go as learning progresses, but there's going to be a large and solid library, of clusters at each level that persists, because all of its member clusters occur reasonably often across a variety of inputs. Define a DeSTIN state-tree as a (quaternary) tree with one node for each DeSTIN node; and living at each node, a small list of (integer pattern_code, float weight) pairs. That is, at each node, the state-tree has a short-list of the patterns that closely match a given state at that node. The weights may be assumed between 0 and 1. The integer pattern codes have the same meaning for every, node on the same level. As you feed DeSTIN inputs, at each point in time it will have a certain state, representable as a state-tree. So, suppose you have a large database of DeSTIN state-trees, obtained by showing various inputs to DeSTIN over a long period of time. Then, you can do various kinds of pattern recognition on this database of state-trees. More formally, define a state-subtree as a (quaternary) tree with a single integer at each node. Two state-subtrees may have various relationships with each other within a single state-tree - for instance they may be adjacent to each other, or one may appear atop or below the other, etc. In these terms, one interesting kind of pattern recognition to do is: Recognize frequent state- subtrees in the stored library of state-trees; and then recognize frequent relationships between these frequent state-subtrees. The latter relationships will form a kind of "image grammar," conceptually similar and formally related to those described in IZNIN. Further, temporal pat- EFTA00624308
162 28 Making DeSTIN Representationally Transparent terns may be recognized in the same way as spatial ones, as part of the state-subtree grammar (e.g. state•subtree A often occurs right before state-subtree B; state-subtree C often occurs right before and right below state-subtree D; etc.). The flow of activation from OpenCog back down to DeSTIN is also fairly straightforward in the context of translation-invariant DeSTIN. If relationships have been stored between concepts in OpenCogPrime's memory and grammatical patterns between state-subtrees, then whenever concept C becomes important in OpenCogPrime's memory, this can cause a top-down increase in the probability of matching inputs to DeSTIN node centroids, that would cause the DeSTIN state-tree to contain the grammatical patterns corresponding to concept C. 28.3.3 Scale-Invariant DeSTIN The next step, moving beyond translation invariance, is to make DeSTIN's pattern recognition mostly (not wholly) scale invariant. We will describe a straightforward way to map centroids on one level of DeSTIN, into centroids on the other levels of DeSTIN. This means that when a centroid has been learned on one level, it can be experimentally ported to all the other levels, to see if it may be useful there too. To make the explanation of this mapping clear, we reiterate some DeSTIN basics in slightly different language: • A centroid on Level N is: a spatial arrangement (e.g. k x k square lattice) of beliefs of Level N —1. (More generally it is a spatiotemporal arrangement of such beliefs, but we will ignore this for the moment.) • A belief on Level N is: a probability distribution over centroids on Level N. For heuristic purposes one can think about this as a mixture of Gaussian, though this won't always be the best model. • Thus, a belief on Level N is: a probability distribution over spatial (or more generally, spatiotemporal) arrangements of beliefs on Level N — 1 On Level 1, the role of centroids is played by simple k x k squares of pixels. Level 1 beliefs are probability distributions over these small pixel squares. Level 2 centroids are hence spa- tial arrangements of probability distributions over small pixel-squares; and Level 2 beliefs are probability distributions over spatial arrangements of probability distributions over small pixel- squares. A small pixel-square S may be mapped into a single pixel P via a heuristic algorithm such as: • if S has more black than white pixels, then P is black • is S has more white than black pixels, then P is white • if S has an equal number of white and black pixels, then use some heuristic. For instance if S is 4 x 4 you could look at the central 2 x 2 square and assign P to the color that occurs most often there. If that is also a tie, then you can just arbitrarily assign P to the color that occurs in the upper left corner of S. A probability distribution over small pixel-squares may then be mapped into a probability distribution over pixel values (B or Mr). A probability distribution over the two values B and Wmay be approximatively mapped into a single pixel value - the one that occurs most often EFTA00624309
28.3 Uniform DeSTIN 163 in the distribution, with a random choice made to break a tie. This tells us how to map Level 2 beliefs into spatial arrangements of pixels; and thus, it tells us how to map Level 2 beliefs into Level 1 beliefs. But this tells us how to map Level N beliefs into Level N-1 beliefs, inductively. Remember, a Level N belief is a probability distribution (pdf for short) over spatial arrangements of beliefs on Level N — I. For example: A Level 3 belief if a pdf over arrangements of Level 2 beliefs. But since we can map Level 2 beliefs into Level 1 beliefs, this means we can map a Level 3 belief into a pdf over arrangements of Level 1 beliefs - which means we can map a Level 3 belief into a Level 2 belief. Etc. Of course, this also tells as how to map Level N centroids into Level N — 1 centroids. A Level N centroid is a pdf over arrangements of Level N —1 beliefs; a Level N —1 centroid is a pdf over arrangements of Level N — 2 beliefs. But Level N —1 beliefs can be mapped into Level N — 2 beliefs. so Level N centroids can be represented as pdfs over arrangements of Level N beliefs, and hence mapped into Level N — 1 centroids. In practice, one can implement this idea by moving from the bottom up. Given the mapping from Level 1 "centroids" to pixels, one can iterate through the Level 1 beliefs and identify which pixels they correspond to. Then one can iterate through the Level 2 beliefs and identify which Level 1 beliefs they correspond to. Etc. Each Level N belief can be explicitly linked to a corresponding level N — 1 belief. Synchronously, as one moves up the hierarchy, Level N centroids can be explicitly linked to corresponding Level N — 1 centroids. Since there are in principle more possible Level Nbeliefs than Level N-1 beliefs, the mapping from level Nbeliefs to level N-1 beliefs is many-to-one. This is a reason not to simply maintain a single centroid pool across levels. However, when a new centroid C is added to the Level N pool, it can be mapped into a Level N — 1 centroid to be added to the Level N — 1 pool (if not there already). And, it can also be used to spawn a Level N + 1 centroid, drawn randomly from the set of possible Level N +1 centroids that map into C. Also, note that it is possible to maintain a single centroid numbering system across levels, so that a reference like "centroid # 175" has only one meaning in an entire DeSTIN network, even though some of these centroid may only be meaningful above a certain level in the network. 28.3.4 Rotation Invariant DeSTIN With a little more work, one can make DeSTIN rotation and shear invariant as well 1. Consid- ering rotation first: • When comparing an input A to a Level N node with a Level N centroid B , consider var- ious rotations of A, and see which rotation gives the closest match. • When you match a centroid to an input observation-or-belief, record the rotation angle corresponding to the match. The second of these points implies the tweaked definitions • A centroid on Level N is: a spatial arrangement (e.g. k x k square lattice) of beliefs of Level N —1 • A belief on Level N is: a probability distribution over (angle, centroid) pairs on Level N. I The basic idea in this section, in the context of rotation, is due to Jade O'Neill (private communication) EFTA00624310
164 28 Making DeSTIN Representationally Transparent From these it follows that a belief on Level N is: a probability distribution over (angle, spatial arrangement of beliefs) pairs on Level N — 1 An additional complexity here is that two different (angle, centroid) pairs (on the same level) could be (exactly or approximately) equal to each other. This necessitates an additional step of "centroid simplification", in which ongoing checks are made to see if there are any two centroids C1, C2 on the same level so that: There exist angles A1, A2 so that (Ai, CI) is very close to (A2, C2). In this case the two centroids may be merged into one. To apply these same ideas to shear, one may simply replace "rotation angle" in the above by "(rotation angle, shear factor) pair." 28.3.5 Temporal Perception Translation and scale invariant DeSTIN can be applied perfectly well if the inputs to DeSTIN, at level 1, are movies rather than static images. Then, in the simplest version, Level 1 consists of pixel cubes instead of pixel squares, etc. (the third dimension in the cube representing time). The scale invariance achieved by the methods described above would then be scale invariance in time as well as in space. In this context, one may enable rectangular shapes as well as cubes. That is, one can look at a Level N centroid consisting of m time-slices of a k x k arrangement of Level N —1 beliefs - without requiring that in = k .... This would make the centroid learning algorithm a little more complex, because at each level one would want to consider centroids with various values of in, from in = 1, ..., k (and potentially m > k also). 28.4 Interpretation of DeSTIN's Activity Uniform DeSTIN constitutes a substantial change in how DeSTIN does its business of recog- nizing patterns in the world - conceptually as well as technically. To explicate the meaning of these changes, we briefly present our favored interpretation of DeSTIN's dynamics. The centroids in the DeSTIN library represent points in "spatial pattern space", i.e. they represent exemplary spatial patterns. DeSTIN's beliefs, as probability distributions over cen- troids, represent guesses as to which of the exemplary spatial patterns are the best models of what's currently being seen in a certain space-time region. This matching between observations and centroids might seem to be a simple matter of "nearest neighbor matching"; but the subtle point is, it's not immediately obvious how to best measure the distance between observations and centroids. The optimal way of measuring distance is going to depend on context; that is to say, on the actual distribution of observations in the system's real environment over time. DeSTIN's algorithm for calculating the belief at a node, based on the observation and cen- troids at that node plus the beliefs at other nearby nodes, is essentially a way of tweaking the distance measurement between observations and centroids, so that this measurement accounts for the context (the historical distribution of observations). There are many possible ways of doing this tweaking. Ideally one could use probability theory explicitly, but that's not always EFTA00624311
28.4 Interpretation of DeSTIN's Activity 165 going to be computationally feasible, so heuristics may be valuable, and various versions of DeSTIN have contained various heuristics in this regard. The various ways of "uniformizing" DeSTIN described above (i.e. making its pattern recogni- tion activity approximately invariant with respect to affine transformations), don't really affect this story - they just improve the algorithm's ability to learn based on small amounts of data (and its rapidity at learning from data in general), by removing the need for the system to repeatedly re-learn transformed versions of the same patterns. So the uniformization just lets DeSTIN carry out its basic activity faster and using less data. 28.4.1 DeSTIN's Assumption of Hierarchical Decomposability Roughly speaking, DeSTIN will work well to the extent that: The average distance between each part of an actually observed spatial pattern, and the closest centroid pattern, is not too large (note: the choice of distance measure in this statement is potentially subtle). That is: DeSTIN's set of centroids is supposed to provide a compact model of the probability distribution of spatial patterns appearing in the experience of the cognitive system of which DeSTIN is a part. DeSTIN's effective functionality relies on the assumption that this probability distribution is hierarchically decomposable - i.e. that the distribution of spatial patterns appearing over a k x k region can be compactly expressed, to a reasonable degree of approximation, as a spatial combination of the distributions of spatial patterns appearing over (k/4) x (k/4) regions. This assumption of hierarchical decomposability greatly simplifies the search problem that DeSTIN faces, but also restricts DeSTIN's capability to deal with more general spatial patterns that are not easily hierarchically decomposable. However, the benefits of this approach seem to outweigh the costs, given that visual patterns in the environments humans naturally encounter do seem (intuitively at least) to have this hierarchical property. 28.4.2 Distance and Utility Above we noted that choice of distance measure involved in the assessment of DeSTIN's effec- tive functionality is subtle. Further above, we observed that the function of DeSTIN's belief assessment is basically to figure out the contextually best way to measure the distance between the observation and the centroids at a node. These comments were both getting at the same point. But what is the right measure of distance between two spatial patterns? Ultimately, the right measure is: the probability that the two patterns A and B can be used in the same way. That is: the system wants to identify observation A with centroid B if it has useful action-patterns involving B, and it can substitute A for B in these patterns without lass. This is difficult to calculate in general, though - a rough proxy, which it seems will often be acceptable, is to measure the distance between A and B in terms of both • the basic (extensional) distance between the physical patterns they embody (e.g. pixel by pixel distance) • the contextual (intensional) distance, i.e. the difference between the contexts in which they occur EFTA00624312
166 28 Making DeSTIN Representationally Transparent Via enabling the belief in a node's parent to play a role in modulating a certain node's be- lief, DeSTIN's core algorithm enables contextual/intensional factors to play a role in distance assessment. 28.5 Benefits and Costs of Uniform DeSTIN We now summarize the main benefits and casts of Uniform DeSTIN a little more systematically. The key point we have made here regarding Uniform DeSTIN and representational transparency may be summarized as follows: • Define an "affine perceptual equivalence class" as a set of percepts that are equivalent to each other, or nearly so, under affine transformation. An example would be views of the same object from different perspectives or distances. • Suppose one has an embodied agent using DeSTIN for visual perception, whose perceptual stream tends to include a lot of reasonably large affine perceptual equivalence classes. • Then, supposing the "mechanics" of DeSTIN can be transferred to the Uniform DeSTIN case without dramatic loss of performance, Uniform DeSTIN should be able to recognize patterns based on many fewer examples than classic DeSTIN. As soon as Uniform DeSTIN has learned to recognize one element of a given affine perceptual equivalence class, it can recognize all of them. Whereas, classic DeSTIN must learn each element of the equivalence class separately. So, roughly speaking, the number of cases required for unsupervised training of Uniform DeSTIN will be less than that for classic DeSTIN, by a ratio equal to the average size of the affine perceptual equivalence classes in the agent's perceptual stream. Counterbalancing this, we have the performance cost of comparing the input to each node against a much larger set of centroids (in Uniform DeSTIN as opposed to classic DeSTIN). However, if a cover tree or other efficient data structure is used, this cost is not so onerous. The cost of nearest neighbor queries in a cover tree storing n items (in this case, n centroids) is 0(cn logn), where the constant c represents the "intrinsic dimensionality" of the data; and in practice the cover tree search algorithm seems to perform quite well. So, the added time cost for online clustering in Uniform DeSTIN as opposed to DeSTIN, is a factor on the order of the log of the number of nodes in the DeSTIN tree. We believe this moderate added time cost is well worth paying, to gain a significant decrease in the number of training examples required for unsupervised learning. Beyond increases in computational cost, there is also the risk that the online clustering may just not work as well when one has so many clusters in each node. This is the sort of problem that can really only be identified, and dealt with. during extensive practice - since the performance of any clustering algorithm is largely determined by the specific distribution of the data it's dealing with. It may be necessary to improve DeSTIN's online clustering in some way to make Uniform DeSTIN work optimally, e.g. improving its ability to form clusters with markedly non-spherical shapes. This ties in to a point raised in chapter 29 - the possibility of supplementing traditional clusters with predicates learned by CogPrime, which may live inside DeSTIN nodes alongside centroids. Each such predicate in effect defines a (generally nonconvex) "cluster". EFTA00624313
28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN 167 28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN One key aspect of vision processing is the ability to preferentially focus attention on certain positions within a perceived visual scene. In this section we describe a novel strategy for enabling this in a hybrid CogPrime /DeSTIN system, via use of imprecise probabilities. In fact the basic idea suggested here applies to any probabilistic sensory system, whether deep-learning-based or not, and whether oriented toward vision or some other sensory modality. However, for sake of concreteness, we will focus here on the case of DeSTIN/CogPrime integration. 28.6.1 Visual Attention Focusing Since visual input streams contain vast amounts of data, it's beneficial for a vision system to be able to focus its attention specifically on the most important parts of its input. Sometimes knowledge of what's important will come from cognition and long-term memory, but sometimes it may come from mathematical heuristics applied to the visual data itself. In the human visual system the latter kind of "low level attention focusing" is achieved largely in the context of the eye changing its focus frequently, looking preferentially at certain positions in the scene rint091. This works because the center of the eye corresponds to a greater density of neurons than the periphery. So for example, consider a computer vision algorithm like SIFT (Scale-Invariant Feature Extraction) ILow991, which (as shown in Figure 28.1) mathematically isolates certain points in a visual scene as "keypoints" which are particularly important for identifying what the scene depicts (e.g. these may be corners, or easily identifiable curves in edges). The human eye, when looking at a scene, would probably spend a greater percentage of its time focusing on the SIFT keypoints than on random points in the image. The human visual system's strategy for low-level attention focusing is obviously workable (at least in contexts similar to those in which the human eye evolved), but it's also somewhat complex, requiring the use of subtle temporal processing to interpret even static scenes. We suggest here that there may be a simpler way to achieve the same thing, in the context of vision systems that are substantially probabilistic in nature, via using imprecise probabilities. The crux of the idea is to represent the most important data, e.g. keypoints, using imprecise probability values with greater confidence. Similarly, cognition-guided visual attention-focusing occurs when a mind's broader knowledge of the world tells it that certain parts of the visual input may be more interesting to study than others. For example, in a picture of a person walking down a dark street, the contours of the person may not be tremendously striking visually (according to SIFT or similar approaches); but even so, if the system as a whole knows that it's looking at a person, it may decide to focus extra visual attention on anything person-like. This sort of cognition guided visual attention focusing, we suggest, may be achieved similarly to visual attention focusing guided on lower- level cues - by increasing the confidence of the imprecise probabilities associated with those aspects of the input that are judged more cognitively significant. EFTA00624314
168 28 Making DeSTIN Representationally Transparent 28.6.2 Using Imprecise Probabilities to Guide Visual Attention Focusing Suppose one has a vision system that internally constructs probabilistic values corresponding to small local regions in visual input (these could be pixels or voxels, or something a little larger), and then (perhaps via a complex process) assigns probabilities to different interpretations of the input based on combinations of these input-level probabilities. For this sort of vision system, one may be able to achieve focusing of attention via appropriately replacing the probabilities with imprecise probabilities. Such an approach may be especially interesting in hierarchical vision systems, that also involve the calculation of probabilities corresponding to larger regions of the visual input. Examples of the latter include deep learning based vision systems like Hill or DeSTIN, which construct nested hierarchies corresponding to larger and larger regions of the input space, and calculate probabilities associated with each of the regions on each level, based in part on the probabilities associated with other related regions. In this context, we now state the basic suggestion of the section: 1. Assign higher confidence to the low-level probabilities that the vision system creates cor- responding to the local visual regions that one wants to focus attention on (based on cues from visual preprocessing or cognitive guidance) 2. Carry out the vision system's processing using imprecise probabilities rather than single- number probabilities 3. Wherever the vision system makes a decision based on "the most probable choice" from a number of possibilities, change the system to make a decision based on "the choice maxi- mizing the product (expectation * confidence)". 28.6.3 Sketch of Application to DeSTIN Internally to DeSTIN, probabilities are assigned to clusters associated with local regions of the visual input. If a system such as SIFT is run as a preprocessor to DeSTIN, then those small regions corresponding to SIFT keypoints may be assumed semantically meaningful. and internal DeSTIN probabilities associated with them can be given a high confidence. A similar strategy may be taken if a cognitive system such as OpenCog is run together with DeSTIN, feeding DeSTIN information on which portions of a partially-processed image appear most cognitively relevant. The probabilistic calculations inside DeSTIN can be replaced with corresponding cal- culations involving imprecise probabilities. And critically, there is a step in DeSTIN where, among a set of beliefs about the state in each region of an image (on each of a set of hierarchi- cal levels), the one with the highest probability is selected. In accordance with the above recipe, this step should be modified to select the belief with the highest probability * confidence. 28.6.3.1 Conceptual Justification What is the conceptual justification for this approach? One justification is obtained by assuming that each percept has a certain probability of being erroneous, and those percepts that appear to more closely embody the semantic meaning of the EFTA00624315
28.6 Imprecise Probability as a Tool for Linking CogPrime and DeSTIN 169 visual scene are less likely to be erroneous. This follows conceptually from the assumption that the perceived world tends to be patterned and structured, so that being part of a statistically significant pattern is (perhaps weak) evidence of being real rather than artifactual. Under this assumption, the proposed approach will maximize the accuracy of the system's judgments. A related justification is obtained by observing that this algorithmic approach follows from the consideration of the perceived world as mutable. Consider a vision system that has the capability to modify even the low-level percepts that it intakes - i.e. to use what it thinks and knows, to modify what it sees. The human brain certainly has this potential ICha091. In this case, it will make sense for the system to place sonic constraints regarding which of its percepts it is more likely to modify. Confidence values semantically embody this - a higher confidence being sensibly assigned to percepts that the system considers should be less likely to be modified based on feedback from its higher (more cognitive) processing levels. In that case, a higher confidence should be given to those percepts that seem to more closely embody the semantic meaning of the visual scene - which is exactly what we're suggesting here. 28.6.3.2 Enabling Visual Attention Focusing in DeSTIN via Imprecise Probabilities We now refer back to the mathematical formulation of DeSTIN summarized in Section 4.3.1 of Chapter 4 above, in the context of which the application of imprecise probability based attention focusing to DeSTIN is almost immediate. The probabilities P(ols) may be assigned greater or lesser confidence depending on the assessed semantic criticality of the observation o in question. So for instance, if one is using SIFT as a preprocessor to DeSTIN, then one may assign probabilities P(ols) higher confidence if they correspond to observations o of SIFT keypoints, than if they do not. These confidence levels may then be propagated throughout DeSTIN's probabilistic mathe- matics. For instance, if one were using \Valley's interval probabilities, then one could carry out the probabilistic equations using interval arithmetic. Finally, one wishes to replace Equation 4.3.1.2 in Chapter 4 with c = arg max ((bp(s)).strength (bp(s)).confidence) , (28.1) or some similar variant. The effect of this is that hypotheses based on high-confidence observa- tions are more likely to be chosen, which of course has a large impact on the dynamics of the DeSTIN network. EFTA00624316
170 28 Making DeSTIN Representationally Transparent Fig. 28.1: The SIFT algorithm finds keypoints in an image, i.e. localized features that are particularly useful for identifying the objects in an image. The top row shows images that are matched against the image in the middle row. The bottom-row image shows some of the keypoints used to perform the matching (i.e. these keypoints demonstrate the same features in the top-row images and their transformed middle-row counterparts). SIFT keypoints are identified via a staged filtering approach. The first stage identifies key locations in scale space by looking for locations that are maxima or minima of a difference-of-Gaussian function. Each point is used to generate a feature vector that describes the local image region sampled relative to its scale-space coordinate frame. The features achieve partial invariance to local variations, such as affine or 3D projections, by blurring image gradient locations. EFTA00624317
Chapter 29 Bridging the Symbolic/Subsymbolic Gap 29.1 Introduction While it's widely accepted that human beings carry out both symbolic and subsymbolic process- ing, as integral parts of their general intelligence, the precise definition of "symbolic" versus "subsymbolic" is a subtle issue, which different Al researchers will approach in different ways depending on their differing overall perspectives on AI. Nevertheless, the intuitive meaning of the concepts is commonly understood: • "subsymbolic" refers to things like pattern recognition in high-dimensional quantitative sensory data, and real-time coordination of multiple actuators taking multidimensional control signals • "symbolic" refers to things like natural language grammar and (certain or uncertain) logical reasoning, that are naturally modeled in terms of manipulation of symbolic tokens in terms of particular (perhaps experientially learned) rules Views on the relationship between these two aspects of intelligence in human and artificial cognition are quite diverse, including perspectives such as 1. Symbolic representation and reasoning are the core of human-level intelligence; subsymbolic aspects of intelligence are of secondary importance and can be thought of as pre- or post- processors to symbolic representation and reasoning 2. Subsymbolic representation and learning are the core of human intelligence; symbolic as- pects of intelligence a. emerge from the subsymbolic aspects as needed; or, b. arise via a relatively simple, thin layer on top of subsymbolic intelligence, that merely applies subsymbolic intelligence in a slightly different way 3. Symbolic and subsymbolic aspects of intelligence are best considered as different subsystems, which a. have a significant degree of independent operation, but also need to coordinate closely together; or, b. operate largely separately and can be mostly considered as discrete modules 171 EFTA00624318
172 29 Bridging the Symbolic/Subsymbolic Gap In evolutionary, terms, it is clear that subsymbolic intelligence came first, and that most of the human brain is concerned with the subsymbolic intelligence that humans share with other animals. However, this observation doesn't have clear implications regarding the relationship between symbolic and subsymbolic intelligence in the context of everyday cognition. In the history of the Al field, the symbolic/subsymbolic distinction was sometimes aligned with the dichotomy between logic-based and rule-based AI systems (on the symbolic side) and neural networks (on the subsymbolic side) IP.188bl. However, this dichotomy has become much blurrier in the last couple decades, with developments such as neural network models of lan- guage parsing IC Hilf and logical reasoning 11.Bi Oh and symbolic approaches to perception and action ISR0-1]. Integrative approaches have also become more common, with one of the ma- jor traditional symbolic AI systems, ACT-R, spawning a neural network version II,A931 with parallel structures and dynamics to the traditional explicitly symbolic version and a hybridiza- tion with a computational neuroscience model 1.11.08; and another one, SOAR, incorporating perception processing components as separate modules ILait2l. The field of "neural-symbolic computing" has emerged, covering the emergence of symbolic rules from neural networks, and the hybridization of neural networks with explicitly symbolic systems [HH07]. Our goal here is not to explore the numerous deep issues involved with the symbolic/subsym- bolic dichotomy, but rather to describe the details of a particular approach to symbolic/sub- symbolic integration, inspired by Perspective 3a in the above list: the consideration of symbolic and subsymbolic aspects of intelligence as different subcystems, which have a significant degree of independent operation, but also need to coordinate closely together. We believe this kind of integration can serve a key role in the quest to create human-level general intelligence. The approach presented here is at the beginning rather than end of its practical implementation; what we are describing here is the initial design intention of a project in progress, which is sure to be revised in some respects as implementation and testing proceed. We will focus mainly on the tight integration of a subsymbolic system enabling gray-scale vision processing into a cognitive architecture with significant symbolic aspects, and will then briefly explain how the same ideas can be used for color vision, and multi-sensory and perception-action integration. The approach presented here begins with two separate Al systems, OpenCog (introduced in Chapter 6.3) and DeSTIN (introduced in Chapter 4.3.1) - both currently implemented in open-source software. Here are the relevant features of each as they pertain to our current effort of bridging the symbolic/subsymbolic gap:: • OpenCog, is centered on a "weighted, labeled hypergraph" knowledge representation called the Atomspace. and features a number of different, sophisticated cognitive algorithms acting on the Atomspace. Some of these cognitive algorithms are heavily symbolic in focus (e.g. a probabilistic logic engine); others are more subsymbolic in nature (e.g. a neural net like system for allocating attention and assigning credit). However, OpenCog in its current form cannot deal with high-dimensional perceptual input, nor with detailed real-time control of complex actuators. OpenCog is now being used to control intelligent characters in an experimental virtual world, where the perceptual inputs are the 3D coordinate locations of objects or small blocks; and the actions are movement commands like "step forward", "turn head to the right." • DeSTIN is a deep learning system consisting of a hierarchy of processing nodes, in which the nodes on higher levels correspond to larger regions of space-time, and each node carries out prediction regarding events in the space-time region to which it corresponds. Feedback and feedforward dynamics between nodes combine with the predictive activity within nodes, to create a complex nonlinear dynamical system whose state self-organizes to reflect the EFTA00624319
29.2 Simplified OpenCog Workflow 173 state of the world being perceived. However, the specifics of DeSTIN's dynamics have been designed in what we consider a particularly powerful way, and the system has shown good results on small-scale test problems IKAII10]. So far DeSTIN has been utilized only for vision processing, but a similar proprietary system has been used for auditory data as well; and DeSTIN was designed to work together with an accompanying action hierarchy. These two systems were not originally designed to work together, but we will describe a method for achieving their tight integration via 1. Modifying DeSTIN in several ways, so that a. the patterns in its states over time will have more easily recognizable regularities b. its nodes are able to scan their inputs not only for simple statistical patterns (DeSTIN "centroids"), but also for patterns recognized by routines supplied to it by an external source (e.g. another AI system such as OpenCog) 2. Utilizing one of OpenCogPrime's cognitive processes (the "Fishgram" frequent subhyper- graph mining algorithm) to recognize patterns in sets of DeSTIN states, and then recording these patterns in OpenCogPrime's Atomspace knowledge store 3. Utilizing OpenCogPrime's other cognitive processes to abstract concepts and draw conclu- sions from the patterns recognized in DeSTIN states by Fishgram 4. Exporting the concepts and conclusions thus formed to DeSTIN, so that its nodes can ex- plicitly scan for their presence in their inputs, thus allowing the results of symbolic cognition to explicitly guide subsymbolic perception 5. Creating an action hierarchy corresponding closely to DeSTIN's perceptual hierarchy, and also corresponding to the actuators of a particular robot. This allows action learning to be done via an optimization approach (ILKP- 05j, IYKL+0-11), where the optimization algo- rithm uses DeSTIN states corresponding to perceived actuator states as part of its inputs. The ideas presented here are compatible with those described in [Coe] lag, but different in emphasis. That paper described a strategy for integrating OpenCog and DeSTIN via creating an intermediate "semantic CSDLN" hierarchy to translate between OpenCog and DeSTIN, in both directions. In the approach suggested here, this semantic CSDLN hierarchy exists conceptually but not as a separate software object: it exists as the combination of • OpenCog predicates exported to DeSTIN and used alongside DeSTIN centroids, inside DeSTIN nodes • OpenCog predicates living in the OpenCog knowledge repository (AtomSpace), and inter- connected in a hierarchical way using OpenCog nodes and links (thus reflecting DeSTIN's hierarchical structure within the AtomSpace). This hierarchical network of predicates, spanning the two software systems, plays the role of a semantic CSDLN as described in ICoel la]. 29.2 Simplified OpenCog Workflow The dynamics inside an OpenCog system may be highly complex, defying simple flowchart- ing, but from the point of view of OpenCog-DeSTIN integration, one important pattern of information flow through the system is as follows: EFTA00624320
174 29 Bridging the Symbolic/Subsymbolic Gap 1. Perceptions come into the Atomspace. In the current OpenCog system, these are provided via a proxy to the game engine where the OpenCog controlled character interacts. In an OpenCog-DeSTIN hybrid, these will be provided via DeSTIN. 2. Hebbian learning builds HebbianLinks between perceptual Atoms representing percepts that have frequently co-occurred 3. PLN inference, concept blending and other methods act on these perceptual Atoms and their HebbianLinks, forming links between them and linking them to other Atoms stored in the Atomspace reflecting prior experience and generalizations therefrom 4. Attention allocation gives higher short and long term importance values to those Atoms that appear likely to be useful based on the links they have obtained 5. Based on the system's current goals and subgoals (the latter learned from the top-level goals using PLN), and the goal-related links in the Atomspace, the OpenPsi mechanism triggers the PLN-based planner, which chooses a series of high-level actions that are judged likely to help the system achieve its goals in the current context 6. The chosen high-level actions are transformed into series of lower-level, directly executable actions. In the current OpenCog system, this is done by a set of hand-coded rules based on the specific mechanics of the game engine where the OpenCog controlled character interacts. In an OpenCog-DeSTIN hybrid, the lower-level action sequence will be chosen by an optimization method acting based on the motor control and perceptual hierarchies. This pattern of information flow omits numerous aspects of OpenCog cognitive dynamics, but gives the key parts of the picture in terms of the interaction of OpenCog cognition with perception and action. Most of the other aspects of the dynamics have to do with the interac- tion of multiple cognitive processes acting on the Atomspace, and the interaction between the Atomspace and several associated specialized memory stores, dealing with procedural. episodic, temporal and spatial aspects of knowledge. From the present point of view, these additional aspects may be viewed as part of Step 3 above, wrapped up in the phrase "and other methods act on these perceptual Atoms." However, it's worth noting that in order to act appropriately on perceptual Atoms, a lot of background cognition regarding more abstract conceptual Atoms (often generalized from previous perceptual Atoms) may be drawn on. This background infer- ence incorporates both symbolic and subsymbolic aspects, but goes beyond the scope of the present discussion, as its particulars do not impinge on the particulars of DeSTIN-OpenCog integration. OpenCog also possesses a specialized facility for natural language comprehension and genera- tion'LCElOJ V;oelObj, which may be viewed as a parallel perception/action pathway, bypassing traditional human-like sense perception and dealing with text directly. Integrating OpenCog- Prime's current linguistics processes with DeSTIN-based auditory and visual processing is a deep and important topic, but one we will bypass here, for sake of brevity and because it's not our current research priority. 29.3 Integrating DeSTIN and OpenCog The integration of DeSTIN and OpenCog involves two key aspects: • recognition of patterns in sets of DeSTIN states, and exportation of these patterns into the OpenCog Atomspace EFTA00624321
29.3 Integrating DeSTIN and OpenCog 175 • use of OpenCog-created concepts within DeSTIN nodes, alongside statistically-derived "cen- troids" From here on, unless specified otherwise, when we mention "DeSTIN" we will refer to "Uniform DeSTIN" as presented in Chapter 28 and an extension of "classic DeSTIN" as defined in IA K 29.3.1 Mining Patterns from DeSTIN States The first step toward using OpenCog tools to mine patterns from sets of DeSTIN states, is to represent these states in Atom form in an appropriate way. A simple but workable approach, restricting attention for the moment to purely spatial patterns, is to use the six predicates: • hasCentroid(node N,int k) • hosParentCentroid(node N,int k) • hasNorthNeighborCentrold(node N,int k) • hasSouthNeighborCentrold(node N,int k) • hasEastNeighborCentroid(node N, int k) • hasWestNeighborCentroid(node N, ird For instance hasNorthNeighborCentroid(N, 3) means that N's north neighbor has centroid #3 One may consider also the predicates • hosParent(node N, Node • hasNorthNeighbor(node N, Node Ail) • hasSouthNeighbor(node N, Node M) • hasEastNeighbor(node N, Node M) • hasWestNeighbor(node N, Node Al) Now suppose we have a stored set of DeSTIN states, saved from the application of DeSTIN to multiple different inputs. What we want to find are predicates P that are conjunctions of instances of the above 10 predicates, which occur frequently in the stored set of DeSTIN states. A simple example of such a predicate would be the conjunction of • hosNorthNeighbor(SN,SAI) • hosParentCentroid($N, 5) • hosParentCentroid($M, 5) • hasNorthNeighborCentrold(SN, 6) • hasWestNeighboreentroid(SAI, 4) This predicate could be evaluated at any pair of nodes ($N, $M) on the same DeSTIN level. If it is true for atypically many of these pairs, then it's a "frequent pattern", and should be detected and stored. EFTA00624322
176 29 Bridging the Symbolic/Subsymbolic Gap OpenCogPrime's pattern mining component, Fishgram, exists precisely for the purpose of mining this sort of conjunction from sets of relationships that are stored in the Atonispace. It may be applied to this problem as follows: • Translate each DeSTIN state into a set of relationships drawn from: hasNorthNeighbor, ha.sSouthNeighbor, hasEastNeighbor, hasWestNeighbor, hasCentroid, hasParent • Import these relationships, describing each DeSTIN state, into the OpenCog Atomspace • Run pattern mining on this AtomSpace. 29.3.2 Probabilistic Inference on Mined Hypergraphs Patterns mined from DeSTIN states can then be reasoned on by OpenCogPrime's PLN inference engine, allowing analogy and generalization. Suppose centroids 5 and 617 are estimated to be similar - either via DeSTIN's built-in simi- larity metric, or, more interestingly via OpenCog inference on the Atom representations of these centroids. As an example of the latter, consider: 5 could represent a person's nose and 617 could represent a rabbit's nose. In this case, DeSTIN might not judge the two centroids particularly similar on a purely visual level, but, OpenCog may know that the images corresponding to both of these centroids are are called "noses" (e.g. perhaps via noticing people indicate these images in association with the word "nose"), and may thus infer (using a simple chain of PLN inferences) that these centroids seem probabilistically similar. If 5 and 617 are estimated to be similar, then a predicate like ANDLink EvaluationLink hasNorthNeighbor ListLink $N $M EvaluationLink hasParentCentroid ListLink $N 5 EvaluationLink hasParentCentroid ListLink $M 5 EvaluationLink hasNorthNeighborCentroid ListLink $N 6 EvaluationLink hasWestNeighborCentroid ListLink $M 4 mined from DeSTIN states, could be extended via PLN analogical reasoning to ANDLink EvaluationLink hasNorthNeighbor ListLink $N $M EvaluationLink EFTA00624323
29.3 Integrating DeSTIN and OpenCog ITT hasParentCentroid ListLink $N 617 EvaluationLink hasParentCentroid ListLink $M 617 EvaluationLink hasNorthNeighborCentroid ListLink $N 6 EvaluationLink hasWestNeighborCentroid ListLink $M 4 29.3.3 Insertion of OpenCog-Learned Predicates into DeSTIN's Pattern Library Suppose one has used Fishgram, as described in the earlier part of this chapter, to recog- nize predicates embodying frequent or surprising patterns in a set of DeSTIN states or state- sequences. The next natural step is to add these frequent or surprising patterns to DeSTIN's pattern library, so that the pattern library contains not only classic DeSTIN centroids, but also these corresponding "image grammar" style patterns. Then, when a new input comes into a DeSTIN node, in addition to being compared to the centroids at the node, it can be fed as input to the predicates associated with the node. What is the advantage of this approach, compared to DeSTIN without these predicates? The capability for more compact representation of a variety of spatial patterns. In many cases, a spatial pattern that would require a large number of DeSTIN centroids to represent, can be represented by a single, fairly compact predicate. It is an open question whether these sorts of predicates are really critical for human-like vision processing. However, our intuition is that they do have a role in human as well as machine vision. In essence, DeSTIN is based on a fancy version of nearest-neighbor search, applied in a clever way on multiple levels of a hierarchy, using context-savvy probabilities to bias the matching. But we suspect there are many visual patterns that are more compactly and intuitively represented using a more flexible language, such as OpenCog predicates formed by combining elementary predicates involving appropriate spatial and temporal relations. For example, consider the archetypal spatial pattern of a face as: either two eyes that are next to each other, or sunglasses, above a nose, which is in turn above a mouth. (This is an oversimplified toy example, but we're positing it for illustration only. The same point applies to more complex and realistic patterns.) One could represent this in OpenCogPrime's Atom language as something like: AND InheritanceLink N B_nose InheritanceLink M B_mouth EvaluationLink above ListLink E N EvaluationLink EFTA00624324
178 29 Bridging the Symbolic/Subsymbolic Gap above ListLink N M OR AND MemberLink El E MemberLink E2 E EvaluationLink next_to ListLink El E2 InheritanceLink El B_eye AND InheritanceLink E B_sunglasses where e.g. B_eye is a DeSTIN belief that corresponds roughly to recognition of the spatial pattern of a human eye. To represent this using ordinary DeSTIN centroids, one couldn't rep- resent the OR explicitly; instead one would need to split it into two different sets of centroids, corresponding to the eye case and the sunglasses case - unless the DeSTIN pattern library contained a belief corresponding to "eyes or sunglasses." But the question then becomes: how would classic DeSTIN actually learn a belief like this? In the suggested architecture, pattern mining on the database of DeSTIN states is proposed as an algorithm for learning such beliefs. This sort of predicate-enhanced DeSTIN will have advantages over the traditional version, only if the actual distribution of images observed by the system contains many (reasonably high probability) images modeled accurately by predicates involving disjunction and/or negations as well as conjunctions. If the system's perceived world is simpler than this, then good old DeSTIN will work just as well, and the OpenCog-learned predicates are a needless complication. Without these sorts of predicates, how might DeSTIN be extended to include beliefs like "eyes or sunglasses"? One way would be to couple DeSTIN with a reinforcement learning subsystem, that reinforced the creation of beliefs that were useful for the system as a whole. If reasoning in terms of faces (independent of whether they have eyes or sunglasses) got the system reward, presumably it could learn to form the concept "eyes or sunglasses." We believe this would also be a workable approach, but that given the strengths and weaknesses of contemporary computer hardware, the proposed DeSTIN-OpenCog approach will prove considerably simpler and more effective. 29.4 Multisensory Integration, and Perception-Action Integration In Chapter 28.2.1 we have briefly indicated how DeSTIN could be extended beyond vision to handle other senses such as audition and touch. If one had multiple perception hierarchies corresponding to multiple senses, the easiest way to integrate them within an OpenCog context would be to use OpenCog as the communication nexus - representing DeSTIN centroids in the various modality-specific hierarchies as OpenCog Atoms (PerceptualCentroidNodes), and building HebbianLinks in OpenCogPrime's Atomspace between these PerceptualCentroidNodes as appropriate based on their association. So for instance the sound of a person's footsteps would correspond to a certain belief (probability distribution over centroids) in the auditory DeSTIN network, and the sight of a person's feet stepping would correspond to a certain belief (probability distribution over centroids) in the visual DeSTIN network; and the OpenCog EFTA00624325
29.4 Multisensory Integration, and Perception-Action Integration 179 Atomspace would contain links between the sets of centroids assigned high weights between these two belief distributions. Importance spreading between these various PerceptualCentroidNodes would cause a dynamic wherein seeing feet stepping would bias the system to think it was hearing footsteps, and hearing footsteps would bias it to think it was seeing feet stepping. And, suppose there are similarities between the belief distributions for the visual appearance of dogs, and the visual appearance of cats. Via the intermediary of the Atomspace, this would bias the auditory and haptic DeSTIN hierarchies to assume a similarity between the auditory and haptic characteristics of dogs, and the analogous characteristics of cats. Because: PLN analogical reasoning would extrapolate from, e.g. • HebbianLinks joining cat-related visual PerceptualCentroidNodes and dog-related visual PerceptualCentroidNodm • HebbianLinks joining cat-related visual PerceptualCentroidNodes to cat-related haptic Per- ceptualCentroidNodes; and others joining dog-related visual PerceptualCentroidNodes to dog-related haptic PerceptualCentroidNodes to yield HebbianLinks joining cat-related haptic PerceptualCentroidNodes and dog-related hap- tic PerceptualCentroidNodes. This sort of reasoning would then cause the system DeSTIN to, for example, upon touching a cat, vaguely expect to maybe hear dog-like things. This sort of simple analogical reasoning will be right sometimes and wrong sometimes - a cat walking sounds a fair bit like a dog walking, and cat and dog growls sound fairly similar, but a cat meowing doesn't sound that much like a dog barking. More refined inferences of the same basic sort may be used to get the details right as the system explores and understands the world more accurately. 29.4.1 Perception-Action Integration While experimentation with DeSTIN has so far been restricted to perception processing, the sys- tem was designed from the beginning with robotics applications in mind, involving integration of perception with action and reinforcement learning. As OpenCog already handles reinforce- ment learning on a high level (via OpenPsi), our approach to robot control using DeSTIN and OpenCog involves creating a control hierarchy parallel to DeSTIN's perceptual hierarchy, and doing motor learning using optimization algorithms guided by reinforcement signals delivered from OpenPsi and incorporating DeSTIN perceptual states as part of their input information. Our initial research goal, where action is concerned, is not to equal the best purely control- theoretic algorithms at fine-grained control of robots carrying out specialized tasks, but rather to achieve basic perception / control / cognition integration in the rough manner of a young human child. A two year old child is not particularly well coordinated, but is capable of coor- dinating actions involving multiple body parts using an integration of perception and action with unconscious and deliberative reasoning. Current robots, in some cases, can carry out spe- cialized actions with great accuracy, but they lack this sort of integration, and thus generally have difficulty effectively carrying out actions in unforeseen environments and circumstances. We will create an action hierarchy with nodes corresponding to different parts of the robot body, where e.g. the node corresponding to an arm would have child nodes corresponding to a shoulder, elbow, wrist and hand; and the node corresponding to a hand would have child nodes corresponding to the fingers of the hand; etc. Physical self-perception is then achieved EFTA00624326
ISO 29 Bridging the Symbolic/Subsymbolic Cap by creating a DeSTIN "action-perception" hierarchy with nodes corresponding to the states of body components. In the simplest case this means the lowest-level nodes will correspond to individual servomotors, and their inputs will be numerical vectors characterizing servomotor states. If one is dealing with a robot endowed with haptic technology. e.g. Syntouch fingertips, then numerical vectors characterizing haptic inputs may be used alongside these. The configuration space of an action-perception node, corresponding to the degrees of free- dom of the servomotors of the body part the node represents, may be approximated by a set of "centroid" vectors. When an action is learned by the optimization method used for this purpose, this involves movements of the servomotors corresponding to many different nodes, and thus creates a series of "configuration vectors" in each node. These configuration vector series may be subjected to online clustering, similar to percepts in a DeSTIN perceptual hierarchy. The result is a library of "codewords", corresponding to discrete trajectories of movement, associ- ated with each node. The libraries may be shared by identical body parts (e.g. shared among legs, shared among fingers), but will be distinct otherwise. Each coordinated whole-body action thus results in a series of (node, centroid) pairs, which may be mined for patterns, similarly to the perception case. The set of predicates needed to characterize states in this action-perception hierarchy is simpler than the one described for visual perception above; here one requires only • haseentroid(node N,int k) • hosParentCentroid(node N, Mt k) • hasParent(node N, Node M) • hasSibling(node N, Node M) and most of the patterns will involve specific nodes rather than node variables. The different nodes in a DeSTIN vision hierarchy are more interchangeable (in terms of their involvement in various patterns) than, say, a leg and a finger. In a pure DeSTIN implementation, the visual and action-perception hierarchies would be directly linked. In the context of OpenCog integration, it is simplest to link the two via OpenCog, in a sense using cognition as a bridge between action and perception. It is unclear whether this strategy will be sufficient in the long run, but we believe it will be more than adequate for experimentation with robotic perceptual-motor coordination in a variety of everyday tasks. OpenCogPrime's Hebbian learning process can be used to find common associations between action-perception states and visual-perception states, via mining a data store containing time- stamped state records from both hierarchies. Importance spreading along the HebbianLinIcs learned in this way can then be used to bias the weights in the belief states of the nodes in both hierarchies. So, for example, the action- perception patterns related to clenching the fist, would be Hebbianly correlated with the visual- perception patterns related to seeing a clenched fist. When a clenched fist was perceived via servomotor data. importance spreading would increase the weighting of visual patterns corre- sponding to clenched fists, within the visual hierarchy. When a clenched fist was perceived via visual data, importance spreading would increase the weighting of servomotor data patterns corresponding to clenched fists, within the action-perception hierarchy. EFTA00624327






