Introduction

Logic-based methods have been a keystone of Artificial intelligence since its very beginnings. Deriving their origin in John McCarthy’s 1959 seminal paper “Programs with Common Sense”, they have grown into a general-purpose tool for complex reasoning. A modern example of their success is the Cyc Knowledge Base, the result of an 18 year effort to encode common sense knowledge into a logical form. As the state of the art in logic-based techniques advances, collections of logical facts and inference rules expand to the point where they can not fit into a computer’s memory and must be stored on the hard drive. This has significant reprecussions for knowledge bases, the programs used to store and manipulate the logical expressions, because they must manipulate data stored on a slow medium (about 2000 times slower than memory access). If logic-based methods are to scale to large information domains, knowledge bases must deal explicitly with the slowdowns imposed by hard drive access.

Background

The difficulty in optimizing knowledge bases to hard drive access lies in the size and the complexity of the data they store. A logical expression is a tree structure composed of nested lists, with predicates and variables for leaves. For example, (isa John Human) is an expression stating that John is Human, while (GDP (USA 6.2) (CAN 1.3) (FRA 2.4)) is an expression listing the Gross Domestic Product of several nations. Logical expressions can be further nested, allowing for large structures. Variables may be used as follows: (implies (isa ?x Human) (isa ?x Mammal)) says that one being a Human implies that they are a mammal and (follows (anniversary ?x ?y) (wedding ?x ?y)) says that a couple’s anniversary comes after their wedding.

Knowledge Bases work by allowing the user to query them to see what information can be derived from the expressions stored inside. Typically, a logical statement is proposed and the knowledge base checks its contents to see if a proof can be constructed supporting the statement. This complex process typically requires the knowledge base to perform many complex queries on its expressions and is the step that will experience significant slowdowns when forced to access the hard drive to retrieve the expressions. My work on the Grex Knowledge Base optimizes the way in which a knowledge base queries its contents, attempting to ensure that while processing a query it only needs to touch the expressions that are results of the query.

Method

A logical expression can be viewed as a set of features. For example, the expression (implies (isa ?x Human) (isa ?x Mammal)) has 10 features. First, it is a list of 3 elements. The first element is the atomic proposition “implies” while the second and third are both 3 element lists. There are also features corresponding to the elements in the two sublists. For any given expression Grex examines its features upto depth 5, looking until the 7^th element in any given sublist, for a total of approximately 30,000 features with which to describe an expression. Expressions are stored and queried according to which features they have. For example, if we store in the same file on the disk all expressions of length 3 whose first element is “isa” and whose third element is “Human”, the query (isa ?x Human) will only have to read its matches from this file alone, giving us 100% efficient hard drive access. However, the query (isa John ?x) (ie. what categories does John fall under) will cause us to read multiple files, containing many extraneous expressions. Clearly, in organizing a knowledge base that contains a great deal of data, it is vital to know the kinds of queries that will be performed upon it.

In order to achieve near-optimal retrieval efficiency without resorting to expensive techniques based on exhaustive search, the Grex Knowledge Base uses the Decision Trie algorithm to store its expressions. Borrowing ideas from both decision trees and retrieval tries, we developed Decision Tries specifically for Grex to work by hierarchically dividing their contents in response to new expressions being inserted, optimizing with respect to the kinds of queries that may be performed on them. Given a list of sample queries, a decision trie ensures that its expressions will be stored in a way which minimizes the number of extraneous expressions accessed by a given query without going to the extreme of storing each expression in a separate file (which introduces other overheads).

A decision trie starts out storing no expressions but having a set of potential queries with respect to which it will optimize itself. As expressions start coming into the Knowledge Base, all are put into the same file until this file has too many expressions (say a cap of 20). Now we pick the expression feature that the majority of our sample queries use to pick out their matching expressions and split the expressions we have according to that feature. (ie. if the feature was the length of an expression’s top-level list then we would put all expressions of the same length into the same file together). Thus, when the majority of the sample queries are actually performed, we will look at the files that directly match each query, resulting in many fewer disk accesses. The algorithm proceeds by adding more and more expressions and hierarchically subdividing as necessary. The resulting performance is much better than that attained by non-hard drive optimized knowledge base implementations without incurring significant overheads.