Datalog User Manual


Next: 

This manual documents Datalog version 1.6, a lightweight deductive database system.

Copyright © 2004 The MITRE Corporation.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. The terms are contained in the file COPYING.LIB in the source distribution of the software, or at http://www.gnu.org/licenses/lgpl.txt.

Contents

The software described in this manual was written by John D. Ramsdell of The MITRE Corporation. The work was supported by the MITRE-Sponsored Research Program.


Next: , Previous: Top, Up: Top

1 Introduction

The Datalog package contains a lightweight deductive database system. Queries and database updates are expressed using Datalog—a declarative logic language in which each formula is a function-free Horn clause, and every variable in the head of a clause must appear in the body of the clause. The use of Datalog syntax and an implementation based on tabling intermediate results, ensures that all queries terminate.

The components in this package are designed to be small, and usable on memory constrained devices. The package includes an interactive interpreter for Datalog, and a library that can be employed to embed a small deductive database into C programs. The library uses the tabled logic programming algorithm described in “Efficient Top-Down Computation of Queries under the Well-Founded Semantics”, Chen, W., Swift, T., and Warren, D. S., J. Logic Prog., Vol. 24, No. 3, pp. 161-199. Another important reference is “Tabled Evaluation with Delaying for General Logic Programs”, Chen, W., and Warren, D. S., J. ACM, Vol. 43, No. 1, Jan. 1996, pp. 20-74. Datalog is described in “What You Always Wanted to Know About Datalog (And Never Dared to Ask)”, Stefano Ceri, Georg Gottlob, and Letizia Tanca, IEEE Transactions of Knowledge and Data Engineering, Vol. 1, No. 1, March 1989.


Next: , Previous: Introduction, Up: Top

2 Tutorial

Datalog is introduced by showing a sequence of interactions with a Datalog interpreter. The interpreter is started with the datalog command.

     $ datalog
     Datalog 1.6
     
     >

Facts are stored in tables. If the name of the table is ‘parent’, and ‘john’ is the parent of ‘douglas’, store the fact in the database with this:

     > parent(john, douglas).

Each item in the parenthesized list following the name of the table is called a term. A term can be either a logical variable or a constant. Thus far, all the terms shown have been constant terms.

A query can be used to see if a particular row is in a table. Type this to see if ‘john’ is the parent of ‘douglas’:

     > parent(john, douglas)?
     parent(john, douglas).

Type this to see if ‘john’ is the parent of ‘ebbon’:

     > parent(john, ebbon)?

The query produced no results because ‘john’ is not the parent of ‘ebbon’. Let's add more rows.

     > parent(bob, john).
     > parent(ebbon, bob).

Type the following to list all rows in the ‘parent’ table:

     > parent(A, B)?
     parent(john, douglas).
     parent(bob, john).
     parent(ebbon, bob).

Type the following to list all the children of ‘john’:

     > parent(john, B)?
     parent(john, douglas).

A term that begins with a capital letter is a logical variable. When producing a set of answers, the Datalog interpreter lists all rows that match the query when each variable in the query is substituted for a constant. The following example produces no answers, as there are no substitutions for the variable ‘A’ that produce a fact in the database. This is because no one is the parent of oneself.

     > parent(A, A)?

A deductive database can use rules of inference to derive new facts. Consider the following rule:

     > ancestor(A, B) :- parent(A, B).

The rule says that if A is the parent of B, then A is an ancestor of B. The other rule defining an ancestor says that if A is the parent of C, C is an ancestor of B, then A is an ancestor of B.

     > ancestor(A, B) :- \
     >>   parent(A, C),  \
     >>   ancestor(C, B).

In the interpreter, a line can be continued by ending it with the backslash character.

Rules are used to answer queries just as is done for facts.

     > ancestor(A, B)?
     ancestor(ebbon, bob).
     ancestor(ebbon, douglas).
     ancestor(bob, john).
     ancestor(john, douglas).
     ancestor(ebbon, john).
     ancestor(bob, douglas).
     > ancestor(X,john)?
     ancestor(bob, john).
     ancestor(ebbon, john).

A fact or a rule can be retracted from the database using tilde syntax:

     > parent(bob, john)~
     > parent(A,B)?
     parent(john, douglas).
     parent(ebbon, bob).
     > ancestor(A,B)?
     ancestor(john, douglas).
     ancestor(ebbon, bob).

Unlike Prolog, the order in which clauses are asserted is irrelevant. All queries terminate, and every possible answer is derived.

     > q(X) :- p(X).
     > q(a).
     > p(X) :- q(X).
     > q(X)?
     q(a).


Next: , Previous: Tutorial, Up: Top

3 Interpreter

The datalog program provides access to a deductive database in both interactive or batch mode. In batch mode, the program reads a Datalog program, prints the answers to its query, and then exits. In interactive mode, the program treats each line of text it reads as a Datalog program, prints the answer to its query, and then prompts for more input.

The datalog interpreter accepts the following options:

     $ datalog -h
     Usage: datalog [options] [file]
     Options:
       -o file -- output to file (default is standard output)
       -i      -- enter interactive mode after loading file
       -v      -- print version information
       -h      -- print this message
     Use - as a file name to specify standard input

When in interactive mode, if the first character of a line of input is the equals sign, the remaining text on the line is used as the name of a file that is loaded into the interpreter. An incomplete line of text can be continued by ending the line with the backslash character.


Next: , Previous: Interpreter, Up: Top

4 Primitive Predicates

The interpreter provides an equality predicate with infix notation. Two terms are equal if both terms are the same constant, or one is a constant and the other is a variable.

     > 1 = 2?
     > 1 = 1?
     1 = 1.
     > X = 1?
     1 = 1.
     > X = X?


Next: , Previous: Primitives, Up: Top

5 Syntax

In Datalog input, whitespace characters are ignored except when they separate adjacent tokens or when they occur in strings. Comments are also considered to be whitespace. The character ‘%’ introduces a comment, which extends to the next line break. Comments do not occur inside strings.

The characters in Datalog input are collected into tokens according to the rules that follow. There are four classes of tokens: punctuations, variables, identifiers, and strings. The punctuation tokens are: ‘(’, ‘,’, ‘)’, ‘=’, ‘:-’, ‘.’, ‘~’, ‘?’, and ‘"’.

A variable is a sequence of Latin capital and small letters, digits, and the underscore character. A variable must begin with a Latin capital letter.

An identifier is a sequence of printing characters that does not contain any of the following characters: ‘(’, ‘,’, ‘)’, ‘=’, ‘:’, ‘.’, ‘~’, ‘?’, ‘"’, ‘%’, and space. An identifier must not begin with a Latin capital letter. Note that the characters that start punctuation are forbidden in identifiers, but the hyphen character is allowed.

A string is a sequence of characters enclosed in double quotes. Characters other than double quote, newline, and backslash may be directly included in a string. The remaining characters may be specified using escape characters, ‘\"’, ‘\n’, and ‘\\’ respectively.

Other escape characters can be used to improve the readability of the input. If a string is too long to fit conveniently on one line, all but the final line containing the string can be ended with a backslash character, and each backslash newline pair is ignored. The character escape codes from the C programming language are allowed—‘\a’, ‘\b’, ‘\f’, ‘\n’, ‘\r’, ‘\t’, ‘\v’, ‘\'’, and ‘\?’. The numeric escape codes consist of exactly two uppercase hex digits. Thus the ASCII character newline is ‘\0A’, and zero is ‘\00’.

5.1 Literals

A literal, is a predicate symbol followed by an optional parenthesized list of comma separated terms. A predicate symbol is either an identifier or a string. A term is either a variable or a constant. As with predicate symbols, a constant is either an identifier or a string. The following are literals:

     parent(john, douglas)
     zero-arity-literal
     aBcD(-0, "\n\FF")
     "="(3,3)
     ""(-0-0-0,&&&,***,"\00")

5.2 Clauses

A clause is a head literal followed by an optional body. A body is a comma separated list of literals. A clause without a body is called a fact, and a rule when it has one. The punctuation ‘:-’ separates the head of a rule from its body. A clause is safe if every variable in its head occurs in some literal in its body. The following are safe clauses:

     parent(john, douglas)
     ancestor(A, B) :-
         parent(A, B)
     ancestor(A, B) :-
         parent(A, C),
         ancestor(C, B)

5.3 Programs

A Datalog reader consumes a Datalog program. A program is a sequence of zero or more statements, followed by an optional query. A statement is an assertion or a retraction. An assertion is a clause followed by a period, and it adds the clause to the database if it is safe. A retraction is a clause followed by a tilde, and it removes the clause from the database. A query is a literal followed by a question mark. The effect of reading a Datalog program is to modify the database as directed by its statements, and then to return the literal designated as the query. If no query is specified, a reader returns a literal know to have no answers. The following is a program:

     edge(a, b). edge(b, c). edge(c, d). edge(d, a).
     path(X, Y) :- edge(X, Y).
     path(X, Y) :- edge(X, Z), path(Z, Y).
     path(X, Y)?


Next: , Previous: Syntax, Up: Top

6 Library

The Datalog library makes the services provided by a Datalog interpreter available to C programs as a set of C functions and types. The interface is declared in the header file datalog.h.

The state of a Datalog interpreter is maintained by a dynamically allocated structure. A pointer to this structure is passed as the first argument to every C function in the library, except to dl_open, which allocates the structure, and returns it as a pointer of type dl_db_t.

     dl_db_t dl_open(void);

If an error occurs, dl_open returns the null pointer. To release the resources associated with a Datalog interpreter, call dl_close.

     void dl_close(dl_db_t);

The package name followed by the version number of the library is returned by dl_version.

     const char* dl_version(void);

6.1 Loading

The loader reads a Datalog program. A program is a sequence of zero or more assertions or retractions, followed by an optional query expressed as a literal. While reading, the loader performs the assertions and retractions requested by the program. It finishes by pushing the query literal on the Datalog interpreter's stack. If no query is specified, it pushes a literal known to have no answers.

The loader obtains buffers containing characters by calling a function given to it of type dl_reader_t.

     typedef const char *(*dl_reader_t)(void *data, size_t *size);

When called by the loader, the reader is given the data supplied to the loader as its first argument. If there is no more input available, the reader returns a null pointer, otherwise it assigns the size of the buffer to the size parameter, and then returns a pointer to the buffer.

The loader reports errors by calling a function given to it of type dl_loaderror_t.

     typedef void (*dl_loaderror_t)(void *data, int lineno,
                                    int colno, const char *msg);

When a loader error occurs, the loader calls this function with the data supplied to the loader, the location of the error as a line number and column number, and an error message.

     int dl_load(dl_db_t database, dl_reader_t reader,
                 dl_loaderror_t loaderror, void *data);

This function loads a Datalog program. It uses a reader to obtain a sequence of buffers that contain the program. The load error reporter is called when the loader detects an error, and then a non-zero value is returned, otherwise, the loader returns zero. The value in the data parameter is passed to both of the supplied functions whenever they are called.

Upon success, the loader leaves a literal on the Datalog interpreter's stack. To ignore this literal, call dl_pop.

     int dl_pop(dl_db_t db);

The dl_loadbuffer function loads a Datalog program contained in one buffer.

     int dl_loadbuffer(dl_db_t database, const char *buffer,
                       size_t size, dl_loaderror_t loaderror);

6.2 Querying

The literal left on the stack by the loader can be used to query the database.

     int dl_ask(dl_db_t db, dl_answers_t *a);

This function asks for a list that contains all ground instances of a literal that are a logical consequence of the clauses stored in the database. An instance of a literal is ground, if it is the result of substituting some constant for each variable that occurs in the literal. The function always pops the literal from the stack. If some answers are found, they are returned by assigning to the a parameter, a pointer to a freshly allocated list of answers of type dl_answers_t. The a parameter is assigned the null pointer when there are errors or no answers. The dl_ask function returns a non-zero value on error.

     void dl_free(dl_answers_t a);

The dl_free function frees the space associated with a list of answers.

A description of the functions that access the components of a list of answers completes this section.

     char *dl_getpred(dl_answers_t a);

This function gets the predicate symbol associated with the answers. If the length of the predicate is n, n + 1 character locations are returned, and the last location is zero. Other character locations in the predicate may be zero. This function returns the null pointer if given it.

     size_t dl_getpredlen(dl_answers_t a);

This function gets the length of the predicate symbol associated with the answers. This function returns zero if given the null pointer.

     size_t dl_getpredarity(dl_answers_t a);

This function gets the number of terms that makes up each answer. This number is called the predicate's arity. It returns zero if given the null pointer.

     char *dl_getconst(dl_answers_t a, int i, int j);

This function gets the constant associated with term j in answer i. Zero-based indexing is used throughout. If the length of the constant is n, n + 1 character locations are returned, and the last location is zero. Other character locations in the constant may be zero. If there is no specified term or no answers at all, the null pointer is returned.

     size_t dl_getconstlen(dl_answers_t a, int i, int j);

This function gets the length of the constant associated with term j in answer i. Zero-based indexing is used throughout. If there is no specified term or no answers at all, zero is returned.

6.3 Printing

The functions in this section provide support for printing using the loader's syntax for constants.

     void dl_putlconst(FILE *out, const char *s, size_t n);

Print a constant of a given length. The function assumes the constant has n + 1 character locations, and the last location is zero. Other character locations in the constant may be zero.

     void dl_putconst(FILE *out, const char *s);

Print a zero terminated constant.

     size_t dl_widthoflconst(const char *s, size_t n);

Determine the columns needed for a constant with a given length. The function assumes the constant has n + 1 character locations, and the last location is zero. Other character locations in the constant may be zero.

     size_t dl_widthofconst(const char *s);

Determine the columns needed for a zero terminated constant.


Next: , Previous: Library, Up: Top

7 Extending

Adding primitive predicates to the Datalog interpreter requires knowledge of its internals. Much of the functionality of the interpreter is implemented using an embedded language called Lua, available at http://www.lua.org. Lua provides the needed infrastructure and a small footprint. As an added bonus, the Lua language is a gem. Knowledge of Lua is a prerequisite for what follows.

A pointer of type dl_db_t, introduced in the previous chapter, is implemented as a pointer to a structure that holds the state of a Lua interpreter. When a Datalog interpreter is opened, a Lua interpreter is opened, some Lua libraries are loaded, and then a Lua program that implements much of the Datalog interpreter's functionality is loaded into the Lua interpreter. Adding primitives to the Datalog interpreter involves adding code into the Lua interpreter returned by the dl_open function.

The mechanics of adding primitives is straightforward. The pointer of type dl_db_t is cast to a pointer of type lua_State *. This pointer can then be used with a variety of functions to load code into Lua, such as the lua_dobuffer function. The more difficult part is to decide what code should be loaded.

The source file for the Datalog specific Lua code is installed in the Datalog package's data directory. The path name of the source file typically ends with share/datalog/datalog.lua. The section with a comment containing the words “PRIMITIVES” is relevant.

There is good support for the case in which a primitive can be defined in terms of a Lua iterator. The file datalog.lua defines the datalog package, and it contains the function add_iter_prim. This function defines a primitive when given a predicate symbol, its arity, and an iterator.

When given a literal, the iterator generates a sequences of answers. Each answer must be an array. Each element in the array must be either a number or a string. The length of the array must be equal to the arity of the predicate.

What follows is a predicate defined by an iterator that produces one answer, and the answer has one element in it—the number three.

     local function all_threes(literal)
       return function (s, v)
         if v then
           return nil
         else
           return {3}
         end
     end
     datalog:add_iter_prim("three", 1, all_threes)

To better integrate with other software written Lua, an existing pointer to Lua's state can be made to be used as a Datalog interpreter. The Datalog specific initialization is performed by casting a pointer to Lua's state to a pointer of type dl_db_t, and giving to

     int dl_init(dl_db_t);


Next: , Previous: Extending, Up: Top

Appendix A Copying the Software

The software described by this manual is covered by the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. The terms are contained in the file COPYING.LIB in the source distribution of the software, or at http://www.gnu.org/licenses/lgpl.txt.


Previous: Copying, Up: Top

Index