What Is Datalog? The Declarative Query Language

Datalog is a declarative logic programming language designed for querying and reasoning about data. It lets you define facts and rules, then automatically derives new information by applying those rules, including through recursion. First introduced in the 1980s as a subset of Prolog, Datalog has seen a major resurgence because of its applications in security analysis, static code analysis, data integration, and networking.

Unlike general-purpose languages where you write step-by-step instructions, Datalog programs describe *what* you want to know rather than *how* to compute it. The system figures out the rest.

How Datalog Programs Work

A Datalog program consists of two things: facts and rules. Facts are simple statements you assert to be true. Rules describe how to derive new facts from existing ones. That’s the entire language at its core.

Imagine you have a directed graph of connections between cities. You could state facts about direct routes:

route("Chicago", "Detroit").
route("Detroit", "Toronto").
route("Toronto", "Montreal").

These are your base facts, known as the extensional database (EDB). They’re the raw input, the data you start with. Then you write rules to derive new relationships from those facts. The derived relations are called the intensional database (IDB). A predicate is always one or the other, never both.

A Classic Example: Transitive Closure

The most famous Datalog example computes the transitive closure of a graph: finding all pairs of nodes connected by any path, not just direct edges. Here’s how it looks:

T(x,y) :- R(x,y).
T(x,y) :- T(x,z), R(z,y).

The first rule says: if there’s a direct edge from x to y in relation R, then T(x,y) is true. The second rule says: if T(x,z) is already true and there’s an edge from z to y, then T(x,y) is also true. The system applies these rules repeatedly until no new facts can be derived, a process called reaching a fixed point.

Notice that T appears on both sides of the second rule. This is recursion, and it’s the key feature that separates Datalog from simple database queries. The language keeps cycling through the rules, discovering new connections, until there’s nothing left to discover.

Datalog vs. SQL for Recursive Queries

SQL added recursive query support through recursive Common Table Expressions (CTEs) in the SQL-99 standard, and most modern databases support them. You can compute transitive closure in SQL. But there are real differences in both readability and performance.

The Datalog version of transitive closure is two lines. The equivalent SQL recursive CTE is significantly more verbose and harder to read, especially as the logic gets more complex. More importantly, dedicated Datalog engines can be dramatically faster for recursive workloads. In benchmarks comparing systems on a transitive closure task over a graph with 1,000 nodes, the Datalog engine Soufflé was roughly 5.5 times faster than the baseline logic programming system XSB. Meanwhile, PostgreSQL was about 3.8 times slower than XSB, and SQLite was about 5.6 times slower. That puts Soufflé at roughly 20 to 30 times faster than those SQL databases for this type of recursive computation.

The gap widens for more complex recursive patterns. On a “same generation” benchmark (finding all pairs of nodes at the same depth in a tree), Soufflé was about 7 times faster than the baseline, while PostgreSQL was nearly 1.8 times slower and MariaDB was 3.5 times slower.

Safety and Guaranteed Termination

One of Datalog’s design choices is that every program is guaranteed to terminate on finite data. This is a deliberate restriction compared to general-purpose logic languages like Prolog, which can loop forever.

Datalog enforces this through a property called safety: every variable that appears in the head (the conclusion) of a rule must also appear in the body (the conditions). This ensures the system never tries to generate infinite results. Safe Datalog queries are also domain independent, meaning the answer depends only on the data in the database, not on what’s outside it.

The computational cost of evaluating a Datalog program is polynomial relative to the size of the input data when the program itself is fixed. Computer scientists describe this as PTIME-complete data complexity. If both the program and the data can vary, the combined complexity rises to EXPTIME-complete, but in practice, programs are small and data is large, so the polynomial guarantee is what matters.

Negation and Stratification

Pure Datalog only deals in positive statements: this is true, therefore that is true. But real-world reasoning often needs negation: something is true *because* something else is *not* true.

Adding negation to Datalog creates a problem. When rules can negate derived relations, there can be multiple minimal models, meaning the system can’t determine a single “correct” answer. The solution is stratification. A stratified Datalog program organizes its rules into layers (strata), where each layer fully computes its results before the next layer is allowed to negate them. If a rule says “X is true because Y is not true,” then Y must be computed in an earlier stratum.

For example, imagine computing “monopoly” routes in a board game: paths that use only red edges, excluding any node reachable by green edges. You’d first compute all green-reachable nodes (stratum 0), then subtract them from the red paths (stratum 1). The stratified model extends the standard fixed-point semantics cleanly. When a program has no negation at all, the stratified model collapses to the unique minimal model, so nothing is lost.

Not every program with negation can be stratified. Unstratified programs require additional mechanisms, like a third truth value (“undefined”), and allow multiple valid models.

Where Datalog Is Used Today

Datalog’s resurgence is driven by industrial applications where recursive reasoning over large datasets is essential.

  • Cloud security analysis. Amazon Web Services uses the Soufflé Datalog engine in production to verify properties of virtual network configurations. The idea is to encode network rules and access policies as Datalog facts, then query whether any unintended access paths exist.
  • Static code analysis. Oracle uses Soufflé for large-scale Java program analysis. By expressing code relationships (which function calls which, what data flows where) as Datalog facts, analysts can query for bugs, vulnerabilities, or policy violations across entire codebases.
  • Data integration. When you need to merge and reason over data from multiple sources with different schemas, Datalog’s rule-based approach provides a natural way to define mappings and derive unified views.
  • Networking. Network routing and configuration verification map naturally to graph reachability problems, which are exactly what Datalog’s recursion handles well.

Soufflé is the most prominent high-performance Datalog engine in active use. It compiles Datalog programs into efficient parallel C++ code, which explains its significant speed advantages in benchmarks. Other implementations exist across the ecosystem, including Datalog-based query layers embedded in databases and program analysis tools.

Datalog’s Relationship to Prolog

Datalog originated as a subset of Prolog, the general-purpose logic programming language. But this description is somewhat misleading for how Datalog is used today. Prolog is a full programming language with features like function symbols, cut operators, and input/output, and it evaluates queries top-down using depth-first search, which can loop infinitely. Datalog strips all of that away. It restricts itself to flat relations (no nested data structures), guarantees termination, and typically evaluates bottom-up, starting from known facts and building toward conclusions.

The syntax isn’t formally standardized across implementations, which means different Datalog engines may use slightly different notation. But the core semantics are consistent: facts, rules with Horn clauses, recursion, and a fixed-point evaluation strategy that always terminates.