Concepts

Syntax, AST, and the SVO Model

Understanding how CNL-PL bridges the gap between readable English and rigid logic through deterministic structural rules.

The SVO Foundation (Subject-Verb-Object)

At the very core of the Controlled Natural Language (CNL) architecture lies the SVO triplet. In natural languages like English, sentence structures can be incredibly fluid. A sentence might start with a prepositional phrase, use passive voice, or bury the subject deep within a clause. While this flexibility allows for poetic expression, it is catastrophic for a deterministic compiler.

CNL-PL solves this by enforcing a strict Subject-Verb-Object structure for every atomic assertion. This is not merely a grammatical preference; it is the fundamental data structure of the language. When the parser encounters a sentence, it is not trying to "understand" it in the AI sense. Instead, it is mechanically mapping tokens into slots.

For example, in the sentence "The server handles high load", the architecture does not see three words. It sees a Subject entity (server), a Relation or verb phrase (handles), and an Object entity (high load). This triplet forms the "atomic fact" or Assertion. Complex logic is built simply by combining these atomic triplets using boolean operators, but the triplets themselves remain the immutable building blocks of truth in the system.

Quick examples

Short examples show how strict syntax prevents ambiguity.

Valid atomic sentence:
Truck_A is assigned to Warehouse_7.

Invalid (missing determiner):
driver is assigned to a route.

Valid relative clause:
A user who is active and who knows python is an admin.

Aggregations and comparators

Aggregations produce numeric values, and comparators express numeric constraints without becoming predicates. Both are parsed into structured AST nodes.

The number of packages is greater than 10.
The sum of weight of every package is less than 5000.

The Lossless AST

The AST (Abstract Syntax Tree) in CNL-PL differs significantly from ASTs in traditional programming languages like JavaScript or Python. In those languages, the AST is often "lowered" or simplified immediately—comments are discarded, whitespace is irrelevant, and syntactic sugar is desugared.

In CNL-PL, the AST is designed to be lossless. It preserves the exact lexical spans (start and end positions) of every node relative to the source text. This is critical for the "Explain" pragmatic. When the system needs to report an error or explain a deduction, it must point back to the exact words the user typed, not a reconstructed approximation.

Furthermore, the AST structure is rigid regarding Canonical Forms. While the user might write "X is greater than Y", the AST normalizes this into a specific Comparison node. However, it retains the knowledge that this came from a natural language sentence. This duality—rigid logical structure on the inside, natural language mapping on the outside—is what makes the AST the "source of truth" for the compiler.

Grammar and EBNF

The rules governing what constitutes a valid sentence are defined using EBNF (Extended Backus-Naur Form). This formal notation describes the syntax recursively. For instance, an EBNF rule might state that a Statement consists of a Sentence followed by a period EOF marker.

Crucially, CNL-PL uses strict token definitions (IDENT for identifiers, strings, numbers). One of the most common sources of ambiguity in natural language parsers is deciding what is a keyword and what is a name. CNL-PL solves this with the "Longest Match" rule in the lexer and strict Action Blocks. If a word is a reserved keyword, it cannot be used as an identifier unless quoted or escaped. This eliminates the "garden path sentence" problem where the parser gets halfway through a sentence before realizing it misunderstood the first word.