CMS

Adversarial Testing Against the Compiler Chain

How the team tries to break the compiler and what those tests can and cannot prove about the formal system.

2026-03-24T00:00:00.000Z

The Question We Asked Ourselves

When you build a compiler that transpiles code between 10 input languages and 14 output targets, that certifies programs mathematically, and that compiles itself — there is one question that towers above everything else:

How do you know it actually works?

The industry standard answer is "we tested it pretty well." We looked at that answer and decided it was an embarrassment. "Pretty well" is not engineering. So we did something different.

What Are Abyssal Tests?

We call them abyssal tests because they go all the way down — to the absolute bottom of the system. These are not your typical integration tests that verify "the button works." These are tests designed to destroy our own compiler. Every atomic operation. Every value combination. Every backend. Every control flow pattern. We tried to break it. Systematically.

110,227 Tests Across 7 Categories

The tests span 7 categories: individual monomer operations, multi-family compositions, cross-target consistency, determinism verification, real execution with verified I/O, security and abuse resistance, and regression coverage. Every single test verifies a concrete, specific property. None of these are randomly generated. Each one exists because it targets a specific execution path that could fail — and we made sure it does not.

What We Tried to Break

Level 1: Individual Operations

Every monomer in the full catalog was tested with boundary values: 0, 1, 127, 128, 255, and every dangerous combination between them. ADD8(255, 1) must produce wrap-around — not a crash, not undefined behavior, wrap-around. DIV8(x, 0) must produce a controlled error — not a segfault. SHL(1, 7) must produce 128. under declared constraints. No "it depends." No platform-specific behavior.

Level 2: Compositions

Here is where most compilers fall apart. An individual monomer can work perfectly and fail catastrophically when composed with another. We generated chains of 2, 3, 4, 5, and 6 operations mixing families: arithmetic with logic, logic with strings, strings with float, float with trigonometry. If ADD8 works and SIN works, does SIN(ADD8(1,2)) work?

Yes. In every single case. Every combination. Every permutation.

Level 3: Cross-Target

The same PCD program must produce correct code in JavaScript, Python, Rust, Go, C, C++, PHP, and Java. Each monomer generates idiomatic code in the target language — not a transliteration, but native semantics appropriate to that language. And here is the hard part: all backends must produce the same result for the same input. Identical outputs. Across languages.

2,864 tests verify this for monomer combinations alone. Every single one passes.

Level 4: Determinism

This is the most important property of BRIK64: the same input produces the same output. Always. Not "usually." Not "in most cases." Always. No garbage collection pausing between two runs. No JIT optimizing differently the second time. No scheduler reordering operations behind your back.

Every program is compiled twice. Hashes are compared. If they differ by a single bit, the test fails. 600 determinism tests. Zero failures.

Level 5: Real Execution

The first 100,000 tests verified code generation — that the compiler produces valid, compilable code. The last 10,000 go further. They verify real execution: that the generated code, when actually run, produces the correct values. Not just valid syntax. Correct answers.

ADD8(1, 2) must not only generate code that compiles — it must produce 3 when executed. SIN(0) must produce 0.0. A loop that accumulates 10 times must produce exactly 10.

These tests execute the BIR (BRIK Intermediate Representation) with known input values and verify that the output is exactly what the mathematics predicts. Not approximately. Exactly.

Level 6: Security and Abuse

What happens when someone deliberately tries to attack the compiler? SQL injection in a PCD variable name. XSS in a string literal. Path traversal in a filesystem argument. Unicode homoglyphs designed to confuse the parser. We threw everything we could think of at it.

484 regression and security tests verify that the system rejects or correctly handles every single malicious case. The compiler is not just correct — it is hostile to attackers.

Level 7: Regression

Every bug we found and fixed during development became a permanent, immortal test case. The array overflow that caused a segfault in ELF generation. The variable scoping in if blocks that did not propagate to the outer scope. The ENV function that did not exist as a monomer and returned garbage.

These bugs can never come back. Not tomorrow. Not next year. Not ever. Their tests are embedded in the artifact forever.

What We Did NOT Find

This is the part that matters most. After 110,227 deliberate, systematic attempts to break our own system:

0 failures in core operations. Every review-scoped monomer, closure check passed. The mathematical certification holds under adversarial conditions.

0 determinism failures. Same input, same output. Always.

0 uncontrolled crashes in the compilation pipeline.

0 cross-target inconsistencies. All backends produce equivalent code. Write once, run anywhere — and get the same answer everywhere.

Why This Is Possible

The secret is not that we are better testers than everyone else. It is that the operation space is finite. And that changes everything.

A conventional program has a virtually infinite state space: any combination of calls to any function with any argument. Exhaustively verifying a 1,000-line Python program is computationally outside the declared model. Nobody will ever do it. It cannot be done.

A PCD program is composed of exactly 128 atomic operations. Each one has a known signature, a known domain, and a known range. You can verify every combination because the space is finite. This is not cleverness. This is architecture.

It is the same reason you can formally verify a digital circuit with 128 gates but you cannot formally verify a modern processor with a billion transistors. We made the deliberate architectural decision to keep the component space finite. And that decision is what makes exhaustive verification not just viable — but inevitable.

The Result

110,227 tests. 0 failures. This is not a marketing claim. It is not a rounded number. It is a verifiable fact. Every test is in the repository. Every one runs on every commit. Every one produces the same result today that it produced yesterday and will produce tomorrow and will produce a decade from now.

Because that is what "deterministic by construction" means. Not a promise. A mathematical property.

Run the Corpus

git clone https://github.com/brik64/brik64-demos.git cd brik64-demos ./run_demo.sh adversarial-corpus

The abyssal tests cover the full monomer catalog, 14 backends, 10 input languages, control flow, multi-family compositions, determinism, real execution, security, and regression. The code and the tests are part of the same verifiable, immutable artifact. Run them yourself. The numbers do not change.

Continue the archive

Full archive

Evidence pack folder beside a secure workstation showing software logic blueprints on office monitors.

SOFTWARE EVIDENCESoftware Governance

Software Logic Evidence Infrastructure: The Evidence Layer for AI-Generated Code

AI makes code generation cheap. It does not make software understanding cheap. Software Logic Evidence Infrastructure turns software logic into reviewable, traceable evidence.

Adversarial Testing Against the Compiler Chain

The Question We Asked Ourselves

What Are Abyssal Tests?

110,227 Tests Across 7 Categories

What We Tried to Break

Level 1: Individual Operations

Level 2: Compositions

Level 3: Cross-Target

Level 4: Determinism

Level 5: Real Execution

Level 6: Security and Abuse

Level 7: Regression

What We Did NOT Find

Why This Is Possible

The Result

Run the Corpus

Continue the archive

Software Logic Evidence Infrastructure: The Evidence Layer for AI-Generated Code

What BRIK64 Can Do With Logic Blueprints

Reviewable AI Coding Pipelines: From Prompt to Blueprint

Making AI-Generated Software Reviewable

AI Governance Workflows Need Reviewable Technical Evidence

Compiler Evidence: Targets, Proof Files, and Test Scope

Safety-Critical Software Needs a Readable Assurance Path

Bounded Contract Logic Before Deployment

What the Proof Material Means for Users

Why a New Format Instead of Another General-Purpose Language

Translation Validation Across Two Targets

Why Tests Passing Is Not the Same as Closure

One Blueprint Across Multiple Targets

How AI Intuition Becomes Reviewable

API and MCP Access Around the Registry

Blueprints Before Refactors

A Bounded JavaScript-to-Rust Workflow

Lifting Existing Code into a Reviewable Blueprint

COBOL Migration Through Bounded Lift-and-Review

Why AI-Generated Code Needs Blueprints and External Checks

Which Parts of a Codebase Are Ready for Stronger Review?

Laszlo B. Kish and the Information-Theory Thread

Informational Entropy Is Not Thermal Entropy

From Preferences to Enforced Action Boundaries

First PCD Circuit: A Minimal Walkthrough

EVA Algebra: Sequence, Parallel, Conditional

Working with the SDKs Without Leaving the Bounded Model

Why Software Verification Still Looks Different from Hardware

128 Operations and the Boundary Between Core and Bridges

PCD for AI Agents: A Small Format with an External Proof Loop

Precision as a Declared Domain

BPU: Policy Enforcement as a Hardware Roadmap

Policy Circuits for AI Safety Workflows

What Digital Circuitality Tries to Formalize