Your First ZK Program

This page walks you through one ZK program and grows it, piece by piece. By the end you'll have a proof that says, "I know a 4 MB blob whose XOR checksum is T" — around 80 lines of Rust, under two seconds on a laptop, verification in tens of milliseconds.

We start as minimal as it gets: three columns, one transition constraint, no chiplets. The whole thing works because of one quirk of the field Hekate runs on — in GF(2^k), addition is XOR. That's the hook. The two chapters that follow add one feature each, every time because the previous chapter hits a wall it can't climb.

The idea

Imagine a service hands you a 4 MB blob. You want to prove to someone else that you've seen every byte of it, and that when you XOR the whole thing together you get a public value T — all without ever showing them the blob itself.

Hekate's field is GF(2^k), where addition literally is XOR at the hardware level. The whole proof shrinks to a one-line recurrence: acc_{i+1} = acc_i + data_i. No hash gadgets, no bit decomposition, no workarounds. That's the quirk we're leaning on for the rest of this page.

The trace

An execution trace is just a table: rows are moments in time, columns are state. For this program we only need three columns and 2^20 rows. At 4 bytes of blob per row, that's the 4 MB we set out to process.

Column	Type	What it holds
`DATA`	`B32`	One 4-byte chunk of the blob at this row
`ACC`	`B32`	Running XOR of everything we've seen so far
`Q`	`Bit`	`1` on active rows, `0` on the final row to end the chain

A quick word on column types. Hekate lets you store each column in whatever width actually fits — Bit for booleans, B32 for u32s, B128 for full field elements. Since our data is just a stream of 4-byte words, B32 is the right choice: 4 bytes per cell instead of 16. When the prover needs to do real field arithmetic it lifts those B32 values into the full Block128 field for you, so you keep the compact storage and still get full-field security.

What "zero-knowledge" means in this proof

The trace is private. The verifier never sees DATA, never sees the running ACC, never sees the blob. What the verifier learns is exactly two things: the public tag T (which they already knew, they're checking against it) and the single bit "yes, the prover knew a 4 MB blob whose XOR is T". Nothing else. Two distinct blobs with the same XOR produce indistinguishable proofs — the protocol erases everything except the claim itself.

Mechanically this works because Hekate's prover injects fresh AES-NI noise into every committed column before hashing it into the Merkle tree, and an extra blinding column past the active rows absorbs whatever Sumcheck would otherwise leak about specific cells. You don't write any of that. The sumcheck_blinding_factor field on Config controls how much noise — 2 is the default for ZK builds, 0 turns blinding off and is only safe for fully public traces.

The split between what's public and what's private gets a name in the API:

ProgramInstance<F> — the public side. Carries num_rows, public_inputs (the tag, in our case), and whatever else the verifier needs to recreate the statement they're verifying. Same instance goes to prover and verifier.
ProgramWitness<F, T> — the private side. Wraps the trace plus any chiplet traces. Never crosses the network. The prover holds it, generates the proof, and the witness stops there.

Same Program<F> definition on both sides. Same instance on both sides. Witness only on the prover side. The proof is the bridge.

The constraints

The AIR (Algebraic Intermediate Representation) is where you declare the rules the trace must obey.

use hekate_core::config::Config;
use hekate_core::errors;
use hekate_core::trace::{ColumnTrace, ColumnType, TraceBuilder};
use hekate_math::{Bit, Block32, Block128, TowerField};
use hekate_program::constraint::{BoundaryConstraint, ConstraintAst};
use hekate_program::constraint::builder::ConstraintSystem;
use hekate_program::define_columns;
use hekate_program::{Air, Program, ProgramInstance, ProgramWitness};
use hekate_sdk::{HekateClient, ProcessTransport};
use rand::{TryRngCore, rngs::OsRng};

type F = Block128;

define_columns! {
    ChecksumCols {
        DATA: B32,
        ACC:  B32,
        Q:    Bit,
    }
}

#[derive(Clone)]
struct ChecksumProgram {
    num_rows: usize,
}

impl Air<F> for ChecksumProgram {
    fn num_columns(&self) -> usize {
        ChecksumCols::NUM_COLUMNS
    }

    fn column_layout(&self) -> &[ColumnType] {
        static L: std::sync::OnceLock<Vec<ColumnType>> = std::sync::OnceLock::new();
        L.get_or_init(ChecksumCols::build_layout)
    }

    fn boundary_constraints(&self) -> Vec<BoundaryConstraint<F>> {
        vec![
            // ACC[0]       = public_inputs[0] = 0
            BoundaryConstraint::with_public_input(ChecksumCols::ACC, 0, 0),
            // ACC[last]    = public_inputs[1] = expected tag
            BoundaryConstraint::with_public_input(ChecksumCols::ACC, self.num_rows - 1, 1),
        ]
    }

    fn constraint_ast(&self) -> ConstraintAst<F> {
        let cs = ConstraintSystem::<F>::new();

        let [data, acc, q] = [
            cs.col(ChecksumCols::DATA),
            cs.col(ChecksumCols::ACC),
            cs.col(ChecksumCols::Q),
        ];
        let next_acc = cs.next(ChecksumCols::ACC);

        cs.assert_boolean(q);
        cs.constrain(q * (next_acc + acc + data));

        cs.build()
    }
}

impl Program<F> for ChecksumProgram {
    fn num_public_inputs(&self) -> usize {
        2
    }
}

What's the AIR actually saying? The heart of it is one transition: q * (next_acc + acc + data) = 0. Read it as "between every pair of consecutive rows, next_acc must equal acc + data" — and remember, in GF(2^k) subtracting and adding are the same thing, so next_acc = acc + data and next_acc + acc + data = 0 are two ways of writing the same equation. The selector Q acts as an on/off switch so the constraint goes silent on the final row and doesn't run off the end of the trace.

Then two boundary constraints pin the endpoints: the accumulator starts at zero, and at the last row it must equal the public tag. One subtle thing to notice: the third argument of BoundaryConstraint::with_public_input isn't a literal value, it's an index into the public inputs. So we pass the zero and the tag through ProgramInstance::public_inputs at indices 0 and 1.

Why do we need the starting boundary at all? Because without it the program is actually unsound. An attacker could start acc[0] at any value they like and still land exactly on tag at the last row by choosing the right blob — the terminal boundary alone isn't enough. Pinning both endpoints is what forces the running XOR of the data to really equal the tag.

Filling the trace

fn generate_trace(blob: &[u8], num_vars: usize) -> errors::Result<(ColumnTrace, Block32)> {
    let num_rows = 1 << num_vars;
    assert_eq!(blob.len(), (num_rows - 1) * 4);

    let mut tb = TraceBuilder::new(&ChecksumCols::build_layout(), num_vars)?;
    let mut acc = Block32::ZERO;

    for i in 0..num_rows - 1 {
        let word = u32::from_le_bytes(blob[i * 4..(i + 1) * 4].try_into().unwrap());
        let chunk = Block32::from(word);

        tb.set_b32(ChecksumCols::DATA, i, chunk)?;
        tb.set_b32(ChecksumCols::ACC, i, acc)?;
        tb.set_bit(ChecksumCols::Q, i, Bit::ONE)?;

        acc = acc + chunk;
    }

    tb.set_b32(ChecksumCols::ACC, num_rows - 1, acc)?;

    Ok((tb.build(), acc))
}

Nothing clever here. One row per 4-byte word of the blob, and a final row that just holds the accumulated tag with Q = 0. Setting Q to zero on that last row makes the transition vacuously true — 0 * anything = 0 — so we don't have to invent a phantom data byte for it. The terminal boundary still pins the value at that row, so the math works out.

Proving and verifying

We have an AIR, a witness, and two public inputs. All that's left is to hand everything to the SDK. You don't have to know anything about Sumcheck or Brakedown to drive it — the SDK exposes three methods and each does one thing:

preflight — sanity-check the witness against your constraints before you generate a proof. Cheap, instant, and it'll save you from lots of painful debugging.
prove — ship the program + instance + witness to the prover binary and get an InnerProof back.
verify — check that proof locally. No binary needed, no network.

fn main() {
    let num_vars = 20;
    let num_rows = 1 << num_vars;

    let mut blob = vec![0u8; (num_rows - 1) * 4];
    OsRng.try_fill_bytes(&mut blob).unwrap();

    let (trace, expected_tag) = generate_trace(&blob, num_vars).unwrap();

    let program = ChecksumProgram { num_rows };
    let instance = ProgramInstance::new(
        num_rows,
        vec![Block128::ZERO, Block128::from(expected_tag)],
    );
    let witness = ProgramWitness::new(trace);

    let config = Config::default();

    let mut client = HekateClient::new()
        .transport(ProcessTransport::new("./prover"));

    let report = client.preflight(&program, &instance, &witness).unwrap();
    assert!(report.is_clean(), "preflight found violations: {:#?}", report);

    let proof = client.prove(&program, &instance, &witness, &config).unwrap();
    let ok = client.verify(&program, &instance, &proof, &config).unwrap();

    assert!(ok);
}

To run it, spin up a fresh cargo project, add the Hekate crates to Cargo.toml, drop the program into src/main.rs, and:

cargo run --release

Measured on Apple Silicon (release, 2^16 rows, single thread of work + Rayon for SpMV):

Trace Generation done in 315.92µs
Proving         done in 29.10ms
Verifying       done in  2.69ms

Total Proof size: 314.32 KB (321866 bytes)
  Trace Commitment:           38 bytes
  Main AIR ZeroCheck:        850 bytes
  Eval Sumcheck:             578 bytes
  Tensor Vector (q):     142667 bytes
  Point Evaluations:         445 bytes
  LDT Merkle Paths:       71841 bytes
  LDT Opened Columns:    105441 bytes

The eval-batch components (tensor + LDT) dominate; everything else is in the noise. Bumping num_vars to 20 (the 4 MB blob this chapter set out to process) scales prove time roughly 16×.

Why preflight matters

The first time you write an AIR you will get a constraint wrong. That's normal. The rough part is what happens when you skip preflight: the prover will happily finish, and the verifier will reject, and the only thing you get back is "sumcheck failed at round N" — which tells you absolutely nothing about which row or column of your trace actually broke.

Preflight spares you all of that. It evaluates the constraint system row by row on your concrete trace before any proving work starts. The report that comes back is structured (PreflightReport<F> with constraint_violations, boundary_violations, lagrange_pin_violations, bus_diagnostics), and the Debug output it formats reads like this:

A clean trace:

PREFLIGHT: 0 constraint, 0 boundary, 0 lagrange pin, 0 bus

A failed transition constraint (the AIR rule that pairs of rows must satisfy next_acc + acc + data = 0):

PREFLIGHT: 2 constraint, 0 boundary, 0 lagrange pin, 0 bus

    [Main] Constraint 0 "fib_a" failed at row 1
    [Main] Constraint 1 "fib_b" failed at row 2

A boundary mismatch (the value at a pinned row doesn't match the public input):

PREFLIGHT: 0 constraint, 2 boundary, 0 lagrange pin, 0 bus

    Boundary #0: col=0 row=0 actual=Flat(Block128(0)) expected=Flat(Block128(2206...))
    Boundary #1: col=1 row=0 actual=Flat(Block128(1)) expected=Flat(Block128(2654...))

A bus mismatch (the multiset on one side of the LogUp bus doesn't match the other):

PREFLIGHT: 0 constraint, 0 boundary, 0 lagrange pin, 1 bus

    Bus "lookup_bus" (2 endpoints):
      Main: 8 rows, 4 active, product=Flat(Block128(3339...))
      Chiplet 0: 8 rows, 2 active, product=Flat(Block128(1795...))

Each line names the table (Main or Chiplet N), the constraint or boundary or bus that failed, and the row. The constraint label ("fib_a") is whatever string you passed to cs.constrain_named(...); if you used plain cs.constrain(...) you get the index alone. Debugging goes from hours to seconds. Get in the habit of calling it first.

What you just proved, and what you didn't

Congratulations — you have a working zero-knowledge proof. The prover knows a 4 MB blob whose XOR fingerprint is T, the verifier confirms it, and the blob itself stays private. Good enough for parity checks, secret-sharing combiners, and any protocol whose primitive is XOR.

There's a hole, though, and it's the kind that only shows up once an adversary is in the picture. XOR is not collision-resistant. Given any blob, an attacker can swap two bytes, flip eight bits in the right places, or substitute an entirely different 4 MB payload — and as long as the new bytes XOR to the same T, your proof accepts it without complaint. The fingerprint says "some blob with this XOR." It does not say "the blob you were supposed to process." For a checksum that's fine. For anything where the content matters — a firmware image, a deterministic transcript, a program a CPU is about to execute — the gap is a soundness break dressed up as a feature.

That's what Chapter 2 closes.

Constraining what goes in

The fix is to publish the blob's positional content up front, not just its fingerprint. We'll commit to a specific 4 MB byte stream — a "program image" — and add a second guarantee on top of the existing checksum: not only does the blob XOR to T, but byte i of the blob equals byte i of the published image, for every i. Two guarantees, one proof. The XOR check still runs unchanged; the new check rules out every blob except the one image we sanctioned.

The Hekate primitive we'll use is positional equality: row i on the CPU side must match row i on the ROM side, on every active row. It's the exact contract you want whenever the CPU reads a pinned sequence in order — continuation-style rollups, streaming data pipelines, this checksum example.

A second table, linked by a bus

Hekate keeps side tables in chiplets. A chiplet is just an independent AIR — its own trace, its own commitment, its own sumcheck — running next to your main trace. The two are connected by a LogUp bus.

A bus spec carries a kind: Permutation or Lookup. The two do different jobs:

Permutation — cross-bus cancellation is multiset equality. Both endpoints emit the same multiset of rows, any order. Used where one side reshuffles the rows of the other (Chapter 3's RAM sorts events by address; the order doesn't match the CPU side but the contents do).
Lookup — cross-bus cancellation is pointwise equality on the padded hypercube. Row i on the CPU side must equal row i on the ROM side, on every active row. Used where the CPU is executing a pinned sequence: program images, deterministic tables, anything where position matters and not just contents.

For a committed program image we want Lookup. Both CpuFetchUnit::linking_spec() and RomChiplet::linking_spec() return Lookup-kind specs; you don't pick the kind yourself, the library specs pick it for you.

The bus contract under Lookup is: every active row on the CPU side emits (pc, opcode, args) into a ledger, every active row on the ROM side emits the same tuple, and the verifier checks that the two sides agree row for row. The moment CPU row 3 disagrees with ROM row 3 — a flipped bit in the opcode, a swapped argument, a PC off by one — the bus leaves a stray term and the proof fails.

Reshaping `ChecksumCols` to match the bus

use hekate_chiplets::{CpuFetchUnit, Instruction, RomChiplet, generate_rom_trace};
use hekate_program::chiplet::ChipletDef;
use hekate_program::permutation::PermutationCheckSpec;

Before we can wire up the bus, there's a small piece of reality to deal with. The LogUp bus works at byte granularity: each Source::Column(i) in a spec points at one column that holds one byte of the key. The shipped RomChiplet has a fixed 8-byte tuple shape — four PC bytes, one opcode, three args — and the matching CPU side is defined by CpuFetchUnit::linking_spec(), which hard-codes the column indices it expects. If we want to reuse that off-the-shelf spec (and we do — hand-rolling your own is Chapter-4 territory), the first nine columns of our main trace need to look exactly like CpuFetchColumns.

So we reshape ChecksumCols. The four data bytes move into the OPCODE and ARG* slots. We keep the four PC slots as zero-filled columns because we don't actually need a PC, but we need the shape. Q is renamed to SELECTOR to line up with the fetch unit's selector. The accumulator columns come after — they only matter for the checksum logic, not the bus.

define_columns! {
    ChecksumCols {
        // Fetch-shaped prefix (must match CpuFetchColumns position-for-position):
        PC_B0:    B32,
        PC_B1:    B32,
        PC_B2:    B32,
        PC_B3:    B32,
        OPCODE:   B32,
        ARG0:     B32,
        ARG1:     B32,
        ARG2:     B32,
        SELECTOR: Bit,

        // XOR accumulator, one lane per data byte:
        ACC_B0: B32,
        ACC_B1: B32,
        ACC_B2: B32,
        ACC_B3: B32,
    }
}

OPCODE and ARG0..ARG2 carry the four bytes of the data word now; PC_B0..B3 stay zero; SELECTOR does the same job Q did in Chapter 1. The accumulator splits into four parallel lanes, one per data byte, each with its own running XOR:

let sel = cs.col(ChecksumCols::SELECTOR);

for (data_idx, acc_idx) in [
    (ChecksumCols::OPCODE, ChecksumCols::ACC_B0),
    (ChecksumCols::ARG0,   ChecksumCols::ACC_B1),
    (ChecksumCols::ARG1,   ChecksumCols::ACC_B2),
    (ChecksumCols::ARG2,   ChecksumCols::ACC_B3),
] {
    let data = cs.col(data_idx);
    let acc = cs.col(acc_idx);
    let next = cs.next(acc_idx);

    cs.constrain(sel * (next + acc + data));
}

The boundary constraints grow to match — every accumulator lane needs its own start-at-zero and its own end-at-the-tag-byte. That means the public inputs vector grows from two entries to five: [ZERO, tb0, tb1, tb2, tb3], one zero everyone shares, plus the four bytes of the expected tag.

Wiring the bus

Now that the main trace has the right shape, wiring the bus is almost a formality. The CPU side is off-the-shelf (CpuFetchUnit::linking_spec()), and we wrap the RomChiplet with ChipletDef::from_air(...) to expose it as a proper independent table:

impl Air<F> for ChecksumProgram {
    // num_columns, column_layout, constraint_ast,
    // boundary_constraints as described above.

    fn permutation_checks(&self) -> Vec<(String, PermutationCheckSpec)> {
        vec![(RomChiplet::BUS_ID.into(), CpuFetchUnit::linking_spec())]
    }
}

impl Program<F> for ChecksumProgram {
    fn num_public_inputs(&self) -> usize { 5 }

    fn chiplet_defs(&self) -> errors::Result<Vec<ChipletDef<F>>> {
        Ok(vec![ChipletDef::from_air(&RomChiplet::new(self.num_rows))?])
    }
}

Notice what just happened: the main AIR didn't gain a single new transition constraint. The XOR checksum still runs, just across four byte lanes instead of one. All the new validity logic lives in the chiplet and the bus match. ChipletDef::from_air snapshots the chiplet's layout, constraints, and virtual expander in one call, and the ROM table runs as a fully independent trace — its own commitment, its own sumcheck, sitting alongside the main trace. The prover draws a per-bus challenge r_bus after every chiplet root has been absorbed into the transcript; that challenge is what weights the Lookup bus-sum term so positional equality is actually enforced at cryptographic soundness, not just at multiset level.

Building the program image

The chiplet side uses Instruction records with the same 8-byte layout the bus expects:

pub struct Instruction {
    pub pc: u32,
    pub opcode: u8,
    pub args: [u8; 3],
}

For a pure data channel we leave pc at zero and pack each 4-byte blob word into opcode and args. The program image is the blob itself, instruction-for-instruction. Since the bus enforces row-for-row equality with the main trace, the ROM needs the same number of rows as the main trace — 2^20:

let program_image: Vec<Instruction> = blob
    .chunks_exact(4)
    .map(|w| Instruction::new(0, w[0], [w[1], w[2], w[3]]))
    .collect();

let rom_trace = generate_rom_trace(&program_image, num_rows)?;

generate_rom_trace is a small helper in hekate-chiplets. It walks the Instruction list, fills the nine RomChiplet columns (four PC bytes, one opcode, three args, one selector) via TraceBuilder, and leaves the selector at zero on any unused rows so they stay quiet on the bus.

The height constraint is real, not cosmetic. A Lookup bus pads both endpoints to N_max = max(CPU_rows, ROM_rows) and demands row-i-equals-row-i on every row of that padded hypercube, padding included. Padding rows all carry selector = 0 on both sides — they match trivially. But that also means active CPU rows past the ROM's real height would be forced to selector = 0 too (to match ROM's padding), which disables them. Practical consequence: the number of active CPU rows is capped at the number of active ROM rows, and the active positions on both sides must align 1:1. For a linear execution pattern like this checksum the fit is natural (one CPU row per program record, trace ≈ program). Programs with control flow that revisit the same opcode many times need a different primitive and don't belong in this chapter.

The witness gains a chiplet trace

let (main_trace, _) = generate_trace(&blob, num_vars)?;
let rom_trace = generate_rom_trace(&program_image, num_rows)?;

let witness = ProgramWitness::new(main_trace).with_chiplets(vec![rom_trace]);

with_chiplets is a builder on ProgramWitness. The order of chiplet traces has to match chiplet_defs(), but that's the only rule — each chiplet has its own row count. For this program both are 2^20 because the bus is positional, but nothing forces them to match a priori. Different heights are fine for Permutation buses (Chapter 3) and for Lookup buses where the ROM is a short pinned table read from many execution rows — the bus handles N_max padding automatically.

Everything else — preflight, prove, verify — is the same call as Chapter 1. Preflight now also reports bus diagnostics: if the blob diverges from the program image at any row, preflight points at the row and at both sides' bytes before any proving work starts.

cargo run --release

Measured at 2^16 rows on the same hardware:

Trace Generation (main) done in 1.00ms
Trace Generation (ROM)  done in 1.79ms
Proving                 done in 86.77ms
Verifying               done in  6.65ms

Total Proof size: 789.02 KB (807960 bytes)
  Trace Commitment:           38 bytes
  Main AIR ZeroCheck:        850 bytes
  Eval Sumcheck:             578 bytes
  Tensor Vector (q):     142667 bytes
  Point Evaluations:         657 bytes
  LDT Merkle Paths:       71841 bytes
  LDT Opened Columns:    207841 bytes
  Chiplets (1):
    Commitments:              39 bytes
    ZeroChecks:              851 bytes
    Eval Arguments:       382489 bytes
  LogUp Bus Aux (2 specs):
    Main:                     54 bytes
    Chiplets:                 55 bytes

Cost over Chapter 1: ~3× prove time, ~2.5× proof size. Both deltas come from the new chiplet table — its eval argument (~382 KB) and the doubled LDT opened columns on the main side (now serving a bus consumer on every active row).

What you just proved, and what you didn't

Same ChecksumProgram as Chapter 1, now with a second guarantee alongside the original one: the blob is exactly the program image you handed to generate_rom_trace (byte-for-byte, row-for-row), and its XOR fingerprint still equals the tag. The fingerprint catches integrity bugs; the Lookup bus catches "wrong program" bugs. For a linear execution pattern like this checksum — CPU row i always processes program word i — positional equality is a tight fit.

There's still a structural gap, and it's the one that separates streaming programs from real programs. Try to extend ChecksumProgram with anything that needs to hold state across distant rows — a histogram bin you update on row 5 and again on row 1700, a stack you push to in one phase and pop from in another, an intermediate result you compute early and consume late — and you discover the main trace can't help you. AIR constraints only see the current row and the next. The Lookup bus only enforces row-i-equals-row-i. Neither gives you an address you can write to at one moment in time and read back from at another. Pure forward-walking arithmetic has no notion of memory.

That's the gap Chapter 3 closes — with a different chiplet that uses a different bus kind, doing a different job.

Remembering what you did

The new chiplet is RAM, and the contract it provides is exactly what the name suggests: the program writes (addr, val) pairs to a scratch buffer at one row and reads them back at a later row, and the chiplet proves that every read returned the value most recently written to that address. We'll use it to extend the running example: treat the blob as an access log. Every word becomes a write of (addr_i, word_i). After the writes finish, the program reads back a handful of addresses chosen by the verifier and claims the values match a public expectation. The proof now carries a third guarantee:

Every write happened, every read-back returned the value that was actually written, and nothing tampered with the buffer in between.

You genuinely cannot build this with a Lookup bus. That bus proves "row i on side A equals row i on side B" — it has no way to express reads and writes that need to agree across distant rows. Random-access memory needs a different contract: one where the two sides hold the same multiset of events but the ordering differs (program order on one side, sorted-by-address on the other), and the chiplet proves read-after-write consistency locally on the sorted side. That's a Permutation bus on top of a sort — the second bus kind we skipped in Chapter 2.

How RAM verification works

The RAM chiplet in Hekate is what the literature calls an offline memory checker. The idea is clean once you see it: the main trace emits a stream of memory events in execution order (one event per memory-op row), and the RAM chiplet takes the exact same events, but sorts them by (address, clock) first. On that sorted side the chiplet proves two simple local properties:

The sort is correct — adjacent rows really do satisfy (addr_i, clk_i) ≤ (addr_{i+1}, clk_{i+1}).
The values are consistent — within any run of rows at the same address, every read returns the value of the most recent write (or a zero default on first touch).

The two sides talk through the "ram_link" bus, which carries a 13-field tuple for every event: (addr_b0..b3, clk_b0..b3, val_b0..b3, is_write). The CPU side adds events in program order, the RAM side adds the same events in sorted order, and the LogUp bus proves the two multisets match. Local sorting + local consistency + global multiset equality = a fully verified memory oracle. Nothing materialises a merged trace, nothing forces the two sides to line up row for row.

Memory events

use hekate_chiplets::{CpuMemoryUnit, MemoryEvent, RamChiplet, generate_ram_trace};

pub struct MemoryEvent {
    pub addr: u32,
    pub clk: u32,
    pub val: u32,
    pub is_write: bool,
}

impl MemoryEvent {
    pub fn write(addr: u32, clk: u32, val: u32) -> Self { /* ... */ }
    pub fn read(addr: u32, clk: u32, val: u32) -> Self { /* ... */ }
}

One MemoryEvent per memory op. The four CPU-side clock bytes clk_b0..b3 are not real columns. The bus generates them virtually from the CPU's row index via Source::RowIndexByte(0) through Source::RowIndexByte(3). On the RAM side the sorted trace does store CLK_B0..B3 as real columns, because the sort has reshuffled the events and the chiplet can no longer rely on row-index-as-time.

Split the blob into 4-byte chunks, pick the public read-back schedule, and size the RAM table to cover every event (power of two):

let chunks: Vec<[u8; 4] > = blob
    .chunks_exact(4)
    .map(|c| c.try_into().unwrap())
    .collect();

let public_reads: Vec<(u32, u32) > = /* (addr, expected_value) pairs known to the verifier */;

let ram_rows: usize = (chunks.len() + public_reads.len()).next_power_of_two();

Build the event list from those inputs, writes first and then read-backs:

let mut events = Vec::with_capacity(ram_rows);

for (i, chunk) in chunks.iter().enumerate() {
    events.push(MemoryEvent::write(i as u32, i as u32, u32::from_le_bytes(*chunk)));
}

for (addr, expected) in &public_reads {
    let later_clk = events.len() as u32;
    events.push(MemoryEvent::read(*addr, later_clk, *expected));
}

let ram_trace = generate_ram_trace(&events, ram_rows)?;

generate_ram_trace does a lot under the hood: it sorts the events by (addr, clk), checks that the sequence is locally consistent (no read disagrees with its last write), fills the 20 physical RAM columns, and flat-out refuses to build the trace if it finds a mismatch. It's your first line of defence — preflight would catch the same problems, but generate_ram_trace catches them before the witness even leaves your process.

What grows on the main side

Everything from Chapter 2 stays exactly where it was. We just append ten new columns at the end of ChecksumCols — four address bytes, four value bytes, IS_WRITE, and a memory selector MEM_SEL — to carry the CPU side of the memory bus:

define_columns! {
    ChecksumCols {
        // Chapter 2 layout (unchanged):
        PC_B0:    B32,
        PC_B1:    B32,
        PC_B2:    B32,
        PC_B3:    B32,
        OPCODE:   B32,
        ARG0:     B32,
        ARG1:     B32,
        ARG2:     B32,
        SELECTOR: Bit,
        ACC_B0:   B32,
        ACC_B1:   B32,
        ACC_B2:   B32,
        ACC_B3:   B32,

        // Chapter 3 additions:
        MEM_ADDR_B0:  B32,
        MEM_ADDR_B1:  B32,
        MEM_ADDR_B2:  B32,
        MEM_ADDR_B3:  B32,
        MEM_VAL_B0:   B32,
        MEM_VAL_B1:   B32,
        MEM_VAL_B2:   B32,
        MEM_VAL_B3:   B32,
        MEM_IS_WRITE: Bit,
        MEM_SEL:      Bit,
    }
}

There's one small gotcha. CpuMemoryUnit::linking_spec() was written assuming CpuMemColumns starts at index 0 — but in our layout MEM_ADDR_B0 sits at index 13, after everything we already had. So before we hand the spec to the bus, we shift its column indices by the right offset. That's what shift_column_indices does. The Air impl gains one entry for the RAM bus, and Program<F> gains one entry for the RAM chiplet:

fn permutation_checks(&self) -> Vec<(String, PermutationCheckSpec)> {
    let mut mem_spec = CpuMemoryUnit::linking_spec();
    mem_spec.shift_column_indices(ChecksumCols::MEM_ADDR_B0);

    vec![
        (RomChiplet::BUS_ID.into(), CpuFetchUnit::linking_spec()),
        (RamChiplet::BUS_ID.into(), mem_spec),
    ]
}

impl Program<F> for ChecksumProgram {
    fn num_public_inputs(&self) -> usize { 5 }

    fn chiplet_defs(&self) -> errors::Result<Vec<ChipletDef<F>>> {
        Ok(vec![
            ChipletDef::from_air(&RomChiplet::new(self.num_rows))?,
            ChipletDef::from_air(&RamChiplet::new(self.num_rows))?,
        ])
    }
}

The public-input count stays at 5 from Chapter 2 (ZERO plus four tag bytes). Chapter 3 adds no boundary constraints; the RAM chiplet does all the value-binding work through the bus.

Two chiplets, two buses, two independent traces, and the main AIR didn't gain a single new transition constraint for it. All the new logic lives in the RAM chiplet itself and in the bus match. This is the whole payoff of the multi-table architecture: features compose by adding tables, not by bloating your main AIR.

Keeping padding inert

We now have two selectors — SELECTOR for the checksum side and MEM_SEL for the memory side. Padding rows set both to zero, and that's enough: the XOR transition goes silent whenever SELECTOR = 0, and the memory bus contribution goes silent whenever MEM_SEL = 0 (both specs are gated by their selectors). So padding rows simply vanish from the protocol — no extra AIR constraints required.

If you were building a program that could halt mid-trace (an early return, an interrupt), you'd also need the Ghost Protocol: an extra constraint s_halt * MEM_SEL = 0 that forces halted rows to stay off the memory bus even if an attacker tried to fill them with fake events. We don't halt here, so a plain boolean selector is all we need. The Cryptographic Chiplets page covers the full Ghost Protocol story.

The witness gains a third trace

generate_trace grows one argument: the public read-back schedule &[(addr, value)]. The body still does everything it did in Chapter 2 (fill OPCODE/ARG*, ACC_B*, SELECTOR, run the XOR), and then layers on two new phases for the memory columns: writes for the data words, reads for the public check-backs. The return type is the same as before, so the tag boundary constraint keeps working without any downstream change:

fn generate_trace(
    blob: &[u8],
    public_reads: &[(u32, u32)],
    num_vars: usize,
) -> errors::Result<(ColumnTrace, Block32)> {
    // ... Chapter 2 writes:
    // PC_B*=0, OPCODE/ARG*=data bytes,
    // ACC_B*, SELECTOR, running XOR ...

    let chunks: Vec<[u8; 4]> = blob
        .chunks_exact(4)
        .map(|c| c.try_into().unwrap())
        .collect();

    // Phase 1:
    // one write event per data row, 
    // addr = row index, val = data word.
    for (i, chunk) in chunks.iter().enumerate() {
        let addr_bytes = (i as u32).to_le_bytes();
        let val_bytes = *chunk;

        for b in 0..4 {
            tb.set_b32(ChecksumCols::MEM_ADDR_B0 + b, i, Block32::from(addr_bytes[b] as u32))?;
            tb.set_b32(ChecksumCols::MEM_VAL_B0 + b, i, Block32::from(val_bytes[b] as u32))?;
        }
        
        tb.set_bit(ChecksumCols::MEM_IS_WRITE, i, Bit::ONE)?;
        tb.set_bit(ChecksumCols::MEM_SEL, i, Bit::ONE)?;
    }

    // Phase 2:
    // append read-back rows for
    // every scheduled (addr, value).
    for (k, (addr, val)) in public_reads.iter().enumerate() {
        let row = chunks.len() + k;
        let addr_bytes = addr.to_le_bytes();
        let val_bytes = val.to_le_bytes();

        for b in 0..4 {
            tb.set_b32(ChecksumCols::MEM_ADDR_B0 + b, row, Block32::from(addr_bytes[b] as u32))?;
            tb.set_b32(ChecksumCols::MEM_VAL_B0 + b, row, Block32::from(val_bytes[b] as u32))?;
        }
        
        tb.set_bit(ChecksumCols::MEM_IS_WRITE, row, Bit::ZERO)?;
        tb.set_bit(ChecksumCols::MEM_SEL, row, Bit::ONE)?;
    }

    // Unused rows stay zero => MEM_SEL = 0 => padding stays off the bus.
    Ok((tb.build(), expected_tag))
}

Three traces in total, one main plus two chiplets:

let (main_trace, _) = generate_trace(&blob, &public_reads, num_vars)?;
let rom_trace = generate_rom_trace(&allow_list, rom_rows)?;
let ram_trace = generate_ram_trace(&events,     ram_rows)?;

let witness = ProgramWitness::new(main_trace)
    .with_chiplets(vec![rom_trace, ram_trace]);

Three traces in, one witness out. The order has to match chiplet_defs() but row counts are fully independent — the main trace at 2^20, ROM at 256, RAM at 2^20 (it must hold every event). Preflight now reports bus diagnostics for every bus; if the read-back at address A expected value V but you actually wrote W, preflight tells you the row, the address, and both values.

cargo run --release

Measured at 2^16 rows on the same hardware:

Trace Generation (main) done in   4.32ms
Trace Generation (ROM)  done in   1.75ms
Trace Generation (RAM)  done in 105.89ms
Proving                 done in 388.49ms
Verifying               done in  11.57ms

Total Proof size: 1371.88 KB (1404810 bytes)
  Trace Commitment:           38 bytes
  Main AIR ZeroCheck:        850 bytes
  Eval Sumcheck:             578 bytes
  Tensor Vector (q):     142667 bytes
  Point Evaluations:         933 bytes
  LDT Merkle Paths:       71841 bytes
  LDT Opened Columns:    294881 bytes
  Chiplets (2):
    Commitments:              77 bytes
    ZeroChecks:             2517 bytes
    Eval Arguments:       890213 bytes
  LogUp Bus Aux (4 specs):
    Main:                    106 bytes
    Chiplets:                109 bytes

Cost over Chapter 2: ~4.5× prove time, ~1.7× proof size. RAM trace generation alone is 106 ms — the chiplet sorts every event by (addr, clk) and verifies local read-after-write consistency before the prover ever touches the trace. The RAM chiplet's eval argument adds ~510 KB; the LDT-opened-columns line on the main side grows again because the new cpu_mem bus claims another column at every active row.

What you just proved

Same ChecksumProgram you started with — doing three things at once now:

The blob's XOR fingerprint equals the public tag. (Chapter 1.)
The blob is exactly the pinned program image, row-for-row. (Chapter 2.)
The program wrote each word into a scratch buffer, and every public read-back returns the value that was actually written. (Chapter 3.)

That's the full zkVM trick in miniature: a streaming linear trace, a positional bus to bind the program image, a memory oracle for random access. Every larger Hekate program — AES, Keccak, ML-KEM, ML-DSA, rollup state transitions — is built out of exactly these primitives. You already know all of them.

Where to go next

Chiplets, the full library: AES, Keccak, ROM, RAM, NTT, ML-KEM, ML-DSA, and how to write your own.
Writing AIR Constraints, the constraint DSL in depth.
Preflight, row-level diagnostics and what the report fields mean.
System Architecture, Sumcheck + Brakedown + LogUp end to end.

Your First ZK Program

The idea

The trace

What "zero-knowledge" means in this proof

The constraints

Filling the trace

Proving and verifying

Why preflight matters

What you just proved, and what you didn't

Constraining what goes in

A second table, linked by a bus

Reshaping ChecksumCols to match the bus

Wiring the bus

Building the program image

The witness gains a chiplet trace

What you just proved, and what you didn't

Remembering what you did

How RAM verification works

Memory events

What grows on the main side

Keeping padding inert

The witness gains a third trace

What you just proved

Where to go next

Reshaping `ChecksumCols` to match the bus