Oumuamua Labs

Hekate ZK Engine Docs

Zero-knowledge proof system over binary tower fields. Streaming architecture. Bounded memory. Edge-native.

Hekate proves computations in GF(2^128) using Sumcheck + Brakedown PCS with O(N) prover time and O(N) memory. No FFTs, no trace materialization, no server-grade RAM requirements. Proves ML-KEM decapsulation and ML-DSA signature verification on a laptop and mobile.


Why Hekate Exists

Current ZK provers, RISC Zero, Plonky2, Plonky3, Binius, Stwo, Winterfell, materialize the full execution trace in RAM before proving. Most then run FFT-based commitments (FRI, Circle FRI) that blow up memory by 2x–8x on top of the trace with O(N log N) prover time. This "monolithic trace + FFT blowup" architecture imposes a hard floor on memory: 128GB+ for real workloads, 76GB just for Keccak at 2^20 scale (Binius), swap death at 2^24 (Plonky3).

That floor kills client-side proving. No mobile device, no browser, no edge node can run these provers.

Hekate eliminates the floor. The prover streams through the trace, folds in-place, and discards intermediate state. Peak memory is bounded per-table, not per-computation. A 2^24 Keccak proof runs in 29.7 GB on a consumer laptop where Binius and Plonky3 crash or thrash.


What It Does

Binary tower field arithmetic, GF(2^8) through GF(2^128), recursive tower extension, hardware-accelerated via PMULL/CLMUL. Constant-time by default.

Chiplet architecture, Independent AIR tables (Keccak, AES, RAM, NTT, ML-KEM, ML-DSA) with own traces and commitments. No column waste, no forced padding. Tables linked by LogUp bus.

Virtual packing, Keccak stores 1600 bits in 25 physical B64 columns instead of 1600 bit columns. Bits expand JIT in registers. 16x memory savings.

Linear-code commitments, Brakedown PCS: O(N) prover, O(N) memory. No FFT blowup. Merkle tree over encoded columns only (raw trace never hashed, true ZK).

Post-quantum crypto suite, ML-DSA (Dilithium) signature verification, ML-KEM (Kyber) decapsulation, AES-128/256, all proven natively in binary fields without bit-decomposition overhead.


Architecture at a Glance

       you write here
            │
   ┌────────▼────────┐
   │   hekate-sdk    │   author API, serialization, preflight
   │ hekate-program  │   AIR + constraint DSL + chiplet composition
   │ hekate-chiplets │   Keccak, AES, RAM, ROM, NTT, ML-KEM, ML-DSA
   └────────┬────────┘
            │
   ┌────────▼────────┐
   │   hekate-core   │   trace, transcript, Merkle, polys
   │  hekate-crypto  │   Blake3, SHA3, SHA-256
   │   hekate-math   │   tower fields (external, sealed)
   └────────┬────────┘
            │
   ┌────────┴────────┐
   ▼                 ▼
hekate-prover   hekate-verifier
(closed)        (open)

Quick Example

Real 32-bit-integer Fibonacci. The CPU side holds five columns and the two Fibonacci transition constraints. Every u32 ADD is offloaded to the IntArithmeticChiplet, its own trace, own commitment, own ZeroCheck, own evaluation argument, and is wired in by a LogUp bus ((val_a, val_b, val_res, opcode, request_idx) keys with a row-index clock).

type F = Block128;

#[derive(Clone)]
struct FibProgram {
    num_rows: usize,
}

impl Air<F> for FibProgram {
    fn num_columns(&self) -> usize {
        CpuArithColumns::NUM_COLUMNS
    }

    fn column_layout(&self) -> &[ColumnType] {
        static LAYOUT: std::sync::OnceLock<Vec<ColumnType>> = std::sync::OnceLock::new();
        LAYOUT.get_or_init(CpuArithColumns::build_layout)
    }

    fn boundary_constraints(&self) -> Vec<BoundaryConstraint<F>> {
        vec![BoundaryConstraint::with_public_input(
            CpuArithColumns::VAL_B,
            self.num_rows - 1,
            0,
        )]
    }

    fn permutation_checks(&self) -> Vec<(String, PermutationCheckSpec)> {
        vec![(
            IntArithmeticChiplet::BUS_ID.into(),
            CpuIntArithmeticUnit::linking_spec(),
        )]
    }

    fn constraint_ast(&self) -> ConstraintAst<F> {
        let cs = ConstraintSystem::<F>::new();

        let s = cs.col(CpuArithColumns::SELECTOR);
        let val_b = cs.col(CpuArithColumns::VAL_B);
        let val_res = cs.col(CpuArithColumns::VAL_RES);
        let next_a = cs.next(CpuArithColumns::VAL_A);
        let next_b = cs.next(CpuArithColumns::VAL_B);

        cs.assert_boolean(s);
        cs.constrain(s * (next_a + val_b));     // next_a = b
        cs.constrain(s * (next_b + val_res));   // next_b = a + b (chiplet provides val_res)

        cs.build()
    }
}

impl Program<F> for FibProgram {
    fn num_public_inputs(&self) -> usize { 1 }

    fn chiplet_defs(&self) -> errors::Result<Vec<ChipletDef<F>>> {
        let arith = IntArithmeticChiplet::new(32, self.num_rows)?;
        Ok(vec![ChipletDef::from_air(&arith)?])
    }
}

Trace generation builds the CPU columns and the chiplet trace independently; they meet on the bus.

fn generate_traces(num_rows: usize) -> errors::Result<(ColumnTrace, ColumnTrace, u32)> {
    let num_vars = num_rows.trailing_zeros() as usize;

    let mut tb = TraceBuilder::new(&CpuArithColumns::build_layout(), num_vars)?;
    let mut ops: Vec<IntArithmeticOp> = Vec::with_capacity(num_rows - 1);

    let mut a: u32 = 0;
    let mut b: u32 = 1;

    for i in 0..num_rows - 1 {
        let res = a.wrapping_add(b);

        tb.set_b32(CpuArithColumns::VAL_A, i, Block32::from(a))?;
        tb.set_b32(CpuArithColumns::VAL_B, i, Block32::from(b))?;
        tb.set_b32(CpuArithColumns::VAL_RES, i, Block32::from(res))?;
        tb.set_b32(CpuArithColumns::OPCODE, i, Block32::from(ArithmeticOpcode::ADD as u32))?;
        tb.set_bit(CpuArithColumns::SELECTOR, i, Bit::ONE)?;

        ops.push(IntArithmeticOp::U32 {
            op: ArithmeticOpcode::ADD,
            a,
            b,
            request_idx: i as u32,
        });

        a = b;
        b = res;
    }

    // Padding row: selector = 0, val_b carries fib[N-1] for the boundary check.
    tb.set_b32(CpuArithColumns::VAL_A, num_rows - 1, Block32::from(a))?;
    tb.set_b32(CpuArithColumns::VAL_B, num_rows - 1, Block32::from(b))?;

    let cpu_trace = tb.build();

    let arith_layout = IntArithmeticLayout::compute(32);
    let arith_trace = generate_arithmetic_trace(&ops, &arith_layout, num_rows)?;

    Ok((cpu_trace, arith_trace, b))
}

Wiring it together for the prover:

let (cpu, arith, fib_n) = generate_traces(num_rows) ?;
let instance = ProgramInstance::new(num_rows, vec![F::from(fib_n as u128)]);
let witness  = ProgramWitness::new(cpu).with_chiplets(vec![arith]);

The chiplet enforces 32-bit ADD with carry, boolean-checks its own selectors, and zero-pins shadow columns when its row is idle. The CPU AIR only needs the two transition constraints above, the LogUp bus guarantees val_res = a + b for every row where s = 1.


Performance

All numbers on Apple M3 Max (16 cores, 48 GB RAM), --release with -C target-cpu=native, features std parallel blake3 table-math. Measured on commit master with the example binaries in hekate/examples/. Peak / total heap via dhat-heap.

Reproduce:

RUSTFLAGS="-C target-cpu=native" cargo run --release \
  --no-default-features --features "std parallel blake3 table-math" \
  --example <name> [-- <arg>]

Post-Quantum Crypto and AES

ML-KEM-768ML-DSA-44ML-DSA-65ML-DSA-87AES-128AES-256
Proving1.40 s2.43 s2.54 s3.98 s2.15 s2.27 s
Verification30.6 ms69.0 ms70.7 ms115.6 ms24.5 ms25.9 ms
Proof Size4,232 KiB5,139 KiB5,156 KiB8,620 KiB3,405 KiB3,706 KiB
Peak Heap331 MB294 MB294 MB580 MB772 MB1,005 MB
Total Alloc1.58 GB3.75 GB3.76 GB7.28 GB2.05 GB2.40 GB
Chiplets677722

Chiplet trace sizes:

AES note: both AES-128 and AES-256 prove 31,250 blocks (~500 KB plaintext) per run. CPU trace 2^16 rows; Round-AIR and S-box ROM chiplets at 2^19. Per-block proving cost: ~69 µs (AES-128) / ~73 µs ( AES-256).

Keccak-f[1600], scaling

hekate/examples/keccak_inline.rs <num_vars>, default 20.

Scale (rows)PermutationsHashedProvingVerifyProof SizePeak HeapTotal Alloc
2^151,310~178 KB919 ms23.3 ms1,312 KiB92 MB255 MB
2^2041,943~5.4 MB14.16 s87.0 ms5,156 KiB2,278 MB3,747 MB
2^24671,088~91 MB268.08 s333.9 ms20,209 KiB31,088 MB51,535 MB

Fibonacci (32-bit integer add), scaling

hekate/examples/fibonacci_raw.rs <num_vars>, default 24. Each row: bit-sliced 32-bit add with explicit carry chain, virtual-expanded into 32 bit + 32 sum + 32 carry columns.

Scale (rows)ProvingVerifyProof SizePeak HeapTotal Alloc
2^20745 ms10.1 ms1,125 KiB209 MB361 MB
2^2411.30 s36.9 ms4,237 KiB3,077 MB5,210 MB
2^2647.20 s76.1 ms8,378 KiB12,072 MB20,486 MB

Hardware Support

ArchitectureStatusInstructions
aarch64ProductionPMULL, NEON
x86_64DevelopmentSoftware fallback (PCLMULQDQ roadmap)
WASMFallbackSoftware multiply

Next Steps


Status

Hekate verifier, core SDK, and chiplets are being open-sourced. The prover and recursive engine remain closed-source, licensed as proprietary binaries.

Contact

[email protected]