Phase 5 — AlphaEvolve → OpenEvolve

Status: scaffold landed (pccx-evolve trait scaffolds + speculative); implementation kicks off in Phase 5 proper. Scope: roadmap Weeks 19-30 (±4 weeks uncertainty band); milestones 5A, 5B, 5C + user-requested 5D, 5E. Thesis: Fix the weaknesses of existing AlphaEvolve-style systems by combining LLM + Reinforcement Learning + Formal Methods + Surrogate Models.

1. Architecture overview

            ┌─────────────────────────────────────────────────┐
            │              pccx-evolve                         │
            │                                                  │
User spec ──▶│  ┌──────────┐   ┌──────────┐   ┌──────────┐  │── accepted
            │  │ LLM prop. │──▶│ PRM gate │──▶│ Surrogate│  │   candidate
            │  │ (Sonnet)  │   │ (fast)   │   │ (GNN)    │  │──▶
            │  └──────────┘   └──────────┘   └──────────┘  │
            │        ▲                                       │
            │        │   evolutionary loop (mutate+cross)   │
            │        └───────────────────────────────────────┤
            │                                                │
            │  Sail refinement check (pccx-verification)    │
            │  Formal property check (Lean 4)               │
            └────────────────────────────────────────────────┘

Five lanes:

Lane

Input

Output

Audience

5A Chip DSE

RTL + target

RTL variants (Pareto front)

HW engineer

5B Compiler DSE

High-level code

LLVM pass order + RL’d alloc

SW engineer

5C OS/Kernel formal

Kernel C

Proven kernel module

Systems engineer

5D Model → API

HF model + spec

Target-specific driver code

AI researcher

5E Model → RTL

HF model + spec

Custom NPU RTL + proof

Chip architect

5D + 5E are user-requested on 2026-04-24 and build on top of 5A + 5C.

2. Milestones

5A — Chip Design Space Exploration (Weeks 19-22)

Problem: RTL design space is enormous (instruction width, opcode encoding, DSP cluster size, pipeline depth).

Solution:

  1. Surrogate — GNN on RTL AST predicts area / power / delay / fmax without synthesis. Trained on ~10 K historical Vivado runs from pccx-FPGA-NPU-LLM-kv260. Target latency: < 10 ms / query.

  2. Evolutionary loop — population = RTL variants, fitness = surrogate prediction + Verilator pass + verible-lint pass + timing-sanity check.

  3. PRM gate — deep cloud LLM proposes RTL → Verilator elaborates → verible lints → timing-check sanity-tests → survivors go to the surrogate.

  4. Formal diff — promoted variants must pass pccx-verification::GoldenDiffGate + Sail refinement check.

Deliverable: “design an NPU for Gemma-3N E4B decoding at 20 tok/s on KV260” in < 1 day wall-clock.

5B — Compiler Superoptimization (Weeks 22-24)

Problem: -O3 leaves performance on the table. Register allocation + instruction scheduling are NP-hard, so compilers use heuristics.

Solution:

  1. MCTS over LLVM pass orderings. Reward = measured runtime on the target (or cycle-accurate sim on TinyNPU).

  2. GNN + RL for register allocation and instruction scheduling. Policy network ingests the data-flow graph; actions are “assign register R to virtual V”. Reward = -pipeline-stalls - register-pressure.

  3. Compiler explainer — post-run, Sonnet narrates why the found pass order beats -O3 in terms the developer understands.

Deliverable: AI-compiled kernel beats hand-tuned expert kernel on ≥ 3 benchmarks (matmul, attention, layer-norm).

5C — OS / Kernel Formal Co-Design (Weeks 24-27)

Non-negotiable: stability > everything.

Hybrid architecture:

  1. LLM drafts kernel module / driver / scheduler.

  2. Feed to Lean 4 theorem prover (extract proof obligations automatically).

  3. Prove: no memory leaks, no deadlocks, mutex correctness, scheduler starvation-free.

  4. On failure: return counter-example trace to LLM → propose fix → re-prove. Iterate until mathematically correct.

Start narrow: reuse seL4-style libraries; don’t reinvent formal primitives.

Deliverable: pccx-NPU driver with signed Lean 4 correctness proof bundled.

5D — Model → ISA-API Compiler (Weeks 27-30) USER BIG BET

Input: a HuggingFace model (.safetensors + config.json + tokenizer) + pccx ISA spec.

Pipeline:

  1. Parse the model’s computation graph → tensor op sequence.

  2. Map each op to pccx ISA opcodes (Sail spec is the ground truth).

  3. deep cloud LLM generates Rust/C driver code that issues those opcodes in order.

  4. Run pccx-lab simulator against PyTorch reference trace → bit-exact check (or pccx-verification::GoldenDiffGate).

  5. Emit the signed driver + a verification report.

Deliverable: drop a model file in → get a uca_run_<model> function out, validated by the Sail oracle.

Depends on: Phase 4 M4.8-M4.10 Sail completion (for reliable refinement check).

5E — Generative Chip Design (Weeks 30+) USER ULTIMATE GOAL

Input: same model file + target silicon family (KV260 / ASIC-22nm).

Pipeline:

  1. Run 5D, inspect the resulting .pccx trace, identify bottleneck (compute-bound vs memory-bound).

  2. Feed bottleneck + model structure to 5A’s evolutionary loop.

  3. Candidates must pass:

    • Verilator + verible-lint (PRM gate),

    • Surrogate Pareto threshold (area / power / fmax),

    • Sail refinement (every ISA op behaves equivalently to the spec),

    • Formal property check (every pccx invariant holds — e.g. “no MAC overflow”).

  4. Synthesize top-K survivors in parallel via 5C-authorised Vivado runners.

  5. Pick Pareto front; user selects final.

Deliverable: feed gemma-3n-e4b.safetensors + “KV260” → receive a tailor-made NPU RTL + bitstream + correctness proof in < 48 hours.

3. Risk register

Risk

Likelihood

Impact

Mitigation

Surrogate accuracy poor on out-of-distribution designs

Medium

High

Keep the “truth” escape valve — any variant whose predicted metrics diverge > 20% from actual synth gets the surrogate retrained on it.

Lean 4 proof obligations auto-extraction brittle

Medium

High

Start with known-provable kernel modules; expand only as tooling matures.

Model input shapes outside pccx v002 ISA capacity

Low

High

Surface as a compile-time error in 5D; fall back to CPU reference path.

Sonnet RTL proposals generate linter-clean but timing-broken candidates

High

Medium

PRM gate does static timing sanity check (critical-path estimate) before the surrogate.

5E wall-clock target (48 h) unachievable on KV260 workstations

High

Medium

Offload synth to a cloud Vivado cluster; document the trade-off.

4. Decision — internal first, open later

Phase 5.0 Gate: use the engine on pccx-lab’s own RTL and kernel for the first 3 months before exposing publicly. Rationale:

  • Proves value on code we understand.

  • Surfaces infra bugs before customer exposure.

  • Generates training data for the surrogate.

  • Establishes a credible launch story (“we used it to build ourselves”).

Open to external users once:

  • Surrogate accuracy ≥ 90% on PCCX-Lab’s internal benchmarks.

  • Formal gate signs off ≥ 3 non-trivial kernel modules.

  • 5D succeeds on ≥ 3 third-party models (Gemma 3N, Llama-2, BERT).

Target public release: pccx-lab v0.5 (roughly Q1-2027 at current cadence).

5. Token budget

  • Surrogate queries: 0 LLM tokens (pure inference).

  • PRM gate: 0 LLM tokens (static analysis only).

  • LLM mutation proposals: Haiku (500-1 K tokens/mutation; thousands/day).

  • LLM final-round refinement: Sonnet (2-5 K tokens/candidate; tens/day).

  • LLM Lean 4 repair (5C): Sonnet/Opus (5-20 K tokens/iteration; hundreds/week).

  • LLM concept-to-RTL narration (5D/5E): Opus (10-50 K tokens/session; dozens/week).

Cache all narrations by (input hash, prompt template hash) — 60% hit rate target in steady state.

6. Dependencies on earlier phases

  • Phase 1 scaffold — done (pccx-evolve traits landed).

  • Phase 2 M2.6 (target-aware suggestions) — feeds FPGA presets into 5A.

  • Phase 3 M3.4 (sandboxed sessions) — runs Vivado in isolation.

  • Phase 4 M4.5 (what-if engine) — visualises 5A’s Pareto front.

  • Phase 4 M4.8-4.10 (Sail finale) — refinement oracle for 5D/5E.

Don’t start 5E before all of the above land.