Phase 5 — AlphaEvolve → OpenEvolve¶

Status: scaffold landed (pccx-evolve trait scaffolds + speculative); implementation kicks off in Phase 5 proper. Scope: roadmap Weeks 19-30 (±4 weeks uncertainty band); milestones 5A, 5B, 5C + user-requested 5D, 5E. Thesis: Fix the weaknesses of existing AlphaEvolve-style systems by combining LLM + Reinforcement Learning + Formal Methods + Surrogate Models.

1. Architecture overview¶

            ┌─────────────────────────────────────────────────┐
            │              pccx-evolve                         │
            │                                                  │
User spec ──▶│  ┌──────────┐   ┌──────────┐   ┌──────────┐  │── accepted
            │  │ LLM prop. │──▶│ PRM gate │──▶│ Surrogate│  │   candidate
            │  │ (Sonnet)  │   │ (fast)   │   │ (GNN)    │  │──▶
            │  └──────────┘   └──────────┘   └──────────┘  │
            │        ▲                                       │
            │        │   evolutionary loop (mutate+cross)   │
            │        └───────────────────────────────────────┤
            │                                                │
            │  Sail refinement check (pccx-verification)    │
            │  Formal property check (Lean 4)               │
            └────────────────────────────────────────────────┘

Five lanes:

Lane	Input	Output	Audience
5A Chip DSE	RTL + target	RTL variants (Pareto front)	HW engineer
5B Compiler DSE	High-level code	LLVM pass order + RL’d alloc	SW engineer
5C OS/Kernel formal	Kernel C	Proven kernel module	Systems engineer
5D Model → API	HF model + spec	Target-specific driver code	AI researcher
5E Model → RTL	HF model + spec	Custom NPU RTL + proof	Chip architect

5D + 5E are user-requested on 2026-04-24 and build on top of 5A + 5C.

2. Milestones¶

5A — Chip Design Space Exploration (Weeks 19-22)¶

Problem: RTL design space is enormous (instruction width, opcode encoding, DSP cluster size, pipeline depth).

Solution:

Surrogate — GNN on RTL AST predicts area / power / delay / fmax without synthesis. Trained on ~10 K historical Vivado runs from pccx-FPGA-NPU-LLM-kv260. Target latency: < 10 ms / query.
Evolutionary loop — population = RTL variants, fitness = surrogate prediction + Verilator pass + verible-lint pass + timing-sanity check.
PRM gate — deep cloud LLM proposes RTL → Verilator elaborates → verible lints → timing-check sanity-tests → survivors go to the surrogate.
Formal diff — promoted variants must pass pccx-verification::GoldenDiffGate + Sail refinement check.

Deliverable: “design an NPU for Gemma-3N E4B decoding at 20 tok/s on KV260” in < 1 day wall-clock.

5B — Compiler Superoptimization (Weeks 22-24)¶

Problem: -O3 leaves performance on the table. Register allocation + instruction scheduling are NP-hard, so compilers use heuristics.

Solution:

MCTS over LLVM pass orderings. Reward = measured runtime on the target (or cycle-accurate sim on TinyNPU).
GNN + RL for register allocation and instruction scheduling. Policy network ingests the data-flow graph; actions are “assign register R to virtual V”. Reward = -pipeline-stalls - register-pressure.
Compiler explainer — post-run, Sonnet narrates why the found pass order beats -O3 in terms the developer understands.

Deliverable: AI-compiled kernel beats hand-tuned expert kernel on ≥ 3 benchmarks (matmul, attention, layer-norm).

5C — OS / Kernel Formal Co-Design (Weeks 24-27)¶

Non-negotiable: stability > everything.

Hybrid architecture:

LLM drafts kernel module / driver / scheduler.
Feed to Lean 4 theorem prover (extract proof obligations automatically).
Prove: no memory leaks, no deadlocks, mutex correctness, scheduler starvation-free.
On failure: return counter-example trace to LLM → propose fix → re-prove. Iterate until mathematically correct.

Start narrow: reuse seL4-style libraries; don’t reinvent formal primitives.

Deliverable: pccx-NPU driver with signed Lean 4 correctness proof bundled.

5D — Model → ISA-API Compiler (Weeks 27-30) USER BIG BET¶

Input: a HuggingFace model (.safetensors + config.json + tokenizer) + pccx ISA spec.

Pipeline:

Parse the model’s computation graph → tensor op sequence.
Map each op to pccx ISA opcodes (Sail spec is the ground truth).
deep cloud LLM generates Rust/C driver code that issues those opcodes in order.
Run pccx-lab simulator against PyTorch reference trace → bit-exact check (or pccx-verification::GoldenDiffGate).
Emit the signed driver + a verification report.

Deliverable: drop a model file in → get a uca_run_<model> function out, validated by the Sail oracle.

Depends on: Phase 4 M4.8-M4.10 Sail completion (for reliable refinement check).

5E — Generative Chip Design (Weeks 30+) USER ULTIMATE GOAL¶

Input: same model file + target silicon family (KV260 / ASIC-22nm).

Pipeline:

Run 5D, inspect the resulting .pccx trace, identify bottleneck (compute-bound vs memory-bound).
Feed bottleneck + model structure to 5A’s evolutionary loop.
Candidates must pass:
- Verilator + verible-lint (PRM gate),
- Surrogate Pareto threshold (area / power / fmax),
- Sail refinement (every ISA op behaves equivalently to the spec),
- Formal property check (every pccx invariant holds — e.g. “no MAC overflow”).
Synthesize top-K survivors in parallel via 5C-authorised Vivado runners.
Pick Pareto front; user selects final.

Deliverable: feed gemma-3n-e4b.safetensors + “KV260” → receive a tailor-made NPU RTL + bitstream + correctness proof in < 48 hours.

3. Risk register¶

Risk	Likelihood	Impact	Mitigation
Surrogate accuracy poor on out-of-distribution designs	Medium	High	Keep the “truth” escape valve — any variant whose predicted metrics diverge > 20% from actual synth gets the surrogate retrained on it.
Lean 4 proof obligations auto-extraction brittle	Medium	High	Start with known-provable kernel modules; expand only as tooling matures.
Model input shapes outside pccx v002 ISA capacity	Low	High	Surface as a compile-time error in 5D; fall back to CPU reference path.
Sonnet RTL proposals generate linter-clean but timing-broken candidates	High	Medium	PRM gate does static timing sanity check (critical-path estimate) before the surrogate.
5E wall-clock target (48 h) unachievable on KV260 workstations	High	Medium	Offload synth to a cloud Vivado cluster; document the trade-off.

4. Decision — internal first, open later¶

Phase 5.0 Gate: use the engine on pccx-lab’s own RTL and kernel for the first 3 months before exposing publicly. Rationale:

Proves value on code we understand.
Surfaces infra bugs before customer exposure.
Generates training data for the surrogate.
Establishes a credible launch story (“we used it to build ourselves”).

Open to external users once:

Surrogate accuracy ≥ 90% on PCCX-Lab’s internal benchmarks.
Formal gate signs off ≥ 3 non-trivial kernel modules.
5D succeeds on ≥ 3 third-party models (Gemma 3N, Llama-2, BERT).

Target public release: pccx-lab v0.5 (roughly Q1-2027 at current cadence).

5. Token budget¶

Surrogate queries: 0 LLM tokens (pure inference).
PRM gate: 0 LLM tokens (static analysis only).
LLM mutation proposals: Haiku (500-1 K tokens/mutation; thousands/day).
LLM final-round refinement: Sonnet (2-5 K tokens/candidate; tens/day).
LLM Lean 4 repair (5C): Sonnet/Opus (5-20 K tokens/iteration; hundreds/week).
LLM concept-to-RTL narration (5D/5E): Opus (10-50 K tokens/session; dozens/week).

Cache all narrations by (input hash, prompt template hash) — 60% hit rate target in steady state.

6. Dependencies on earlier phases¶

Phase 1 scaffold — done (pccx-evolve traits landed).
Phase 2 M2.6 (target-aware suggestions) — feeds FPGA presets into 5A.
Phase 3 M3.4 (sandboxed sessions) — runs Vivado in isolation.
Phase 4 M4.5 (what-if engine) — visualises 5A’s Pareto front.
Phase 4 M4.8-4.10 (Sail finale) — refinement oracle for 5D/5E.

Don’t start 5E before all of the above land.