# Phase 5 — AlphaEvolve → OpenEvolve **Status:** scaffold landed (`pccx-evolve` trait scaffolds + speculative); implementation kicks off in Phase 5 proper. **Scope:** roadmap Weeks 19-30 (±4 weeks uncertainty band); milestones 5A, 5B, 5C + user-requested 5D, 5E. **Thesis:** Fix the weaknesses of existing AlphaEvolve-style systems by combining **LLM + Reinforcement Learning + Formal Methods + Surrogate Models**. ## 1. Architecture overview ``` ┌─────────────────────────────────────────────────┐ │ pccx-evolve │ │ │ User spec ──▶│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │── accepted │ │ LLM prop. │──▶│ PRM gate │──▶│ Surrogate│ │ candidate │ │ (Sonnet) │ │ (fast) │ │ (GNN) │ │──▶ │ └──────────┘ └──────────┘ └──────────┘ │ │ ▲ │ │ │ evolutionary loop (mutate+cross) │ │ └───────────────────────────────────────┤ │ │ │ Sail refinement check (pccx-verification) │ │ Formal property check (Lean 4) │ └────────────────────────────────────────────────┘ ``` Five lanes: | Lane | Input | Output | Audience | |---|---|---|---| | **5A** Chip DSE | RTL + target | RTL variants (Pareto front) | HW engineer | | **5B** Compiler DSE | High-level code | LLVM pass order + RL'd alloc | SW engineer | | **5C** OS/Kernel formal | Kernel C | Proven kernel module | Systems engineer | | **5D** Model → API | HF model + spec | Target-specific driver code | AI researcher | | **5E** Model → RTL | HF model + spec | Custom NPU RTL + proof | Chip architect | 5D + 5E are user-requested on 2026-04-24 and build on top of 5A + 5C. ## 2. Milestones ### 5A — Chip Design Space Exploration (Weeks 19-22) **Problem:** RTL design space is enormous (instruction width, opcode encoding, DSP cluster size, pipeline depth). **Solution:** 1. **Surrogate** — GNN on RTL AST predicts area / power / delay / fmax without synthesis. Trained on ~10 K historical Vivado runs from `pccx-FPGA-NPU-LLM-kv260`. Target latency: < 10 ms / query. 2. **Evolutionary loop** — population = RTL variants, fitness = surrogate prediction + Verilator pass + verible-lint pass + timing-sanity check. 3. **PRM gate** — deep cloud LLM proposes RTL → Verilator elaborates → verible lints → timing-check sanity-tests → survivors go to the surrogate. 4. **Formal diff** — promoted variants must pass `pccx-verification::GoldenDiffGate` + Sail refinement check. Deliverable: "design an NPU for Gemma-3N E4B decoding at 20 tok/s on KV260" in < 1 day wall-clock. ### 5B — Compiler Superoptimization (Weeks 22-24) **Problem:** `-O3` leaves performance on the table. Register allocation + instruction scheduling are NP-hard, so compilers use heuristics. **Solution:** 1. **MCTS** over LLVM pass orderings. Reward = measured runtime on the target (or cycle-accurate sim on TinyNPU). 2. **GNN + RL** for register allocation and instruction scheduling. Policy network ingests the data-flow graph; actions are "assign register R to virtual V". Reward = -pipeline-stalls - register-pressure. 3. **Compiler explainer** — post-run, Sonnet narrates *why* the found pass order beats `-O3` in terms the developer understands. Deliverable: AI-compiled kernel beats hand-tuned expert kernel on ≥ 3 benchmarks (matmul, attention, layer-norm). ### 5C — OS / Kernel Formal Co-Design (Weeks 24-27) **Non-negotiable: stability > everything.** Hybrid architecture: 1. LLM drafts kernel module / driver / scheduler. 2. Feed to Lean 4 theorem prover (extract proof obligations automatically). 3. Prove: no memory leaks, no deadlocks, mutex correctness, scheduler starvation-free. 4. On failure: return counter-example trace to LLM → propose fix → re-prove. Iterate until mathematically correct. Start narrow: reuse seL4-style libraries; don't reinvent formal primitives. Deliverable: pccx-NPU driver with signed Lean 4 correctness proof bundled. ### 5D — Model → ISA-API Compiler (Weeks 27-30) **USER BIG BET** **Input:** a HuggingFace model (`.safetensors` + `config.json` + tokenizer) + pccx ISA spec. **Pipeline:** 1. Parse the model's computation graph → tensor op sequence. 2. Map each op to pccx ISA opcodes (Sail spec is the ground truth). 3. deep cloud LLM generates Rust/C driver code that issues those opcodes in order. 4. Run pccx-lab simulator against PyTorch reference trace → bit-exact check (or `pccx-verification::GoldenDiffGate`). 5. Emit the signed driver + a verification report. **Deliverable:** drop a model file in → get a `uca_run_` function out, validated by the Sail oracle. Depends on: Phase 4 `M4.8-M4.10` Sail completion (for reliable refinement check). ### 5E — Generative Chip Design (Weeks 30+) **USER ULTIMATE GOAL** **Input:** same model file + target silicon family (KV260 / ASIC-22nm). **Pipeline:** 1. Run 5D, inspect the resulting `.pccx` trace, identify bottleneck (compute-bound vs memory-bound). 2. Feed bottleneck + model structure to 5A's evolutionary loop. 3. Candidates must pass: - Verilator + verible-lint (PRM gate), - Surrogate Pareto threshold (area / power / fmax), - Sail refinement (every ISA op behaves equivalently to the spec), - Formal property check (every pccx invariant holds — e.g. "no MAC overflow"). 4. Synthesize top-K survivors in parallel via 5C-authorised Vivado runners. 5. Pick Pareto front; user selects final. **Deliverable:** feed `gemma-3n-e4b.safetensors` + "KV260" → receive a tailor-made NPU RTL + bitstream + correctness proof in < 48 hours. ## 3. Risk register | Risk | Likelihood | Impact | Mitigation | |---|---|---|---| | Surrogate accuracy poor on out-of-distribution designs | Medium | High | Keep the "truth" escape valve — any variant whose predicted metrics diverge > 20% from actual synth gets the surrogate retrained on it. | | Lean 4 proof obligations auto-extraction brittle | Medium | High | Start with known-provable kernel modules; expand only as tooling matures. | | Model input shapes outside pccx v002 ISA capacity | Low | High | Surface as a compile-time error in 5D; fall back to CPU reference path. | | Sonnet RTL proposals generate linter-clean but timing-broken candidates | High | Medium | PRM gate does static timing sanity check (critical-path estimate) before the surrogate. | | 5E wall-clock target (48 h) unachievable on KV260 workstations | High | Medium | Offload synth to a cloud Vivado cluster; document the trade-off. | ## 4. Decision — internal first, open later **Phase 5.0 Gate:** use the engine on pccx-lab's **own** RTL and kernel for the first 3 months before exposing publicly. Rationale: - Proves value on code we understand. - Surfaces infra bugs before customer exposure. - Generates training data for the surrogate. - Establishes a credible launch story ("we used it to build ourselves"). Open to external users once: - Surrogate accuracy ≥ 90% on PCCX-Lab's internal benchmarks. - Formal gate signs off ≥ 3 non-trivial kernel modules. - 5D succeeds on ≥ 3 third-party models (Gemma 3N, Llama-2, BERT). Target public release: **pccx-lab v0.5** (roughly Q1-2027 at current cadence). ## 5. Token budget - Surrogate queries: 0 LLM tokens (pure inference). - PRM gate: 0 LLM tokens (static analysis only). - LLM mutation proposals: Haiku (500-1 K tokens/mutation; thousands/day). - LLM final-round refinement: Sonnet (2-5 K tokens/candidate; tens/day). - LLM Lean 4 repair (5C): Sonnet/Opus (5-20 K tokens/iteration; hundreds/week). - LLM concept-to-RTL narration (5D/5E): Opus (10-50 K tokens/session; dozens/week). Cache all narrations by `(input hash, prompt template hash)` — 60% hit rate target in steady state. ## 6. Dependencies on earlier phases - Phase 1 scaffold — done (pccx-evolve traits landed). - Phase 2 M2.6 (target-aware suggestions) — feeds FPGA presets into 5A. - Phase 3 M3.4 (sandboxed sessions) — runs Vivado in isolation. - Phase 4 M4.5 (what-if engine) — visualises 5A's Pareto front. - Phase 4 M4.8-4.10 (Sail finale) — refinement oracle for 5D/5E. Don't start 5E before all of the above land.