Design Rationale: v001 → v002¶

pccx v001 reached late-stage RTL implementation before being archived to docs/archive/experimental_v001/ rather than taken through to tape-out. This page documents the architectural weaknesses that drove that decision and how v002 addresses each of them.

1. Core Flaws in v001 & v002’s Response¶

The table below visualizes how each architectural weakness in v001 directly drove a specific design decision in v002.

        flowchart LR
  subgraph v001 ["v001 Flaws"]
    F1[Ambiguous core roles]
    F2[Too many buses]
    F3[L2 / Global Cache overlap]
    F4[Inefficient HP port layout]
    F5[Under-utilized systolic array]
  end

  subgraph v002 ["v002 Responses"]
    R1[Three-core organization]
    R2[Bus simplification]
    R3[Centralized L2 Cache]
    R4[Distributed HP ports]
    R5[Dual-channel bit packing]
  end

  F1 -->|Fuzzy boundaries| R1
  F2 -->|Routing congestion| R2
  F3 -->|Duplicated data| R3
  F4 -->|Bottlenecked weight supply| R4
  F5 -->|1 MAC per DSP| R5

  style v001 fill:#f5edd5,stroke:#dbe1ea,stroke-width:2px,color:#000
  style v002 fill:#dae7f4,stroke:#dbe1ea,stroke-width:2px,color:#000

Three-core organization: GEMV, GEMM, and SFU are cleanly separated. Each core is wired to the L2 cache and weight buffer that suits its access pattern.
Bus simplification: Everything collapses onto two orthogonal axes (WEIGHT BUS and ACTIVATION BUS) to avoid routing contention.
Centralized L2: Global Cache responsibilities are folded into L2, which is placed in the center of the floorplan.
Distributed HP ports: HP2 and HP3 are assigned to independent slices, eliminating the weight-supply bottleneck.
Dual-channel bit packing: 1 DSP = 2 MACs, yielding 2,048 MACs per clock cycle across the systolic array.

3. Speedup Analysis — 3.125×¶

The theoretical throughput gain over v001 comes from three independent levers, multiplied together.

Lever	Factor	Justification
Higher internal clock	× 400 / 250 = 1.6	External AXI 250 MHz decoupled from internal core 400 MHz.
Dual HP ports	(already consumed at 400 MHz)	2 of 4 HP ports (HP2 / HP3) are independently assigned to the upper and lower slices, doubling weight-supply bandwidth.
Bit packing	× 2	1 DSP now executes 2 MACs simultaneously.

Multiplying the three levers gives 1.6 × 2 × (bottleneck removed) ≈ 3.125× effective throughput.

3.1 Load-Side Derivation¶

v001: 250 MHz × 1 HP × 1 MAC/DSP = 250 units of throughput. v002: HP2 and HP3 stream weights at 250 MHz into a shared buffer that the internal 400 MHz domain drains at 2 MACs per DSP, yielding 800 units of internal consumption rate.

\[\frac{800}{250} \;=\; \mathbf{3.125\,\times}\]

The external port rate is unchanged. The improvement is structural: weights are buffered externally at 250 MHz, drained at the higher internal 400 MHz clock, and each DSP executes two MACs per cycle. The effective throughput seen by the systolic array is 3.125× higher.

3.2 Per-Cycle Internal Throughput¶

        flowchart LR
  subgraph ext[External 250 MHz Domain]
    HP2[AXI HP2] --> BUF[Weight Buffer<br/>CDC FIFO]
    HP3[AXI HP3] --> BUF
  end
  subgraph core[Internal 400 MHz Domain]
    BUF -->|broadcast| SA[Systolic Array<br/>32×32 · 1 DSP = 2 MAC<br/>cascade break @ row 16]
    SA --> ACC[Result Accumulator<br/>819 GMAC/s peak]
  end

The single 32 × 32 grid holds 1,024 PEs × 2 MAC = 2,048 MAC/clk. Running at 400 MHz, this yields a 819 GMAC/s theoretical peak.

4. New Trade-offs¶

v002 accepts the following constraints in exchange for the throughput gain.

Constraint	Description
Weight precision ceiling	Beyond W4, guard bits are exhausted and the maximum representable accumulated value (`N_max`) drops sharply. W5/W6 support would require a separate mode.
K-split required	Layers with K > 4,096 must be tiled by the driver or compiler.
Sign-recovery post-processing	Each PE adds a 1-bit adder and 23-bit split logic. No throughput impact, but additional area cost.
CDC complexity	Asynchronous 250 MHz ↔ 400 MHz FIFOs need careful design and verification.

5. Summary vs. Archived v001¶

Aspect	v001 (Archived)	v002
Design bias	GEMM-centric (prefill-optimized)	Three-core layout: GEMM · GEMV · SFU
L2 cache placement	Peripheral	Central, symmetric interconnect on both sides
Global Cache	Separate block	Absorbed into L2
Quantization	W4A16 (BF16 activations)	W4A8 (INT8 activations)
HP port	One per SA	HP2 / HP3 distributed (upper / lower slices)
DSP utilization	1 DSP = 1 MAC	1 DSP = 2 MAC
Peak throughput (400 MHz)	~320 GMAC/s	819 GMAC/s (~2.56× measured improvement expected)