Design Rationale: v001 → v002¶
pccx v001 reached late-stage RTL implementation before being archived
to docs/archive/experimental_v001/ rather than taken through to
tape-out. This page documents the architectural weaknesses that drove
that decision and how v002 addresses each of them.
1. Core Flaws in v001 & v002’s Response¶
The table below visualizes how each architectural weakness in v001 directly drove a specific design decision in v002.
flowchart LR
subgraph v001 ["v001 Flaws"]
F1[Ambiguous core roles]
F2[Too many buses]
F3[L2 / Global Cache overlap]
F4[Inefficient HP port layout]
F5[Under-utilized systolic array]
end
subgraph v002 ["v002 Responses"]
R1[Three-core organization]
R2[Bus simplification]
R3[Centralized L2 Cache]
R4[Distributed HP ports]
R5[Dual-channel bit packing]
end
F1 -->|Fuzzy boundaries| R1
F2 -->|Routing congestion| R2
F3 -->|Duplicated data| R3
F4 -->|Bottlenecked weight supply| R4
F5 -->|1 MAC per DSP| R5
style v001 fill:#f5edd5,stroke:#dbe1ea,stroke-width:2px,color:#000
style v002 fill:#dae7f4,stroke:#dbe1ea,stroke-width:2px,color:#000
Three-core organization: GEMV, GEMM, and SFU are cleanly separated. Each core is wired to the L2 cache and weight buffer that suits its access pattern.
Bus simplification: Everything collapses onto two orthogonal axes (WEIGHT BUS and ACTIVATION BUS) to avoid routing contention.
Centralized L2: Global Cache responsibilities are folded into L2, which is placed in the center of the floorplan.
Distributed HP ports: HP2 and HP3 are assigned to independent slices, eliminating the weight-supply bottleneck.
Dual-channel bit packing: 1 DSP = 2 MACs, yielding 2,048 MACs per clock cycle across the systolic array.
3. Speedup Analysis — 3.125×¶
The theoretical throughput gain over v001 comes from three independent levers, multiplied together.
Lever |
Factor |
Justification |
|---|---|---|
Higher internal clock |
× 400 / 250 = 1.6 |
External AXI 250 MHz decoupled from internal core 400 MHz. |
Dual HP ports |
(already consumed at 400 MHz) |
2 of 4 HP ports (HP2 / HP3) are independently assigned to the upper and lower slices, doubling weight-supply bandwidth. |
Bit packing |
× 2 |
1 DSP now executes 2 MACs simultaneously. |
Multiplying the three levers gives 1.6 × 2 × (bottleneck removed) ≈ 3.125× effective throughput.
3.1 Load-Side Derivation¶
v001: 250 MHz × 1 HP × 1 MAC/DSP = 250 units of throughput. v002: HP2 and HP3 stream weights at 250 MHz into a shared buffer that the internal 400 MHz domain drains at 2 MACs per DSP, yielding 800 units of internal consumption rate.
The external port rate is unchanged. The improvement is structural: weights are buffered externally at 250 MHz, drained at the higher internal 400 MHz clock, and each DSP executes two MACs per cycle. The effective throughput seen by the systolic array is 3.125× higher.
3.2 Per-Cycle Internal Throughput¶
flowchart LR
subgraph ext[External 250 MHz Domain]
HP2[AXI HP2] --> BUF[Weight Buffer<br/>CDC FIFO]
HP3[AXI HP3] --> BUF
end
subgraph core[Internal 400 MHz Domain]
BUF -->|broadcast| SA[Systolic Array<br/>32×32 · 1 DSP = 2 MAC<br/>cascade break @ row 16]
SA --> ACC[Result Accumulator<br/>819 GMAC/s peak]
end
The single 32 × 32 grid holds 1,024 PEs × 2 MAC = 2,048 MAC/clk. Running at 400 MHz, this yields a 819 GMAC/s theoretical peak.
4. New Trade-offs¶
v002 accepts the following constraints in exchange for the throughput gain.
Constraint |
Description |
|---|---|
Weight precision ceiling |
Beyond W4, guard bits are exhausted and the maximum representable
accumulated value ( |
K-split required |
Layers with K > 4,096 must be tiled by the driver or compiler. |
Sign-recovery post-processing |
Each PE adds a 1-bit adder and 23-bit split logic. No throughput impact, but additional area cost. |
CDC complexity |
Asynchronous 250 MHz ↔ 400 MHz FIFOs need careful design and verification. |
5. Summary vs. Archived v001¶
Aspect |
v001 (Archived) |
v002 |
|---|---|---|
Design bias |
GEMM-centric (prefill-optimized) |
Three-core layout: GEMM · GEMV · SFU |
L2 cache placement |
Peripheral |
Central, symmetric interconnect on both sides |
Global Cache |
Separate block |
Absorbed into L2 |
Quantization |
W4A16 (BF16 activations) |
W4A8 (INT8 activations) |
HP port |
One per SA |
HP2 / HP3 distributed (upper / lower slices) |
DSP utilization |
1 DSP = 1 MAC |
1 DSP = 2 MAC |
Peak throughput (400 MHz) |
~320 GMAC/s |
819 GMAC/s (~2.56× measured improvement expected) |
See also
v001 details: Archive: v001 Experimental Architecture
Bit packing details: DSP48E2 W4A8 Bit Packing and Sign Recovery
KV cache strategy: KV Cache Optimization Strategy