PREPROCESS RTL Reference¶
The PREPROCESS subdirectory contains five SystemVerilog modules.
barrel_shifter_BF16.sv is not present in the current working tree.
preprocess_fmap¶
module preprocess_fmap #(
parameter fmap_width = `DEVICE_ACP_WIDTH_BIT
) (
input logic clk,
input logic rst_n,
input logic i_clear,
// AXI4-Stream Interfaces from ACP
axis_if.slave S_AXIS_ACP_FMAP, // ACP (128-bit)
// Control from Brain
input logic i_rd_start,
// Output to Branch Engines (Systolic / GEMV / CVO)
output logic [`FIXED_MANT_WIDTH-1:0] o_fmap_broadcast[0:`ARRAY_SIZE_H-1],
output logic o_fmap_valid,
output logic [`BF16_EXP_WIDTH-1:0] o_cached_emax[0:`ARRAY_SIZE_H-1]
);
Receives a 128-bit BF16 stream from the ACP S_AXIS_ACP_FMAP interface,
buffers it into 256-bit words through an XPM block FIFO, and runs
exponent caching and mantissa alignment in parallel. The aligned 432-bit
output is written to fmap_cache; o_fmap_broadcast and o_cached_emax
are driven to MAT_CORE.
fmap_width defaults to `DEVICE_ACP_WIDTH_BIT.
ARRAY_SIZE_H controls the lane count for both output arrays.
preprocess_bf16_fixed_pipeline¶
module preprocess_bf16_fixed_pipeline (
input logic clk,
input logic rst_n,
// AXI-Stream Slave (Input from 256-bit FIFO)
input logic [255:0] s_axis_tdata,
input logic s_axis_tvalid,
output logic s_axis_tready,
// AXI-Stream Master (Output to SRAM Cache - 16 x 27-bit = 432-bit)
output logic [431:0] m_axis_tdata,
output logic m_axis_tvalid,
input logic m_axis_tready
);
Accepts a 256-bit AXI-Stream slave (16 × BF16) and produces a 432-bit master (16 × 27-bit fixed-point). The conversion spans 3 registered pipeline stages.
Stage 1 (
phase / buffer_low / block_valid): On the even beat, stores the lower sixteen BF16 words and their locale_max. On the odd beat, combines both halves into a 32-element block and computes the block-globale_max.Stage 2 (
shift_phase / shift_trigger / shift_target_data): Processes the block over two clocks, sixteen lanes at a time. Each lane inserts the hidden bit into a 27-bit container and right-shifts by(e_max - e_val). Two’s-complement negation is applied when the BF16 sign bit is set. Adelta_e ≥ 27check flushes the lane result to zero.Stage 3 (
m_axis_tvalid / m_axis_tdataoutput register): Latches the 432-bit result only on cycles whereshift_triggeris asserted.
s_axis_tready is hardwired to 1; the module never asserts backpressure
to the upstream FIFO.
bf16_to_INT8_pipeline_power_of_two_scale¶
hw/rtl/PREPROCESS/bf16_to_INT8_pipeline_power_of_two_scale.sv is the
placeholder module for the Option A (power-of-two scale) INT8 quantizer.
The port declaration accepts 256-bit input and emits 256-bit output
(32 × INT8), but the body contains an incomplete always_ff block
with an empty index expression (buffer_low[]) and does not synthesize.
The internal logic is a copy of preprocess_bf16_fixed_pipeline carried
over as scaffolding. Full implementation follows the scale-policy decision
in TODO.md §A-1; the file is currently untracked in the RTL repo.
bf16_to_INT8_pipeline_true_symmetric_INT8¶
hw/rtl/PREPROCESS/bf16_to_INT8_pipeline_true_symmetric_INT8.sv is the
placeholder module for the Option B (true symmetric INT8) quantizer.
Port structure and body state are identical to
bf16_to_INT8_pipeline_power_of_two_scale. The max_abs-based real-valued
scale path is intended to be implemented with driver-computed S_a
stored in the Constant Cache via MEMSET; implementation requirements
are specified in TODO.md §A-1. The file is currently untracked in the
RTL repo as well.
fmap_cache¶
module fmap_cache #(
parameter DATA_WIDTH = 27, // Fixed-point Mantissa width
parameter WRITE_LANES = 16, // 16 words per write
parameter CACHE_DEPTH = 2048, // Accommodates 1x2048 vector
parameter LANES = 32 // Number of vertical lanes to feed
) (
input logic clk,
input logic rst_n,
// ===| Write Interface (From BF16-to-Fixed Shifter) |=======
input logic [(DATA_WIDTH*WRITE_LANES)-1:0] wr_data,
input logic wr_valid,
input logic [ 6:0] wr_addr, // log2(2048/16) = 7 bits
input logic wr_en,
// ===| Read Interface (To Staggered Delay Line) |=======
input logic rd_start, // Trigger to start broadcasting
output logic [DATA_WIDTH-1:0] rd_data_broadcast[0:LANES-1], // 32 identical copies
output logic rd_valid
);
Receives the preprocess_bf16_fixed_pipeline output, stages it in a
2048-deep BRAM, and broadcasts one word per clock to MAT_CORE.
Four parameters govern geometry: DATA_WIDTH (default 27),
WRITE_LANES (default 16), CACHE_DEPTH (default 2048), and
LANES (default 32). The write port maps to xpm_memory_sdpram
Port A (7-bit address, 432-bit data); the read port maps to Port B
(11-bit address, 27-bit data). READ_LATENCY_B = 2 is set for
400 MHz operation; the read-valid signal is delayed through a 3-stage
shift chain (rd_valid_pipe_1 → rd_valid_pipe_2 → rd_valid) to align
with BRAM output.
The read FSM initialises rd_addr to zero on rd_start and
de-asserts is_reading after the address reaches CACHE_DEPTH - 1.
The broadcast assignment updates all LANES outputs simultaneously on
every cycle where rd_valid_pipe_2 is asserted.
Last verified against
Commit 8c09e5e @ pccxai/pccx-FPGA-NPU-LLM-kv260 (2026-04-29).