PREPROCESS RTL Reference¶

The PREPROCESS subdirectory contains five SystemVerilog modules. barrel_shifter_BF16.sv is not present in the current working tree.

`preprocess_fmap`¶

Listing 23 hw/rtl/PREPROCESS/preprocess_fmap.sv¶

module preprocess_fmap #(
    parameter fmap_width = `DEVICE_ACP_WIDTH_BIT
) (
    input logic clk,
    input logic rst_n,
    input logic i_clear,

    // AXI4-Stream Interfaces from ACP
    axis_if.slave S_AXIS_ACP_FMAP,  // ACP (128-bit)

    // Control from Brain
    input logic i_rd_start,




    // Output to Branch Engines (Systolic / GEMV / CVO)
    output logic [`FIXED_MANT_WIDTH-1:0] o_fmap_broadcast[0:`ARRAY_SIZE_H-1],
    output logic                         o_fmap_valid,

    output logic [`BF16_EXP_WIDTH-1:0] o_cached_emax[0:`ARRAY_SIZE_H-1]
);

Receives a 128-bit BF16 stream from the ACP S_AXIS_ACP_FMAP interface, buffers it into 256-bit words through an XPM block FIFO, and runs exponent caching and mantissa alignment in parallel. The aligned 432-bit output is written to fmap_cache; o_fmap_broadcast and o_cached_emax are driven to MAT_CORE.

fmap_width defaults to `DEVICE_ACP_WIDTH_BIT. ARRAY_SIZE_H controls the lane count for both output arrays.

`preprocess_bf16_fixed_pipeline`¶

Listing 24 hw/rtl/PREPROCESS/preprocess_bf16_fixed_pipeline.sv¶

module preprocess_bf16_fixed_pipeline (
    input logic clk,
    input logic rst_n,

    // AXI-Stream Slave (Input from 256-bit FIFO)
    input  logic [255:0] s_axis_tdata,
    input  logic         s_axis_tvalid,
    output logic         s_axis_tready,

    // AXI-Stream Master (Output to SRAM Cache - 16 x 27-bit = 432-bit)
    output logic [431:0] m_axis_tdata,
    output logic         m_axis_tvalid,
    input  logic         m_axis_tready
);

Accepts a 256-bit AXI-Stream slave (16 × BF16) and produces a 432-bit master (16 × 27-bit fixed-point). The conversion spans 3 registered pipeline stages.

Stage 1 (phase / buffer_low / block_valid): On the even beat, stores the lower sixteen BF16 words and their local e_max. On the odd beat, combines both halves into a 32-element block and computes the block-global e_max.
Stage 2 (shift_phase / shift_trigger / shift_target_data): Processes the block over two clocks, sixteen lanes at a time. Each lane inserts the hidden bit into a 27-bit container and right-shifts by (e_max - e_val). Two’s-complement negation is applied when the BF16 sign bit is set. A delta_e ≥ 27 check flushes the lane result to zero.
Stage 3 (m_axis_tvalid / m_axis_tdata output register): Latches the 432-bit result only on cycles where shift_trigger is asserted.

s_axis_tready is hardwired to 1; the module never asserts backpressure to the upstream FIFO.

`bf16_to_INT8_pipeline_power_of_two_scale`¶

hw/rtl/PREPROCESS/bf16_to_INT8_pipeline_power_of_two_scale.sv is the placeholder module for the Option A (power-of-two scale) INT8 quantizer. The port declaration accepts 256-bit input and emits 256-bit output (32 × INT8), but the body contains an incomplete always_ff block with an empty index expression (buffer_low[]) and does not synthesize. The internal logic is a copy of preprocess_bf16_fixed_pipeline carried over as scaffolding. Full implementation follows the scale-policy decision in TODO.md §A-1; the file is currently untracked in the RTL repo.

`bf16_to_INT8_pipeline_true_symmetric_INT8`¶

hw/rtl/PREPROCESS/bf16_to_INT8_pipeline_true_symmetric_INT8.sv is the placeholder module for the Option B (true symmetric INT8) quantizer. Port structure and body state are identical to bf16_to_INT8_pipeline_power_of_two_scale. The max_abs-based real-valued scale path is intended to be implemented with driver-computed S_a stored in the Constant Cache via MEMSET; implementation requirements are specified in TODO.md §A-1. The file is currently untracked in the RTL repo as well.

`fmap_cache`¶

Listing 25 hw/rtl/PREPROCESS/fmap_cache.sv¶

module fmap_cache #(
    parameter DATA_WIDTH  = 27,    // Fixed-point Mantissa width
    parameter WRITE_LANES = 16,    // 16 words per write
    parameter CACHE_DEPTH = 2048,  // Accommodates 1x2048 vector
    parameter LANES       = 32     // Number of vertical lanes to feed
) (
    input logic clk,
    input logic rst_n,

    // ===| Write Interface (From BF16-to-Fixed Shifter) |=======
    input logic [(DATA_WIDTH*WRITE_LANES)-1:0] wr_data,
    input logic                                wr_valid,
    input logic [                         6:0] wr_addr,   // log2(2048/16) = 7 bits
    input logic                                wr_en,

    // ===| Read Interface (To Staggered Delay Line) |=======
    input  logic                  rd_start,                      // Trigger to start broadcasting
    output logic [DATA_WIDTH-1:0] rd_data_broadcast[0:LANES-1],  // 32 identical copies
    output logic                  rd_valid
);

Receives the preprocess_bf16_fixed_pipeline output, stages it in a 2048-deep BRAM, and broadcasts one word per clock to MAT_CORE.

Four parameters govern geometry: DATA_WIDTH (default 27), WRITE_LANES (default 16), CACHE_DEPTH (default 2048), and LANES (default 32). The write port maps to xpm_memory_sdpram Port A (7-bit address, 432-bit data); the read port maps to Port B (11-bit address, 27-bit data). READ_LATENCY_B = 2 is set for 400 MHz operation; the read-valid signal is delayed through a 3-stage shift chain (rd_valid_pipe_1 → rd_valid_pipe_2 → rd_valid) to align with BRAM output.

The read FSM initialises rd_addr to zero on rd_start and de-asserts is_reading after the address reaches CACHE_DEPTH - 1. The broadcast assignment updates all LANES outputs simultaneously on every cycle where rd_valid_pipe_2 is asserted.

Last verified against

Commit 8c09e5e @ pccxai/pccx-FPGA-NPU-LLM-kv260 (2026-04-29).

PREPROCESS RTL Reference¶

preprocess_fmap¶

preprocess_bf16_fixed_pipeline¶

bf16_to_INT8_pipeline_power_of_two_scale¶

bf16_to_INT8_pipeline_true_symmetric_INT8¶

fmap_cache¶

`preprocess_fmap`¶

`preprocess_bf16_fixed_pipeline`¶

`bf16_to_INT8_pipeline_power_of_two_scale`¶

`bf16_to_INT8_pipeline_true_symmetric_INT8`¶

`fmap_cache`¶