Shared Library

The v002 RTL isolates common numeric operations and data structures under the Library/ directory. In the compile ordering recorded by filelist.f, these files appear immediately after the package tier (A–D) and before isa_pkg. Compute cores depend on this shared library to prevent duplicate implementations of the same operation.

Algorithms Package

algorithms_pkg (Library/Algorithms/Algorithms.sv) Defines the QUEUE status struct queue_stat_t, a two-field packed struct containing empty and full. External logic that needs to inspect queue state uses this type rather than reading the raw signals directly. A STACK entry is reserved as a commented stub.

Listing 28 Library/Algorithms/Algorithms.sv
package algorithms_pkg;

  /*─────────────────────────────────────────────
  QUEUE
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic empty;
    logic full;
  } queue_stat_t;

  /*─────────────────────────────────────────────
  STACK
  ─────────────────────────────────────────────*/
  // typedef struct packed { ... } stack_stat_t;

bf16_math_pkg (Library/Algorithms/BF16_math.sv) Provides BF16 arithmetic as a SystemVerilog package. The file header documents the bit layout: [15]=sign, [14:7]=exp(8b), [6:0]=mantissa(7b). The hidden bit (implicit leading 1) is not stored.

Exposed types and functions:

  • bf16_t — Packed struct with a 1-bit sign, 8-bit exponent, and 7-bit mantissa.

  • bf16_aligned_t — Packed struct holding an 8-bit emax and a 24-bit two’s-complement aligned value.

  • to_bf16(raw[15:0]) — Automatic function that casts a raw 16-bit value to bf16_t.

  • align_to_emax(val, emax) — Aligns a BF16 value to a given emax and returns a 24-bit two’s-complement integer. Shifts the mantissa right by diff = emax - val.exp before sign extension.

  • bf16_add(a[15:0], b[15:0]) — Adds two packed BF16 values and returns a packed BF16 result. Aligns both operands to the larger exponent, performs a 24-bit signed addition, then renormalises by locating the leading 1. Denormal, NaN, and Inf handling are not included; the autoregressive decode path operates exclusively on normalised BF16 operands.

Listing 29 Library/Algorithms/BF16_math.sv
package bf16_math_pkg;

  /*─────────────────────────────────────────────
  BF16 struct
  [15]=sign  [14:7]=exp(8b)  [6:0]=mantissa(7b)
  hidden bit is implicit (not stored)
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic       sign;
    logic [7:0] exp;
    logic [6:0] mantissa;
  } bf16_t;

  /*─────────────────────────────────────────────
  Aligned output
  24-bit 2's complement
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic [7:0]  emax;
    logic [23:0] val;
  } bf16_aligned_t;

  /*─────────────────────────────────────────────
  cast raw 16-bit → bf16_t
  ─────────────────────────────────────────────*/
  function automatic bf16_t to_bf16(input logic [15:0] raw);
    return bf16_t'{sign: raw[15], exp: raw[14:7], mantissa: raw[6:0]};
  endfunction

  /*─────────────────────────────────────────────
  align one BF16 value to a given emax
  returns 24-bit 2's complement
  ─────────────────────────────────────────────*/
  function automatic logic [23:0] align_to_emax(input bf16_t val, input logic [7:0] emax);
    logic [ 7:0] diff;
    logic [22:0] mag;
    logic [23:0] result;

    diff   = emax - val.exp;
    mag    = ({1'b1, val.mantissa, 15'd0}) >> diff;
    result = val.sign ? (~{1'b0, mag} + 24'd1) : {1'b0, mag};
    return result;
  endfunction

  /*─────────────────────────────────────────────
  BF16 add: a + b as packed 16-bit values
  - aligns to the larger exponent
  - signed-adds the 24-bit aligned mantissas
  - renormalizes by counting the leading one
  - repacks to BF16
  First-pass implementation: no denormal / NaN / Inf handling; softmax
  uses normalized BF16 operands so the subtle corner cases don't fire
  on the autoregressive decode path. Used by CVO_top's sub-emax stage.
  ─────────────────────────────────────────────*/
  function automatic logic [15:0] bf16_add(input logic [15:0] a,
                                           input logic [15:0] b);
    bf16_t         av, bv;
    logic [7:0]    emax;
    logic [23:0]   aa, ba;
    logic signed [24:0] sum;
    logic               out_sign;
    logic [23:0]   mag;
    int            lead;
    logic [7:0]    out_exp;
    logic [6:0]    out_mant;

    av   = to_bf16(a);
    bv   = to_bf16(b);
    emax = (av.exp > bv.exp) ? av.exp : bv.exp;

    aa = align_to_emax(av, emax);
    ba = align_to_emax(bv, emax);
    sum = $signed({aa[23], aa}) + $signed({ba[23], ba});

    out_sign = sum[24];
    mag      = out_sign ? (~sum[23:0] + 24'd1) : sum[23:0];

    if (mag == 24'd0) return 16'd0;

    // Find the position of the leading 1 (MSB-first).
    lead = 23;
    while (lead > 0 && mag[lead] == 1'b0) lead = lead - 1;

    // Re-bias exponent. The mantissa's implicit leading-1 is at bit 15
    // before alignment; "lead - 15" is the net exponent correction.
    out_exp  = emax + 8'(lead - 15);

    // 7 mantissa bits immediately below the leading 1.
    if (lead >= 7)
      out_mant = mag[lead-1 -: 7];
    else
      out_mant = 7'(mag << (7 - lead));

    return {out_sign, out_exp, out_mant};
  endfunction

QUEUE Interface

The QUEUE primitive is split across two files: an interface (IF_queue) and a module (QUEUE).

IF_queue (Library/Algorithms/QUEUE/IF_queue.sv) A parameterised SystemVerilog interface with DATA_WIDTH (default 32) and DEPTH (default 8). The interface itself takes clk and rst_n as ports. Pointer width PTR_W = $clog2(DEPTH) is derived internally. The storage array mem[0:DEPTH-1] and pointers wr_ptr/rd_ptr are declared inside the interface. The empty and full flags are assigned combinationally.

Three modports:

  • producer — Imports the push() task only. Drives push_data/push_en; reads empty/full.

  • consumer — Imports the pop() task only. Reads pop_data/empty/full; drives pop_en.

  • owner — Used by the QUEUE module itself. Receives all handshake signals as inputs; drives wr_ptr/rd_ptr and references mem via ref.

Listing 30 Library/Algorithms/QUEUE/IF_queue.sv
  modport producer(import push, input empty, full, clk, rst_n, output push_data, push_en);

  // consumer : only pops
  modport consumer(import pop, input empty, full, pop_data, clk, rst_n, output pop_en);

  // owner : the FIFO module itself. Reads producer/consumer handshake
  // signals, updates its own pointers + memory contents.
  modport owner(input  clk, rst_n, push_data, push_en, pop_en, full, empty,
                output wr_ptr, rd_ptr, ref mem);

QUEUE (Library/Algorithms/QUEUE/QUEUE.sv) A module with a single port IF_queue.owner q. It re-derives the pointer width as PTR_W = $clog2($size(q.mem)) because modports cannot export parameters. The always_ff block initialises both pointers to zero on reset, writes a word when push_en && !full, and advances the read pointer when pop_en && !empty.

Quantizations

Quantize_BF16.sv (Library/Quantizations/BF16/Quantize_BF16.sv) The file is an empty placeholder. It marks the intended location for BF16 quantization helpers that will provide a common conversion path between the offline quantization pipeline and the RTL datapath.

Usage Patterns

The table reflects import statements and interface instantiations confirmed directly in each source file.

Table 7 Library dependencies by compute core

Module (core)

algorithms_pkg

bf16_math_pkg

IF_queue

QUEUE

CVO_top (CVO_CORE)

o

AXIL_CMD_IN (sub-module of ctrl_npu_frontend)

o

o

o

o = import or instantiation confirmed in source. = not present in that file.

CVO_top declares import bf16_math_pkg::*; directly. Per the source comment, the FLAG_SUB_EMAX path (the sub-emax stage of the CVO softmax) uses this package’s BF16 arithmetic. algorithms_pkg, IF_queue, and QUEUE are instantiated inside AXIL_CMD_IN, which buffers AXI4-Lite commands into a FIFO and is itself instantiated by ctrl_npu_frontend. GEMM_systolic_top, GEMV_top, and the PREPROCESS modules do not import any library package; they use only `define headers.


Last verified against

Commit 8c09e5e @ pccxai/pccx-FPGA-NPU-LLM-kv260 (2026-04-29).