공유 라이브러리¶

v002 RTL은 공통 수치 연산과 데이터 구조를 Library/ 디렉터리에 분리한다. 컴파일 순서(filelist.f)상 이 파일들은 패키지 계층(A–D) 직후, isa_pkg 이전에 위치한다. 컴퓨트 코어가 공통 라이브러리에 의존함으로써 동일 연산의 중복 구현을 막는다.

Algorithms 패키지¶

algorithms_pkg (Library/Algorithms/Algorithms.sv) QUEUE 상태 구조체 queue_stat_t를 정의한다. empty / full 두 필드로 구성된 packed struct이며, QUEUE 모듈 외부에서 상태를 참조할 때 사용한다. STACK 지원은 스텁(stub)으로 예약되어 있다.

리스팅 28 Library/Algorithms/Algorithms.sv¶

package algorithms_pkg;

  /*─────────────────────────────────────────────
  QUEUE
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic empty;
    logic full;
  } queue_stat_t;

  /*─────────────────────────────────────────────
  STACK
  ─────────────────────────────────────────────*/
  // typedef struct packed { ... } stack_stat_t;

bf16_math_pkg (Library/Algorithms/BF16_math.sv) BF16 산술 함수를 패키지로 제공한다. 파일 상단에서 BF16 비트 레이아웃을 [15]=sign, [14:7]=exp(8b), [6:0]=mantissa(7b)로 명시한다. 은닉 비트(hidden bit)는 저장하지 않는다.

노출되는 타입과 함수:

bf16_t — 부호 1비트, 지수 8비트, 가수 7비트의 packed struct.
bf16_aligned_t — emax 기준으로 정렬된 24비트 2의 보수 값과 8비트 emax 필드를 보유하는 packed struct.
to_bf16(raw[15:0]) — 16비트 원시값을 bf16_t로 캐스팅하는 자동 함수.
align_to_emax(val, emax) — BF16 값을 주어진 emax에 맞춰 정렬하여 24비트 2의 보수로 반환한다. 지수 차(diff)만큼 가수를 오른쪽으로 시프트한 뒤 부호 확장한다.
bf16_add(a[15:0], b[15:0]) — 두 BF16 값을 더해 BF16 결과를 반환한다. 큰 쪽 지수로 정렬 후 24비트 부호합산, 선행 1 탐색으로 재정규화한다. 비정규화/NaN/Inf 처리는 포함하지 않는다(autoregressive 디코딩 경로의 정규화 BF16 피연산자 전제).

리스팅 29 Library/Algorithms/BF16_math.sv¶

package bf16_math_pkg;

  /*─────────────────────────────────────────────
  BF16 struct
  [15]=sign  [14:7]=exp(8b)  [6:0]=mantissa(7b)
  hidden bit is implicit (not stored)
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic       sign;
    logic [7:0] exp;
    logic [6:0] mantissa;
  } bf16_t;

  /*─────────────────────────────────────────────
  Aligned output
  24-bit 2's complement
  ─────────────────────────────────────────────*/
  typedef struct packed {
    logic [7:0]  emax;
    logic [23:0] val;
  } bf16_aligned_t;

  /*─────────────────────────────────────────────
  cast raw 16-bit → bf16_t
  ─────────────────────────────────────────────*/
  function automatic bf16_t to_bf16(input logic [15:0] raw);
    return bf16_t'{sign: raw[15], exp: raw[14:7], mantissa: raw[6:0]};
  endfunction

  /*─────────────────────────────────────────────
  align one BF16 value to a given emax
  returns 24-bit 2's complement
  ─────────────────────────────────────────────*/
  function automatic logic [23:0] align_to_emax(input bf16_t val, input logic [7:0] emax);
    logic [ 7:0] diff;
    logic [22:0] mag;
    logic [23:0] result;

    diff   = emax - val.exp;
    mag    = ({1'b1, val.mantissa, 15'd0}) >> diff;
    result = val.sign ? (~{1'b0, mag} + 24'd1) : {1'b0, mag};
    return result;
  endfunction

  /*─────────────────────────────────────────────
  BF16 add: a + b as packed 16-bit values
  - aligns to the larger exponent
  - signed-adds the 24-bit aligned mantissas
  - renormalizes by counting the leading one
  - repacks to BF16
  First-pass implementation: no denormal / NaN / Inf handling; softmax
  uses normalized BF16 operands so the subtle corner cases don't fire
  on the autoregressive decode path. Used by CVO_top's sub-emax stage.
  ─────────────────────────────────────────────*/
  function automatic logic [15:0] bf16_add(input logic [15:0] a,
                                           input logic [15:0] b);
    bf16_t         av, bv;
    logic [7:0]    emax;
    logic [23:0]   aa, ba;
    logic signed [24:0] sum;
    logic               out_sign;
    logic [23:0]   mag;
    int            lead;
    logic [7:0]    out_exp;
    logic [6:0]    out_mant;

    av   = to_bf16(a);
    bv   = to_bf16(b);
    emax = (av.exp > bv.exp) ? av.exp : bv.exp;

    aa = align_to_emax(av, emax);
    ba = align_to_emax(bv, emax);
    sum = $signed({aa[23], aa}) + $signed({ba[23], ba});

    out_sign = sum[24];
    mag      = out_sign ? (~sum[23:0] + 24'd1) : sum[23:0];

    if (mag == 24'd0) return 16'd0;

    // Find the position of the leading 1 (MSB-first).
    lead = 23;
    while (lead > 0 && mag[lead] == 1'b0) lead = lead - 1;

    // Re-bias exponent. The mantissa's implicit leading-1 is at bit 15
    // before alignment; "lead - 15" is the net exponent correction.
    out_exp  = emax + 8'(lead - 15);

    // 7 mantissa bits immediately below the leading 1.
    if (lead >= 7)
      out_mant = mag[lead-1 -: 7];
    else
      out_mant = 7'(mag << (7 - lead));

    return {out_sign, out_exp, out_mant};
  endfunction

QUEUE 인터페이스¶

QUEUE는 인터페이스(IF_queue)와 모듈(QUEUE) 두 파일로 구성된다.

IF_queue (Library/Algorithms/QUEUE/IF_queue.sv) 파라미터 DATA_WIDTH(기본 32)와 DEPTH(기본 8)를 받는 SystemVerilog 인터페이스다. clk, rst_n을 인터페이스 포트로 받는다. PTR_W = $clog2(DEPTH) 비트의 포인터(wr_ptr, rd_ptr)와 mem[0:DEPTH-1] 스토리지를 인터페이스 내부에서 선언한다. empty/full 플래그는 순수 조합 논리로 할당된다.

세 modport:

producer — push() 태스크만 import. push_data/push_en을 출력, empty/full을 입력.
consumer — pop() 태스크만 import. pop_data/empty/full을 입력, pop_en을 출력.
owner — QUEUE 모듈 자신이 사용. 핸드셰이크 신호를 입력으로 받고, 포인터와 ref mem을 출력한다.

리스팅 30 Library/Algorithms/QUEUE/IF_queue.sv¶

  modport producer(import push, input empty, full, clk, rst_n, output push_data, push_en);

  // consumer : only pops
  modport consumer(import pop, input empty, full, pop_data, clk, rst_n, output pop_en);

  // owner : the FIFO module itself. Reads producer/consumer handshake
  // signals, updates its own pointers + memory contents.
  modport owner(input  clk, rst_n, push_data, push_en, pop_en, full, empty,
                output wr_ptr, rd_ptr, ref mem);

QUEUE (Library/Algorithms/QUEUE/QUEUE.sv) IF_queue.owner q를 단일 포트로 받는 모듈이다. PTR_W = $clog2($size(q.mem))로 포인터 폭을 재유도한다. always_ff 블록에서 리셋 시 양쪽 포인터를 0으로 초기화하고, push_en && !full 조건에서 데이터를 기록하며, pop_en && !empty 조건에서 읽기 포인터를 증가시킨다.

Quantizations¶

Quantize_BF16.sv (Library/Quantizations/BF16/Quantize_BF16.sv) 현재 내용이 없는 자리 표시자 파일이다. BF16 양자화 헬퍼를 위해 예약된 위치로, 향후 오프라인 양자화 파이프라인에서 RTL과 소프트웨어 스택 간 공통 변환 로직을 이곳에 배치할 계획이다.

사용 패턴¶

아래 표는 소스에서 직접 확인된 import 구문과 인터페이스 인스턴스화를 기준으로 컴퓨트 코어와 라이브러리 파일 간 의존 관계를 보여준다.

표 7 컴퓨트 코어별 라이브러리 의존성¶
모듈 (코어)	`algorithms_pkg`	`bf16_math_pkg`	`IF_queue`	`QUEUE`
`CVO_top` (CVO_CORE)	—	o	—	—
`AXIL_CMD_IN` (ctrl_npu_frontend 하위)	o	—	o	o

o = 소스에서 직접 확인된 import 또는 인스턴스화. — = 해당 파일에 없음.

bf16_math_pkg는 CVO_top이 import bf16_math_pkg::*;로 직접 가져온다. 소스 주석에 따르면 FLAG_SUB_EMAX 경로(softmax의 sub-emax 스테이지)가 이 패키지의 BF16 산술을 사용한다. algorithms_pkg와 IF_queue / QUEUE는 AXIL_CMD_IN에서 인스턴스화된다. AXIL_CMD_IN은 AXI4-Lite 커맨드 채널의 FIFO 버퍼로, ctrl_npu_frontend가 내부에서 인스턴스화한다. GEMM_systolic_top, GEMV_top, PREPROCESS 모듈은 라이브러리 패키지를 직접 import하지 않고 `define 헤더만 사용한다.

마지막 검증 대상

커밋 8c09e5e @ pccxai/pccx-FPGA-NPU-LLM-kv260 (2026-04-29).