HAL — AXI-Lite MMIO Layer

The uca_hal_* function set is the AXI-Lite MMIO layer that sits between the public C API (uca_*, see :doc:api) and the NPU hardware. No code above this layer accesses physical addresses or register offsets directly.

The implementation lives in codes/v002/sw/driver/uCA_v1_hal.c / uCA_v1_hal.h.

HAL Position

The driver stack is organized into two layers.

Table 2 Driver layer structure

Layer

Symbol prefix

Role

Public API

uca_*

Compute and memory primitives. Assembles 64-bit VLIW instructions and passes them through the HAL. See :doc:api.

HAL

uca_hal_*

AXI-Lite register reads and writes, 64-bit instruction latching, status polling. Depends directly on KV260 bare-metal pointer MMIO.

The HAL stores all state in a single file-scope singleton, g_mmio_base (volatile uint32_t *). No context pointer is used; a single process is expected to communicate with one NPU instance.

Listing 1 uCA_v1_hal.c — MMIO base pointer declaration
static volatile uint32_t *g_mmio_base = NULL;

Register Map

The MMIO base address is UCA_MMIO_BASE_ADDR = 0xA0000000. This value must match the AXI-Lite slave address assigned in the Vivado block design.

Listing 2 uCA_v1_hal.h — base address and register offset definitions
// ===| MMIO Base Address |=======================================================
// Must match the AXI-Lite slave address assigned in the Vivado block design.
#define UCA_MMIO_BASE_ADDR  0xA0000000UL

// ===| Register Offsets |========================================================
// All offsets are byte offsets from UCA_MMIO_BASE_ADDR.
// The 64-bit instruction register is split into two 32-bit words.
// Write LO first; writing HI triggers the NPU instruction latch.
#define UCA_REG_INSTR_LO    0x00  // [31:0]  lower 32 bits of 64-bit VLIW instruction
#define UCA_REG_INSTR_HI    0x04  // [63:32] upper 32 bits; writing this latches the instruction
#define UCA_REG_STATUS      0x08  // [31:0]  NPU status (read-only)

Table 3 Register map

Name

Offset

Access

Description

UCA_REG_INSTR_LO

0x00

Write

Lower 32 bits of the 64-bit VLIW instruction. Written first.

UCA_REG_INSTR_HI

0x04

Write

Upper 32 bits of the 64-bit VLIW instruction. Writing this register triggers the NPU instruction latch.

UCA_REG_STATUS

0x08

Read

NPU status register (read-only). Contains UCA_STAT_BUSY and UCA_STAT_DONE bits.

A 64-bit instruction is written as a pair, LO first, HI second. The HI write triggers the controller’s instruction latch.

Listing 3 uCA_v1_hal.c — uca_hal_issue_instr implementation
void uca_hal_issue_instr(uint64_t instr) {
    // Write lower word first.
    // Writing the upper word triggers the NPU instruction latch (ISA §8).
    uca_hal_write32(UCA_REG_INSTR_LO, (uint32_t)(instr & 0xFFFFFFFFULL));
    uca_hal_write32(UCA_REG_INSTR_HI, (uint32_t)(instr >> 32));
}

CMD_IN / STAT_OUT Mechanics

uca_hal_issue_instr submits a 64-bit instruction to the NPU’s CMD_IN path by writing the register pair. The call returns immediately; the NPU controller executes the instruction independently inside its pipeline.

Status register UCA_REG_STATUS bit fields:

Listing 4 uCA_v1_hal.h — status bit definitions
// ===| Status Register Bit Fields |==============================================
#define UCA_STAT_BUSY       (1U << 0)  // NPU is executing — do not issue new instruction
#define UCA_STAT_DONE       (1U << 1)  // Last operation completed successfully

  • UCA_STAT_BUSY (bit 0) — NPU is executing an instruction. Do not issue a new instruction while this bit is set.

  • UCA_STAT_DONE (bit 1) — Last operation completed successfully.

Polling is performed by uca_hal_wait_idle. Because no hardware timer driver is yet available on the bare-metal KV260, the current implementation uses a busy-wait loop with an iteration count estimated at the 400 MHz core rate.

Listing 5 uCA_v1_hal.c — uca_hal_wait_idle implementation
int uca_hal_wait_idle(uint32_t timeout_us) {
    // Bare-metal busy-wait.
    // TODO: replace with a hardware timer once a timer driver is available.
    uint32_t count = timeout_us * 400;  // ~1 iteration per ns at 400 MHz estimate
    while (count--) {
        if (!(uca_hal_read_status() & UCA_STAT_BUSY)) {
            return 0;  // Idle
        }
    }
    return -1;  // Timeout
}

When timeout_us decrements to zero, -1 is returned. The NPU state is not forced-reset on timeout; the caller is responsible for error recovery.

uca_init Flow

uca_hal_init performs three operations in sequence.

  1. Sets g_mmio_base to (volatile uint32_t *)UCA_MMIO_BASE_ADDR. Physical addresses are directly accessible in the KV260 bare-metal environment.

  2. Calls uca_hal_read32(UCA_REG_STATUS) to read the status register.

  3. If the return value is 0xFFFFFFFF, the AXI bus is not responding; returns -1. Otherwise returns 0.

Listing 6 uCA_v1_hal.c — uca_hal_init implementation
int uca_hal_init(void) {
    // On bare-metal KV260, physical addresses are directly accessible.
    g_mmio_base = (volatile uint32_t *)UCA_MMIO_BASE_ADDR;

    // Sanity check: status register reads all-ones on an unconnected AXI bus.
    uint32_t stat = uca_hal_read32(UCA_REG_STATUS);
    if (stat == 0xFFFFFFFFU) {
        return -1;  // Hardware not responding
    }
    return 0;
}

uca_hal_deinit sets g_mmio_base to NULL. Any subsequent uca_hal_write32 or uca_hal_read32 call will dereference a null pointer; the caller must ensure no HAL calls follow uca_hal_deinit.

See also

  • Public API primitives: :doc:api

  • AXI-Lite command path architecture: :doc:../Architecture/top_level

  • ISA instruction encoding: :doc:../ISA/encoding

Last verified against

Commit 8c09e5e @ pccxai/pccx-FPGA-NPU-LLM-kv260 (2026-04-29)