Vivado Build¶
This page documents the Vivado build flow for the pccx v002 NPU core.
All build scripts reside under hw/vivado/; build.sh is the single
entry point. Source files are managed in the
pccxai/pccx-FPGA-NPU-LLM-kv260 repository.
Build Flow¶
build.sh is a thin wrapper around Vivado batch-mode invocations.
The first argument selects one of four stages.
./hw/vivado/build.sh project # create project only
./hw/vivado/build.sh synth # create project + OOC synthesis
./hw/vivado/build.sh impl # full implementation + bitstream
./hw/vivado/build.sh clean # remove build/ directory
The script first searches PATH for vivado; if not found it falls
back through /tools/Xilinx/2025.2, 2024.1, 2023.2 in that order.
Vivado 2023.2 or later is required.
Stage descriptions:
create_project.tcl— Creates a Vivado project targeting partxck26-sfvc784-2LV-c(KV260 ZU5EV) atbuild/pccx_v002_kv260/. Parsesfilelist.fto populate thesources_1fileset, then adds every*.xdcfromhw/constraints/toconstrs_1.synth.tcl— Runssynth_design -mode out_of_context -flatten_hierarchy rebuilt. On completion, writesutilization_post_synth.rpt,clocks_post_synth.rpt,timing_summary_post_synth.rpt, anddrc_post_synth.rpttobuild/reports/. Check WNS in the timing summary before proceeding to implementation.impl.tcl— Verifies thatsynth_1progress is100%, then launchesimpl_1 -to_step write_bitstream -jobs 4. On success, the bitstream is copied tobuild/pccx_v002_kv260.bit. Implementation is an hour-scale job; run it only after OOC synthesis is clean.
The OOC mode is required because NPU_top uses SystemVerilog interface
ports (axil_if / axis_if). Out-of-context synthesis leaves those
ports unbound, allowing the core to be synthesised and checked in
isolation before a block-design wrapper is available.
SV Interface Wrapper¶
hw/vivado/npu_core_wrapper.sv converts NPU_top’s SystemVerilog
interface ports into plain AXI4-Lite and AXI4-Stream signal bundles so
the core can be packaged as a Vivado IP and placed alongside the Zynq PS
in a block design. The wrapper contains no registers and no CDC logic;
it performs one-to-one signal expansion only.
module npu_core_wrapper #(
parameter int AXIL_ADDR_W = 32,
parameter int AXIL_DATA_W = 32,
parameter int HP_DATA_W = 128,
parameter int ACP_DATA_W = 128
) (
// ===| Clocks and resets |==================================================
input logic clk_core,
input logic rst_n_core,
input logic clk_axi,
input logic rst_axi_n,
input logic i_clear,
// ===| S_AXIL_CTRL (AXI4-Lite slave) |======================================
input logic [AXIL_ADDR_W-1:0] s_axil_awaddr,
input logic s_axil_awvalid,
output logic s_axil_awready,
input logic [AXIL_DATA_W-1:0] s_axil_wdata,
input logic [AXIL_DATA_W/8-1:0] s_axil_wstrb,
input logic s_axil_wvalid,
output logic s_axil_wready,
output logic [1:0] s_axil_bresp,
output logic s_axil_bvalid,
input logic s_axil_bready,
input logic [AXIL_ADDR_W-1:0] s_axil_araddr,
input logic s_axil_arvalid,
output logic s_axil_arready,
output logic [AXIL_DATA_W-1:0] s_axil_rdata,
output logic [1:0] s_axil_rresp,
output logic s_axil_rvalid,
input logic s_axil_rready,
// ===| S_AXI_HP0..3_WEIGHT (AXIS slave, 128-bit each) |====================
input logic [HP_DATA_W-1:0] s_axis_hp0_tdata,
input logic s_axis_hp0_tvalid,
output logic s_axis_hp0_tready,
input logic [HP_DATA_W-1:0] s_axis_hp1_tdata,
input logic s_axis_hp1_tvalid,
output logic s_axis_hp1_tready,
input logic [HP_DATA_W-1:0] s_axis_hp2_tdata,
input logic s_axis_hp2_tvalid,
output logic s_axis_hp2_tready,
input logic [HP_DATA_W-1:0] s_axis_hp3_tdata,
input logic s_axis_hp3_tvalid,
output logic s_axis_hp3_tready,
// ===| ACP FMap (AXIS slave) + Result (AXIS master) |======================
input logic [ACP_DATA_W-1:0] s_axis_acp_fmap_tdata,
input logic s_axis_acp_fmap_tvalid,
output logic s_axis_acp_fmap_tready,
output logic [ACP_DATA_W-1:0] m_axis_acp_result_tdata,
output logic m_axis_acp_result_tvalid,
input logic m_axis_acp_result_tready
);
The external interface exposed by the wrapper is as follows.
Port group |
Direction |
Width |
Description |
|---|---|---|---|
|
Input |
1-bit |
Core-domain clock and active-low reset |
|
Input |
1-bit |
AXI-domain clock and active-low reset |
|
Slave |
32-bit |
AXI4-Lite control channel (CMD_IN / STAT_OUT) |
|
Slave |
128-bit |
AXI4-Stream HP ports × 4 (weight streaming) |
|
Slave |
128-bit |
AXI4-Stream ACP FMap input |
|
Master |
128-bit |
AXI4-Stream ACP result output |
The Vivado IP packager auto-infers AXI interfaces from plain signal
ports, so this wrapper allows NPU_top to be placed directly into a
block design alongside the Zynq PS.
Constraints¶
hw/constraints/pccx_timing.xdc is a timing-only constraint file.
Pin and IO constraints are absent; they are delegated to the block
design that wraps this core.
Two clock domains are declared.
Clock name |
Period |
Frequency |
Scope |
|---|---|---|---|
|
4.000 ns |
250 MHz |
AXI-Lite MMIO, CDC FIFO drain sides, DMA path |
|
2.500 ns |
400 MHz |
DSP48E2 array, GEMV lanes, CVO SFU |
The two domains are genuinely asynchronous. set_clock_groups -asynchronous is applied; every domain crossing uses a CDC FIFO or a
properly-staged reset synchroniser.
Additional path exceptions are set.
False paths — Reset bridge first-flop paths and
XPM_FIFO_ASYNCgray-coded pointer crossings.Multicycle path (setup 2, hold 1) — From DSP48E2 P-registers inside the GEMM systolic array to
mat_result_normalizer. The controller stalls new MACs during accumulator flush, so the drain path tolerates two cycles.
File Manifest¶
hw/vivado/filelist.f is the source list for both OOC synthesis and
xvlog lint. create_project.tcl parses this file to add sources to
the Vivado project.
Compile order follows the declaration order in the file. Packages and interfaces must appear before the modules that import them.
rtl/Constants/compilePriority_Order/B_device_pkg/device_pkg.sv
# ===| C: dtype / mem packages (depend on B) |=================================
rtl/Constants/compilePriority_Order/C_type_pkg/dtype_pkg.sv
rtl/Constants/compilePriority_Order/C_type_pkg/mem_pkg.sv
# ===| D: vector-core configuration package (depends on B+C) |=================
rtl/Constants/compilePriority_Order/D_pipeline_pkg/vec_core_pkg.sv
# ===| Library packages and interfaces |=======================================
The full file is organized into the following sections.
Section |
Contents |
|---|---|
A (comment only) |
|
B |
|
C |
|
D |
|
Library |
BF16 math library, algorithms, QUEUE interface |
ISA |
|
MAT_CORE |
GEMM systolic array and result packer |
VEC_CORE |
GEMV core (accumulate, LUT generation, reduction, top) |
CVO_CORE |
CORDIC unit, SFU, CVO top |
PREPROCESS |
BF16↔fixed-point pipeline, fmap cache |
MEM_control |
L2 cache, HP buffer, CVO bridge, dispatcher |
NPU_Controller |
AXI-Lite decoder, dispatcher, front-end, Global Scheduler, top |
Top level |
|
When adding a new .sv file, insert it in dependency order inside
filelist.f.