AI Edge Platform on ZynQ FPGA
Artificial Neural Networks
Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" to perform tasks by considering examples, generally without being programmed with task-specific rules; so you can get it to learn things, recognize patterns, and make decisions in a "human-like" way. The patterns recognized are numerical, contained in tensors, into which all real-world data, be it images, sound, text or time series, must be translated.

(Laurence Moroney and Karmel Allison)
How does a neural network learn?
Information flows through a neural network in two different ways. When the model is learning (being trained) or operating normally (after being trained either being used or tested), patterns of information from the dataset are being fed into the network via the input neurons, which trigger the layers of hidden neurons, and these, in turn, arrive at the output neurons; this is called a feedforward network. Not all neurons “fire” all the time. Each neuron receives inputs from the neurons to its left, and the inputs are multiplied by the weights of the connections they travel along. Every neuron adds up all the inputs it receives in this way and (this is the simplest neural network) if the sum is more than a certain threshold value, the neuron “fires” and triggers the neurons it’s connected to (the neurons on its right).
A practical definition of Deep Learning is:
Deep Learning refers to neural networks with multiple hidden layers that can learn increasingly
abstract representation of the input data.
DNNDK™ (Deep Neural Network Development Kit)

DNNDK unleashes the productivity and efficiency of deploying AI inference on Xilinx Edge AI platforms.
Deep Neural Network Development Kit (DNNDK) is a full-stack deep learning SDK for the Deep learning Processor Unit (DPU). It provides a unified solution for deep neural network inference applications by providing pruning, quantization, compilation, optimization, and run-time support.
DNNDK consists of:
- DEep ComprEssioN Tool (DECENT)
- Deep Neural Network Compiler (DNNC)
- Neural Network Runtime (N2Cube)
The most important feature of DNNDK v3.0 is that popular deep learning framework TensorFlow is officially supported by DNNDK toolchains.
TensorFlow framework is supported by DECENT. And since DNNDK v3.0, both CPU version and GPU version DECENT binary tools are released for user convenience.
Caffe and TensorFlow are supported by one single DNNC binary tool.
The evaluation boards supported for this framework (at version 3.1) are:
- Xilinx® ZCU102
- Xilinx ZCU104
- Avnet Ultra96

Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit
The performances on these boards are very impressive. Take a look at a ZCU104 (performances based on Xilinx AI Zoo https://github.com/Xilinx/AI-Model-Zoo ):
Model | E2E latency (ms) Thread num=1 | E2E throughput (fps) (Single Thread) | E2E throughput (fps) (Multi-Thread) |
resnet50 | 12.13 | 82.45 | 151.8 |
Inception_v1 | 5.07 | 197.333 | 404.933 |
Inception_v2 | 6.33 | 158.033 | 310.15 |
Inception_v3 | 16.03 | 62.3667 | 126.283 |
mobileNet_v2 | 3.85 | 259.833 | 536.95 |
tf_resnet50 | 11.31 | 88.45 | 163.65 |
tf_inception_v1 | 6.35 | 157.367 | 305.467 |
tf_mobilenet_v2 | 5.21 | 191.867 | 380.933 |
ssd_adas_pruned_0.95 | 10.69 | 93.5333 | 242.917 |
ssd_pedestrain_pruned_0.97 | 12.13 | 82.45 | 236.083 |
ssd_traffic_pruned_0.9 | 16.48 | 60.6667 | 159.617 |
ssd_mobilnet_v2 | 37.78 | 26.4667 | 116.433 |
tf_ssd_voc | 75.09 | 13.3167 | 33.5667 |
densebox_320_320 | 2.33 | 428.533 | 1167.35 |
densebox_360_640 | 4.65 | 215.017 | 626.317 |
yolov3_adas_prune_0.9 | 10.51 | 95.1667 | 228.383 |
yolov3_voc | 66.37 | 15.0667 | 33 |
tf_yolov3_voc | 66.74 | 14.9833 | 32.8 |
refinedet_pruned_0.8 | 28 | 35.7167 | 79.1333 |
refinedet_pruned_0.92 | 14.54 | 68.7833 | 160.6 |
refinedet_pruned_0.96 | 10.39 | 96.2333 | 241.783 |
FPN | 15.72 | 63.6167 | 177.333 |
VPGnet_pruned_0.99 | 8.91 | 112.233 | 355.717 |
SP-net | 1.6 | 626.5 | 1337.33 |
Openpose_pruned_0.3 | 267.86 | 3.73333 | 12.1333 |
yolov2_voc | 37.66 | 26.55 | 63.7833 |
yolov2_voc_pruned_0.66 | 17.51 | 57.1167 | 158.917 |
yolov2_voc_pruned_0.71 | 15.63 | 63.9667 | 186.867 |
yolov2_voc_pruned_0.77 | 13.78 | 72.55 | 224.883 |
Inception-v4 | 32.33 | 30.9333 | 64.6 |
SqueezeNet | 3.52 | 284.033 | 940.917 |
face_landmark | 1.02 | 977.683 | 1428.2 |
reid | 2.45 | 407.583 | 702.717 |
yolov3_bdd | 69.77 | 14.3333 | 31.7 |
tf_mobilenet_v1 | 3.03 | 330.25 | 728.35 |
resnet18 | 4.84 | 206.65 | 428.55 |
resnet18_wide | 31.23 | 32.0167 | 62.7667 |
Key Features
- Provides a complete set of toolchains with compression, compilation, deployment, and profiling.
- Supports mainstream frameworks and the latest models capable of diverse deep learning tasks
- Provides a lightweight standard C/C++ programming API (no RTL programming knowledge required)
- Scalable board support from cost-optimized to performance-driven platforms
- Supports system integration with both SDSoC and Vivado
DNNDK Framework
DNNDK is composed of Deep Compression Tool (DECENT), Deep Neural Network Compiler (DNNC), Deep Neural Network Assembler (DNNAS), Neural Network Runtime (N2Cube), DPU Simulator, and Profiler.
DECENT
The process of inference is computation-intensive and requires a high memory bandwidth to satisfy the low latency and high throughput requirement of edge applications. The Deep Compression Tool, DECENT, employs coarse-grained pruning, trained quantization and weight sharing to address these issues while achieving high performance and high energy efficiency with very small accuracy degradation.
DNNC
DNNC™ (Deep Neural Network Compiler) is the dedicated proprietary compiler designed for the DPU. It maps the neural network algorithm to the DPU instructions to achieve maxim utilization of DPU resources by balancing computing workload and memory access.
DNNAS
The Deep Neural Network Assembler (DNNAS) is responsible for assembling DPU instructions into ELF binary code. It is a part of the DNNC code generating back end, and cannot be invoked alone.
Profiler
The DPU profiler is composed of two components: DPU tracer and DSight. DPU tracer is implemented in the DNNDK runtime N2 cube, and it is responsible for gathering the raw profiling data while running neural networks on DPU. With the provided raw profiling data, DSight can help to generate the visualized charts for performance analysis.
N2Cube
The Cube of Neutral Networks (N2Cube) is the DPU runtime engine. It acts as the loader for the DNNDK applications and handles resource allocation and DPU scheduling. Its core components include DPU driver, DPU loader, tracer, and programming APIs for application development.

Network Deployment Overview
There are two stages for developing deep learning applications: training and inference.
The DNNDK toolchain provides an innovative workflow to efficiently deploy deep learning inference applications on the DPU with five simple steps:
1. Compress the neural network model.
2. Compile the neural network model.
3. Program with DNNDK APIs.
4. Compile the hybrid DPU application.
5. Run the hybrid DPU executable.
Programming with DNNDK
To develop deep learning applications on the DPU, three types of work must be done:
- Use DNNDK APIs to manage DPU kernels;
- DPU kernel creation and destruction;
- DPU task creation;
- Managing input and output tensors;
- • Implement kernels not supported by the DPU on the CPU;
- • Add pre-processing and post-processing routines to read in data or calculate results.

Conclusions
We have presented the DNNDK tool for the FPGA integration of Deep Neural Network and Convolutional Neural Network. If you need more information about this tool or if you need support or consultancy, please send us a mail at
[email protected]
Could you please share the performance data of ZCU102 and Ultra96 on Xilinx Model-Zoo with me?I want to choose which board to use in my program.
Hi, thanks for your interest. What kind of performance data do you need? ZCU102 and Ultra96 are very different according to speed and power consumption.
If you need only performances about latency for the neural networks, you can visit the Vitis Model Zoo repository (https://github.com/Xilinx/Vitis-AI/tree/master/models/AI-Model-Zoo)