AI Edge Platform on ZynQ FPGA

Artificial Neural Networks

Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems "learn" to perform tasks by considering examples, generally without being programmed with task-specific rules; so you can get it to learn things, recognize patterns, and make decisions in a "human-like" way. The patterns recognized are numerical, contained in tensors, into which all real-world data, be it images, sound, text or time series, must be translated.

Difference between traditional programming and machine learning
(Laurence Moroney and Karmel Allison)

How does a neural network learn?

Information flows through a neural network in two different ways. When the model is learning (being trained) or operating normally (after being trained either being used or tested), patterns of information from the dataset are being fed into the network via the input neurons, which trigger the layers of hidden neurons, and these, in turn, arrive at the output neurons; this is called a feedforward network. Not all neurons “fire” all the time. Each neuron receives inputs from the neurons to its left, and the inputs are multiplied by the weights of the connections they travel along. Every neuron adds up all the inputs it receives in this way and (this is the simplest neural network) if the sum is more than a certain threshold value, the neuron “fires” and triggers the neurons it’s connected to (the neurons on its right).

A practical definition of Deep Learning is:
Deep Learning refers to neural networks with multiple hidden layers that can learn increasingly
abstract representation of the input data.

DNNDK™ (Deep Neural Network Development Kit)

DNNDK unleashes the productivity and efficiency of deploying AI inference on Xilinx Edge AI platforms.
Deep Neural Network Development Kit (DNNDK) is a full-stack deep learning SDK for the Deep learning Processor Unit (DPU). It provides a unified solution for deep neural network inference applications by providing pruning, quantization, compilation, optimization, and run-time support.
DNNDK consists of:

  • DEep ComprEssioN Tool (DECENT)
  • Deep Neural Network Compiler (DNNC)
  • Neural Network Runtime (N2Cube)

The most important feature of DNNDK v3.0 is that popular deep learning framework TensorFlow is officially supported by DNNDK toolchains.
TensorFlow framework is supported by DECENT. And since DNNDK v3.0, both CPU version and GPU version DECENT binary tools are released for user convenience.
Caffe and TensorFlow are supported by one single DNNC binary tool.

The evaluation boards supported for this framework (at version 3.1) are:

  • Xilinx® ZCU102
  • Xilinx ZCU104
  • Avnet Ultra96
Risultati immagini per zcu104
A ZynQ architecture evaluation board:
Zynq UltraScale+ MPSoC ZCU104 Evaluation Kit

The performances on these boards are very impressive. Take a look at a ZCU104 (performances based on Xilinx AI Zoo ):

Model E2E latency (ms)
Thread num=1
E2E throughput (fps)
(Single Thread)
E2E throughput (fps)
resnet50 12.13 82.45 151.8
Inception_v1 5.07 197.333 404.933
Inception_v2 6.33 158.033 310.15
Inception_v3 16.03 62.3667 126.283
mobileNet_v2 3.85 259.833 536.95
tf_resnet50 11.31 88.45 163.65
tf_inception_v1 6.35 157.367 305.467
tf_mobilenet_v2 5.21 191.867 380.933
ssd_adas_pruned_0.95 10.69 93.5333 242.917
ssd_pedestrain_pruned_0.97 12.13 82.45 236.083
ssd_traffic_pruned_0.9 16.48 60.6667 159.617
ssd_mobilnet_v2 37.78 26.4667 116.433
tf_ssd_voc 75.09 13.3167 33.5667
densebox_320_320 2.33 428.533 1167.35
densebox_360_640 4.65 215.017 626.317
yolov3_adas_prune_0.9 10.51 95.1667 228.383
yolov3_voc 66.37 15.0667 33
tf_yolov3_voc 66.74 14.9833 32.8
refinedet_pruned_0.8 28 35.7167 79.1333
refinedet_pruned_0.92 14.54 68.7833 160.6
refinedet_pruned_0.96 10.39 96.2333 241.783
FPN 15.72 63.6167 177.333
VPGnet_pruned_0.99 8.91 112.233 355.717
SP-net 1.6 626.5 1337.33
Openpose_pruned_0.3 267.86 3.73333 12.1333
yolov2_voc 37.66 26.55 63.7833
yolov2_voc_pruned_0.66 17.51 57.1167 158.917
yolov2_voc_pruned_0.71 15.63 63.9667 186.867
yolov2_voc_pruned_0.77 13.78 72.55 224.883
Inception-v4 32.33 30.9333 64.6
SqueezeNet 3.52 284.033 940.917
face_landmark 1.02 977.683 1428.2
reid 2.45 407.583 702.717
yolov3_bdd 69.77 14.3333 31.7
tf_mobilenet_v1 3.03 330.25 728.35
resnet18 4.84 206.65 428.55
resnet18_wide 31.23 32.0167 62.7667

Key Features

  • Provides a complete set of toolchains with compression, compilation, deployment, and profiling.
  • Supports mainstream frameworks and the latest models capable of diverse deep learning tasks
  • Provides a lightweight standard C/C++ programming API (no RTL programming knowledge required)
  • Scalable board support from cost-optimized to performance-driven platforms
  • Supports system integration with both SDSoC and Vivado

DNNDK Framework

DNNDK is composed of Deep Compression Tool (DECENT), Deep Neural Network Compiler (DNNC), Deep Neural Network Assembler (DNNAS), Neural Network Runtime (N2Cube), DPU Simulator, and Profiler.


The process of inference is computation-intensive and requires a high memory bandwidth to satisfy the low latency and high throughput requirement of edge applications. The Deep Compression Tool, DECENT, employs coarse-grained pruning, trained quantization and weight sharing to address these issues while achieving high performance and high energy efficiency with very small accuracy degradation.


DNNC™ (Deep Neural Network Compiler) is the dedicated proprietary compiler designed for the DPU. It maps the neural network algorithm to the DPU instructions to achieve maxim utilization of DPU resources by balancing computing workload and memory access.


The Deep Neural Network Assembler (DNNAS) is responsible for assembling DPU instructions into ELF binary code. It is a part of the DNNC code generating back end, and cannot be invoked alone.


The DPU profiler is composed of two components: DPU tracer and DSight. DPU tracer is implemented in the DNNDK runtime N2 cube, and it is responsible for gathering the raw profiling data while running neural networks on DPU. With the provided raw profiling data, DSight can help to generate the visualized charts for performance analysis.


The Cube of Neutral Networks (N2Cube) is the DPU runtime engine. It acts as the loader for the DNNDK applications and handles resource allocation and DPU scheduling. Its core components include DPU driver, DPU loader, tracer, and programming APIs for application development.

The optimization process of a DNN

Network Deployment Overview

There are two stages for developing deep learning applications: training and inference.
The DNNDK toolchain provides an innovative workflow to efficiently deploy deep learning inference applications on the DPU with five simple steps:

1. Compress the neural network model.
2. Compile the neural network model.
3. Program with DNNDK APIs.
4. Compile the hybrid DPU application.
5. Run the hybrid DPU executable.

Programming with DNNDK

To develop deep learning applications on the DPU, three types of work must be done:

  • Use DNNDK APIs to manage DPU kernels;
  • DPU kernel creation and destruction;
  • DPU task creation;
  • Managing input and output tensors;
  • • Implement kernels not supported by the DPU on the CPU;
  • • Add pre-processing and post-processing routines to read in data or calculate results.
DNNDK Programming architecture


We have presented the DNNDK tool for the FPGA integration of Deep Neural Network and Convolutional Neural Network. If you need more information about this tool or if you need support or consultancy, please send us a mail at
[email protected]