Accelerate AI Applications on Xilinx VCK190 Evaluation Kit with Design Gateway's IP Cores

By Design Gateway Co., Ltd.

2022-11-18

Xilinx's Versal AI Core series devices are designed to solve the unique and most difficult problem of AI inference by using a high computation efficiency ASIC-class AI compute engine together with a flexible programmable fabric to build an AI application with accelerators that maximize efficiency for any given workload while delivering low power and low latency.

The Versal AI Core series VCK190 evaluation kit features the VC1902 device which has the best AI performance in the portfolio. The kit is made for designs requiring high-throughput AI inference and signal processing computing performance. Delivering 100 times greater computing power than current server-class CPUs and having various connectivity options makes the VCK190 kit an ideal evaluation and prototyping platform for a diverse range of applications from the cloud to the edge.

Image of Xilinx Versal AI Core series VCK190 evaluation kit Figure 1: Xilinx Versal AI Core series VCK190 evaluation kit. (Image source: AMD, Inc)

Key features of the VCK190 evaluation kit

Onboard Versal AI Core Series device
- Equipped with Versal ACAP XCVC1902 production silicon
- AI and DSP Engines providing 100X greater compute performance over today’s server-class CPUs
- Pre-built partner reference designs for rapid prototyping
Latest Connectivity technology for leading edge application development
- Built-in PCIe® Gen4 Hard IP for high performance device interface such as NVMe SSD and Host Processors
- Built-in 100G EMAC Hard IP for high speed 100G network interfaces
- DDR4 and LPDDR4 memory interfaces
Co-Optimized Tools and Debug Method
- Vivado® ML, Vitis™ unified software platform, Vitis AI, AI Engine tools for AI Inference application development

AI Interface Acceleration with Xilinx's Versal AI Core Series devices

Diagram of Xilinx Versal AI Core VC1902 ACAP device block diagram Figure 2: Xilinx Versal AI Core VC1902 ACAP device block diagram. (Image source: AMD, Inc)

The Versal® AI Core adaptive compute acceleration platform (ACAP) is a highly integrated, multicore, heterogeneous device that can dynamically adapt at the hardware and software level for a wide range of AI workloads, making it ideal for AI edge computing applications or cloud accelerator cards. The platform integrates next-generation Scalar Engines for embedded compute, Adaptable Engines for hardware flexibility, and Intelligent Engines consisting of DSP Engines and revolutionary AI Engines for inference and signal processing. The result is an adaptable accelerator that exceeds the performance, latency, and power efficiency of traditional FPGAs and GPUs for AI/ML workloads.

Versal ACAP Platform highlight

Adaptable Engines:
- Custom memory hierarchy optimizes data movement and management for accelerator kernels
- Pre- and post-processing functions including neural network RT compression and image scaling
AI Engines (DPU)
- Tiled array of vector processors, up to 133 INT8 TOPS performance with XCVC1902 devices, called, Deep Learning Processing Unit or DPU
- Ideal for neural networks ranging from CNN, RNN, and MLP; hardware adaptable to optimize for evolving algorithms
Scalar Engines
- Quad-Core ARM processing subsystem, Platform management controller for security, power and bitstream management

VCK190 AI Inference performance

VCK190 is capable of delivering over 100X greater compute performance compared to current server class CPUs. Below is an example of performance based on AI Engine implementation by C32B6 DPU Core with batch = 6. Refer to the following table for the throughput performance (in frames/sec or fps) for various neural network samples on VCK190 with the DPU running at 1250 MHz.

No	Neural Network	Input Size	GOPS	Performance (fps) (Multiple thread)
1	face_landmark	96x72	0.14	24605.3
2	facerec_resnet20	112x96	3.5	5695.3
3	inception_v2	224x224	4	1845.8
4	medical_seg_cell_tf2	128x128	5.3	3036.3
5	MLPerf_resnet50_v1.5_tf	224x224	8.19	2744.2
6	RefineDet-Medical_EDD_tf	320x320	9.8	1283.6
7	tiny_yolov3_vmss	416x416	5.46	1424.4
8	yolov2_voc_pruned_0_77	448x448	7.8	1366.0

Table 1: Example of VCK190 AI Inference performance.

See more detail of VCK190 AI performance from Vitis AI Library User Guide (UG1354), r2.5.0 at https://docs.xilinx.com/r/en-US/ug1354-xilinx-ai-sdk/VCK190-Evaluation-Board

How Design Gateway's IP cores accelerate AI application performance?

Design Gateway's IP Cores are designed to handle Networking and Data Storage protocol without need for CPU intervention. This makes it ideal to fully offload CPU systems from complicated protocol processing and which enables them to utilize most of their computing power for AI applications including AI inference, pre and post data processing, user interface, network communication and data storage access for the best possible performance.

Figure 3: Block diagram of example an AI Application with Design Gateway's IP Cores. (Image source: Design Gateway)

Design Gateway's TCP Offload Engine IP (TOExxG-IP) performance

Processing high speed, high throughput TCP data streams over 10GbE or 25GbE by traditional CPU systems needs more than 50% of CPU time which reduces overall performance of AI applications. According to 10G TCP performance test on Xilinx's MPSoC Linux systems, CPU usage during 10GbE TCP transmission is more than 50%, TCP send and receive data transfer speed could be achieved just around 40% to 60% of 10GbE speed or 400 MB/s to 600 MB/s.

By implementing Design Gateway's TOExxG-IP Core, CPU usage for TCP transmission over 10GbE and 25GbE can be reduced to almost 0% while ethernet bandwidth utilization can be achieved close to 100%. This allows the sending and receiving of data over the TCP network directly by pure hardware logic and be fed into the Versal AI Engine with minimum CPU usage and the lowest possible latency. Figure 4 below shows the CPU usage and TCP transmission speed comparison between TOExxG-IP and MPSoC Linux systems.

Image of performance comparison of 10G/25G TCP transmission by MPSoC Linux systems Figure 4: Performance comparison of 10G/25G TCP transmission by MPSoC Linux systems and Design Gateway's TOExxG-IP Core. (Image source: Design Gateway)

Design Gateway’s TOExxG-IP for Versal devices

Diagram of TOExxG-IP systems overview Figure 5: TOExxG-IP systems overview. (Image source: Design Gateway)

The TOExxG-IP core implements the TCP/IP stack (in hardwire logic) and connects with Xilinx’s EMAC Hard IP and Ethernet Subsystem module for the lower-layer hardware interface with 10G/25G/100G Ethernet speed. The user interface of the TOExxG-IP consists of a Register interface for control signals and a FIFO interface for data signals. The TOExxG-IP is designed to connect with Xilinx's Ethernet subsystem through the AXI4-ST interface. The clock frequency of the user interface depends on the Ethernet interface speed (e.g., 156.625 MHz or 322.266 MHz).

TOExxG-IP’s features

Full TCP/IP stack implementation without need of the CPU
Supports one session with one TOExxG-IP
Multi-session can be implemented by using multiple TOExxG-IP instances
Support for both Server and Client mode (Passive/Active open and close)
Supports Jumbo frame
Simple data interface by standard FIFO interface
Simple control interface by single port RAM interface

FPGA resource usages on the XCVC1902-VSVA2197-2MP-ES FPGA device are shown in Table 2 below.

Family	Example Device	Fmax (MHz)	CLB Regs	CLB LUTs	Slice	IOB	BRAMTile¹	URAM	Design Tools
Versal AI Core	XCVC1902-VSVA2197-2MP-ES	350	11340	10921	2165	-	51.5	-	Vivado2021.2

Table 2: Example Implementation Statistics for Versal device.

More details of the TOExxG-IP are described in its datasheet which can be downloaded from Design Gateway’s website at the following links:

Design Gateway's NVMe Host Controller IP performance

NVMe Storage interface speed with PCIe Gen3 x4 or PCIe Gen4 x4 has data rates up to 32 Gbps and 64 Gbps. This is three to six times higher than 10GbE Ethernet speed. Processing complicated NVMe storage protocol by the CPU to achieve the highest possible disk access speed requires more CPU time than TCP protocol over 10GbE.

Design Gateway solved this problem by developing the NVMe IP core that is able to run as a standalone NVMe host controller, able to communicate with an NVMe SSD directly without the CPU. This enables a high efficiency and performance of the NVMe PCIe Gen3 and Gen4 SSD access, which simplifies the user interface and standard features for ease of usage without needing knowledge of the NVMe protocol. NVMe PCIe Gen4 SSD performance can achieve up to a 6 GB/s transfer speed with NVMe IP as shown in Figure 6.

Image of performance comparison of NVMe PCIe Gen3 and Gen4 SSD Figure 6: Performance comparison of NVMe PCIe Gen3 and Gen4 SSD with Design Gateway's NVMe-IP Core. (Image source: Design Gateway)

Design Gateway's NVMe-IP’s for Versal devices

Diagram of NVMe-IP systems overview Figure 7: NVMe-IP systems overview. (Image source: Design Gateway)

NVMe-IP’s features

Able to implement application layer, transaction layer, data link layer, and some parts of the physical layer to access the NVMe SSD without a CPU or external DDR memory
Operates with Xilinx PCIe Gen3 and Gen4 Hard IP
The ability to utilize BRAM and URAM as data buffers without needing an external memory interface
Supports six commands: Identify, Shutdown, Write, Read, SMART, and Flush (optional additional command support available)

FPGA resource usages on the XCVC1902-VSVA2197-2MP-E-S FPGA device are shown in Table 3.

Family	Example Device	Fmax (MHz)	CLB Regs	CLB LUTs	Slice	IOB	BRAMTile¹	URAM	Design Tools
Versal AI Core	XCVC1902-VSVA2197-2MP-E-S	375	6280	3948	1050	-	4	8	Vivado2022.1

Table 3: Example Implementation Statistics for Versal device.

More details of the NVMe-IP for Versal device are described in its datasheet which can be downloaded from Design Gateway’s website at the link below:

NVMe IP Core for Gen4 Xilinx Datasheet

Conclusion

Both the TOExxG-IP and NVMe-IP core could help accelerate AI application performance by fully offloading CPU systems from compute and memory intensive protocol such as TCP and NVMe Storage protocol which is critical for real-time AI application. This allows Xilinx's Versal AI Core series device to to perform AI inference and high-performance computing applications without bottleneck or delay from network and data storage protocol processing.

The VCK190 evaluation kit and Design Gateway’s network and storage IP solution enable the best possible performance in AI applications with the lowest possible FPGA resource usage and very high power efficiency on Xilinx's Versal AI Core device.

Disclaimer: The opinions, beliefs, and viewpoints expressed by the various authors and/or forum participants on this website do not necessarily reflect the opinions, beliefs, and viewpoints of DigiKey or official policies of DigiKey.

Accelerate AI Applications on Xilinx VCK190 Evaluation Kit with Design Gateway's IP Cores

Key features of the VCK190 evaluation kit

AI Interface Acceleration with Xilinx's Versal AI Core Series devices

Versal ACAP Platform highlight

VCK190 AI Inference performance

How Design Gateway's IP cores accelerate AI application performance?

Design Gateway's TCP Offload Engine IP (TOExxG-IP) performance

Design Gateway’s TOExxG-IP for Versal devices

TOExxG-IP’s features

Design Gateway's NVMe Host Controller IP performance

Design Gateway's NVMe-IP’s for Versal devices

NVMe-IP’s features

Conclusion

Related Product Highlight

About this author

INFORMATION

HELP

CONTACT US

FOLLOW US