background image

 

 

 

Chapter 3: DPU Configuration 

 

DPU IP Product Guide

 

www.xilinx.com

 

21 

PG338 (v1.2) March 26, 2019 

 

DSP Usage 

You can select whether DSP48E slices are used for the accumulation in the DPU convolution module. 
If the low DSP usage is selected, the DPU IP will use DSP slices for multiplication only in the 
convolution. In the high DSP usage mode, the DSP slice will be used for both multiplication and 
accumulation. As a result, the high DSP usage consumes more DSP slices and less LUT. The difference 
of logic utilization between high and low DSP usage is shown in Table 9. The data is tested on the 
Xilinx ZCU102 platform without Depthwise Conv, Average Pooling, Relu6, and Leaky Relu features. 

Table 9: Resources of Different DSP Usage 

   

 
 
 
 
 
 

 

 

High DSP Usage 

Low DSP Usage 

Arch  LUT 

Register  BRAM  DSP  Arch  LUT 

Register  BRAM  DSP 

B512  20177 

31782 

69.5 

98 

B512 

20759  33572 

69.5 

66 

B800  20617 

35065 

87 

142 

B800 

21050  33752 

87 

102 

B1024  27377 

46241 

101.5 

194 

B1024  29155  49823 

101.5 

130 

B1152  28698 

46906 

117.5 

194 

B1152  30043  49588 

117.5 

146 

B1600  30877 

56267 

123 

282 

B1600  33130  60739 

123 

202 

B2304  34379 

67481 

161.5 

386 

B2304  37055  72850 

161.5 

290 

B3136  38555 

79867 

203.5 

506 

B3136  41714  86132 

203.5 

394 

B4096  40865 

92630 

249.5 

642 

B4096  44583  99791 

249.5 

514 

Send Feedback

Summary of Contents for DPU IP

Page 1: ...DPU for Convolutional Neural Network v1 2 DPU IP Product Guide PG338 v1 2 March 26 2019 ...

Page 2: ...pdated descrption Build the Demo Updated figure Demo Execution Updated code 03 08 2019 Version 1 1 Table 6 Reg_dpu_base_addr Updated descrption Figure 10 DPU Configuration Updated figure Build the PetaLinux Project Updated code Build the Demo Updated descrption 03 05 2019 Version 1 1 Chapter 6 Example Design Added chapter regarding the DPU targeted reference design 02 28 2019 Version 1 0 Initial r...

Page 3: ...ware Architecture 10 DSP with Enhanced Utilization DPU_EU 11 Register Space 13 Interrupts 17 Chapter 3 DPU Configuration 18 Introduction 18 Configuration Options 19 DPU Performance on Different Devices 22 Performance of Different Models 22 I O Bandwidth Requirements 23 Chapter 4 Clocking and Resets 24 Introduction 24 Clock Domain 24 Reference Clock Generation 25 Reset 27 Chapter 5 Development Flow...

Page 4: ...PU IP Product Guide www xilinx com 4 PG338 v1 2 March 26 2019 Introduction 33 Hardware Design Flow 36 Software Design Flow 39 Appendix A Legal Notices 43 References 43 Please Read Important Legal Notices 43 Send Feedback ...

Page 5: ...s B512 B800 B1024 B1152 B1600 B2304 B3136 and B4096 o Configurable core number up to three o Convolution and deconvolution o Max pooling o ReLu and Leaky ReLu o Concat o Elementwise o Dilation o Reorg o Fully connected layer o Batch Normalization o Split DPU IP Facts Table Core Specifics Supported Device Family Zynq 7000 SoC and UltraScale MPSoC Family Supported User Interfaces Memory mapped AXI i...

Page 6: ...t YOLO SSD MobileNet FPN etc The DPU IP can be integrated as a block in the programmable logic PL of the selected Zynq 7000 SoC and Zynq UltraScale MPSoC devices with direct connections to the processing system PS To use DPU you should prepare the instructions and input image data in the specific memory address that DPU can access The DPU operation also requires the application processing unit APU...

Page 7: ... driver which is included in the Xilinx Deep Neural Network Development Kit DNNDK toolchain You can download the free developer resources from the Xilinx website https www xilinx com products design tools ai inference ai developer hub html edge Refer to the DNNDK User Guide UG1327 to obtain an essential guide on how to run a DPU with DNNDK tools The basic development flow is shown in the following...

Page 8: ...entation DPU Camera AXI Interconnect Controller DDR ARM R5 DisplayPort USB3 0 SATA3 1 PCIe Gen2 GigE USB2 0 UART SPI Quad SPI NAND SD demosaic gamma Color_ conversion DMA AXI Interconnect AXI Interconnect MIPI CSI2 AXI Interconnect MIPI CSI2 X22329 030719 Figure 3 Example System with Integrated DPU DNNDK Deep Neural Network Development Kit DNNDK is a full stack deep learning toolchain for inferenc...

Page 9: ... Loader Operating System Host CPU Deep Learning App DPU accelerated Profiler Libarary DPU Driver DPU User Space Kernel Space Hardware Platform X22331 022019 Figure 5 Application Execution Hierarchy Licensing and Ordering Information This IP module is provided at no additional cost under the terms of the Xilinx End User License Information about this and other IP modules is available at the Xilinx ...

Page 10: ...used to buffer the intermediate data input and output data The data is reused as much as possible to reduce the memory bandwidth Deep pipelined design is used for the computing engine Like other accelerators the computational arrays PE take full advantage of the fine grained building blocks which includes multiplier adder accumulator etc in Xilinx devices Instruction Scheduler CPU DNNDK Memory Con...

Page 11: ...e performance achieved with the device Therefore two input clocks for DPU is needed one for general logic and the other for DSP slices The difference between DPU and DPU_EU is shown in Figure 7 All DPU mentioned in this document refer to DPU_EU unless otherwise specified IMG ram IMG ram WGT ram A D B B RES DSP48 Slice A D M clk1x IMG ram IMG ram WGT ram A D B DSP48 Slice A D M clk2x WGT ram RES 0 ...

Page 12: ...ctive Low reset for DSP unit m_axi_dpu_aclk Clock 1 I Input clock used for DPU general logic m_axi_dpu_aresetn Reset 1 I Active Low reset for DPU general logic DPUx_M_AXI_INSTR Memory mapped AXI master interface 32 I O 32 bit Memory mapped AXI interface for instruction of DPU DPUx_M_AXI_DATA0 Memory mapped AXI master interface 128 I O 128 bit Memory mapped AXI interface for DPU data fetch DPUx_M_A...

Page 13: ...gnals are active High The details of reg_dpu_reset is shown in Table 2 Table 2 Reg_dpu_reset Register Address Offset Width Type Description Reg_dpu_reset 0x004 32 R W 0 reset of DPU core 0 1 reset of DPU core 1 2 reset of DPU core 2 Reg_dpu_isr The reg_dpu_isr register represents the interrupt status of all DPU cores integrated in the DPU IP The lower three bits of this register shows the interrup...

Page 14: ...dth Type Description Reg_dpu0_instr_addr 0x20c 32 R W 0 The instruction start address in external memory for DPU core0 Reg_dpu1_instr_addr 0x30c 32 R W 0 The instruction start address in external memory for DPU core1 Reg_dpu2_instr_addr 0x40c 32 R W 0 The instruction start address in external memory for DPU core2 Reg_dpu_base_addr The reg_dpu_base_addr register is used to indicate the address of i...

Page 15: ...base address4 of DPU core0 Reg_dpu0_base_addr5_l 0x24C 32 R W The lower 32 bits of the base address5 of DPU core0 Reg_dpu0_base_addr5_h 0x250 32 R W The lower 8 bits in the register represent the upper 8 bits of the base address5 of DPU core0 Reg_dpu0_base_addr6_l 0x254 32 R W The lower 32 bits of the base address6 of DPU core0 Reg_dpu0_base_addr6_h 0x258 32 R W The lower 8 bits in the register re...

Page 16: ... R W The lower 8 bits in the register represent the upper 8 bits of the base address1 of DPU core2 Reg_dpu2_base_addr2_l 0x434 32 R W The lower 32 bits of the base address2 of DPU core2 Reg_dpu2_base_addr2_h 0x438 32 R W The lower 8 bits in the register represent the upper 8 bits of the base address2 of DPU core2 Reg_dpu2_base_addr3_l 0x43C 32 R W The lower 32 bits of the base address3 of DPU core...

Page 17: ... determined by the number of DPU cores When the parameter of DPU_NUM is set to 2 it means the DPU IP is integrated with two DPU cores and the data width of the dpu_interrupt signal is two bits The lower bit represents the DPU core 0 interrupt and the higher bit represents the DPU core1 interrupt The interrupt connection between the DPU and PS is described in the Device Tree file which indicates th...

Page 18: ...wing table Table 7 Deep Neural Network Features and Parameters Supported by DPU Features Description Convolution Kernel Sizes W 1 16 H 1 16 Strides W 1 4 H 1 4 Padding_w 1 kernel_w 1 Padding_h 1 kernel_h 1 Input Size Arbitrary Input Channel 1 256 channel_parallel Output Channel 1 256 channel_parallel Activation ReLU LeakyReLU Dilation dilation input_channel 256 channel_parallel stride_w 1 stride_h...

Page 19: ...rbitrary Notes 1 The parameter channel_parallel is determined by the DPU configuration For example the channel_parallel of DPU B1152 is 12 the channel_parallel of DPU B4096 is 16 Configuration Options You can configure the DPU with some predefined options which includes DPU core number DPU convolution architecture DSP cascade DSP usage and UltraRAM usage These options enable the DPU IP configurabl...

Page 20: ...different programmable logic resource The larger convolution architecture can achieve higher performance with more resources The parallelism for different convolution architecture is listed in Table 8 Table 8 Parallelism for Different Convolution Architecture Convolution Architecture Pixel Parallelism PP Input Channel Parallelism ICP Output Channel Parallelism OCP Peak Ops operations per clk B512 ...

Page 21: ...nd low DSP usage is shown in Table 9 The data is tested on the Xilinx ZCU102 platform without Depthwise Conv Average Pooling Relu6 and Leaky Relu features Table 9 Resources of Different DSP Usage High DSP Usage Low DSP Usage Arch LUT Register BRAM DSP Arch LUT Register BRAM DSP B512 20177 31782 69 5 98 B512 20759 33572 69 5 66 B800 20617 35065 87 142 B800 21050 33752 87 102 B1024 27377 46241 101 5...

Page 22: ... The final utilization is shown in Figure 11 Figure 11 Summary Page of DPU Configuration DPU Performance on Different Devices Table 10 shows the peak performance of the DPU on different devices Table 10 DPU_EU Performance GOPs on Different Device Device DPU Configuration Frequency MHz Peak Performance Z7020 B1152x1 200 230 Gops ZU2 B1152x1 370 426 Gops ZU3 B2304x1 370 852 Gops ZU5 B4096x1 350 1 4 ...

Page 23: ...ndwidth requirements for some neural networks averaged by layer have been tested with one DPU core running at full speed The peak and average I O bandwidth requirements of three different neural networks are shown in Table 12 The table only shows the number of two commonly used DPU B1152 and B4096 Note that when multiple DPU cores run in parallel each core might not be able to run at full speed du...

Page 24: ...gure 12 shows the three clock domains PL s_axi_clk DPU Register Configure Data Controller Calculation Unit m_axi_dpu_aclk dpu_2x_aclk X22334 022019 Figure 12 Clock Domain in DPU Register Clock The input s_axi_clk is used for the register configure module This module receives the DPU configure data though the S_AXI interface and the related clock of S_AXI is s_axi_clk The register for DPU configure...

Page 25: ..._axi_dpu_aclk and the two clocks must be synchronous to meet the timing closure The recommended circuit design is shown in Figure 13 MMCM RST CLKIN CLKOUT BUFGCE_DIV CE CLR I O BUFGCE_DIV_CLK2_INST dpu_clk_2x BUFGCE_DIV CE CLR I O BUFGCE_DIV_CLK1_INST dpu_clk clk_in1 resetn X22335 022019 Figure 13 Reference Circuit You can instantiate an MMCM and two BUFGCE_DIV to design this circuit The frequency...

Page 26: ... be set to Auto Figure 14 Recommended Clocking Options of Clock Wizard Matched Routing Select the Matched Routing for the m_axi_dpu_aclk and dpu_2x_clk in the Output Clocks tab of the Clock Wizard IP When the Matched Routing setting enables the two clocks that are both generated through a BUFGCE_DIV the skew between the two clocks has significantly decreased The related configuration is shown in F...

Page 27: ...et You must guarantee each pair of clocks and resets is generated in a synchronous clock domain If the related clocks and resets are not matched the DPU might not work properly A recommended solution is to instantiate a Processor System Reset IP to generate a matched reset for each clock The reference design is shown in Figure 16 Figure 16 Reference Design for Resets Send Feedback ...

Page 28: ...pository Add DPU IP into Block Design Configure DPU Parameters Connect DPU with a Processing System in the Xilinx SoC Assign Register Address for DPU Generate Bitstream Generate BOOT BIN Add DPU IP into Repository In the Vivado GUI click Project Manager IP Catalog In the IP Catalog tab right click and select Add Repository Figure 17 then select the location of the DPU IP This will appear in the IP...

Page 29: ...arch 26 2019 Figure 18 DPU IP in Repository Add DPU IP into Block Design Search DPU IP in the block design interface and add DPU IP into the block design The procedure is shown in Figure 19 and Figure 20 Figure 19 Search DPU IP Figure 20 Add DPU IP into Block Design Send Feedback ...

Page 30: ... The number of master interfaces in the DPU IP depends on the DPU_NUM parameter You can connect the DPU to a processing system PS with any kind of interconnections You must ensure the DPU can correctly access the DDR memory space Generally an AXI data transaction passes an Interconnect IP the delay of data transaction will increase The delay of the data transmission between the DPU and the Interco...

Page 31: ...ustom system with the pre built Linux environment in the DNNDK package the DPU slave interface must be connected to the M_AXI_HPM0_LPD PS Master and the DPU base address must be set to 0x8F00_000 with a range of 16 MB The DPU register address in the driver and device tree file in the DNNDK package is fixed at 0x8F00_0000 If the address in the driver and device tree file is the same as the address ...

Page 32: ...Bitstream in Vivado shown in Figure 24 Figure 24 Generate Bitstream Generate BOOT BIN You can use the Vivado SDK or PetaLinux to generate the BOOT BIN file For boot image creation using the Vivado SDK refer to the Zynq UltraScale MPSoC Embedded Design Tutorial UG1209 For PetaLinux use the PetaLinux Tools Documentation Reference Guide UG1144 Send Feedback ...

Page 33: ...tures Hardware Design Flow gives an overview of how to use Xilinx Vivado Design Suite to generate the reference hardware design Software Design Flow describes the design flow of project creation in the PetaLinux environment Demo Execution describes how to run the application created by the TRD DPU TRD Overview The TRD creates an image classification application running a popular deep neural networ...

Page 34: ...r more information about the DNNDK package refer to the DNNDK User Guide UG1327 Requirements The following summarizes the requirements of the TRD Target platforms ZCU102 evaluation board production silicon See ZCU102 Evaluation Board User Guide UG1182 Xilinx tools Vivado Design Suite 2018 2 PetaLinux 2018 2 Hardware peripherals SD Ethernet UART Linux or Windows host system Serial terminal Network ...

Page 35: ...gn DPU IP Product Guide www xilinx com 35 PG338 v1 2 March 26 2019 Design Files Design files are in the following directory structure Figure 26 Directory Structure Note DPU_IP is in the pl srcs dpu_ip directory Send Feedback ...

Page 36: ...t file The parameters of DPU IP in the reference design are configured accordingly Both the connections of the DPU interrupt and the assignment addresses for DPU in the reference design should not be modified If those connections or assignment address have been modified the reference design might not work properly Board Setup The following figure shows the ZCU102 board with interfaces identified F...

Page 37: ...to build the reference Vivado project with Vivado 2018 2 For information about setting up your Vivado environment refer to the Vivado Design Suite User Guide UG910 Building the hardware design consists of the following steps Building the Hardware Design on Linux 1 Open a Linux terminal 2 Change the directory to TRD_HOME pl 3 Create the Vivado IP integrator project and invoke the GUI by running the...

Page 38: ...PU IP Product Guide www xilinx com 38 PG338 v1 2 March 26 2019 Figure 29 TRD Block Design 4 In the GUI click Generate Bitstream to generate the bit file as shown in the following figure Figure 30 Generate Bitstream Send Feedback ...

Page 39: ...owing figure Figure 31 DPU Configuration Page Those parameters of DPU can be configured in case of different resource requirements For more information about DPU and its parameters refer to Chapter 3 DPU Configuration Software Design Flow This section shows how to generate BOOT BIN using the PetaLinux build PetaLinux Design Flow Install PetaLinux Install PetaLinux as described in the PetaLinux Too...

Page 40: ...uilt path to TRD_HOME pl prj zcu102 sdk Create BOOT BIN Use the following to create the BOOT BIN file cd images linux petalinux package boot fsbl zynqmp_fsbl elf u boot u boot elf pmufw pmufw elf fpga system bit This section describes how to run the executables generated by the TRD Connect to ZCU102 board through UART or Ethernet Note the login password on the ZCU102 board is root root Build the D...

Page 41: ...he resnet50 directory on the SD card for example home resnet50 If the directory does not exist create a new directory 4 Insert the SD card into the ZCU102 and boot up the board After the Linux boot run as follows cd home resnet50 resnet50 The screenshot is shown in Figure 32 The input images name is displayed in each line beginning with Load image followed by the expected result of the input image...

Page 42: ...Chapter 6 Example Design DPU IP Product Guide www xilinx com 42 PG338 v1 2 March 26 2019 Figure 32 Running Results Send Feedback ...

Page 43: ...icly display the Materials without prior written consent Certain products are subject to the terms and conditions of Xilinx s limited warranty please refer to Xilinx s Terms of Sale which can be viewed at https www xilinx com legal htm tos IP cores may be subject to warranty and support terms contained in a license issued to you by Xilinx Xilinx products are not designed or intended to be fail saf...

Reviews: