background image

 

 

 



Note: The DGX-2 will not operate with less than three PSUs.  

 

WARNING: To avoid electric shock or fire, do not connect other power cords to the 
DGX-2. For more details, see 

B.6. Electrical  Precautions.

 

Summary of Contents for DGX-2 SYSTEM

Page 1: ......

Page 2: ...4 1 4 2...

Page 3: ...7 4 1...

Page 4: ...9 1...

Page 5: ...12 1 12 1 3 12 3 1 12 6 1...

Page 6: ......

Page 7: ...s...

Page 8: ......

Page 9: ...Hardware Overview...

Page 10: ...ration Chapters 5 6 Network and storage configuration instructions Chapter 7 Special Features and Configuration Chapters 8 10 Software and firmware update instructions Chapter 11 How to use the BMC Ch...

Page 11: ......

Page 12: ...Note The DGX 2 will not operate with less than three PSUs WARNING To avoid electric shock or fire do not connect other power cords to the DGX 2 For more details see B 6 Electrical Precautions...

Page 13: ......

Page 14: ...IMPORTANT See the section Turning the DGX 2 On and Off for instructions on how to properly turn the system on or off...

Page 15: ......

Page 16: ...enp134s0f0 enp134s0f1...

Page 17: ...enp6s0 enp134s0f0 enp134s0f1...

Page 18: ...Mellanox ConnectX 5 Firmware Download...

Page 19: ...enp134s0f0 enp134s0f1...

Page 20: ...Note Some ofthe documentation listed below are not available at the time of publication See https docs nvidia com dgx for the latest status...

Page 21: ...NVIDIA Enterprise Support portal enterprisesupport nvidia com NVIDIA Enterprise Support Phone Numbers...

Page 22: ...installs Docker CE which uses the 172 17 xx xx subnet by default for Docker containers Ifthe DGX 2 System is on the same subnet you will not be able to establish a network connection to the DGX 2 Sys...

Page 23: ...DGX 2 Server Front DGX 2 Server Back...

Page 24: ...Configuring Static IP Address for the BMC https bmc ip address...

Page 25: ......

Page 26: ...Network Ports Configuring Static IP Addresses for the Network Ports...

Page 27: ...Connecting to the DGX 2Console...

Page 28: ...n credentials the default admin admin login will no longer work Note The BMC software will not accept sysadmin for a user name Ifyou create this user name for the system log in sysadmin will not be av...

Page 29: ...th the option to configure the network manually https usn ubuntu com a Run the package manager sudo apt update b Upgrade to the latest version sudo apt full upgrade Note RAID 1 Rebuild in Progress Whe...

Page 30: ...https docs nvidia com dgx NVIDIA GPU Cloud for DGX https www nvidia com en us support enterprise...

Page 31: ...e installed by NVIDIA service personnel or trained Advanced Technology Program ATP installation partner If not performed by NVIDIA or an ATP partner your DGX 2 hardware warranty will be voided https d...

Page 32: ...r io nvidia cuda 10 0 runtime nvidia smi https docs nvidia com deeplearning frameworks user guide index html WARNING Risk of Danger Removing power cables or using Power Distribution Units PDUs to shut...

Page 33: ...which already has the nvidia docker2 package installed then the instructions for using the NVIDIA Container Runtime for Docker can still be used docker run gpus docker run gpus all docker run gpus 2 d...

Page 34: ...mes nvidia path usr bin nvidia container runtime runtimeArgs docker run docker run CAUTION If you build Docker images while nvidia is set as the default runtime make sure the build scripts executed by...

Page 35: ...Instructions for specifying the GPU architecture depend on the application and are beyond the scope ofthis document Consult the specific application build process for guidance...

Page 36: ...port https_proxy https username password host port no_proxy localhost 127 0 0 1 localaddress localdomain com HTTP_PROXY http username password host port FTP_PROXY ftp username password host port HTTPS...

Page 37: ...Acquire http proxy http myproxy server com 8080 Acquire ftp proxy ftp myproxy server com 8080 Acquire https proxy https myproxy server com 8080 https docs docker com engine admin systemd http proxy 1...

Page 38: ...verlay2 LimitMEMLOCK infinity LimitSTACK 67108864 Service ExecStart ExecStart usr bin dockerd H fd s overlay2 bip 192 168 127 1 24 fixed cidr 192 168 127 128 25 LimitMEMLOCK infinity LimitSTACK 671088...

Page 39: ...ps nvcr io v2 2018 08 01 19 42 58 https nvcr io v2 Resolving nvcr io nvcr io 52 8 131 152 52 9 8 8 Connecting to nvcr io nvcr io 52 8 131 152 443 connected HTTP request sent awaiting response 401 Unau...

Page 40: ...oard directly to the DGX 2 System sudo ipmitool lan print 1 Set the IP address source to static sudo ipmitool lan set 1 ipsrc static Set the appropriate address information sudo ipmitool lan set 1 ipa...

Page 41: ......

Page 42: ......

Page 43: ......

Page 44: ...Note If you cannot access the DGX 2 System remotely then connect a display 1440x900 or lower resolution and keyboard directly to the DGX 2 System...

Page 45: ...c netplan 01 netcfg yaml network version 2 renderer networkd ethernets port designation dhcp4 no dhcp6 no addresses 10 10 10 2 24 gateway4 10 10 10 1 nameservers search mydomain other domain addresses...

Page 46: ...boot the system https help ubuntu com lts serverguide network configuration html en sudo mst start sudo mst status not MST modules MST PCI module is not loaded MST PCI configuration module is not load...

Page 47: ...r reg 88 data reg 92 Chip revision is 00 dev mst mt4119_pciconf4 PCI configuration cycles access domain bus dev fn 0000 86 00 0 addr reg 88 data reg 92 Chip revision is 00 dev mst mt4119_pciconf5 PCI...

Page 48: ...E_P1 IB 1 Device 4 Device type ConnectX5 Device 0000 e1 00 0 LINK_TYPE_P1 IB 1 Device 5 Device type ConnectX5 Device 0000 35 00 0 LINK_TYPE_P1 IB 1 Device 6 Device type ConnectX5 Device 0000 5d 00 0 L...

Page 49: ...4 set LINK_TYPE_P1 2 sudo mlxconfig y d dev mst mt4119_pciconf5 set LINK_TYPE_P1 2 sudo mlxconfig y d dev mst mt4119_pciconf6 set LINK_TYPE_P1 2 sudo mlxconfig y d dev mst mt4119_pciconf7 set LINK_TYP...

Page 50: ..._pciconf0 set LINK_TYPE_P1 1 sudo mlxconfig y d dev mst mt4119_pciconf1 set LINK_TYPE_P1 1 sudo mlxconfig y d dev mst mt4119_pciconf2 set LINK_TYPE_P1 1 sudo mlxconfig y d dev mst mt4119_pciconf3 set...

Page 51: ...IB 1 Device 5 Device type ConnectX5 Device 0000 35 00 0 LINK_TYPE_P1 IB 1 Device 6 Device type ConnectX5 Device 0000 5d 00 0 LINK_TYPE_P1 IB 1 Device 7 Device type ConnectX5 Device 0000 e6 00 0 LINK_...

Page 52: ...Configure an NFS mount for the DGX 2 System a Edit the filesystem tables configuration sudo vi etc fstab b Add a new line for the NFS mount using the local mount point of mnt nfs_server export_path mn...

Page 53: ...Verify the NFS server is reachable ping nfs_server Mount the NFS export sudo mount mnt mnt Verify caching is enabled cat proc fs nfsfs volumes...

Page 54: ...ter MaxQ is not supported on DGX 2H systems Commands to switch to MaxP or MaxQ or to see the current power state are not supported on DGX 2H systems Setting MaxP MaxQ is not supported on DGX 2 systems...

Page 55: ...sudo nvsm set powermode maxp sudo nvsm show chassis localhost...

Page 56: ...usr sbin dgx kdump config enable dmesg dump usr sbin dgx kdump config enable vmcore dump usr sbin dgx kdump config disable ipmitool I lanplus H bmc ip address U BMC USERNAME P BMC PASSWORD sol activat...

Page 57: ...Managing CPU Mitigations cat sys devices system cpu vulnerabilities Mitigation...

Page 58: ...al RSB filling Mitigation Clear CPU buffers SMT vulnerable Vulnerable KVM Vulnerable Mitigation PTE Inversion VMX vulnerable Vulnerable SMT vulnerable Vulnerable Vulnerable Vulnerable __user pointer s...

Page 59: ...ilities nvidia smi i 3 q grep UUID etc modprobe d nvidia conf nvidia smi q egrep GPU 00000000 GPU UUID options nvidia NVreg_GpuBlacklist gpu uuid GPU 00000000 34 00 0 GPU UUID GPU 00000000 36 00 0 GPU...

Page 60: ...sudo update initramfs u dracut force boot initramfs uname r img uname r sudo reboot...

Page 61: ...SB Flash Drive Note The DGX OS Server software is restored on one ofthe two NMVe M 2 drives When the system is booted after restoring the image software RAID begins the process rebuilding the RAID 1 a...

Page 62: ...rprise Support Re Imaging the System from a USB Flash Drive DGX 2 Obtainingthe DGX 2 Software ISO Image and Checksum File Install DGX Server without formatting RAID Retaining the RAID Partition While...

Page 63: ...Up the DGX 2 System Note If you are restoring the software image remotely through the BMC you do not need a bootable installation medium and you can omit this task Creatinga Bootable USB Flash Drive b...

Page 64: ...o perform a simple file copy of the image the resulting flash drive may not be bootable lsblk sudo dd if path to software image bs 2048 of usb drive device name CAUTION The dd command erases all data...

Page 65: ...Akeo Reliable USB Formatting Utility Rufus DGX 2 DGX 2 Akeo Reliable USB Formatting Utility Rufus Start...

Page 66: ...de OK Re Imaging the System Remotely DGX 2 Install DGX Server without formatting RAID Retaining the RAID Partition While Installing the OS Enter Note The Mellanox InfiniBand driver installation may ta...

Page 67: ...GX 2 System Install DGX Software Install DGX Server without formatting RAID RUN yes etc default cachefilesd raid etc fstab cachefilesd raid etc default cachefiled etc fstab sudo mount raid systemctl s...

Page 68: ...ease wget O f3 usarchive http us archive ubuntu com ubuntu dists bionic Release wget O f4 security http security ubuntu com ubuntu dists bionic Release wget O f5 download http download docker com linu...

Page 69: ...that you installed yourself If you want to prevent an application from being updated you can instruct the Ubuntu package manager to keep the current version For more information see Introduction to H...

Page 70: ...r Do not update the DGX 2H firmware using an earlier container as this will result in version mismatch with the DGX 2H DGX 2 System Firmware Update Container Release Notes Caution Stop all unnecessary...

Page 71: ...software assists in determining the activity state of the DGX system and provides a warning if it detects that activity levels are above a predetermined threshold If the warning is encountered you ar...

Page 72: ..._18 09 3 tar gz REPOSITORY TAG IMAGE ID CREATED SIZE nvfw dgx2_18 09 3 latest aa681a4ae600 1 hours ago 278MB nvfw_dgx2_version nvfw_dgx2 tag tag nvfw dgx2_19 03 1 tar gz REPOSITORY TAG IMAGE ID CREATE...

Page 73: ...ge name update_fw f target target all SBIOS BMC Note Other components may be supported beyond those listed here Query the firmware manifest to see all the components supported by the container Additio...

Page 74: ...oceed with firmware update Y N Y Note While the progress output shows the current and manifest firmware versions the versions may be truncated due to space limitations You can confirm the updated vers...

Page 75: ...2 19 03 1 update_fw SBIOS Following components will be updated with new firmware version IMPORTANT Firmware update is disruptive and may require system reboot Stop system activities before performing...

Page 76: ...fw target sudo docker run rm privileged v hostfs image name show_fw_manifest sudo docker run rm privileged v hostfs image name show_version sudo docker run rm privileged ti v hostfs image name update_...

Page 77: ...name chmod x run file name run sudo run file name run update_fw all sudo run file name run show_fw_manifest sudo run file name run show_version sudo run file name run update_fw target sudo run file na...

Page 78: ...te backup bmc Note The ability to update the secondary SBIOS using the firmware update container is available only with firmware update container version 19 12 1 and later sudo run file name run sbios...

Page 79: ...nvsm dump health https nvid nvidia com dashboard...

Page 80: ...https bmc ip address...

Page 81: ...ware version the following quick links may appear Maintenance Firmware Update Settings NbMeManagement NvMe P3700Vpd Info Do not access these tasks using the Quick Links dropdown menu as the resulting...

Page 82: ......

Page 83: ...om versions earlier than 01 04 02 using the BMC UI as the sensor data record SDR is erroneously preserved which can result in the BMC UI reporting a critical 3V Battery sensor error This is corrected...

Page 84: ...It is strongly recommended that you create a unique password assoon as possible https bmc ip address...

Page 85: ...hpm hpm...

Page 86: ...telinit 1 umount raid sync ipmitool chassis power off...

Page 87: ...https www linux kvm org Note NVIDIA KVM is also supported on the NVIDIA DGX 2H References to DGX 2 in this chapter also apply to DGX 2H...

Page 88: ...on to switch to languages such as Chinese To set up a guest VM for a different language install the appropriate language pack onto the guest VM Example of installing a Chinese language pack Guest vm 2...

Page 89: ...KVM Performance Tuning section of the DGX Best Practices guide https linux die net man 1 virsh nvidia vm nvidia vm nvidia vm sudo nvidia vm help...

Page 90: ...ther operations sudo apt get update sudo apt install y dgx bionic updates repo sudo apt update sudo apt full upgrade y sudo apt cache policy dgx kvm image dgx kvm sw sudo apt get install dgx kvm sw dg...

Page 91: ...date sudo apt full upgrade y CAUTION Reverting the server back to a bare metal system destroys all guest GPU VMs that were created as well as any data Be sure to save your data before removing the KVM...

Page 92: ...s assigned 8 GPUs from index 0 through 7 my lab vm2 1g0 This VM is assigned 1 GPU from index 0 my lab vm3 4g8 11 This VM is assigned 4 GPUs from index 8 through 11 nvidia vm About nvidia vm Syntax sud...

Page 93: ...0 2 4 6 8 10 12 14 4 0 4 8 12 8 0 8 16 0 image Managing theImages user data Using cloud init to Initialize the Guest VM meta data Using cloud init to Initialize the Guest VM options Command Help sudo...

Page 94: ...rted sudo nvidia vm create gpu count 2 gpu index 2 image dgx kvm image 4 1 1 rootTue1308 2g2 3 dgx kvm image 4 1 1 Note If you encounter the following message when creating a VM Error setting up logfi...

Page 95: ...g Cloud init Releases the CPUs memory GPUs and NVLink Retains allocation of the OS and data disks Note Since allocation of the OS and data disks are retained the creation of other VMs is still impacte...

Page 96: ...vidia vm as explained in About nvidia vm Syntax sudo nvidia vm delete domain vm domain Command Help sudo nvidia vm delete help Command Examples sudo nvidia vm delete domain dgx2vm labTue1308 4g12 15 s...

Page 97: ...s Protocol Address lo 00 00 00 00 00 00 ipv4 127 0 0 1 8 ipv6 1 128 enp1s0 52 54 00 1e 23 2b ipv4 10 120 28 219 24 ipv6 fe80 5054 ff fe1e 232b 64 enp2s0 52 54 00 1b 3b 1c ipv4 192 168 254 150 24 ipv6...

Page 98: ...ge the Login Credentials Add SSH Keys Creating a new user account sudo useradd m new username p new password Deleting the nvidia user account deluser r nvidia sudo usermod a G libvirt new username sud...

Page 99: ...e Ifthe source image is ever uninstalled the guest VM may not work properly To keep guest VMs running uninterrupted save the KVM source image to another location before uninstalling it nvidia vm About...

Page 100: ...Docker Version 2 0 3 docker18 09 4 1 Nvidia Container Runtime Version 2 0 0 docker18 09 4 1 Libnvidia Container Version 1 0 2 1 nvidia vm image install Syntax sudo nvidia vm image install kvm image Ex...

Page 101: ...o bare metal or example to recover space on the Hypervisor or to upgrade to a newer image then you should make a copy ofthe image first A KVM guest VM runs a thin provisioned copy ofthe source image I...

Page 102: ...dev vda1 dev vdb1 raid Resource Allocation Show storage pool virsh pool list Name State Autostart dgx kvm pool active yes Create a VM sudo nvidia vm create gpu count 1 gpu index 0...

Page 103: ...rootTue1616 1g0 raid dgx kvm vol dgx2vm labTue1616 1g0 file 1 74 TiB 3 71 GiB Viewing the Data Volume from the Guest VM virsh console dgx2vm labTue1616 1g0 Connected to domain dgx2vm rootTue1616 1g0 n...

Page 104: ...Configuration Host VM VM VM VM External macvtap macvtap Private Yes KVM Networking Best PracticesGuide privateIP sudo nvidia vm create gpu count 4 gpu index 12 privateIP...

Page 105: ...iated KVM image Guest VMs created from this older image will no longer be available To keep guest VMs save the older KVM image to another location and then and then restore the image after updating th...

Page 106: ...ce image Ifthe source image is ever uninstalled the guest VM may not work properly To keep guest VMs running uninterrupted save the KVM source image to another location before uninstalling it sudo apt...

Page 107: ...10 22 46 92 Memory GB 90 180 361 723 1446 InfiniBand N A 1 2 4 8 OS Drive GB 50 50 50 50 50 Data Drive TB 1 92 3 84 7 68 15 36 31 72 Ethernet macvtap macvtap macvtap macvtap macvtap NVLink N A 1 3 6...

Page 108: ...sudo nvidia vm health check options...

Page 109: ...m health check show virsh shutdown vm name sudo virt edit vm name lib systemd system nvidia fabricmanager service ExecStart usr bin nv hostengine l g log level 4 log rotate log filename var log fabric...

Page 110: ...l dcgmi health host vm ip address s a dcgmi health host vm ip address check Health Monitor Report Overall Health Healthy sudo nvidia vm create gpu count 8 gpu index 8 ERROR GPU 12 is in unexpected st...

Page 111: ...5b1c87e ERROR 2 GPU s are unavailable unable to start this VM dgx2vm labMon1559 8g8 15 Note If you attempt to launch a VM with a failed GPU before the system has identified its failed state the VM wil...

Page 112: ...nvsysinfo grep i error fail HOME cache virt manager virt install log sudo egrep i error fail var log libvirt qemu vm name log virsh console vm name...

Page 113: ...dr vm name source agent virsh domifaddr 1gpu vm 1g2 source agent Name MAC address Protocol Address lo 00 00 00 00 00 00 ipv4 127 0 0 1 8 ipv6 1 128 enp1s0 52 54 00 3c 07 62 ipv4 10 120 28 227 24 ipv6...

Page 114: ...Node SN Model Namespace Usage Format FW Rev dev nvme0n1 S2X6NX0K501953 SAMSUNG MZ1LW960HMJP 00003 1 61 79 GB 960 20 GB 512 B 0 B CXV8601Q snip dev nvme9n1 18141C246847 Micron_9200_MTFDHAL3T8TCT 1 3 8...

Page 115: ...Size 937034752 893 63 GiB 959 52 GB Raid Devices 2 Total Devices 2 Persistence Superblock is persistent Intent Bitmap Internal Update Time Thu Aug 15 16 31 29 2019 State clean Active Devices 2 Working...

Page 116: ...GiB 19 62 GiB 858 95 GiB From within the guest VM run the following command lsblk NAME MAJ MIN RM SIZE RO TYPE MOUNTPOINT vda 252 0 0 50G 0 disk vda1 252 1 0 50G 0 part vdb 252 16 0 13 9T 0 disk vdb1...

Page 117: ...sudo virt log d vm name sudo virt cat d vm name var log syslog sudo virt edit d vm name var log syslog...

Page 118: ...name sudo virt df d 1gpu vm 1g0 Filesystem 1K blocks Used Available Use 1gpu vm 1g0 dev sda1 51341792 3313160 45390912 7 sudo virt filesystems d vm name DGX 2 Server Software Release Notes Linux KVM...

Page 119: ...NVIDIA DGX 2 Service Manual...

Page 120: ...Creating a Unique BMC Password...

Page 121: ...nvmecli Do not use a root file system Execute a shell in the installer environment sudo nvme list Node SN Model Namespace Usage Format FM Rev...

Page 122: ...8R0 dev nvme7n1 18xxxxxx Micron_9200_xx 1 3 84 TB 3 84 TB 512 B 0 B 101008R0 dev nvme8n1 18xxxxxx Micron_9200_xx 1 3 84 TB 3 84 TB 512 B 0 B 101008R0 dev nvme9n1 18xxxxxx Micron_9200_xx 1 3 84 TB 3 84...

Page 123: ...have made on the DGX 2 System Be sure to back up any data that you want to preserve and push any Docker images that you want to keep to a trusted registry a Log on to the NVIDIA Enterprise Support sit...

Page 124: ...os release number scheme https docs nvidia com dgx pdf DGX OS server 4 1 relnotes update guide pdf Note These procedures apply only to upgrades within the same major release such as 4 x 4 y It does n...

Page 125: ...ttp security ubuntu com ubuntu bionic security universe deb http security ubuntu com ubuntu bionic security multiverse deb http archive ubuntu com ubuntu bionic main multiverse universe deb http archi...

Page 126: ...com dgx repos bionic bionic updates main restricted universe multiverse Only for DGX OS 4 1 0 deb i386 http international download nvidia com dgx repos bionic bionic r418 cuda10 1 main multiverse res...

Page 127: ...onic r418 cuda11 0 repo list deb file media usb repository mirror international download nvidia com dgx repos bionic bionic r418 cuda10 1 main multiverse restricted universe etc apt sources list d dgx...

Page 128: ...om dgx repos bionic bionic r418 cuda10 1 multiverse amd64 Packages 10 1 kB Get 8 file media usb repository mirror international download nvidia com dgx repos bionic bionic r418 cuda10 1 restricted amd...

Page 129: ...rces list d dgx bionic r450 cuda11 0 repo list sudo apt install cuda toolkit 11 0 Note If you did not configure apt to use the NVIDIA DGX OS packages in the file etc apt sources list d dgx bionic r450...

Page 130: ...docker images docker save nvcr io nvidia repository tag framework tar docker load i framework tar docker images...

Page 131: ...d named after the user The default file contains a dummy value which must be replaced with your own groups Optional Additional groups to add the user to Defaults to none shell Shell for the created us...

Page 132: ...instance data json https cloudinit readthedocs io en latest topics instancedata html format of instance data json...

Page 133: ......

Page 134: ......

Page 135: ...plug your system into a surge suppressor and disconnect telecommunication linesto your modem during an electrical storm Provided with a properly grounded wall outlet Provided with sufficient space to...

Page 136: ...Do not attempt to modify or use the AC power cord s if they are not the exact type required to fit into the grounded electrical outlets The power cord s must meet the following criteria...

Page 137: ...rs when removing accesscover s Upon completion of accessing inside the product refasten access cover with original screws or fasteners Do not access the inside of the power supply There are no service...

Page 138: ......

Page 139: ...www dtsc ca gov perchlorate...

Page 140: ...o make sure you have not left loose tools or parts inside the system Check that cables add in cards and other componentsare properly installed Attach the covers to the chassis according to the product...

Page 141: ...nt This equipment generates uses and can radiate radio frequency energy and if not installed and used in accordance with the instruction manual may cause harmful interference to radio communications O...

Page 142: ...ent sur le materiel brouilleur du Canada European Conformity Conformit Europ enne CE This is a Class A product In a domestic environment this product may cause radio frequency interference in which ca...

Page 143: ...t this product may cause radio interference in which case the user may be required to take corrective actions VCCI A A Japanese regulatory requirement defined by specification JIS C 0950 2008 mandates...

Page 144: ...Pb Hg Cd Cr VI PBB PBDE 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 JIS C 0950 2008 2006 7 1 DGX 2 1 0 JIS C 0950 2008 2 JIS C 09...

Page 145: ...tion JIS C 0950 2008 mandates that manufacturers provide Material Content Declarations for certain categories of electronic products offered for sale after July 1 2006 Product Model Number DGX 2 Major...

Page 146: ...ical and Electronic Products SJ T 11364 2014 The table according to SJ T 11364 2014 O GB T 26572 2011 O Indicates that this hazardous substance contained in all of the homogeneous materials for this p...

Page 147: ...Notice...

Reviews: