background image

NVIDIA DGX-1

DU-08033-001 _v13.1 | December 2017

User Guide

Summary of Contents for DGX-1

Page 1: ...NVIDIA DGX 1 DU 08033 001 _v13 1 December 2017 User Guide ...

Page 2: ... 1 12 2 7 Attaching the Bezel 13 2 8 Connecting the Power Cables 14 2 9 Connecting the Network Cables 15 2 10 Setting Up the DGX 1 16 2 11 Post Setup Instructions for DGX OS Server Software Version 2 x and Earlier 18 Chapter 3 Preparing for Using Docker Containers 20 3 1 Installing Docker and NVIDIA Docker on DGX OS Server Software 2 x or Earlier 20 3 2 Configuring Docker IP Addresses 21 3 2 1 Con...

Page 3: ...ely 43 5 2 3 Creating a Bootable Installation Medium 46 5 2 3 1 Creating a Bootable USB Flash Drive by Using the dd Command 46 5 2 3 2 Creating a Bootable USB Flash Drive by Using Akeo Rufus 47 5 2 4 Re Imaging the System From a USB Flash Drive 49 5 2 5 Retaining the RAID Partition While Installing the OS 49 5 3 Updating the System BIOS 50 5 4 Updating the BMC 53 5 5 Replacing the System and Compo...

Page 4: ...cation Uses 104 8 3 Site Selection 104 8 4 Equipment Handling Practices 105 8 5 Electrical Precautions 105 8 6 System Access Warnings 106 8 7 Rack Mount Warnings 106 8 8 Electrostatic Discharge 107 8 9 Other Hazards 108 Chapter 9 Compliance 110 9 1 United States 110 9 2 United States Canada 110 9 3 Canada 111 9 4 CE 111 9 5 Japan 111 9 6 Australia 112 9 7 China 112 9 8 Israel 114 9 9 South Korea 1...

Page 5: ...t can be deployed quickly and easily 1 1 Using the DGX 1 Overview The NVIDIA DGX 1 comes with a base operating system consisting of an Ubuntu OS Docker Docker Engine Utility for NVIDIA GPUs and NVIDIA drivers Ths system is designed to run a number of NVIDIA optimized deep learning framework applications packaged in Docker containers You can use your own scheduling and management software to run jo...

Page 6: ...ts Power Supply 4 1600 W each CPU 2 Intel Xeon E5 2698 v4 20 core 2 2GHz 135W GPU 8 Option 1 Tesla P100 featuring 170 teraflops FP16 16 GB memory per GPU 28 672 NVIDIA CUDA Cores Option 2 Tesla V100 featuring 960 teraflops FP16 16 GB memory per GPU 40 960 NVIDIA CUDA Cores 5120 NVIDIA Tensor Cores System Memory 16 32 GB DDR4 LRDIMM 512 GB total SAS Raid Controller 1 8 port LSI SAS 3108 RAID Mezzan...

Page 7: ...00 240 V ac 3500 W max 1600 W 200 240 V 8 A 50 60 Hz The DGX 1 contains four load balancing power supplies with 3 1 redundancy 1 2 4 Connections and Controls ID Type Qty Description 1 Power button 1 Press to turn the DGX 1 on or off Blue System power on Off System power off Amber blinking DC Off and fault Amber and blue blinking DC On and fault 2 ID button 1 Press to cause an LED on the back of th...

Page 8: ...ng 7 AC input 4 Power supply inputs 8 Ethernet RJ45 2 10GBASE T dual port network adapter Mezzanine 9 IPMI RJ45 1 10 100BASE T Intelligent Platform Management Interface IPMI port 1 2 5 Rear Panel Power Controls ID Type Qty Description 1 Power button 1 Press and immediately release the power button for a graceful shutdown of the host OS Press and hold the power button for at least four seconds to s...

Page 9: ...ber blinking LAN access off when there is traffic 1 Port 1 Link Activity Off Disconnected Green 10 Gb s Amber 1 Gb s 2 Port 1 Speed Off 100 Mb s Amber steady LAN link Amber blinking LAN access off when there is traffic 3 Port 0 Link Activity Off Disconnected Green 10 Gb s Amber 1 Gb s 4 Port 0 Speed Off 100 Mb s 1 2 7 IPMI Port LEDs LEDs on the IPMI port indicate the connection status as described...

Page 10: ...iption 1 Button and release lever for removing the HDD 2 HDD present LED Blue Steady Drive present Blue Blinking twice sec Identification such as when initializing or locating through the SBIOS Blue Blinking once sec Rebuilding such as when creating a RAID array Amber Steady Warning failure Off Slot empty 3 HDD activity LED Blue Access 1 2 9 Power Supply Unit PSU LED The PSU LED indicates the oper...

Page 11: ...ction to the NVIDIA DGX 1 Deep Learning System www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 7 Activity Description Green Normal operation Amber blinking Power off Fault Green blinking Power on Standby mode ...

Page 12: ...roduct Registration http www nvidia com object dgx product registration page 2 Enter all required information and then click SUBMIT to complete the registration process and receive all warranty entitlements and if applicable DGX 1 support services entitlements Refer to the Customer Support chapter for customer support contact information 2 2 Obtaining Software and Software Updates You must registe...

Page 13: ... 2cm of clearance in the back of the rack to allow for sufficient airflow and ease in servicing Always make sure the rack is secured and stable before adding or removing the appliance or any other component Prepare adequate sound proofing The equipment fans can generate 72 100 dBA Environmental Conditions Operating environment Temperature 5 C to 35 C 41 F to 95 F Relative humidity 20 to 85 noncond...

Page 14: ... and assign a DNS configuration to the DGX 1 If DHCP is not available then you will need to set up a static IP for each Ethernet port NVIDIA recommends that customers follow best security practices for BMC management IPMI port These include but are not limited to such measures as Restricting the DGX 1 IPMI port to an isolated dedicated management network Using a separate firewalled subnet Configur...

Page 15: ...X 1 Bezel Rail hardware kit Accessory Box AC Power Cables qty 4 IEC 60320 C13 14 compatible with data center PDUs IMPORTANT Use only the supplied power cables and do not use the cables with any other product or for any other purpose Hard disk bay screws Toxic Substance Notice Safety Instructions Quick Start Guide DVD containing source files for open source software The four power cables included i...

Page 16: ...er rail or its outer rail mate to determine the proper orientation and positioning to connect to the chassis then secure to the chassis IMPORTANT Make sure that the reinforced hole at the front end of the rail is positioned on the bottom side of the rail and that it aligns with the thumbscrew on the front of the DGX 1 If the hole is positioned on the top side then the rail is on the wrong side of ...

Page 17: ...ner rails into the outer rails keeping the pressure even on both sides you may have to depress the locking tabs when inserting When the DGX 1 has been pushed completely into the rack you should hear the locking tabs click into the locked position 3 Lock the unit in place using the thumb screws located on the front of the unit 2 7 Attaching the Bezel The bezel is designed to attach easily to the fr...

Page 18: ...ins near the corners of the DGX 1 with the holes in back of the bezel then gently press the bezel against the DGX 1 CAUTION Be careful not to accidentally press the power button that is on the right edge of the DGX 1 when removing or installing the bezel The bezel is held in place magnetically 2 8 Connecting the Power Cables 1 Open the accessory box and remove the four C13 C14 power cables 2 Use t...

Page 19: ...firmly inserted into the PDU There is usually a click to indicate full insertion 2 9 Connecting the Network Cables 1 Using an Ethernet cable connect one of the dual Ethernet ports em1 or em2 to your LAN for internet access to the NVIDIA Cloud Portal remote access to launched application containers on the DGX 1 or to connect to the DGX 1 using SSH The left side right side ethernet port designation ...

Page 20: ...et interfaces on the same network 2 Using an Ethernet cable connect the IPMI BMC port to your LAN for remote access to the base management controllerr BMC Vefiy that all network cables are firmly inserted into the DGX 1 and the associated network switch 2 10 Setting Up the DGX 1 These instructions describe the setup process that occurs the first time the DGX 1 is powered on after delivery Be prepa...

Page 21: ...twork interface the system attempts to configure the interface for DHCP and then asks you to enter a hostname for the system If DHCP is not available you will have the option to configure the network manually If you need to configure a static IP address on a network interface connected to a DHCP network select Cancel at the Network configuration Please enter the hostname for the system screen The ...

Page 22: ...tc resolv conf run resolvconf resolv conf If the expected output appears then skip to step 2 If this does not appear then enable dynamic DNS updates as follows a Launch the Resolvconf Reconfigure package sudo dpkg reconfigure resolvconf The Configuring resolvconf screen appears b Select Yes when asked whether to prepare etc resolv conf for dynamic updates c Select No when asked whether to append o...

Page 23: ...43360 13 rdma_cm ib_cm ib_sa iw_cm nv_peer_mem mlx4_ib mlx5_ib ib_mad ib_ucm ib_umad ib_uverbs rdma_ucm ib_ipoib 3 If there is no output to the lsmod command then build and install the nvidia peer memory module a Get and install the module sudo apt get update sudo apt get install reinstall mlnx ofed kernel dkms nvidia peer memory dkms Expected output DKMS install completed Processing triggers for ...

Page 24: ...g a System Proxy Configuring NFS Mount and Cache 3 1 Installing Docker and NVIDIA Docker on DGX OS Server Software 2 x or Earlier To enable portability in Docker images that leverage GPUs NVIDIA developed nvidia docker an open source project that provides a command line tool to mount the user mode components of the NVIDIA driver and the GPUs into the Docker container at launch As of DGX OS Server ...

Page 25: ...r file when done d Restart Docker with the new configuration sudo service docker restart 3 Install NVIDIA Docker The following example installs both nvidia docker and the nvidia docker plugin wget P tmp https github com NVIDIA nvidia docker releases download v1 0 1 nvidia docker_1 0 1 1_amd64 deb sudo dpkg i tmp nvidia docker deb rm tmp nvidia docker deb 3 2 Configuring Docker IP Addresses To ensu...

Page 26: ...ng the correct bridge IP address and IP address ranges for your network Consult your IT administrator for the correct addresses For example if your DNS server exists at IP address 10 10 254 254 and the 192 168 0 0 24 subnet is not otherwise needed by the DGX 1 you can add the following line to the etc default docker file DOCKER_OPTS dns 10 10 254 254 bip 192 168 0 1 24 fixedcidr 192 168 0 0 24 If ...

Page 27: ... sudo systemctl restart docker 3 3 Letting Users Issue Docker Commands To prevent the docker daemon from running without protection against escalation of privileges the NVIDIA Docker software requires sudo privileges to run containers You can grant the required privileges to users who will run containers on the DGX 1 in one of the following ways Add each user as an administrator user with sudo pri...

Page 28: ...ew user in order to add them to the docker group perform the following 1 Add the user sudo useradd username 2 Set up the password sudo passwd username Enter a password at the prompts Enter new UNIX password Retype new UNIX password passwd password updated successfully 3 3 3 Adding a User to the Docker Group For each user you want to add to the docker group enter the following command sudo usermod ...

Page 29: ...ng to use the DGX 1 in cloud managed mode The DGX 1 Cloud Services software will set up the NFS cache for you as part of the cloud managed mode configuration Similarly in cloud managed mode the person setting up the job will specify any NFS mount requirements for the job at that time 1 Check if the cache daemon is installed and configured service cachefilesd status If the output indicates that cac...

Page 30: ...s used here as an example mount point Consult your Network Administrator for the correct values for nfs_server and export_path The nfs arguments presented here are a list of recommended values based on typical use cases However fsc must always be included as that argument specifies use of FS Cache c Save the changes 8 Verify the NFS server is reachable ping nfs_server Use the server IP address or ...

Page 31: ...e and monitor the DGX 1 independently of the CPU or operating system You can access the BMC remotely through the Ethernet connection to the IPMI port This section describes how to access the BMC and describes a few common tasks that you can accomplish through the BMC It is not meant to be a comprehensive description of all the BMC capabilities To access the BMC remotely 1 Make sure you have connec...

Page 32: ...for the first time you set up a username and password for the system These credentials are also used to log in to the BMC remotely except that the BMC password is the username It is strongly recommended that you create a unique password as soon as possible Create a unique BMC password as follows 1 Open a Java enabled web browser within your LAN and go to http IPMI IP address Use Firefox or Interne...

Page 33: ...ch as temperatures and voltages 4 1 3 Submitting BMC Log Files The BMC provides automatic logging of system activities and status The NVIDIA Enterprise Support team uses the log files to assist in troubleshooting Follow these instructions to obtain the log files to send to NVIDIA Enterprise Support 1 Log into the BMC then click Server Health from the top menu and select Event Log 2 Make sure that ...

Page 34: ...sole to open the popup window The window provides interactive control of the DGX 1 console 4 1 6 Powering Off Power Cycling the System Remotely 4 1 6 1 From the DGX 1 Console Window If you have opened the Java Viewer Remote Control Console Redirection to view the console window then you can power cycle reset or shutdown the DGX 1 as follows 1 From the JViewer top menu click Power and then select f...

Page 35: ...ring a BMC Static IP Address Using the System BIOS Configuring the BMC Static IP Address Using ipmitool Configuring the BMC Static IP Address Using the BMC User Interface 4 2 1 Configuring a BMC Static IP Address Using ipmitool This section describes how to set a static IP address for the BMC from the Ubuntu command line If you cannot access the DGX 1 remotely then connect a display 1024x768 or lo...

Page 36: ...tic sudo ipmitool lan set 1 ipsrc static 2 Set the appropriate address information To set the IP address Station IP address in the BIOS settings enter the following and replace the italicized text with your information sudo ipmitool lan set 1 ipaddr 10 31 241 190 To set the subnet mask enter the following and replace the italicized text with your information sudo ipmitool lan set 1 netmask 255 255...

Page 37: ...and Managing the DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 33 3 At the BIOS Setup Utility screen navigate to the Server Mgmt tab on the top menu then scroll to BMC network configuration and press Enter ...

Page 38: ...aging the DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 34 4 Scroll to Configuration Address Source and press Enter then at the Configuration Address source pop up select Static on next reset and then press Enter ...

Page 39: ... DU 08033 001 _v13 1 35 5 Set the addresses for the Station IP address Subnet mask and Router IP address as needed by performing the following for each a Scroll to the specific item and press Enter b Enter the appropriate information at the pop up then press Enter ...

Page 40: ...ess Enter You can now access the BMC over the network 4 2 3 Configuring a BMC Static IP Address Using the BMC Dashboard 1 Log into the BMC then click Configuration from the top menu and select Network Settings 2 In the IPv4 Configuration section of the Network Settings page clear the Use DHCP check box and then enter the appropriate values for the IPv4 Address Subnet Mask and Default Gateway field...

Page 41: ... static IP addresses for the network ports If you did not set this up at that time you can configure the static IP addresses from the Ubuntu command line according to the following instructions If you cannot access the DGX 1 remotely then connect a display 1024x768 or lower resolution and keyboard directly to the DGX 1 1 Determine the port designation that you want to configure based on the physic...

Page 42: ...m1 iface em1 inet static address 192 168 1 14 gateway 192 168 1 1 netmask 255 255 255 0 network 192 168 1 0 broadcast 192 168 1 255 Consult your network adiminstrator for the appropriate addresses for your network and use the port designations that you determined in step 1 3 When finished with your edits press ESC to switch to command mode then save the file to the disk and exit the editor wq 4 Re...

Page 43: ...or enp1s0f1 em1 or enp1s0f0 1 Connect a display 1024x768 or lower resolution and keyboard to the DGX 1 2 Turn the DGX 1 on or reboot 3 At the NVIDIA logo boot screen press F2 or Del to enter the BIOS setup screen 4 Select the Advanced tab from the top menu then scroll down to view the two Quanta Dual Port 10G BASE T Mezzanine items ...

Page 44: ... _v13 1 40 The first item shows the MAC address for ethernet port em1 and the second item shows the MAC address for em2 5 Navigate to and select Server Mgmt from the top menu then scroll down to and select BMC network configuration 6 Scroll down to view the Station MAC address ...

Page 45: ...Configuring and Managing the DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 41 This shows the MAC address for the BMC ...

Page 46: ...t chapter for additional contact information Refer to Submitting BMC Log Files for instructions on how to obtain the BMC log files to assist in troubleshooting 5 2 Restoring the DGX 1 Software Image If the DGX 1 software image becomes corrupted or the OS SSD was replaced after a failure restore the DGX 1 software image to its original factory condition from a pristine copy of the image The process...

Page 47: ...hecksum file and save them to your local disk The ISO image is also available in an archive file If you download the archive file be sure to extract the ISO image before proceeding 5 2 2 Re Imaging the System Remotely These instructions describe how to re image the system remotely through the BMC For information about how to restore the system locally see Re Imaging the System From a USB Flash Dri...

Page 48: ...led for this site c From the JViewer top menu bar click Media and then select Virtual Media Wizard d From the CD DVD Media I section of the Virtual Media dialog click Browse and then locate the re image ISO file and click Open You can ignore the device redirection warning at the bottom of the Virtual Media wizard as it does not affect the ability to re image the system e Click Connect CD DVD then ...

Page 49: ...nd complete the DGX 1 setup a From the top menu click Power and then select Reset Server b Click Yes and then OK at the Power Control dialogs then wait for the system to power down and then come back online c At the boot selection screen select Install DGX 1 OS and then press Enter If you are an advanced user who is not using the RAID disks as cache and want to keep data on the RAID disks then sel...

Page 50: ...estoring the software image remotely through the BMC you do not need a bootable installation medium and you can omit this task If you are creating a bootable USB flash drive follow the instructions for the platform that you are using On Linux see Creating a Bootable USB Flash Drive by Using the dd Command On Windows see Creating a Bootable USB Flash Drive by Using Akeo Rufus If you are creating a ...

Page 51: ...UNTPOINT sda 8 0 0 1 8T 0 disk _sda1 8 1 0 121M 0 part boot efi _sda2 8 2 0 1 8T 0 part sdb 8 16 0 1 8T 0 disk _sdb1 8 17 0 1 8T 0 part sdc 8 32 0 1 8T 0 disk sdd 8 48 0 1 8T 0 disk sde 8 64 1 7 6G 0 disk _sde1 8 65 1 7 6G 0 part media deeplearner DGXSTATION 3 As root convert and copy the image to the USB flash drive sudo dd if path to software image bs 2048 of usb drive device name Caution The dd...

Page 52: ...ition scheme and target system type select GPT partition scheme for UEFI 4 Select the Create a bootable disk using option and from the dropdown menu select ISO image 5 Click the optical drive icon and open the DGX 1 software ISO image 6 Click Start Because the image is a hybrid ISO file you are prompted to select whether to write the image in ISO Image file copy mode or DD Image disk image mode ...

Page 53: ... as cache and want to keep data on the RAID disks then select Install DGX Server without formatting RAID See the section Retaining the RAID Partition While Installing the OS for more information The DGX 1 will reboot and proceed to install the image This can take more than 15 minutes The Mellanox InfiniBand driver installation may take up to 10 minutes After the installation is completed the syste...

Page 54: ...to use the RAID disks as other than cache disks You can always choose to use the RAID disks as cache disks at a later time by enabling cachefilesd and adding raid to the file system table as follows 1 Uncomment the RUN yes line in etc default cachefiled 2 Uncomment the raid line in etc fstab 3 Run the following a Mount raid sudo mount raid b Reload the systemd manager configuration systemctl daemo...

Page 55: ... Shutdown option then click Perform Action You can verify that the DGX 1 is shut down by noting that all the Power Control and Status options are grayed out except for the Power On Server option 3 Update the system BIOS a From the top menu click Firmware Update select BIOS Update and then click Enter Update Mode b Click OK at the Are you sure to enter update mode dialog c From the BIOS Upload scre...

Page 56: ...to start the actual upgrade process The BIOS Flash Status screen shows the upgrade progress which should take a couple of minutes to complete Do not interrupt the upgrade process once it has started 4 After the upgrade process has completed you can use the top menu to turn the system back on a From the top menu click Remote Control and then select Server Power Control b Select the Power On Server ...

Page 57: ...ternet Explorer Google Chrome is not officially supported by the BMC 3 If you re using DHCP and choose not to preserve the network configuration then obtain the MAC address for the BMC If the BMC is connected to a network via DHCP the IP address could change after the update Follow these substeps to obtain the MAC address in order to connect to the BMC after the update in case the IP address chang...

Page 58: ...ect Firmware Update from the drop down menu to return to the Firmware Update page 7 Click Enter Update Mode then click OK at the confirmation dialog After entering Update Mode aborting the operation or even resizing the browser windows will terminate the session and reset the BMC If this happens you will need to close and then reopen the browser to initiate a new session You may need to wait sever...

Page 59: ...eset itself During this time the BMC will be unresponsive 5 5 Replacing the System and Components Be sure to familiarize yourself with the NVIDIA Terms Conditions documents before attempting to perform any modification or repair to the DGX 1 These Terms Conditions for the DGX 1 can be found through the NVIDIA DGX Systems Support http www nvidia com object dgxsystems support html page Contact NVIDI...

Page 60: ...ing an SSD Access the SSDs from the front of the DGX 1 You can hot swap the SSDs as follows 1 If not already removed remove the bezel by grasping the bezel by the side handles and then pulling the bezel straight off the front of the DGX 1 CAUTION Be careful not to accidentally press the power button that is on the right edge of the DGX 1 when removing or installing the bezel 2 Locate the SDD that ...

Page 61: ...identally press the power button that is on the right edge of the DGX 1 when removing or installing the bezel 5 5 3 Recreating the Virtual Drives After you have replaced the OS SSD with or without any of the cache SSDs you need to recreate the virtual drives and then re image the system in order to recreate the partitions on all the virtual drives The following is an overview of the process 1 Clea...

Page 62: ...1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 58 3 At the NVIDIA logo boot screen press F2 or Del to enter the BIOS setup screen 4 Select the Advanced tab from the top menu and then Scroll down and select the MegaRAID Configuration Utility ...

Page 63: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 59 The RAID Configuration menu appears ...

Page 64: ...3 001 _v13 1 60 If you replaced the OS drive follow the instructions in the section Clear the Drive Group Configuration 5 5 3 2 Clear the Drive Group Configuration These instructions apply when you have replaced the OS drive 1 Select Main Menu then select Configuration Management ...

Page 65: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 61 2 Select Clear Configuration ...

Page 66: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 62 3 Select Confirm Disabled and then select Enabled at the confirmation popup ...

Page 67: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 63 4 Select Yes then select OK at the success screen ...

Page 68: ...ual Drive and then Recreate the RAID0 Virtual Drive 5 5 3 3 Recreate the OS Virtual Drive These instructions apply when you have replaced the OS drive Be sure to first complete the instructions in the section Clear the Drive Group Configuration 1 Navigate to the RAID Utility Main Menu then under Actions select Configure then select Configuration Management ...

Page 69: ...Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 65 2 Select Create Virtual Drive then select Select Drives at the next screen Leave all other options at their default settings as shown below ...

Page 70: ...ng the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 66 The list of drives under CHOOSE UNCONFIGURED DRIVES will initially be empty 3 To view the available drives select Select Media Type HDD then change to SSD ...

Page 71: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 67 4 Under CHOOSE UNCONFIGURED DRIVES select the 446 GB drive then change to Enabled at the pop up dialog ...

Page 72: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 68 5 Confirm that only the first drive at Drive Port 0 3 01 00 displays as Enabled ...

Page 73: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 69 6 Scroll up and select Apply Changes ...

Page 74: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 70 7 Select OK at the success screen ...

Page 75: ... 1 DU 08033 001 _v13 1 71 The virtual drive creation page now displays a summary of your selection The Virtual Drive Size should be approximately 446 GB 8 Select Save Configuration at the top of the menu 9 Change the Confirm Disabled field to Enabled and then select Yes ...

Page 76: ... Drive 0 where the OS will be installed 11 Follow the instructions in the section Recreate the RAID0 Virtual Drive 5 5 3 4 Recreate the RAID0 Virtual Drive These instructions apply when you have replaced the OS drive and cleared the drive group configuration 1 Navigate to the RAID Utility Main Menu then under Action select Configure then select Configuration Management ...

Page 77: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 73 2 Select Create Virtual Drive ...

Page 78: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 74 3 Scroll to Select RAID Level and switch to RAID0 if not already set ...

Page 79: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 75 4 Scroll to Select Media Type and switch to SSD ...

Page 80: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 76 5 Select Select Drives ...

Page 81: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 77 6 Switch all unconfigured 1TB drives to Enabled ...

Page 82: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 78 7 Select Apply Changes ...

Page 83: ...IDIA DGX 1 DU 08033 001 _v13 1 79 8 Change Confirm to Enabled then select Yes 9 Select OK at the success screen The Create Virtual Drive screen displays a summary of your selection 10 Verify that the summary matches your selection then select Save Configuration ...

Page 84: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 80 11 Make sure Confirm is set to Enabled then select Yes to confirm the change ...

Page 85: ...ining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 81 12 Select OK at the success screen 13 Confirm and exit a Select View Drive Group Properties to confirm the configuration ...

Page 86: ... nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 82 b Verify that your configuration screen shows that you have two virtual drives with the following properties Virtual Drive 0 of size 446 GB or very similar Virtual Drive 1 of size 7 TB or very similar ...

Page 87: ...NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 83 c If your Drive Groups match the above press F10 to save these settings and reset the system d Select Save Changes and Reset then select Yes at the confirmation prompt ...

Page 88: ...pt you need to get and install the StorCLI utility For instructions see the document Using StorCLI to Recreate the NVIDIA DGX 1 RAID 0 Array available from the Enterprise Services site Connect a display 1024x768 or lower resolution and keyboard to the DGX 1 when booting the DGX 1 before recreating the RAID array This is because the system may halt at the BIOS screen alerting you that the RAID arra...

Page 89: ...ntally press the power button that is on the right edge of the DGX 1 when removing or installing the bezel 2 Unplug the power cord from the power connector on the fan assembly 3 Flip the power supply handle out 4 Push the green release lever to the left and simultaneously use the power supply handle to pull out the power supply 5 Slide the replacement power supply into the bay and push until seate...

Page 90: ...t the front of the DGX 1 then slide the DGX 1 about half way out from the rack 2 Squeeze together the latches at the square access openings on the top of the chassis then flip open the top of the chassis to expose the fan modules 3 Squeeze the release tabs on the outer edge of the fan module you want to replace then pull up to lift the fan module out of the unit 4 Replace with a new fan module usi...

Page 91: ...which is accessible from the rear of the DGX 1 1 Turn off the DGX 1 and disconnect all network and power cabling 2 Remove the motherboard tray a Locate the locking levers for the motherboard tray at the rear of the DGX 1 There are two sets of locking levers The locking levers for the motherboard are the bottom set b Rotate the retention clasps inward towards the center of the unit The retention cl...

Page 92: ...oard tray on a clean work surface and position it so that the locking levers are at the top as you look down on the tray The DIMMs are on a printed circuit board on the left side of the tray 3 Using the figure below as a guide locate the DIMM corresponding to the ID of the faulty DIMM as reported in the BMC log ...

Page 93: ...Maintaining and Servicing the NVIDIA DGX 1 www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 89 4 Remove the DIMM ...

Page 94: ...DGX 1 DU 08033 001 _v13 1 90 a Press down on the side latches at both ends of the DIMM socket to push them away from the DIMM This should unseat the DIMM from the socket b Pull the DIMM straight up to remove it from the socket 5 Carefully insert the replacement DIMM ...

Page 95: ...s click in place c Make sure that the latches are up and locked in place 6 Carefully insert the motherboard tray back into the unit then swing the locking levers flat against the tray and secure them in place with the retention clasps 5 5 8 Replacing the InfiniBand Cards The InfiniBand cards are located on the GPU tray which is accessible from the rear of the DGX 1 Be sure you have identified the ...

Page 96: ... strap connected to the chassis ground and placing components on static free work surfaces 1 Turn off the DGX 1 and disconnect all network and power cabling 2 Remove the GPU tray a Locate the locking levers for the GPU tray at the rear of the DGX 1 There are two sets of locking levers The locking levers for the GPU tray are the top set b Rotate the retention clasps inward towards the center of the...

Page 97: ...n clasps they may break 3 Set the GPU tray on a clean work surface WARNING Do not attempt to move or lift the GPU tray by grabbing the U bolts To properly move the GPU tray grab the tray by the outer edges of the assembly and support it from underneath taking care not to damage any components 4 At the top edge of the bracket for the InfiniBand card that you want to replace rotate the retention cla...

Page 98: ...U 08033 001 _v13 1 94 5 Firmly grasp the InfiniBand card and lift it straight up out of the PCIe slot 6 Position the replacement InfiniBand card over the empty PCIe slot and insert it into the slot 7 Swing the retention clasp over the bracket to secure the bracket in place ...

Page 99: ... that the card was installed correctly and is recognized by the system lspci grep i mellanox The output should show all four InfiniBand cards Example 05 00 0 Infiniband controller Mellanox Technologies MT27700 Family ConnectX 4 0c 00 0 Infiniband controller Mellanox Technologies MT27700 Family ConnectX 4 84 00 0 Infiniband controller Mellanox Technologies MT27700 Family ConnectX 4 8b 00 0 Infiniba...

Page 100: ... OFED software was installed correctly modinfo mlx5_core grep i version head 1 Example output Version 3 4 1 0 0 DGX 1 OS release 1 0 should have OFED software 3 2 DGX 1 OS release 2 0 should have OFED software 3 4 4 Restart the InfiniBand services so that the new card is recognized a Restart the InfiniBand service sudo service openibd restart b Restart the Service Manager service sudo service open...

Page 101: ...are cat sys class infiniband mlx5 fw_ver 12 17 1010 12 17 1010 12 17 1010 12 17 1010 7 Verify the physical port state for the InfiniBand cards ibstat In the output text verify that the Physical State for each card with a cable connection is LinkUp and that the port for the card is configured with a GUID The following example output shows one card in a non connected state and three cards in a conne...

Page 102: ...48a0703001effde System image GUID 0x248a0703001effde Port 1 State Initializing Physical state LinkUp Rate 100 Base lid 65535 LMC 0 SM lid 0 Capability mask 0x2651e848 Port GUID 0x248a0703001effde Link layer InfiniBand CA mlx5_3 CA type MT4115 Number of ports 1 Firmware version 12 17 1010 Hardware version 0 Node GUID 0x7cfe900300118f22 System image GUID 0x7cfe900300118f22 Port 1 State Initializing ...

Page 103: ...is method is available only for software versions that are available as ISO images for download Alternately you can update the DGX 1 software by performing a network update from a local repository This method is available only for software versions that are available for over the network updates 6 1 1 Re Imaging the System WARNING This process destroys all data and software customizations that you...

Page 104: ...s and Upgrade Guide which you can obtain from the Enterprise Support site 6 2 Installing Docker Containers This method applies to Docker containers hosted on the NVIDIA DGX Container Registry and requires that you have an active DGX Cloud Services account 1 On a system with internet access log in to the DGX Container Registry by entering the following command and credentials docker login nvcr io U...

Page 105: ...Installing Software on Air Gapped NVIDIA DGX 1 Systems www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 101 7 Verify the image is on your system docker images ...

Page 106: ...vices https nvid nvidia com enterpriselogin NVIDIA Enterprise Support Email enterprisesupport nvidia com NVIDIA Enterprise Support Local Language Phone Numbers EMEA Monday Saturday 9 AM 7 PM CET DE 49 3030806888 ES 34 911880035 FR 33 975186797 IT 39 0294757110 PL 48 223975013 RU 7 499 6092527 11 AM 9 PM MSK JP 0120 706 170 9 AM 6 PM JST US 1 408 486 2500 Monday Friday 8 AM 5 PM PST After hours eme...

Page 107: ... specified in this guide Use of other products I components will void the UL Listing and other regulatory approvals of the product and may result in noncompliance with product regulations in the region s in which the product is sold 8 1 Safety Warnings and Cautions To avoid personal injury or property damage before you begin installing the product read observe and adhere to all of the following sa...

Page 108: ...nd similar commercial type locations The suitability of this product for other product categories and environments such as medical industrial residential alarm systems and test equipment other than an ITE application may require further evaluation 8 3 Site Selection Choose a site that is Clean dry and free of airborne particles other than normal room dust Well ventilated and away from sources of h...

Page 109: ...ower supply Some power supplies in servers use Neutral Pole Fusing To avoid risk of shock use caution when working with power supplies that use Neutral Pole Fusing The power supply in this product contains no user serviceable parts Do not open the power supply Hazardous voltage current and energy levels are present inside the power supply Return to manufacturer for servicing When replacing a hot p...

Page 110: ... the power supply There are no serviceable parts in the power supply Return to manufacturer for servicing Power down the server and disconnect all power cords before adding or replacing any non hot plug component When replacing a hot plug power supply unplug the power cord to the power supply being replaced before removing the power supply from the server Caution If the server has been running any...

Page 111: ...quipment in the rack should be such that a hazardous condition is not achieved due to uneven mechanical loading Circuit Overloading Consideration should be given to the connection of the equipment to the supply circuit and the effect that overloading of the circuits might have on overcurrent protection and supply wiring Appropriate consideration of equipment nameplate ratings should be used when a...

Page 112: ...a problem you should be aware of the possibility in case you re susceptible to nickel related reactions Battery Replacement Caution There is the danger of explosion if the battery is incorrectly replaced When replacing the battery use only the battery recommended by the equipment manufacturer Dispose of batteries according to local ordinances and regulations Do not attempt to recharge a battery Do...

Page 113: ...Safety www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 109 Attach the covers to the chassis according to the product instructions ...

Page 114: ...ice NOTE This equipment has been tested and found to comply with the limits for a Class A digital device pursuant to part 15 of the FCC Rules These limits are designed to provide reasonable protection against harmful interference when the equipment is operated in a commercial environment This equipment generates uses and can radiate radio frequency energy and if not installed and used in accordanc...

Page 115: ...stic environment this product may cause radio frequency interference in which case the user may be required to take adequate measures The product has been marked with the CE Mark to illustrate its compliance This device complies with the following Directives EMC Directive 2014 30 EU for Class A I T E equipment Low Voltage Directive 2014 35 EU for electrical safety RoHS Directive 2011 65 EU for haz...

Page 116: ...DU 08033 001 _v13 1 112 This is a Class A product In a domestic environment this product may cause radio interference in which case the user may be required to take corrective actions VCCI A 9 6 Australia RCM 9 7 China RoHS Material Content ...

Page 117: ...Compliance www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 113 ...

Page 118: ...m NVIDIA DGX 1 DU 08033 001 _v13 1 114 9 8 Israel SII 9 9 South Korea KC Class A Equipment Industrial Broadcasting Communication Equipment This equipment Industrial Class A electromagnetic wave suitability equipment and seller ...

Page 119: ...Compliance www nvidia com NVIDIA DGX 1 DU 08033 001 _v13 1 115 or user should take notice of it and this equipment is to be used in the places except for home 9 10 India BIS ...

Page 120: ...oduct described in this guide will be suitable for any specified use without further testing or modification Testing of all parameters of each product is not necessarily performed by NVIDIA It is customer s sole responsibility to ensure the product is suitable and fit for the application planned by customer and to do the necessary testing for the application in order to avoid a default of the appl...

Reviews: