Urika®-GX System Administration Guide
(2.2.UP00)
S-3016
Page 1: ...Urika GX System Administration Guide 2 2 UP00 S 3016...
Page 2: ...tation SMW 22 3 4 Hardware Supervisory System HSS 22 3 4 1 Hardware Supervisory System HSS Architecture Overview 24 3 4 2 The xtdiscover Command 25 3 4 3 Hardware Supervisory System HSS Component Loca...
Page 3: ...formation Using the xtdumpsys Command 43 3 5 Dual Aries Network Card dANC Management 43 3 6 Analyze Node Memory Dump Using the kdump and crash Utilities on a Node 44 3 7 Cray Lightweight Log Managemen...
Page 4: ...05 5 Resource Management 124 5 1 Manage Resources on Urika GX 124 5 2 Use Apache Mesos on Urika GX 126 5 2 1 Access the Apache Mesos Web UI 128 5 3 Use mrun to Retrieve Information About Marathon and...
Page 5: ...Default System Management Workstation SMW Passwords 223 7 9 5 Change LDAP Password on Urika GX 224 7 9 6 Reset a Forgotten Password for the Cray Application Management UI 224 7 9 7 Reset an Administr...
Page 6: ...Framework 256 8 6 Clean Up Log Data 257 8 7 Diagnose and Troubleshoot Orphaned Mesos Tasks 258 8 8 Troubleshoot Common Analytic and System Management Issues 259 8 9 Troubleshoot mrun Issues 268 8 10 T...
Page 7: ...sh At the end of a command line indicates the Linux shell line continuation character lines joined by a backslash are parsed as a single line Do not type anything after the backslash or the continuati...
Page 8: ...RAYPAT CRAYPORT DATAWARP ECOPHLEX LIBSCI NODEKARE The following system family marks and associated model number marks are trademarks of Cray Inc CS CX XC XE XK XMT and XT The registered trademark LINU...
Page 9: ...rvisory System HSS HSS is an integrated system of hardware and software components that are used for managing and monitoring the system Cobbler Cobbler is used on Urika GX for provisioning and deploym...
Page 10: ...t network The operational Ethernet network is used for ingesting user data This network is comprised of a single unit 48 port GigE switch that provides dual 1GigE and or dual 10GigE interfaces to the...
Page 11: ...ice versa is routed through the corresponding RC For additional information see the Urika GX Hardware Guide 2 3 File Systems Supported file system types on Urika GX include Internal file systems Hadoo...
Page 12: ...a GX hardware High speed network management network switches must not be modified as this network is internal to Urika GX Moving the system from the rack Cray supplies to customer provided racks is no...
Page 13: ...w Docker images is currently not supported on Urika GX For more information contact Cray Support Before installing any additional software on the Urika GX system a ticket should be opened with Cray Su...
Page 14: ...Tenancy on page 183 Tenant NameNode configuration is managed automatically by the Urika GX tenant management scripts Manually altering the configurations of the tenant NameNode is not supported Tenant...
Page 15: ...to the SMW as root ssh root hostname smw 2 Display the current service mode by using one of the following options Execute the urika state command This displays the current service mode as well as the...
Page 16: ...gent Subrack Control Board iSCB rRsSiI I 0 to 1 Dual Aries Network Card dANC There are up to 2 dANCs per sub rack accommodating up to 16 nodes rRsScC C 0 to 1 High Speed Network HSN cable The j name i...
Page 17: ...s in this publication be followed before making any changes reconfigurations to the SWM as well as before restarting the SMW 3 3 1 Power On the System Management Workstation SMW The SMW can be turned...
Page 18: ...shut down the System Management Workstation SMW Procedure 1 Point a browser to the site specific iDRAC IP address such as https system smw ras The iDRAC console s login screen appears 2 Enter root and...
Page 19: ...Figure 1 iDRAC Login Screen 3 On the Quick Launch Tasks section of the iDRAC UI click on Power ON OFF link to control the SMW s power System Management S3016 19...
Page 20: ...leges About this task The components of the Cray system synchronize time with the System Management Workstation SMW through Network Time Protocol NTP By default the NTP configuration of the SMW is con...
Page 21: ...out NTP refer to the Linux documentation 6 Sync the hardware clock smw hwclock systohc 7 Verify that the SMW has jitter from the NTP server smw ntpq p 3 3 5 Synchronize Time of Day on System Nodes Pre...
Page 22: ...estarts establishes communications with all external interfaces restores the proper state in the state manager and continues normal operation without user intervention For a scheduled or unscheduled s...
Page 23: ...itors and controls the rack power and communicates with all dANC controllers in the rack It sends a periodic heartbeat to the SMW to indicate rack health The rack controller connects to the dANC contr...
Page 24: ...rview HSS hardware on the Urika GX system consists of a System Management Workstation SMW which is a rack mounted Intel based server running CentOS along with an Ethernet network that connects the SMW...
Page 25: ...r provided block of IP address space This information is used to create the etc hosts file and DHCP entries for the HSS network This setup typically only needs to be done once unless the address block...
Page 26: ...the database when needed Thus the dynamic system state persists between SMW boots The state manager uses the Lightweight Log Manager LLM The log data from state manager is written to var opt cray log...
Page 27: ...not typically result in a new mapping Since the operating system always uses NIDs HSS converts these to NIC IDs when sending them on to the HSS network and converts them to NIDs when forwarding event...
Page 28: ...ies the authorized_keys file for Rack Controllers RCs and dANCCs xtchecklink Checks HSN and PCIe link health xtclass Displays the network topology class for this system xtclear Clears component flags...
Page 29: ...ble interrupt to target nodes See the xtnmi 8 man page for more information xtpcimon Monitors health of PCIe channels for Urika GX systems rackfw Flashes all devices in the Urika GX system via out of...
Page 30: ...ate with the Cray network application specific integrated circuit ASIC xtbte_perf Determines the one hop connections between nodes It then performs BTE transfers over these one hop connections and det...
Page 31: ...d Monitors ANCC heartbeats emits heartbeat for the State Manager controls dANC power operations and monitors the iSCB s health Controller Vitality Check Daemon cvcd Monitors memory consumption CPU uti...
Page 32: ...dwidth low latency communication between all the compute processing elements of the system CAUTION xtbounce should never be executed when nodes are up as this command will not gracefully shut nodes do...
Page 33: ...etwork Card dANC will fail if any nodes under the component are in the ready state unless the force option f is used An error message will indicate the reason for the failure If the system is currentl...
Page 34: ...state HSS managers and the xtcli command ignore empty or disabled components Setting a selected component to the EMPTY state is typically done when a component usually a blade is physically removed B...
Page 35: ...ponent as follows crayadm smw xtcli lock u lock_number Where lock_number is the value given when initiating the lock it is also indicated in the xtcli lock show query Unlocking does nothing to the sta...
Page 36: ...rrectly and debugging information is needed or to stop a node that is running incorrectly The sole purpose of the xtnmi command is to collect debug information from unresponsive nodes As soon as that...
Page 37: ...ormation request For more information see the rtr 8 man page Display routing information The system map option to rtr writes the current routing information to stdout or to a specified file This comma...
Page 38: ...e following circumstances On any single group system at any time even those listed above During a warmswap operation See the rtr 8 man page for additional information 3 4 19 Power Up a Rack or Dual Ar...
Page 39: ...7 ping nid00015 PING nid00015 10 128 0 16 56 84 bytes of data 64 bytes from nid00015 10 128 0 16 icmp_seq 1 ttl 64 time 0 032 ms 64 bytes from nid00015 10 128 0 16 icmp_seq 2 ttl 64 time 0 010 ms 3 4...
Page 40: ...he xtshow alert command for a cabinet does not display an alert for a node Similarly checking the status of a node does not detect an alert on a cabinet Show all alerts on the system crayadm smw xtsho...
Page 41: ...hardware components For more information see the xtshow 8 man page 3 4 28 Clear Component Flags Use the xtclear command to clear system information for selected components Type commands as xtclear opt...
Page 42: ...te the authorized_keys file for the Intelligent Subrack Control Boards iSCBs It must be run as root The xtcc ssh keys command takes no options When run it invokes the user selected text editor specifi...
Page 43: ...d name the components from which to collect data NOTE Only the crayadm account can execute the xtdumpsys command For more information see the xtdumpsys 8 man page 3 5 Dual Aries Network Card dANC Mana...
Page 44: ...and determining the cause of a crash Dumped image of the main memory exported as an Executable and Linkable Format ELF object can be accessed either directly during the handling of a kernel crash thro...
Page 45: ...ansports those messages to the SMW and places the messages into log files By default LLM has a log trimming mechanism enabled called xttrim 3 8 Urika GX Node Power Management A number of CLI scripts c...
Page 46: ...rocedure assumes that the instructions are being carried out on a 3 sub rack 48 node system About this task The instructions documented in the procedure can be used for powering up the Urika GX system...
Page 47: ...lts when the preceding command is executed use the xtclear_alert or xtclear_reserve command to clear those statuses 6 Ensure that the ANCCs are powered on by executing the command root smw xtalive All...
Page 48: ...le d smw_deploy_profile sh 18 Confirm that the required nodes have a mount point by executing the following command root smw pall df k grep lustre sort 19 Start up the analytic applications using the...
Page 49: ...smw urika stop For more information see the urika stop man page 3 Execute the urika state command to ensure that analytic applications have shut down The HA Proxy service may still be running root sm...
Page 50: ...displayed in the Flags column of the results when the following commands are executed root smw xtcli status t node s0 Network topology class 0 Network type Aries Nodeid Service Core Arch Comp state F...
Page 51: ...specify the node IDs separated by commas The list of services that are started using this command depends on the service mode default or secure that the system is running in This command starts all th...
Page 52: ...ode IDs separated by commas To stop a specific service use the s or service option to specify the name of that service This command stops all the running analytic services if used without any options...
Page 53: ...page Display status of services urika state Sequence of Execution of Scripts The urika state command can be used to view the state services running on Urika GX Use this command to ensure that the foll...
Page 54: ...erequisites This procedure requires root privileges Before performing this procedure use the urika state command to ensure that the system is operating in the service mode that supports using InfluxDB...
Page 55: ...ration is changed from 0 forever to 2 weeks 504 hours alter retention policy default on Cray Urika GX Duration 2w show retention policies on Cray Urika GX name duration shardGroupDuration replicaN def...
Page 56: ...t nid00008 ZooKeeper Secondary HDFS NameNode Mesos Master Oozie Hive Server2 Spark Thrift Server Hive Metastore WebHCat Postgres database Marathon YARN Resource Manager Collectl nrpe kubelet nid00014...
Page 57: ...pe kubelet nid00030 Login node 2 HUE HA Proxy Collectl Service for flexing a YARN cluster Grafana InfluxDB nrpe kubelet Table 10 Urika GX Service to Node Mapping 3 Sub rack System Node ID s Service s...
Page 58: ...1 HUE HA Proxy Collectl Urika GX Applications Interface UI Jupyter Notebook Service for flexing a YARN cluster Documentation and Learning Resources UI nrpe kubelet Kubernetes Controller nid00031 nid00...
Page 59: ...iven host Because of the lightweight nature of containers users can run more containers on a given hardware combination than if using virtual machines Images shipped with the system are managed by Doc...
Page 60: ...name as the Spark job s name spark submit class org apache spark examples SparkPi conf spark app name spark pi local opt spark examples target scala 2 11 jars spark examples_2 11 2 2 0 k8s 0 5 0 jar...
Page 61: ...yspark pi jars local opt spark examples target scala 2 11 jars spark examples_2 11 2 2 0 k8s 0 5 0 jar local opt spark examples src main python pi py Execute the kubectl logs pod_name command and grep...
Page 62: ...s 1 Number of cores requested for the driver This should be increased if a job does a lot of work in the driver e g aggregations result collection spark driver memory 16g Amount of memory requested pe...
Page 63: ...r Configures the tenant specific Spark Thrift Server and launches the Spark Thrift Server spark job in the tenant s Kubernetes namespace stop thriftserver Stops the tenant specific Spark Thrift Server...
Page 64: ...ervice To see the state of the Kubernetes POD containing the Metastore server root login1 kubectl get pods n TENANT VM metastore TENANT VM show all The status should be Running and the Status 2 2 indi...
Page 65: ...CAUTION iSCB CLI commands other than the status command should NOT be executed on the Urika GX system unless advised by Cray Support For more information contact Cray support The capmc ux nid cobbler...
Page 66: ...ate and send alert notifications For more information refer to Tenant Management on page 189 Procedure 1 Access the Nagios UI by pointing a browser at http machine smw nagios 2 Enter crayadm as the us...
Page 67: ...his procedure requires root privileges Ensure that the mod_ssl package is installed on the system If it is not install it by executing the following as root on the SMW yum install y mod_ssl About this...
Page 68: ...called a Distinguished Name or a DN There are quite a few fields but you can leave some blank For some fields there will be a default value If you enter the field will be left blank Country Name 2 le...
Page 69: ...l x509 req days 365 in certrequest csr signkey keyfile key out certfile crt This should produce output saying that the signature was OK and that it retrieved the private key Self signing will result i...
Page 70: ...is returned add a security exception The Apache test web page will be displayed upon success The Nagios Core server can now be accessed by directing a web browser at https serverName nagios More detai...
Page 71: ...cations can be sent out to this contact Valid options are a combination of one or more of the following w Notify on WARNING service states u Notify on UNKNOWN service states c Notify on CRITICAL servi...
Page 72: ...ontact2 define service name gluster service use generic service notifications_enabled 1 notification_period 24x7 notification_options w u c r f s notification_interval 120 register 0 _gluster_entity S...
Page 73: ...gios server sends notifications during status changes to the mail addresses specified in the file It is important to note that By default the system ensures three occurrences of the event before sendi...
Page 74: ...g file 6 Restart the Nagios service service nagios restart 4 2 4 Configure Email Alerts Prerequisites This procedure requires root privileges About this task This procedure provides instructions for u...
Page 75: ...vice that alerting needs to be set up for The following example shows how to set up the service for aggregate CPU usage define service use local service host_name localhost service_description Aggrega...
Page 76: ...e warning and critical levels are digits up to 4 decimal points between 0 0 and 1 0 which represent the percentage of the metric The following plugins are configurable on the SMW at the default path u...
Page 77: ...cfg file 7 Restart the Nagios service service nagios restart 8 Switch to the usr local nagios etc directory 9 Modify the nrpe conf configuration file as needed a Define the command name and the path...
Page 78: ...ources All Dashboards are owned by a particular Organization User A User is a named account in Grafana A user can belong to one or more Organizations and can be assigned different levels of privileges...
Page 79: ...oop and Spark dashboards to display data refer to Update the InfluxDB Data Retention Policy on page 54 Reducing the amount of data retained makes Grafana dashboards display faster 4 4 Default Grafana...
Page 80: ...st traffic generated by CGE is not shown here Operational network Bytes sec In Out Displays the overall operational network traffic information for all nodes Management network Bytes sec In Out Displa...
Page 81: ...sec In Out Displays the overall management network traffic information for each node FILE SYSTEM UTILIZATION Root File System Hard Drive dev sda Reads Writes Bytes Sec 200MB sec max Displays informati...
Page 82: ...d by all the compute nodes FILE SYSTEM READS WRITE BYTES SEC Root File System Hard Drive dev sda Reads Writes Bytes Sec 200MB sec max Displays information about the usage of memory on the root file sy...
Page 83: ...Management network Bytes sec In Out Displays the overall management network traffic information for compute nodes Figure 7 Compute Node Performance Statistics Hadoop Application Metrics This section c...
Page 84: ...ime The Y axis represents a linear number AllocatedContainers Displays the number of allocated YARN containers by all the jobs in the hadoop cluster The Y axis represents a linear number DataNode Byte...
Page 85: ...or GB used by all the I O and login nodes FILE SYSTEM READS WRITE BYTES SEC Root File System Hard Drive dev sda Reads Writes Bytes Sec 200MB sec max Displays information about the usage of memory on...
Page 86: ...Displays graphs representing statistical data related to network CPU and I O utilization for individual Urika GX nodes The node s hostname can be selected using the hostname drop down provided on the...
Page 87: ...elected node NFS HDFS Lustre Percentage Used Displays the percentage of NFS HDFS and Lustre used by the selected node File System Percent Used Displays the percentage of file system used by the select...
Page 88: ...ys the management network s data rate NETWORK PACKET DROPS SEC AND ERRORS SEC Operational Network Dropped and Errors Per Sec Displays the number of dropped packets and errors per second for the operat...
Page 89: ...ber of tasks and X axis displays the start stop time of the task for a particular Spark Job This section contains the following graphs Completed Tasks Per Job The approximate total number of tasks tha...
Page 90: ...ted will always be greater than or equal to used JVM Memory Usage Max Represents the maximum amount of memory in bytes that can be used for memory management Its value may be undefined The maximum amo...
Page 91: ...r admin admin true 3 Edit the etc influxdb influxdb conf file as root vi etc influxdb influxdb conf change auth to be true save file exit systemctl restart influxdb At this point Grafana dashboards wi...
Page 92: ...ikaGXHadoop 906M CrayUrikaGXSpark 21M _internal 4 Connect to InfluxDB to view the current data retention policy bin influx Visit https enterprise influxdata com to register for updates InfluxDB server...
Page 93: ...iguration file contains instructions for making such updates 4 8 Change the Default Timezone Displayed on Grafana Prerequisites This procedure requires root privileges Before performing this procedure...
Page 94: ...Figure 14 Grafana Login Screen 3 Select Home admin Preferences Figure 15 Preferences Interface System Monitoring S3016 94...
Page 95: ...rvice mode that supports using Grafana For more information see the urika state man page and refer to Urika GX Service Modes on page 177 About this task In addition to the default set of dashboards ad...
Page 96: ...y of access or at http hostname login2 3000 2 Log on to Grafana by entering LDAP credentials or credentials of the default Grafana account username admin password admin to log on to Grafana Figure 17...
Page 97: ...vigation menu 5 Add panels graphs rows to the new dashboard and customize as needed 6 Optional Update the dashboard s default settings using Settings from the configuration gear at the top of the inte...
Page 98: ...o the default set of panels administrators can add additional panels to the Dashboard of the Grafana UI as described in the following instructions Procedure 1 Access the Grafana UI using the Urika GX...
Page 99: ...4 Select Add Panel graph from the green menu on the left side of the Dashboard Figure 22 Add a New Graph to Grafana Dashboard 5 Click on the title of the new panel and select edit as shown in the fol...
Page 100: ...ites This procedure requires root privileges Before performing this procedure use the urika state command to ensure that the system is operating in the service mode that supports using InfluxDB For mo...
Page 101: ...Prerequisites This procedure requires root access About this task Access to the iSCB module is made through an SSH password less login Once logged on the iSCB status command can be used to monitor the...
Page 102: ...Executable and Linkable Format ELF object can be accessed either directly during the handling of a kernel crash through proc vmcore or it can be automatically saved to a locally accessible file system...
Page 103: ...can be used to retrieve various pieces of information Procedure 1 Log on to the SMW as root 2 Execute the urika check platform command as shown in the following examples Various HSS checks Rack contro...
Page 104: ...Ethernet port and a dedicated logical serial port assigned for command and power management access The iSCB module is typically managed over a secure private management network Cray recommends the iSC...
Page 105: ...s files etc hosts 192 168 2 10 r0s0i0 rx0y0s0i0 iscb_r0s0i0 192 168 2 11 r0s1i0 rx0y0s1i0 iscb_r0s1i0 192 168 2 12 r0s2i0 rx0y0s2i0 iscb_r0s2i0 192 168 2 13 r0s3i0 rx0y0s3i0 iscb_r0s3i0 iSCB Password...
Page 106: ...ID LED log Display event logs by specified log level all info critical and warning node Display status of all or specified nodes nvram Display or update the configuration parameters stored in NVRAM p...
Page 107: ...7 0 0 0 0 5 r0s0n5 on 0070 1 08 00 1e 67 b5 2b f4 10 10 128 8 0 0 0 0 6 r0s0n6 on 0070 1 08 00 1e 67 b5 28 f7 10 10 128 9 0 0 0 0 7 r0s0n7 on 0070 1 08 00 1e 67 b5 28 bb 10 10 128 10 0 0 0 0 8 r0s0n8...
Page 108: ...nection The asterisk indicates the current connection Syntax console kill slot_number Parameters kill slot_number Kill the specified CLI connection Example r0s0i0 iscb console No Auth Remote Timeout I...
Page 109: ...0 PM Dev LM25066 AA power on 12V 12 1 V Watt 5 0 W Ver 2 stat 0103h htbeat 59 dANC 22 C 55 Aries0 29 C 85 Aries1 30 C 85 AOC0 0 C 1 AOC1 0 C 1 AOC2 0 C 1 AOC3 0 C 1 AOC4 0 C 1 ok r0s0i0 iscb danc cycl...
Page 110: ...M1276 3 Power 12 0 A 153 W Peak 558 W Average 477 W Energy 13520 6 Wh in last 101865secs 28h 17m 45s ok r0s0i0 help Display available commands and usage Syntax help command Parameters command Displays...
Page 111: ...lasterr ok r0s0i0 led Turn Off the green red LED and the node s ID LED Syntax led on off node all Parameters on Turn on the specific nodes idled If node is already turned on there will be no action of...
Page 112: ...through 16 all Display status for all nodes Example r0s0i0 iscb node No Name Pwr LED GPU BMC PID Sns GPUMap Watt Tmrg Console 0 r0s0n0 on off off 1 08 0070 4 0 0000 108 63 idle 1 r0s0n1 on off off 1...
Page 113: ...Addr 10 11 0 20 255 255 0 0 gw 10 11 0 1 Time Date TZ CST6CDT TimeServer 192 168 0 1 Event Log Level info Remote Log Level info LogServer 192 168 0 1 Port 514 SNMP Trap Level off Community public Rece...
Page 114: ...urn on the specific nodes If node is already turned on there will be no action off Turn off the specific nodes If node is already turned off there will be no action cycle Turn off then turn on the spe...
Page 115: ...t dur min max count err last_access blade0 0 5 3 5 10775 0 48781 pm0 0 5 4 5 10774 0 48781 blade1 0 4 3 5 10775 0 48782 pm0 0 5 4 5 10774 0 48782 blade2 0 4 4 5 10775 0 48783 pm0 0 5 4 5 10774 0 48783...
Page 116: ...4 16167 0 48782 dANC1 0 3 3 4 16156 0 48782 dANC1 pm 0 3 3 4 16167 0 48782 ok r0s0i0 psu Display status power consumption and energy data of all or specified PSUs Syntax psu psuno Parameters psuno PSU...
Page 117: ...p sensor data from specified node Syntax pump node Parameters node Specified node 1 through 16 Example r0s0i0 iscb pump 1 slot 1 name r0s0n1 present on power on pumps no pump sensor ok r0s0i0 reboot R...
Page 118: ...E cluster Not implemented for Urika GX Default is off acelog level If ace is set generate ace related log of defined level info warning critical activefan on off Dynamic Fan Control by node s health s...
Page 119: ...lade exceeds 85 C the blade health status is set to warning gpu type slotmap GPU Co proc type and PCI slot map Type is one of auto m20xx k10 k20 k40 k80 and phi gputemp temp The critical temperature o...
Page 120: ...abinet subrack ID Type is one of SR5110 SR6110 SR8104 SR8204 SR8202 SR10216 Cabinet ID is 0 127 Subrack ID is 0 7 shelfname name Subrack alias name 14 chars max shutdown level Event level of shutdown...
Page 121: ...arning and higher level event logs Example r0s0i0 iscb status iSCB Status Node 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 Total Power ID LED Console BMC Temp C 63 61 66 61 63 60 64 62 67 68 67 1...
Page 122: ...dle 0 0 0 0 0 7003 telnet ttyUSB2 idle 0 0 0 0 0 7004 telnet ttyUSB3 idle 0 0 0 0 0 7005 telnet ttyUSB4 idle 0 0 0 0 0 7006 telnet ttyUSB5 idle 0 0 0 0 0 7007 telnet ttyUSB6 idle 0 0 0 0 0 7008 telnet...
Page 123: ...iscb 1 0 bin 100 17MB 8 8MB s 00 02 r0s0i0 ssh root r0s0i0 iscb updatefw ok r0s0i0 ver Display firmware version and build date Syntax ver Parameters None Example r0s0i0 iscb ver iSCB Ver 1 1 11 53 51...
Page 124: ...obs to a Mesos cluster frameworks register with Mesos Mesos offers resources to registered frameworks Frameworks can either choose to accept or reject the resource offer from Mesos If the resource off...
Page 125: ...e Urika GX scripts for flexing clusters and the mrun command do not submit their jobs unless they know the resource requirement is satisfied Once the flex up request is successful YARN uses its own qu...
Page 126: ...plications and or frameworks It lies between the application layer and the operating system and simplifies the process of managing applications in large scale cluster environments while optimizing res...
Page 127: ...ompute nodes YARN Resource Manager HDFS NameNode Secondary HDFS NameNode Hadoop Application Timeline Server Hadoop Job History Server Spark History Server Oozie For services like Mesos Masters and Mar...
Page 128: ...to verify that the Mesos service is running Check the service mode by executing urika service mode to ensure that the system is running in the service mode required for Mesos to run About this task T...
Page 129: ...md system services must be running for mrun CGE to work If either service is stopped or disabled mrun will no longer be able to function The mrun command needs to be executed from a login node Some ex...
Page 130: ...port 2016 133 03 50 28 235572 Nodes 20 CPUs 320 480 user user cmd cge server nid00032 urika com startedAt 2016 05 12T17 05 53 573Z nid00028 urika com startedAt 2016 05 12T17 05 53 360Z nid00010 urika...
Page 131: ...0x05 nid00005 32 515758 idle 6 0x06 nid00006 32 515758 idle 7 0x07 nid00007 32 515758 busy 8 0x08 nid00008 32 515758 busy 9 0x09 nid00009 32 515758 busy 10 0x0a nid00010 32 515758 busy 11 0x0b nid0001...
Page 132: ...lines are ignored NCMDServer nid00000 MesosServer localhost same as host MesosPort 5050 MarathonServer localhost MarathonPort 8080 DebugFLAG False same as debug VerboseFLAG False same as verbose JobT...
Page 133: ...cation instances as needed In addition Cray developed scripts for starting a cluster of YARN Node Managers are also launched through Marathon Before using Marathon ensure that the system is running in...
Page 134: ...accessed at the port it runs on i e at http hostname login1 8080 or http hostname login2 8080 Figure 24 Urika GX Applications Interface Resource Management S3016 134...
Page 135: ...UI Marathon also enables creating applications from the UI via the Create Application button which displays the New Application pop up Figure 26 Create an Application Using the Marathon UI Resource Ma...
Page 136: ...cedure requires version 10 11 of the operating system About this task Cray recommends to have the Spark Thrift to be started up by administrators however users can use the following instructions if th...
Page 137: ...tem is running in the service mode that allows use of the Cray Application Management UI Execute the urika state or urika service mode commands to check the service mode For more information refer to...
Page 138: ...uick Filters UI elements are used at the same time only jobs that match the selected quick filter will be displayed The table displayed on the UI contains information about submitted jobs and contains...
Page 139: ...Finished which can be used to view and download logs that help identify whether or not the Spark job succeeded Selecting this link will present a login screen where users will need to enter their LDAP...
Page 140: ...See the mount 8 and dvs 5 man pages for more information Cray DVS uses the Linux supplied VFS interface to process file system access operations This allows DVS to project any POSIX compliant file sy...
Page 141: ...failover and failback for parallel modes The topic describes how it works and includes example console messages DVS periodic sync promotes data and application resiliency and is more efficient than t...
Page 142: ...16 bit must be 0 or 1 for SET Retrieves sets the deferopens value for a file on a DVS mount DVS_GET_FILE_KILLPROCESS DVS_SET_FILE_KILLPROCESS signed 16 bit must be 0 or 1 for SET Retrieves sets the ki...
Page 143: ...t setting noatomic or 0 Associated environment variable DVS_ATOMIC Additional notes none attrcache_timeout attrcache_timeout enables client side attribute caching File attributes and dentries for geta...
Page 144: ...he final close of a file descriptor in addition to forwarding the close to the DVS server the DVS server node waits until data has been written to the underlying media before indicating that the close...
Page 145: ...ibute_create_ops caused DVS to change its hashing algorithm so that create and lookup requests are distributed across all of the servers as opposed to being distribute to a single server This applies...
Page 146: ...servers Default setting fnv 1a Associated environment variable none Additional notes Except in cases of extremely advanced administrators or specific advice from DVS developers do not use the hash mou...
Page 147: ...nce For a description of loadbalance mode see About DVS Loadbalance Mode on page 152 noloadbalance automatically sets the following mount options maxnodes 1 cache 1 and hash_on_nid 0 Default setting n...
Page 148: ...required unless nodename is used Associated environment variable none Additional notes none nodename nodename is equivalent to nodefile but the administrator specifies a list of server nodes on the mo...
Page 149: ...nv or 1 Associated environment variable none Additional notes none 6 1 4 DVS Environment Variables By default user environment variables allow client override of options specified in the etc fstab ent...
Page 150: ...m administrator s choice of DVS mount point options A DVS mode is really just the name given to a collection of mount options chosen to achieve a particular goal Users cannot choose among DVS modes un...
Page 151: ...s that all bytes associated with a read or write are not interleaved with bytes from other read or write operations 6 1 5 2 About DVS Serial Mode Serial mode is the simplest implementation of DVS wher...
Page 152: ...bar1 foo bar2 foo bar3 foo bar3 foo bar1 foo bar2 foo bar3 foo bar1 foo bar2 foo bar1 foo bar2 foo bar3 6 1 5 4 About DVS Loadbalance Mode Loadbalance mode is a client access mode for DVS used to mor...
Page 153: ...handled by three different DVS servers thus distributing the load at a more granular level than that achieved by cluster parallel mode Note that while file I O is distributed at the block level file m...
Page 154: ...e that server is selected the entire read or write request is handled by that server only This ensures that all I O requests are atomic while allowing DVS clients to access different servers for subse...
Page 155: ...anner When failback occurs files are restriped to their original pattern Client System Console Message DVS file_node_down removing from list of available servers for 2 mount points The following messa...
Page 156: ...300 On DVS clients specifies the number of seconds between checks for dirty files that need to request the last sync time from the server The default value is 600 A fourth proc file procsfs dvs sync_s...
Page 157: ...was not spawned through aprun the request apid is 0 An example output of the file looks like cat proc fs dvs ipc requests server r0s1c1n3 request RQ_LOOKUP path dsl ufs home user 12795 time 0 000 sec...
Page 158: ...le in the next release Write access is available when the MAP_PRIVATE flag is specified to mmap because the file data is private to the process that performed the mmap and therefore coherency with oth...
Page 159: ...rop_caches proc file The following example uses the second mount point on the client and uses the cat command first to confirm that is the desired mount point To specify a different mount point replac...
Page 160: ...0000040b517a2609 doesn t contain a valid partition table Disk dev disk by id dm uuid mpath 360080e50002ff41a0000040e517a261e 7999 4 GB 7999443697664 bytes 255 heads 63 sectors track 972543 cylinders t...
Page 161: ...uuid mpath 360080e50002ff41a0000040b517a2609 successfully created nid00006 pvcreate dev disk by id dm uuid mpath 360080e50002ff41a0000040e517a261e Physical volume dev disk by id dm uuid mpath 360080e5...
Page 162: ...o add the following dev mapper fs1 fs1 scratch xfs defaults 1 0 If there is a secondary XFS DVS node used as a high availability manual backup and failover device for the primary XFS DVS node do not a...
Page 163: ...S request log options dvsproc dvs_request_log_enabled 1 6 1 8 5 Change Kernel Module Parameters Dynamically Using Proc Files Some of the kernel module parameters in the following list can be changed d...
Page 164: ...dvsipc_msg_thread_limit and dvsipc_single_msg_queue however this module parameter has priority over the individual ones and if set will override them Table 16 dvs_instance_info Field Definitions Fiel...
Page 165: ...to nid_of_choice etc modprobe d dvs local conf comment out either the compute or service line depending on the type of node s being configured Set parameters for DVS thread pool in DVS IPC layer Defa...
Page 166: ...er to 1 disables fairness of service by forcing DVS to use a single message queue instead of a list of queues Default value 0 fairness of service enabled To view read only cat sys module dvsipc parame...
Page 167: ...module dvs parameters dvsof_concurrent_writes To change prior to boot add these lines to nid_of_choice etc modprobe d dvs local conf Disable concurrent writes options dvs dvsof_concurrent_writes 1 Set...
Page 168: ...ons dvsproc dvsproc_stat_defaults enable legacy brief plain notest To change dynamically This is root writable at sys module dvsproc parameters dvsproc_stat_defaults but changes should be made only th...
Page 169: ...he DVS IPC layer Table 17 kdwfs_instance_info Field Definitions Field Definition thread_min Number of threads created at startup thread_max Maximum number of persistent threads thread_limit Maximum nu...
Page 170: ...defaults Set parameters for DataWarp thread pool in DVS IPC layer options dvsipc kdwfs_instance_info 256 256 1024 4 10 1 1 1 This translates to kdwfs_instance_info thread_min 256 thread_max 256 threa...
Page 171: ...meout_secs To change prior to boot add these lines to nid_of_choice etc modprobe d dvs local conf Set the timeout seconds for syncing dirty files on a DVS server or client options dvs sync_dirty_timeo...
Page 172: ...DVS client DVS quiesce can be used when a file system needs to be repaired or to safely take a DVS server node out of service CAUTION Because it may cause data corruption do not use DVS quiesce to qui...
Page 173: ...rent server in a round robin fashion until it finds a server that allows the operation to complete successfully open file If the request is for an open file read write lseek etc then DVS attempts the...
Page 174: ...requires administrative privileges About this task The example provided in this procedure is for a scenario in which an admin wants to remove a DVS server from service but wants to let any outstanding...
Page 175: ...rent aspects of the Urika GX system such as deciding whether a user who has been identified by the Linux login mechanism is actually permitted to use Urika GX whether a user is restricted to tenant ac...
Page 176: ...have no authorization to log the Urika GX in any way even if the user can pass the authentication stage of logging into the system When the crayLoginShell attribute is set to a regular shell such as...
Page 177: ...s password Until the user has a keytab however the user will not have access to HDFS Kerberos is only used in the Urika GX secure service mode Both of these secrets are user specific and both of them...
Page 178: ...Kubernetes No Yes HDFS NameNode Yes Yes HDFS Secondary NameNode Yes Yes HDFS DataNode Yes Yes CGE Yes No Spark History Server Yes No Spark Thrift Server Yes No YARN Yes No YARN Resource Manager Yes No...
Page 179: ...roxy Yes Yes Connectivity to Tableau Yes No Docker No Yes Any additional services installed on the system will use their own security mechanisms and will not be affected by Urika GX s default and secu...
Page 180: ...in place and the security of data that was protected by secure mode may be compromised while running in the default mode Cray cannot extend the secure mode security assurances to any system that has...
Page 181: ...ecure service mode the system will return the following message Requested service unavailable under current security mode Accessing an Application URL If the user attempts to access an application spe...
Page 182: ...l ssh Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home users erl ssh id_rsa Your public key has been saved in home users erl ssh id_rsa p...
Page 183: ...be an underlying issue The second and subsequent times this is run the user will not be prompted for host authenticity This is shown below utp launch true echo 0 7 5 Tenancy On Urika GX tenancy refer...
Page 184: ...e hsm_restore hsm_release hsm_remove hsm_cancel swap_layouts migrate mv help version Two convenient commands for diagnosing problems using Urika GX s tenant proxy mechanism env true false IMPORTANT Th...
Page 185: ...procedure should be used as guidance only Site specific customizations requirements are beyond the scope of this publication About Bridge Port 0 Management Network The Urika GX management network int...
Page 186: ...Xs0f1 files which are required for setting up br1 s configuration The syntax of the command is utm host net opns enable force opns enable dryrun opns enable ip_addr hwaddr not used netmask default_gat...
Page 187: ...CONFIGURATION mgmt enable NETINFO Enables the management network for use with UTM Attempts to detect and only make changes where needed mgmt enable dryrun NETINFO Show changes that would be made by mg...
Page 188: ...k Missing bridge br1 Found operation network ethernet device enp7s0f1 state br1 NO_IP_LINK enp7s0f1 UP using bridge NO IP 172 30 51 152 20 Action The Operations network on this host will need to be be...
Page 189: ...checking the status of tenants In addition tenant management also involves managing users associated with tenants These tasks are performed by a number of tenant management commands as described in th...
Page 190: ...NT_IP_ADDR 172 30 48 7 UXTENANT_TENANT_IP_ADDR_NETMASK 255 255 240 0 UXTENANT_TENANT_IP_ONBOOT no UXTENANT_TENANT_IP_GATEWAY 172 30 48 1 UXTENANT_TENANT_IP_DNS1 172 30 84 40 UXTENANT_TENANT_IP_DNS2 17...
Page 191: ...configurations home and lustre provided Site specific ones can be added as needed The sample mount points can be found by looking in the configuration files etc sysconfig uxtenant mounts home and etc...
Page 192: ...ag on a locally controlled mount because it allows tenant creation to fully manage tenant isolation within a directory This will be examined more closely in the lustre example below In this case the f...
Page 193: ...but it is required to be tenant isolated create the directory and export it to the tenant manually or by some site defined procedure prior to creating the tenant VM In that case set this setting to NO...
Page 194: ...N A None 10 142 150 1 cb_tenant_00 N A N A notfound N A N A None 10 142 166 100 cb_tenant_01 N A N A notfound N A N A None 10 142 166 101 cb_tenant_02 N A N A notfound N A N A None 10 142 166 102 cb_...
Page 195: ...nant VM s state can be retrieved via the urika tenant status command The list of states that a tenant VM can be in at a given time include Table 20 Tenant States State Description notfound The tenant...
Page 196: ...ot robust to node redeployment or replacement Therefore it is importunate to note that tenant VMs will go into the notfound state if a node is redeployed or replaced These state transitions are depict...
Page 197: ...ing items need to be kept under consideration while using this command Number of CPUs At least 2 CPUs need to remain available when the number of CPUs is changed by this script That is if there are N...
Page 198: ...xecute the usm sync users command if the deleted user accounts are added back again and usm sync users was not run since they were deleted If the ux tenant remove user command is executed on a deleted...
Page 199: ...wing for additional help information replacing sub command name with the name of the actual sub command help sub command name The system will return the following error if a user attempts to view help...
Page 200: ...re 1 Log on to the SMW as root 2 Configure a tenant VM For information about tenant configuration steps and files refer to Tenant Management on page 189 3 Ensure that the system is functioning in the...
Page 201: ...p s mesos_cluster 5 Start the Mesos service urika start s mesos_cluster If the system is running in the default service mode perform the following set of steps to remove a user 1 Execute the ux tenant...
Page 202: ...er a tenant can only access their own data and have no visibility into the existence of data belonging to other tenants This includes local file systems used for temporary working space In addition us...
Page 203: ...ectory the mode 750 rwx r x may be helpful since this only allows the owner of the directory to create remove rename files and directories in it Automatic Assignment of Tenant Group Ownership and mode...
Page 204: ...a warning indicating that it detected an argument that is not allowed for restricted users and that the argument is being removed When the cluster is switched to the secure service mode Kerberos and...
Page 205: ...o the system For example a user with Kerberos principal johnsmith local would be mapped to user johnsmith in HDFS HDFS provides a number of ways to map users to groups via group mapping providers The...
Page 206: ...ant add user command to add a user to the authorized list of users ux tenant add user u bob In this example bob is added to the authorized user list but is not assigned tenant membership which would p...
Page 207: ...man page View List of Authorized Users Execute the ux tenant list users to view the list of authorized users For more information refer to the ux tenant list users man page Urika GX Mesos and Kerbero...
Page 208: ...ng Mesos jobs so it is done as a manual step After running usm sync users or usm recreate secret wait for an appropriate maintenance window and run urika stop s mesos_custer then urika start s mesos_c...
Page 209: ...t cannot account for that variation The actual details of the site s implementation will vary according to their authentication data source and policies There is no suggestion that the procedures outl...
Page 210: ...ldapsearch host port x search_dn query grep grep shell_field space sed sed e s shell_field space space if out search grep sed then return 1 fi echo out return 0 usage error 1 if z error then echo erro...
Page 211: ...e command will import users in dry run mode from echo the LDAP server at cfdcg02 us cray com using the table with the DN echo ou people dc datacenter dc cray dc com assuming that the field name echo f...
Page 212: ...hell shell dry run fi else if z dry_run then echo Removing user u if ux tenant remove user u u then echo WARNING removing user u failed skipping this user continue fi else echo Not really removing use...
Page 213: ...cess and then install them differently the tenant users with tenant membership and restricted access i e not specifying a crayLoginShell value in the ux tenant add user command the users with relaxed...
Page 214: ...hell dry run fi else if z dry_run then echo Removing user u if ux tenant remove user u u then echo WARNING removing user u failed skipping this user continue fi else echo Not really removing user u dr...
Page 215: ...be achieved by using the ux tenant list users command in the raw R mode The following shell function can help do this get_user_list ux tenant list users F name R while read expr do eval expr echo nam...
Page 216: ...ocumentation and Learning Resources UI Not available Grafana UI LDAP The system also ships with a default account that can be used to log on to Grafana The credentials of this account are username adm...
Page 217: ...ystem Administration Guide Nagios UI Default credentials User name crayadm Password initial0 7 9 Change Default Passwords Change the SMW s Password Follow the instructions documented in Change the Def...
Page 218: ...e asked to provide and confirm the old password the new password and will be asked to supply the old password again for the actual bind to take place After that the password will change If a new user...
Page 219: ...7 9 1 Default Urika GX System Accounts Default System Management Workstation SMW Accounts Table 25 Default SMW Accounts Account Name Password root initial0 crayadm crayadm The SMW account should not...
Page 220: ...systemd network systemd bus proxy sshd smmsp setroubleshoot saslauth rtkit rpcuser rpc radvd rabbitmq qemu pulse postgres postfix polkitd pcp oprofile ntp nginx nfsnobody mysql memcached mailnull libs...
Page 221: ...ce if it is running service nagios stop 4 Change the default Nagios password htpasswd c usr local nagios etc htpasswd users nagiosadmin 5 Start the HTTPD service service httpd start 6 Start the Nagios...
Page 222: ...name and password on the iDRAC s log in screen 4 Select the Submit button 5 Select iDRAC settings from the left navigation menu bar 6 Select User Authentication 7 Select the User ID for the user that...
Page 223: ...ge 7 9 4 Change the Default System Management Workstation SMW Passwords Prerequisites Ensure that the SMW is accessible This procedure requires root access About this task After logging on to the SMW...
Page 224: ...password on Urika GX Procedure 1 Log on to nid00030 which is a login node 2 Edit the slapd conf file and add in ACLs to allow users to modify the LDAP password root nid00003 vim usr local openldap et...
Page 225: ...password for user admin Password 7 9 7 Reset an Administrator LDAP Password on Systems Using Urika GX 1 2UP01 and Earlier Releases Prerequisites This procedure requires root privileges assumes that t...
Page 226: ...Va24 cachesize 50000 dirtyread dbnosync checkpoint 128 15 idlcachesize 50000 index objectClass eq database meta COMBINES the LDAP DATABASES database meta suffix dc local rootdn cn crayadm dc local roo...
Page 227: ...perform the necessary ldif operations however this is not configured in the default Urika GX installation therefore setting this up for existing systems requires knowing the OLC admin password Below a...
Page 228: ...ing entry SLAPD_CONF_DIR usr local openldap etc openldap slapd d Procedure 1 Log on to the LDAP host server as root 2 Generate a new hashed password root nid00030 slappasswd New password Re enter new...
Page 229: ...UE To resolve this issue please follow the instructions documented at http gethue com SSL authentication for Tableau can be set up using instructions documented in Enable SSL on Urika GX Urika GX ship...
Page 230: ...g the settings to enable SSL 3 Edit the settings for the Urika Applications Interface by uncommenting some settings to enable SSL In the following instructions it is assumed that the SSL certificate i...
Page 231: ...kthrift_ssl_backend mode tcp balance source server server1 192 168 0 33 10015 frontend grafana_ssl bind 29201 ssl crt etc ssl certs filename pem mode http option forwardfor option http server close op...
Page 232: ...d be available at https hostname login1 29202 HUE would still be available at http hostname login1 8888 but this URL not secure It is recommended to use https hostname login1 29202 The preceding examp...
Page 233: ...anagement default application_management settings py as shown in the following example This allows the Urika GX Applications Interface page to load secure URLs configured in the preceding steps when t...
Page 234: ...name smw 13 Restart Apache on the SMW service httpd restart 14 Verify that all the URLs of services are accessible 7 12 Enable SSL for Spark Thrift Server of a Tenant Prerequisites This procedure requ...
Page 235: ...xml file to change the value of hive server2 use SSL to false 2 Stop and the start the Spark Thrift Server if it is currently running 7 13 Install a Trusted SSL Certificate on Urika GX Prerequisites T...
Page 236: ...ertificate 7 14 Enable LDAP Authentication on Urika GX Prerequisites This procedure requires root access Ensure that the storage LDAP client points at login node 1 which is the LDAP server on Urika GX...
Page 237: ...and changes Procedure 1 Log on to login node 1 as root ssh root login 1 2 Edit the usr local openldap etc openldap slapd conf file to uncomment the following lines and replace myldap with the actual...
Page 238: ...he following example system smw is used as an example for the SMW s hostname and username is used as an example of an LDAP user s username root system smw pdsh w nid000 00 47 id username c Log on to l...
Page 239: ...of HUE as shown in the following figure To resolve this issue please follow the instructions documented at http gethue com Figure 43 Error Message 5 Log on to the SMW 6 Restart HiveServer2 service ur...
Page 240: ...to modify other parameters of the hive site xml file that are documented in the Hive documentation as they are already configured on the Urika GX system 4 Start the Hive service urika start s hive 5...
Page 241: ...DAP based authentication System administrators can use LDAP servers to setup user authentication Add a user to a group Use the usermod command to add a user to an existing group Add a user to LDAP Use...
Page 242: ...75 DataNode use for data transfers 50010 DataNode used for metadata operations 8010 Table 33 Services Accessible via the Login Nodes via the Hostname Service Default Port Mesos Master UI 5050 This UI...
Page 243: ...chever login node 1 or 2 that the user executes spark submit spark shell spark sql pyspark on This UI is user visible InfluxDB 8086 on login2 InfluxDB runs on nid00046 on three sub rack and on nid0003...
Page 244: ...Service Port Kafka not configured by default 9092 Flume not configured by default 41414 Port for SSH 22 Security S3016 244...
Page 245: ...directory console date Contains node console logs consumer date Contains ERD event collector xtconsumer logs netwatch date Contains logs generated by the xtnetwatch command nlrd date Contains logs gen...
Page 246: ...ss often Nagios Default location of logs is var log nagios nagios log on the SMW This location can be configured if desired N A Cobbler Located in the Cobbler KVM at var log cobbler N A Grafana Locate...
Page 247: ...by editing the etc grafana grafana ini file Yes Jupyter Notebook Log levels are controlled by the Application log_level configuration parameter in etc jupyterhub jupyterhub_config py It is set to 30 b...
Page 248: ...InfluxDB var log influxdb influxd log collectl collectl does not produce any logging information It uses logging as a mechanism for storing metrics These metrics are exported to InfluxDB If collectl...
Page 249: ...spark test 1522834122688 driver STOP Application Stopped Wed Apr 04 04 30 09 CDT 2018 username spark test 1522834207396 driver START Application Started with 1 driver plus 2 0 executors using 3 0 core...
Page 250: ...ode e g a login node add the user to the user authorization list as follows smw ux tenant add user u username smw ux tenant relax username If this was seen on a tenant VM and the user is supposed to b...
Page 251: ...13 34 30 nid00030 sshd 143134 pam_cray_keytab kinit failed for user username see previous logs There may be other errors reported by kinit as well If the logs contain such information first check whe...
Page 252: ...ux tenant list users error configuration is out of date with the current version The following configuration files need to be brought up to date with the current version of Urika Tenant Management The...
Page 253: ...pts Look first in ifcfg br1 It should looks something like this DEVICE br1 TYPE Bridge BOOTPROTO static ONBOOT yes NM_CONTROLLED no IPADDR 172 30 51 237 NETMASK 255 255 240 0 GATEWAY 172 30 48 1 DNS1...
Page 254: ...erated for the tenant VM login session A ticket granting ticket is generated each time a command is run on a physical node through the tenant proxy mechanism Logs Urika GX Tenant Management Logs utp s...
Page 255: ...ing environment for any given tenant VM resides in a disk image file found on the host node the node named in the etc sysconfig uxtenant hosts host name file named in the tenant configuration at qemu...
Page 256: ...from this situation as well since we do not know all of the paths that could result in network timeouts during LDAP start up One consequence of this can be that the slapd LDAP server on login1 will t...
Page 257: ...cified or if the secret is not entered in plain text 4 Start the Mesos cluster CAUTION The Mesos cluster needs to be restarted after running the urika mesos change secret command otherwise the framewo...
Page 258: ...r var log oozie HUE HUE logs are generated under var log hue Flex scripts urika yam status urika yam flexup urika yam flexdown urika yam flexdown all These scripts generate log files under var log uri...
Page 259: ...on all the nodes with Mesos slave processes running The following example can be used on a 3 sub rack system pdsh w nid000 00 47 x nid000 00 16 30 31 32 46 47 rm vf var log mesos agent meta slaves lat...
Page 260: ...Error Message Description Resolution INFO mesos CoarseMesosSchedulerBac kend Blacklisting Mesos slave 20151120 121737 1560611850 505 0 20795 S0 due to too many failures is Spark installed on it INFO...
Page 261: ...erwise please specify the desired value when using the urika yam flexup command ID names can only contain alphanumeric dot and dash characters not allowed in jhoole Usage urika yam flexup nodes nodes...
Page 262: ...t be set to the same as the default value Ensure that this parameter is set and the value is the same as default value Marathon port is not set in the etc urika yam conf file please re check the confi...
Page 263: ...DFS is in a healthy state Marathon and Mesos services are up and running The status of the aforementioned services can be checked using the urika state command For more information see the urika state...
Page 264: ...ing too many jobs a user waiting for their job to start but previous jobs have not freed up nodes etc Additionally if a user set a job timeout s to 1 hour and the job lasted longer than 1 hour the use...
Page 265: ...nplugs an Ethernet cable while a CGE job was running or a node died etc Ensure that the Ethernet cable is plugged while jobs are running NCMD Error leasing cookies MUNGE Munge authentication failure s...
Page 266: ...wn option of the mrun command The first remote compute node did not exit successfully Thus this error message can only occur if the application is killed or if someone used the Marathon REST API in an...
Page 267: ..._files do echo Fixing hdfsfile hadoop fs setrep 3 hdfsfile done Too many corrupt blocks in name node UI The NameNode might not have access to at least one replication of the block Check if any of the...
Page 268: ...rk jobs Kill the job using the Spark UI Click on the text kill in the Description column of the Stages tab Kill the job using the Linux kill command Kill the job using the Ctrl C keyboard keys The sys...
Page 269: ...ror N nodes must be 1 parser error n images must be N nodes parser error No command specified to launch error Not enough CPUs error Not enough CPUs for exclusive access error Not enough nodes parser e...
Page 270: ...run 1 man page 8 10 Troubleshoot Application Hangs as a Result of NFS File Locking About this task Applications may hang when NFS file systems are projected through DVS and file locking is used To avo...
Page 271: ...o define a large number of user environment variables Cray recommends that users include those definitions in the user s shell so that they are available at startup and stored where DVS can always loc...
Page 272: ...too many times or with very large temporary files the SSDs may begin to fill up This can cause Spark jobs to fail or slow down Urika GX checks for any idle nodes once per hour and cleans up any left o...