Cray Urika-GX Скачать руководство пользователя страница 231

2. Uncomment the following lines in 

/etc/haproxy/haproxy.cfg

 configuration file and replace all

occurrences of 

/etc/ssl/certs/

filename

.pem

 with the full path to the SSL certificate.

frontend hive_ssl

  bind *:29207 ssl crt /etc/hive/conf/server.pem

  mode tcp

  option forwardfor

  reqadd X-Forwarded-Proto:\ https

  default_backend hive_ssl_backend

 

backend hive_ssl_backend

  mode tcp

  balance source

  server server1 192.168.0.33:10000

frontend sparkthrift_ssl

  bind *:29208 ssl crt /etc/hive/conf/server.pem

  mode tcp

  option forwardfor

  reqadd X-Forwarded-Proto:\ https

  default_backend sparkthrift_ssl_backend

 

backend sparkthrift_ssl_backend

  mode tcp

  balance source

  server server1 192.168.0.33:10015

frontend grafana_ssl

  bind *:29201 ssl crt /etc/ssl/certs/filename.pem

  mode http

  option forwardfor

  option http-server-close

  option httpclose

  reqadd X-Forwarded-Proto:\ https

  default_backend grafana_ssl_backend

 

backend grafana_ssl_backend

  mode http

  balance source

  server server1 192.168.0.47:3000

 

frontend hue_ssl

  bind *:29202 ssl crt /etc/ssl/certs/filename.pem

  mode http

  option forwardfor

  option http-server-close

  option httpclose

  reqadd X-Forwarded-Proto:\ https

  reqadd X-Forwarded-Protocol:\ https

  default_backend hue_ssl_backend

 

backend hue_ssl_backend

  mode http

  balance source

  server server3  192.168.0.31:8888

 

frontend jupyter_ssl

  bind *:29204 ssl crt /etc/ssl/certs/filename.pem

  mode http

  option forwardfor

Security

S3016 

 

 231

Содержание Urika-GX

Страница 1: ...Urika GX System Administration Guide 2 2 UP00 S 3016...

Страница 2: ...tation SMW 22 3 4 Hardware Supervisory System HSS 22 3 4 1 Hardware Supervisory System HSS Architecture Overview 24 3 4 2 The xtdiscover Command 25 3 4 3 Hardware Supervisory System HSS Component Loca...

Страница 3: ...formation Using the xtdumpsys Command 43 3 5 Dual Aries Network Card dANC Management 43 3 6 Analyze Node Memory Dump Using the kdump and crash Utilities on a Node 44 3 7 Cray Lightweight Log Managemen...

Страница 4: ...05 5 Resource Management 124 5 1 Manage Resources on Urika GX 124 5 2 Use Apache Mesos on Urika GX 126 5 2 1 Access the Apache Mesos Web UI 128 5 3 Use mrun to Retrieve Information About Marathon and...

Страница 5: ...Default System Management Workstation SMW Passwords 223 7 9 5 Change LDAP Password on Urika GX 224 7 9 6 Reset a Forgotten Password for the Cray Application Management UI 224 7 9 7 Reset an Administr...

Страница 6: ...Framework 256 8 6 Clean Up Log Data 257 8 7 Diagnose and Troubleshoot Orphaned Mesos Tasks 258 8 8 Troubleshoot Common Analytic and System Management Issues 259 8 9 Troubleshoot mrun Issues 268 8 10 T...

Страница 7: ...sh At the end of a command line indicates the Linux shell line continuation character lines joined by a backslash are parsed as a single line Do not type anything after the backslash or the continuati...

Страница 8: ...RAYPAT CRAYPORT DATAWARP ECOPHLEX LIBSCI NODEKARE The following system family marks and associated model number marks are trademarks of Cray Inc CS CX XC XE XK XMT and XT The registered trademark LINU...

Страница 9: ...rvisory System HSS HSS is an integrated system of hardware and software components that are used for managing and monitoring the system Cobbler Cobbler is used on Urika GX for provisioning and deploym...

Страница 10: ...t network The operational Ethernet network is used for ingesting user data This network is comprised of a single unit 48 port GigE switch that provides dual 1GigE and or dual 10GigE interfaces to the...

Страница 11: ...ice versa is routed through the corresponding RC For additional information see the Urika GX Hardware Guide 2 3 File Systems Supported file system types on Urika GX include Internal file systems Hadoo...

Страница 12: ...a GX hardware High speed network management network switches must not be modified as this network is internal to Urika GX Moving the system from the rack Cray supplies to customer provided racks is no...

Страница 13: ...w Docker images is currently not supported on Urika GX For more information contact Cray Support Before installing any additional software on the Urika GX system a ticket should be opened with Cray Su...

Страница 14: ...Tenancy on page 183 Tenant NameNode configuration is managed automatically by the Urika GX tenant management scripts Manually altering the configurations of the tenant NameNode is not supported Tenant...

Страница 15: ...to the SMW as root ssh root hostname smw 2 Display the current service mode by using one of the following options Execute the urika state command This displays the current service mode as well as the...

Страница 16: ...gent Subrack Control Board iSCB rRsSiI I 0 to 1 Dual Aries Network Card dANC There are up to 2 dANCs per sub rack accommodating up to 16 nodes rRsScC C 0 to 1 High Speed Network HSN cable The j name i...

Страница 17: ...s in this publication be followed before making any changes reconfigurations to the SWM as well as before restarting the SMW 3 3 1 Power On the System Management Workstation SMW The SMW can be turned...

Страница 18: ...shut down the System Management Workstation SMW Procedure 1 Point a browser to the site specific iDRAC IP address such as https system smw ras The iDRAC console s login screen appears 2 Enter root and...

Страница 19: ...Figure 1 iDRAC Login Screen 3 On the Quick Launch Tasks section of the iDRAC UI click on Power ON OFF link to control the SMW s power System Management S3016 19...

Страница 20: ...leges About this task The components of the Cray system synchronize time with the System Management Workstation SMW through Network Time Protocol NTP By default the NTP configuration of the SMW is con...

Страница 21: ...out NTP refer to the Linux documentation 6 Sync the hardware clock smw hwclock systohc 7 Verify that the SMW has jitter from the NTP server smw ntpq p 3 3 5 Synchronize Time of Day on System Nodes Pre...

Страница 22: ...estarts establishes communications with all external interfaces restores the proper state in the state manager and continues normal operation without user intervention For a scheduled or unscheduled s...

Страница 23: ...itors and controls the rack power and communicates with all dANC controllers in the rack It sends a periodic heartbeat to the SMW to indicate rack health The rack controller connects to the dANC contr...

Страница 24: ...rview HSS hardware on the Urika GX system consists of a System Management Workstation SMW which is a rack mounted Intel based server running CentOS along with an Ethernet network that connects the SMW...

Страница 25: ...r provided block of IP address space This information is used to create the etc hosts file and DHCP entries for the HSS network This setup typically only needs to be done once unless the address block...

Страница 26: ...the database when needed Thus the dynamic system state persists between SMW boots The state manager uses the Lightweight Log Manager LLM The log data from state manager is written to var opt cray log...

Страница 27: ...not typically result in a new mapping Since the operating system always uses NIDs HSS converts these to NIC IDs when sending them on to the HSS network and converts them to NIDs when forwarding event...

Страница 28: ...ies the authorized_keys file for Rack Controllers RCs and dANCCs xtchecklink Checks HSN and PCIe link health xtclass Displays the network topology class for this system xtclear Clears component flags...

Страница 29: ...ble interrupt to target nodes See the xtnmi 8 man page for more information xtpcimon Monitors health of PCIe channels for Urika GX systems rackfw Flashes all devices in the Urika GX system via out of...

Страница 30: ...ate with the Cray network application specific integrated circuit ASIC xtbte_perf Determines the one hop connections between nodes It then performs BTE transfers over these one hop connections and det...

Страница 31: ...d Monitors ANCC heartbeats emits heartbeat for the State Manager controls dANC power operations and monitors the iSCB s health Controller Vitality Check Daemon cvcd Monitors memory consumption CPU uti...

Страница 32: ...dwidth low latency communication between all the compute processing elements of the system CAUTION xtbounce should never be executed when nodes are up as this command will not gracefully shut nodes do...

Страница 33: ...etwork Card dANC will fail if any nodes under the component are in the ready state unless the force option f is used An error message will indicate the reason for the failure If the system is currentl...

Страница 34: ...state HSS managers and the xtcli command ignore empty or disabled components Setting a selected component to the EMPTY state is typically done when a component usually a blade is physically removed B...

Страница 35: ...ponent as follows crayadm smw xtcli lock u lock_number Where lock_number is the value given when initiating the lock it is also indicated in the xtcli lock show query Unlocking does nothing to the sta...

Страница 36: ...rrectly and debugging information is needed or to stop a node that is running incorrectly The sole purpose of the xtnmi command is to collect debug information from unresponsive nodes As soon as that...

Страница 37: ...ormation request For more information see the rtr 8 man page Display routing information The system map option to rtr writes the current routing information to stdout or to a specified file This comma...

Страница 38: ...e following circumstances On any single group system at any time even those listed above During a warmswap operation See the rtr 8 man page for additional information 3 4 19 Power Up a Rack or Dual Ar...

Страница 39: ...7 ping nid00015 PING nid00015 10 128 0 16 56 84 bytes of data 64 bytes from nid00015 10 128 0 16 icmp_seq 1 ttl 64 time 0 032 ms 64 bytes from nid00015 10 128 0 16 icmp_seq 2 ttl 64 time 0 010 ms 3 4...

Страница 40: ...he xtshow alert command for a cabinet does not display an alert for a node Similarly checking the status of a node does not detect an alert on a cabinet Show all alerts on the system crayadm smw xtsho...

Страница 41: ...hardware components For more information see the xtshow 8 man page 3 4 28 Clear Component Flags Use the xtclear command to clear system information for selected components Type commands as xtclear opt...

Страница 42: ...te the authorized_keys file for the Intelligent Subrack Control Boards iSCBs It must be run as root The xtcc ssh keys command takes no options When run it invokes the user selected text editor specifi...

Страница 43: ...d name the components from which to collect data NOTE Only the crayadm account can execute the xtdumpsys command For more information see the xtdumpsys 8 man page 3 5 Dual Aries Network Card dANC Mana...

Страница 44: ...and determining the cause of a crash Dumped image of the main memory exported as an Executable and Linkable Format ELF object can be accessed either directly during the handling of a kernel crash thro...

Страница 45: ...ansports those messages to the SMW and places the messages into log files By default LLM has a log trimming mechanism enabled called xttrim 3 8 Urika GX Node Power Management A number of CLI scripts c...

Страница 46: ...rocedure assumes that the instructions are being carried out on a 3 sub rack 48 node system About this task The instructions documented in the procedure can be used for powering up the Urika GX system...

Страница 47: ...lts when the preceding command is executed use the xtclear_alert or xtclear_reserve command to clear those statuses 6 Ensure that the ANCCs are powered on by executing the command root smw xtalive All...

Страница 48: ...le d smw_deploy_profile sh 18 Confirm that the required nodes have a mount point by executing the following command root smw pall df k grep lustre sort 19 Start up the analytic applications using the...

Страница 49: ...smw urika stop For more information see the urika stop man page 3 Execute the urika state command to ensure that analytic applications have shut down The HA Proxy service may still be running root sm...

Страница 50: ...displayed in the Flags column of the results when the following commands are executed root smw xtcli status t node s0 Network topology class 0 Network type Aries Nodeid Service Core Arch Comp state F...

Страница 51: ...specify the node IDs separated by commas The list of services that are started using this command depends on the service mode default or secure that the system is running in This command starts all th...

Страница 52: ...ode IDs separated by commas To stop a specific service use the s or service option to specify the name of that service This command stops all the running analytic services if used without any options...

Страница 53: ...page Display status of services urika state Sequence of Execution of Scripts The urika state command can be used to view the state services running on Urika GX Use this command to ensure that the foll...

Страница 54: ...erequisites This procedure requires root privileges Before performing this procedure use the urika state command to ensure that the system is operating in the service mode that supports using InfluxDB...

Страница 55: ...ration is changed from 0 forever to 2 weeks 504 hours alter retention policy default on Cray Urika GX Duration 2w show retention policies on Cray Urika GX name duration shardGroupDuration replicaN def...

Страница 56: ...t nid00008 ZooKeeper Secondary HDFS NameNode Mesos Master Oozie Hive Server2 Spark Thrift Server Hive Metastore WebHCat Postgres database Marathon YARN Resource Manager Collectl nrpe kubelet nid00014...

Страница 57: ...pe kubelet nid00030 Login node 2 HUE HA Proxy Collectl Service for flexing a YARN cluster Grafana InfluxDB nrpe kubelet Table 10 Urika GX Service to Node Mapping 3 Sub rack System Node ID s Service s...

Страница 58: ...1 HUE HA Proxy Collectl Urika GX Applications Interface UI Jupyter Notebook Service for flexing a YARN cluster Documentation and Learning Resources UI nrpe kubelet Kubernetes Controller nid00031 nid00...

Страница 59: ...iven host Because of the lightweight nature of containers users can run more containers on a given hardware combination than if using virtual machines Images shipped with the system are managed by Doc...

Страница 60: ...name as the Spark job s name spark submit class org apache spark examples SparkPi conf spark app name spark pi local opt spark examples target scala 2 11 jars spark examples_2 11 2 2 0 k8s 0 5 0 jar...

Страница 61: ...yspark pi jars local opt spark examples target scala 2 11 jars spark examples_2 11 2 2 0 k8s 0 5 0 jar local opt spark examples src main python pi py Execute the kubectl logs pod_name command and grep...

Страница 62: ...s 1 Number of cores requested for the driver This should be increased if a job does a lot of work in the driver e g aggregations result collection spark driver memory 16g Amount of memory requested pe...

Страница 63: ...r Configures the tenant specific Spark Thrift Server and launches the Spark Thrift Server spark job in the tenant s Kubernetes namespace stop thriftserver Stops the tenant specific Spark Thrift Server...

Страница 64: ...ervice To see the state of the Kubernetes POD containing the Metastore server root login1 kubectl get pods n TENANT VM metastore TENANT VM show all The status should be Running and the Status 2 2 indi...

Страница 65: ...CAUTION iSCB CLI commands other than the status command should NOT be executed on the Urika GX system unless advised by Cray Support For more information contact Cray support The capmc ux nid cobbler...

Страница 66: ...ate and send alert notifications For more information refer to Tenant Management on page 189 Procedure 1 Access the Nagios UI by pointing a browser at http machine smw nagios 2 Enter crayadm as the us...

Страница 67: ...his procedure requires root privileges Ensure that the mod_ssl package is installed on the system If it is not install it by executing the following as root on the SMW yum install y mod_ssl About this...

Страница 68: ...called a Distinguished Name or a DN There are quite a few fields but you can leave some blank For some fields there will be a default value If you enter the field will be left blank Country Name 2 le...

Страница 69: ...l x509 req days 365 in certrequest csr signkey keyfile key out certfile crt This should produce output saying that the signature was OK and that it retrieved the private key Self signing will result i...

Страница 70: ...is returned add a security exception The Apache test web page will be displayed upon success The Nagios Core server can now be accessed by directing a web browser at https serverName nagios More detai...

Страница 71: ...cations can be sent out to this contact Valid options are a combination of one or more of the following w Notify on WARNING service states u Notify on UNKNOWN service states c Notify on CRITICAL servi...

Страница 72: ...ontact2 define service name gluster service use generic service notifications_enabled 1 notification_period 24x7 notification_options w u c r f s notification_interval 120 register 0 _gluster_entity S...

Страница 73: ...gios server sends notifications during status changes to the mail addresses specified in the file It is important to note that By default the system ensures three occurrences of the event before sendi...

Страница 74: ...g file 6 Restart the Nagios service service nagios restart 4 2 4 Configure Email Alerts Prerequisites This procedure requires root privileges About this task This procedure provides instructions for u...

Страница 75: ...vice that alerting needs to be set up for The following example shows how to set up the service for aggregate CPU usage define service use local service host_name localhost service_description Aggrega...

Страница 76: ...e warning and critical levels are digits up to 4 decimal points between 0 0 and 1 0 which represent the percentage of the metric The following plugins are configurable on the SMW at the default path u...

Страница 77: ...cfg file 7 Restart the Nagios service service nagios restart 8 Switch to the usr local nagios etc directory 9 Modify the nrpe conf configuration file as needed a Define the command name and the path...

Страница 78: ...ources All Dashboards are owned by a particular Organization User A User is a named account in Grafana A user can belong to one or more Organizations and can be assigned different levels of privileges...

Страница 79: ...oop and Spark dashboards to display data refer to Update the InfluxDB Data Retention Policy on page 54 Reducing the amount of data retained makes Grafana dashboards display faster 4 4 Default Grafana...

Страница 80: ...st traffic generated by CGE is not shown here Operational network Bytes sec In Out Displays the overall operational network traffic information for all nodes Management network Bytes sec In Out Displa...

Страница 81: ...sec In Out Displays the overall management network traffic information for each node FILE SYSTEM UTILIZATION Root File System Hard Drive dev sda Reads Writes Bytes Sec 200MB sec max Displays informati...

Страница 82: ...d by all the compute nodes FILE SYSTEM READS WRITE BYTES SEC Root File System Hard Drive dev sda Reads Writes Bytes Sec 200MB sec max Displays information about the usage of memory on the root file sy...

Страница 83: ...Management network Bytes sec In Out Displays the overall management network traffic information for compute nodes Figure 7 Compute Node Performance Statistics Hadoop Application Metrics This section c...

Страница 84: ...ime The Y axis represents a linear number AllocatedContainers Displays the number of allocated YARN containers by all the jobs in the hadoop cluster The Y axis represents a linear number DataNode Byte...

Страница 85: ...or GB used by all the I O and login nodes FILE SYSTEM READS WRITE BYTES SEC Root File System Hard Drive dev sda Reads Writes Bytes Sec 200MB sec max Displays information about the usage of memory on...

Страница 86: ...Displays graphs representing statistical data related to network CPU and I O utilization for individual Urika GX nodes The node s hostname can be selected using the hostname drop down provided on the...

Страница 87: ...elected node NFS HDFS Lustre Percentage Used Displays the percentage of NFS HDFS and Lustre used by the selected node File System Percent Used Displays the percentage of file system used by the select...

Страница 88: ...ys the management network s data rate NETWORK PACKET DROPS SEC AND ERRORS SEC Operational Network Dropped and Errors Per Sec Displays the number of dropped packets and errors per second for the operat...

Страница 89: ...ber of tasks and X axis displays the start stop time of the task for a particular Spark Job This section contains the following graphs Completed Tasks Per Job The approximate total number of tasks tha...

Страница 90: ...ted will always be greater than or equal to used JVM Memory Usage Max Represents the maximum amount of memory in bytes that can be used for memory management Its value may be undefined The maximum amo...

Страница 91: ...r admin admin true 3 Edit the etc influxdb influxdb conf file as root vi etc influxdb influxdb conf change auth to be true save file exit systemctl restart influxdb At this point Grafana dashboards wi...

Страница 92: ...ikaGXHadoop 906M CrayUrikaGXSpark 21M _internal 4 Connect to InfluxDB to view the current data retention policy bin influx Visit https enterprise influxdata com to register for updates InfluxDB server...

Страница 93: ...iguration file contains instructions for making such updates 4 8 Change the Default Timezone Displayed on Grafana Prerequisites This procedure requires root privileges Before performing this procedure...

Страница 94: ...Figure 14 Grafana Login Screen 3 Select Home admin Preferences Figure 15 Preferences Interface System Monitoring S3016 94...

Страница 95: ...rvice mode that supports using Grafana For more information see the urika state man page and refer to Urika GX Service Modes on page 177 About this task In addition to the default set of dashboards ad...

Страница 96: ...y of access or at http hostname login2 3000 2 Log on to Grafana by entering LDAP credentials or credentials of the default Grafana account username admin password admin to log on to Grafana Figure 17...

Страница 97: ...vigation menu 5 Add panels graphs rows to the new dashboard and customize as needed 6 Optional Update the dashboard s default settings using Settings from the configuration gear at the top of the inte...

Страница 98: ...o the default set of panels administrators can add additional panels to the Dashboard of the Grafana UI as described in the following instructions Procedure 1 Access the Grafana UI using the Urika GX...

Страница 99: ...4 Select Add Panel graph from the green menu on the left side of the Dashboard Figure 22 Add a New Graph to Grafana Dashboard 5 Click on the title of the new panel and select edit as shown in the fol...

Страница 100: ...ites This procedure requires root privileges Before performing this procedure use the urika state command to ensure that the system is operating in the service mode that supports using InfluxDB For mo...

Страница 101: ...Prerequisites This procedure requires root access About this task Access to the iSCB module is made through an SSH password less login Once logged on the iSCB status command can be used to monitor the...

Страница 102: ...Executable and Linkable Format ELF object can be accessed either directly during the handling of a kernel crash through proc vmcore or it can be automatically saved to a locally accessible file system...

Страница 103: ...can be used to retrieve various pieces of information Procedure 1 Log on to the SMW as root 2 Execute the urika check platform command as shown in the following examples Various HSS checks Rack contro...

Страница 104: ...Ethernet port and a dedicated logical serial port assigned for command and power management access The iSCB module is typically managed over a secure private management network Cray recommends the iSC...

Страница 105: ...s files etc hosts 192 168 2 10 r0s0i0 rx0y0s0i0 iscb_r0s0i0 192 168 2 11 r0s1i0 rx0y0s1i0 iscb_r0s1i0 192 168 2 12 r0s2i0 rx0y0s2i0 iscb_r0s2i0 192 168 2 13 r0s3i0 rx0y0s3i0 iscb_r0s3i0 iSCB Password...

Страница 106: ...ID LED log Display event logs by specified log level all info critical and warning node Display status of all or specified nodes nvram Display or update the configuration parameters stored in NVRAM p...

Страница 107: ...7 0 0 0 0 5 r0s0n5 on 0070 1 08 00 1e 67 b5 2b f4 10 10 128 8 0 0 0 0 6 r0s0n6 on 0070 1 08 00 1e 67 b5 28 f7 10 10 128 9 0 0 0 0 7 r0s0n7 on 0070 1 08 00 1e 67 b5 28 bb 10 10 128 10 0 0 0 0 8 r0s0n8...

Страница 108: ...nection The asterisk indicates the current connection Syntax console kill slot_number Parameters kill slot_number Kill the specified CLI connection Example r0s0i0 iscb console No Auth Remote Timeout I...

Страница 109: ...0 PM Dev LM25066 AA power on 12V 12 1 V Watt 5 0 W Ver 2 stat 0103h htbeat 59 dANC 22 C 55 Aries0 29 C 85 Aries1 30 C 85 AOC0 0 C 1 AOC1 0 C 1 AOC2 0 C 1 AOC3 0 C 1 AOC4 0 C 1 ok r0s0i0 iscb danc cycl...

Страница 110: ...M1276 3 Power 12 0 A 153 W Peak 558 W Average 477 W Energy 13520 6 Wh in last 101865secs 28h 17m 45s ok r0s0i0 help Display available commands and usage Syntax help command Parameters command Displays...

Страница 111: ...lasterr ok r0s0i0 led Turn Off the green red LED and the node s ID LED Syntax led on off node all Parameters on Turn on the specific nodes idled If node is already turned on there will be no action of...

Страница 112: ...through 16 all Display status for all nodes Example r0s0i0 iscb node No Name Pwr LED GPU BMC PID Sns GPUMap Watt Tmrg Console 0 r0s0n0 on off off 1 08 0070 4 0 0000 108 63 idle 1 r0s0n1 on off off 1...

Страница 113: ...Addr 10 11 0 20 255 255 0 0 gw 10 11 0 1 Time Date TZ CST6CDT TimeServer 192 168 0 1 Event Log Level info Remote Log Level info LogServer 192 168 0 1 Port 514 SNMP Trap Level off Community public Rece...

Страница 114: ...urn on the specific nodes If node is already turned on there will be no action off Turn off the specific nodes If node is already turned off there will be no action cycle Turn off then turn on the spe...

Страница 115: ...t dur min max count err last_access blade0 0 5 3 5 10775 0 48781 pm0 0 5 4 5 10774 0 48781 blade1 0 4 3 5 10775 0 48782 pm0 0 5 4 5 10774 0 48782 blade2 0 4 4 5 10775 0 48783 pm0 0 5 4 5 10774 0 48783...

Страница 116: ...4 16167 0 48782 dANC1 0 3 3 4 16156 0 48782 dANC1 pm 0 3 3 4 16167 0 48782 ok r0s0i0 psu Display status power consumption and energy data of all or specified PSUs Syntax psu psuno Parameters psuno PSU...

Страница 117: ...p sensor data from specified node Syntax pump node Parameters node Specified node 1 through 16 Example r0s0i0 iscb pump 1 slot 1 name r0s0n1 present on power on pumps no pump sensor ok r0s0i0 reboot R...

Страница 118: ...E cluster Not implemented for Urika GX Default is off acelog level If ace is set generate ace related log of defined level info warning critical activefan on off Dynamic Fan Control by node s health s...

Страница 119: ...lade exceeds 85 C the blade health status is set to warning gpu type slotmap GPU Co proc type and PCI slot map Type is one of auto m20xx k10 k20 k40 k80 and phi gputemp temp The critical temperature o...

Страница 120: ...abinet subrack ID Type is one of SR5110 SR6110 SR8104 SR8204 SR8202 SR10216 Cabinet ID is 0 127 Subrack ID is 0 7 shelfname name Subrack alias name 14 chars max shutdown level Event level of shutdown...

Страница 121: ...arning and higher level event logs Example r0s0i0 iscb status iSCB Status Node 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 Total Power ID LED Console BMC Temp C 63 61 66 61 63 60 64 62 67 68 67 1...

Страница 122: ...dle 0 0 0 0 0 7003 telnet ttyUSB2 idle 0 0 0 0 0 7004 telnet ttyUSB3 idle 0 0 0 0 0 7005 telnet ttyUSB4 idle 0 0 0 0 0 7006 telnet ttyUSB5 idle 0 0 0 0 0 7007 telnet ttyUSB6 idle 0 0 0 0 0 7008 telnet...

Страница 123: ...iscb 1 0 bin 100 17MB 8 8MB s 00 02 r0s0i0 ssh root r0s0i0 iscb updatefw ok r0s0i0 ver Display firmware version and build date Syntax ver Parameters None Example r0s0i0 iscb ver iSCB Ver 1 1 11 53 51...

Страница 124: ...obs to a Mesos cluster frameworks register with Mesos Mesos offers resources to registered frameworks Frameworks can either choose to accept or reject the resource offer from Mesos If the resource off...

Страница 125: ...e Urika GX scripts for flexing clusters and the mrun command do not submit their jobs unless they know the resource requirement is satisfied Once the flex up request is successful YARN uses its own qu...

Страница 126: ...plications and or frameworks It lies between the application layer and the operating system and simplifies the process of managing applications in large scale cluster environments while optimizing res...

Страница 127: ...ompute nodes YARN Resource Manager HDFS NameNode Secondary HDFS NameNode Hadoop Application Timeline Server Hadoop Job History Server Spark History Server Oozie For services like Mesos Masters and Mar...

Страница 128: ...to verify that the Mesos service is running Check the service mode by executing urika service mode to ensure that the system is running in the service mode required for Mesos to run About this task T...

Страница 129: ...md system services must be running for mrun CGE to work If either service is stopped or disabled mrun will no longer be able to function The mrun command needs to be executed from a login node Some ex...

Страница 130: ...port 2016 133 03 50 28 235572 Nodes 20 CPUs 320 480 user user cmd cge server nid00032 urika com startedAt 2016 05 12T17 05 53 573Z nid00028 urika com startedAt 2016 05 12T17 05 53 360Z nid00010 urika...

Страница 131: ...0x05 nid00005 32 515758 idle 6 0x06 nid00006 32 515758 idle 7 0x07 nid00007 32 515758 busy 8 0x08 nid00008 32 515758 busy 9 0x09 nid00009 32 515758 busy 10 0x0a nid00010 32 515758 busy 11 0x0b nid0001...

Страница 132: ...lines are ignored NCMDServer nid00000 MesosServer localhost same as host MesosPort 5050 MarathonServer localhost MarathonPort 8080 DebugFLAG False same as debug VerboseFLAG False same as verbose JobT...

Страница 133: ...cation instances as needed In addition Cray developed scripts for starting a cluster of YARN Node Managers are also launched through Marathon Before using Marathon ensure that the system is running in...

Страница 134: ...accessed at the port it runs on i e at http hostname login1 8080 or http hostname login2 8080 Figure 24 Urika GX Applications Interface Resource Management S3016 134...

Страница 135: ...UI Marathon also enables creating applications from the UI via the Create Application button which displays the New Application pop up Figure 26 Create an Application Using the Marathon UI Resource Ma...

Страница 136: ...cedure requires version 10 11 of the operating system About this task Cray recommends to have the Spark Thrift to be started up by administrators however users can use the following instructions if th...

Страница 137: ...tem is running in the service mode that allows use of the Cray Application Management UI Execute the urika state or urika service mode commands to check the service mode For more information refer to...

Страница 138: ...uick Filters UI elements are used at the same time only jobs that match the selected quick filter will be displayed The table displayed on the UI contains information about submitted jobs and contains...

Страница 139: ...Finished which can be used to view and download logs that help identify whether or not the Spark job succeeded Selecting this link will present a login screen where users will need to enter their LDAP...

Страница 140: ...See the mount 8 and dvs 5 man pages for more information Cray DVS uses the Linux supplied VFS interface to process file system access operations This allows DVS to project any POSIX compliant file sy...

Страница 141: ...failover and failback for parallel modes The topic describes how it works and includes example console messages DVS periodic sync promotes data and application resiliency and is more efficient than t...

Страница 142: ...16 bit must be 0 or 1 for SET Retrieves sets the deferopens value for a file on a DVS mount DVS_GET_FILE_KILLPROCESS DVS_SET_FILE_KILLPROCESS signed 16 bit must be 0 or 1 for SET Retrieves sets the ki...

Страница 143: ...t setting noatomic or 0 Associated environment variable DVS_ATOMIC Additional notes none attrcache_timeout attrcache_timeout enables client side attribute caching File attributes and dentries for geta...

Страница 144: ...he final close of a file descriptor in addition to forwarding the close to the DVS server the DVS server node waits until data has been written to the underlying media before indicating that the close...

Страница 145: ...ibute_create_ops caused DVS to change its hashing algorithm so that create and lookup requests are distributed across all of the servers as opposed to being distribute to a single server This applies...

Страница 146: ...servers Default setting fnv 1a Associated environment variable none Additional notes Except in cases of extremely advanced administrators or specific advice from DVS developers do not use the hash mou...

Страница 147: ...nce For a description of loadbalance mode see About DVS Loadbalance Mode on page 152 noloadbalance automatically sets the following mount options maxnodes 1 cache 1 and hash_on_nid 0 Default setting n...

Страница 148: ...required unless nodename is used Associated environment variable none Additional notes none nodename nodename is equivalent to nodefile but the administrator specifies a list of server nodes on the mo...

Страница 149: ...nv or 1 Associated environment variable none Additional notes none 6 1 4 DVS Environment Variables By default user environment variables allow client override of options specified in the etc fstab ent...

Страница 150: ...m administrator s choice of DVS mount point options A DVS mode is really just the name given to a collection of mount options chosen to achieve a particular goal Users cannot choose among DVS modes un...

Страница 151: ...s that all bytes associated with a read or write are not interleaved with bytes from other read or write operations 6 1 5 2 About DVS Serial Mode Serial mode is the simplest implementation of DVS wher...

Страница 152: ...bar1 foo bar2 foo bar3 foo bar3 foo bar1 foo bar2 foo bar3 foo bar1 foo bar2 foo bar1 foo bar2 foo bar3 6 1 5 4 About DVS Loadbalance Mode Loadbalance mode is a client access mode for DVS used to mor...

Страница 153: ...handled by three different DVS servers thus distributing the load at a more granular level than that achieved by cluster parallel mode Note that while file I O is distributed at the block level file m...

Страница 154: ...e that server is selected the entire read or write request is handled by that server only This ensures that all I O requests are atomic while allowing DVS clients to access different servers for subse...

Страница 155: ...anner When failback occurs files are restriped to their original pattern Client System Console Message DVS file_node_down removing from list of available servers for 2 mount points The following messa...

Страница 156: ...300 On DVS clients specifies the number of seconds between checks for dirty files that need to request the last sync time from the server The default value is 600 A fourth proc file procsfs dvs sync_s...

Страница 157: ...was not spawned through aprun the request apid is 0 An example output of the file looks like cat proc fs dvs ipc requests server r0s1c1n3 request RQ_LOOKUP path dsl ufs home user 12795 time 0 000 sec...

Страница 158: ...le in the next release Write access is available when the MAP_PRIVATE flag is specified to mmap because the file data is private to the process that performed the mmap and therefore coherency with oth...

Страница 159: ...rop_caches proc file The following example uses the second mount point on the client and uses the cat command first to confirm that is the desired mount point To specify a different mount point replac...

Страница 160: ...0000040b517a2609 doesn t contain a valid partition table Disk dev disk by id dm uuid mpath 360080e50002ff41a0000040e517a261e 7999 4 GB 7999443697664 bytes 255 heads 63 sectors track 972543 cylinders t...

Страница 161: ...uuid mpath 360080e50002ff41a0000040b517a2609 successfully created nid00006 pvcreate dev disk by id dm uuid mpath 360080e50002ff41a0000040e517a261e Physical volume dev disk by id dm uuid mpath 360080e5...

Страница 162: ...o add the following dev mapper fs1 fs1 scratch xfs defaults 1 0 If there is a secondary XFS DVS node used as a high availability manual backup and failover device for the primary XFS DVS node do not a...

Страница 163: ...S request log options dvsproc dvs_request_log_enabled 1 6 1 8 5 Change Kernel Module Parameters Dynamically Using Proc Files Some of the kernel module parameters in the following list can be changed d...

Страница 164: ...dvsipc_msg_thread_limit and dvsipc_single_msg_queue however this module parameter has priority over the individual ones and if set will override them Table 16 dvs_instance_info Field Definitions Fiel...

Страница 165: ...to nid_of_choice etc modprobe d dvs local conf comment out either the compute or service line depending on the type of node s being configured Set parameters for DVS thread pool in DVS IPC layer Defa...

Страница 166: ...er to 1 disables fairness of service by forcing DVS to use a single message queue instead of a list of queues Default value 0 fairness of service enabled To view read only cat sys module dvsipc parame...

Страница 167: ...module dvs parameters dvsof_concurrent_writes To change prior to boot add these lines to nid_of_choice etc modprobe d dvs local conf Disable concurrent writes options dvs dvsof_concurrent_writes 1 Set...

Страница 168: ...ons dvsproc dvsproc_stat_defaults enable legacy brief plain notest To change dynamically This is root writable at sys module dvsproc parameters dvsproc_stat_defaults but changes should be made only th...

Страница 169: ...he DVS IPC layer Table 17 kdwfs_instance_info Field Definitions Field Definition thread_min Number of threads created at startup thread_max Maximum number of persistent threads thread_limit Maximum nu...

Страница 170: ...defaults Set parameters for DataWarp thread pool in DVS IPC layer options dvsipc kdwfs_instance_info 256 256 1024 4 10 1 1 1 This translates to kdwfs_instance_info thread_min 256 thread_max 256 threa...

Страница 171: ...meout_secs To change prior to boot add these lines to nid_of_choice etc modprobe d dvs local conf Set the timeout seconds for syncing dirty files on a DVS server or client options dvs sync_dirty_timeo...

Страница 172: ...DVS client DVS quiesce can be used when a file system needs to be repaired or to safely take a DVS server node out of service CAUTION Because it may cause data corruption do not use DVS quiesce to qui...

Страница 173: ...rent server in a round robin fashion until it finds a server that allows the operation to complete successfully open file If the request is for an open file read write lseek etc then DVS attempts the...

Страница 174: ...requires administrative privileges About this task The example provided in this procedure is for a scenario in which an admin wants to remove a DVS server from service but wants to let any outstanding...

Страница 175: ...rent aspects of the Urika GX system such as deciding whether a user who has been identified by the Linux login mechanism is actually permitted to use Urika GX whether a user is restricted to tenant ac...

Страница 176: ...have no authorization to log the Urika GX in any way even if the user can pass the authentication stage of logging into the system When the crayLoginShell attribute is set to a regular shell such as...

Страница 177: ...s password Until the user has a keytab however the user will not have access to HDFS Kerberos is only used in the Urika GX secure service mode Both of these secrets are user specific and both of them...

Страница 178: ...Kubernetes No Yes HDFS NameNode Yes Yes HDFS Secondary NameNode Yes Yes HDFS DataNode Yes Yes CGE Yes No Spark History Server Yes No Spark Thrift Server Yes No YARN Yes No YARN Resource Manager Yes No...

Страница 179: ...roxy Yes Yes Connectivity to Tableau Yes No Docker No Yes Any additional services installed on the system will use their own security mechanisms and will not be affected by Urika GX s default and secu...

Страница 180: ...in place and the security of data that was protected by secure mode may be compromised while running in the default mode Cray cannot extend the secure mode security assurances to any system that has...

Страница 181: ...ecure service mode the system will return the following message Requested service unavailable under current security mode Accessing an Application URL If the user attempts to access an application spe...

Страница 182: ...l ssh Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home users erl ssh id_rsa Your public key has been saved in home users erl ssh id_rsa p...

Страница 183: ...be an underlying issue The second and subsequent times this is run the user will not be prompted for host authenticity This is shown below utp launch true echo 0 7 5 Tenancy On Urika GX tenancy refer...

Страница 184: ...e hsm_restore hsm_release hsm_remove hsm_cancel swap_layouts migrate mv help version Two convenient commands for diagnosing problems using Urika GX s tenant proxy mechanism env true false IMPORTANT Th...

Страница 185: ...procedure should be used as guidance only Site specific customizations requirements are beyond the scope of this publication About Bridge Port 0 Management Network The Urika GX management network int...

Страница 186: ...Xs0f1 files which are required for setting up br1 s configuration The syntax of the command is utm host net opns enable force opns enable dryrun opns enable ip_addr hwaddr not used netmask default_gat...

Страница 187: ...CONFIGURATION mgmt enable NETINFO Enables the management network for use with UTM Attempts to detect and only make changes where needed mgmt enable dryrun NETINFO Show changes that would be made by mg...

Страница 188: ...k Missing bridge br1 Found operation network ethernet device enp7s0f1 state br1 NO_IP_LINK enp7s0f1 UP using bridge NO IP 172 30 51 152 20 Action The Operations network on this host will need to be be...

Страница 189: ...checking the status of tenants In addition tenant management also involves managing users associated with tenants These tasks are performed by a number of tenant management commands as described in th...

Страница 190: ...NT_IP_ADDR 172 30 48 7 UXTENANT_TENANT_IP_ADDR_NETMASK 255 255 240 0 UXTENANT_TENANT_IP_ONBOOT no UXTENANT_TENANT_IP_GATEWAY 172 30 48 1 UXTENANT_TENANT_IP_DNS1 172 30 84 40 UXTENANT_TENANT_IP_DNS2 17...

Страница 191: ...configurations home and lustre provided Site specific ones can be added as needed The sample mount points can be found by looking in the configuration files etc sysconfig uxtenant mounts home and etc...

Страница 192: ...ag on a locally controlled mount because it allows tenant creation to fully manage tenant isolation within a directory This will be examined more closely in the lustre example below In this case the f...

Страница 193: ...but it is required to be tenant isolated create the directory and export it to the tenant manually or by some site defined procedure prior to creating the tenant VM In that case set this setting to NO...

Страница 194: ...N A None 10 142 150 1 cb_tenant_00 N A N A notfound N A N A None 10 142 166 100 cb_tenant_01 N A N A notfound N A N A None 10 142 166 101 cb_tenant_02 N A N A notfound N A N A None 10 142 166 102 cb_...

Страница 195: ...nant VM s state can be retrieved via the urika tenant status command The list of states that a tenant VM can be in at a given time include Table 20 Tenant States State Description notfound The tenant...

Страница 196: ...ot robust to node redeployment or replacement Therefore it is importunate to note that tenant VMs will go into the notfound state if a node is redeployed or replaced These state transitions are depict...

Страница 197: ...ing items need to be kept under consideration while using this command Number of CPUs At least 2 CPUs need to remain available when the number of CPUs is changed by this script That is if there are N...

Страница 198: ...xecute the usm sync users command if the deleted user accounts are added back again and usm sync users was not run since they were deleted If the ux tenant remove user command is executed on a deleted...

Страница 199: ...wing for additional help information replacing sub command name with the name of the actual sub command help sub command name The system will return the following error if a user attempts to view help...

Страница 200: ...re 1 Log on to the SMW as root 2 Configure a tenant VM For information about tenant configuration steps and files refer to Tenant Management on page 189 3 Ensure that the system is functioning in the...

Страница 201: ...p s mesos_cluster 5 Start the Mesos service urika start s mesos_cluster If the system is running in the default service mode perform the following set of steps to remove a user 1 Execute the ux tenant...

Страница 202: ...er a tenant can only access their own data and have no visibility into the existence of data belonging to other tenants This includes local file systems used for temporary working space In addition us...

Страница 203: ...ectory the mode 750 rwx r x may be helpful since this only allows the owner of the directory to create remove rename files and directories in it Automatic Assignment of Tenant Group Ownership and mode...

Страница 204: ...a warning indicating that it detected an argument that is not allowed for restricted users and that the argument is being removed When the cluster is switched to the secure service mode Kerberos and...

Страница 205: ...o the system For example a user with Kerberos principal johnsmith local would be mapped to user johnsmith in HDFS HDFS provides a number of ways to map users to groups via group mapping providers The...

Страница 206: ...ant add user command to add a user to the authorized list of users ux tenant add user u bob In this example bob is added to the authorized user list but is not assigned tenant membership which would p...

Страница 207: ...man page View List of Authorized Users Execute the ux tenant list users to view the list of authorized users For more information refer to the ux tenant list users man page Urika GX Mesos and Kerbero...

Страница 208: ...ng Mesos jobs so it is done as a manual step After running usm sync users or usm recreate secret wait for an appropriate maintenance window and run urika stop s mesos_custer then urika start s mesos_c...

Страница 209: ...t cannot account for that variation The actual details of the site s implementation will vary according to their authentication data source and policies There is no suggestion that the procedures outl...

Страница 210: ...ldapsearch host port x search_dn query grep grep shell_field space sed sed e s shell_field space space if out search grep sed then return 1 fi echo out return 0 usage error 1 if z error then echo erro...

Страница 211: ...e command will import users in dry run mode from echo the LDAP server at cfdcg02 us cray com using the table with the DN echo ou people dc datacenter dc cray dc com assuming that the field name echo f...

Страница 212: ...hell shell dry run fi else if z dry_run then echo Removing user u if ux tenant remove user u u then echo WARNING removing user u failed skipping this user continue fi else echo Not really removing use...

Страница 213: ...cess and then install them differently the tenant users with tenant membership and restricted access i e not specifying a crayLoginShell value in the ux tenant add user command the users with relaxed...

Страница 214: ...hell dry run fi else if z dry_run then echo Removing user u if ux tenant remove user u u then echo WARNING removing user u failed skipping this user continue fi else echo Not really removing user u dr...

Страница 215: ...be achieved by using the ux tenant list users command in the raw R mode The following shell function can help do this get_user_list ux tenant list users F name R while read expr do eval expr echo nam...

Страница 216: ...ocumentation and Learning Resources UI Not available Grafana UI LDAP The system also ships with a default account that can be used to log on to Grafana The credentials of this account are username adm...

Страница 217: ...ystem Administration Guide Nagios UI Default credentials User name crayadm Password initial0 7 9 Change Default Passwords Change the SMW s Password Follow the instructions documented in Change the Def...

Страница 218: ...e asked to provide and confirm the old password the new password and will be asked to supply the old password again for the actual bind to take place After that the password will change If a new user...

Страница 219: ...7 9 1 Default Urika GX System Accounts Default System Management Workstation SMW Accounts Table 25 Default SMW Accounts Account Name Password root initial0 crayadm crayadm The SMW account should not...

Страница 220: ...systemd network systemd bus proxy sshd smmsp setroubleshoot saslauth rtkit rpcuser rpc radvd rabbitmq qemu pulse postgres postfix polkitd pcp oprofile ntp nginx nfsnobody mysql memcached mailnull libs...

Страница 221: ...ce if it is running service nagios stop 4 Change the default Nagios password htpasswd c usr local nagios etc htpasswd users nagiosadmin 5 Start the HTTPD service service httpd start 6 Start the Nagios...

Страница 222: ...name and password on the iDRAC s log in screen 4 Select the Submit button 5 Select iDRAC settings from the left navigation menu bar 6 Select User Authentication 7 Select the User ID for the user that...

Страница 223: ...ge 7 9 4 Change the Default System Management Workstation SMW Passwords Prerequisites Ensure that the SMW is accessible This procedure requires root access About this task After logging on to the SMW...

Страница 224: ...password on Urika GX Procedure 1 Log on to nid00030 which is a login node 2 Edit the slapd conf file and add in ACLs to allow users to modify the LDAP password root nid00003 vim usr local openldap et...

Страница 225: ...password for user admin Password 7 9 7 Reset an Administrator LDAP Password on Systems Using Urika GX 1 2UP01 and Earlier Releases Prerequisites This procedure requires root privileges assumes that t...

Страница 226: ...Va24 cachesize 50000 dirtyread dbnosync checkpoint 128 15 idlcachesize 50000 index objectClass eq database meta COMBINES the LDAP DATABASES database meta suffix dc local rootdn cn crayadm dc local roo...

Страница 227: ...perform the necessary ldif operations however this is not configured in the default Urika GX installation therefore setting this up for existing systems requires knowing the OLC admin password Below a...

Страница 228: ...ing entry SLAPD_CONF_DIR usr local openldap etc openldap slapd d Procedure 1 Log on to the LDAP host server as root 2 Generate a new hashed password root nid00030 slappasswd New password Re enter new...

Страница 229: ...UE To resolve this issue please follow the instructions documented at http gethue com SSL authentication for Tableau can be set up using instructions documented in Enable SSL on Urika GX Urika GX ship...

Страница 230: ...g the settings to enable SSL 3 Edit the settings for the Urika Applications Interface by uncommenting some settings to enable SSL In the following instructions it is assumed that the SSL certificate i...

Страница 231: ...kthrift_ssl_backend mode tcp balance source server server1 192 168 0 33 10015 frontend grafana_ssl bind 29201 ssl crt etc ssl certs filename pem mode http option forwardfor option http server close op...

Страница 232: ...d be available at https hostname login1 29202 HUE would still be available at http hostname login1 8888 but this URL not secure It is recommended to use https hostname login1 29202 The preceding examp...

Страница 233: ...anagement default application_management settings py as shown in the following example This allows the Urika GX Applications Interface page to load secure URLs configured in the preceding steps when t...

Страница 234: ...name smw 13 Restart Apache on the SMW service httpd restart 14 Verify that all the URLs of services are accessible 7 12 Enable SSL for Spark Thrift Server of a Tenant Prerequisites This procedure requ...

Страница 235: ...xml file to change the value of hive server2 use SSL to false 2 Stop and the start the Spark Thrift Server if it is currently running 7 13 Install a Trusted SSL Certificate on Urika GX Prerequisites T...

Страница 236: ...ertificate 7 14 Enable LDAP Authentication on Urika GX Prerequisites This procedure requires root access Ensure that the storage LDAP client points at login node 1 which is the LDAP server on Urika GX...

Страница 237: ...and changes Procedure 1 Log on to login node 1 as root ssh root login 1 2 Edit the usr local openldap etc openldap slapd conf file to uncomment the following lines and replace myldap with the actual...

Страница 238: ...he following example system smw is used as an example for the SMW s hostname and username is used as an example of an LDAP user s username root system smw pdsh w nid000 00 47 id username c Log on to l...

Страница 239: ...of HUE as shown in the following figure To resolve this issue please follow the instructions documented at http gethue com Figure 43 Error Message 5 Log on to the SMW 6 Restart HiveServer2 service ur...

Страница 240: ...to modify other parameters of the hive site xml file that are documented in the Hive documentation as they are already configured on the Urika GX system 4 Start the Hive service urika start s hive 5...

Страница 241: ...DAP based authentication System administrators can use LDAP servers to setup user authentication Add a user to a group Use the usermod command to add a user to an existing group Add a user to LDAP Use...

Страница 242: ...75 DataNode use for data transfers 50010 DataNode used for metadata operations 8010 Table 33 Services Accessible via the Login Nodes via the Hostname Service Default Port Mesos Master UI 5050 This UI...

Страница 243: ...chever login node 1 or 2 that the user executes spark submit spark shell spark sql pyspark on This UI is user visible InfluxDB 8086 on login2 InfluxDB runs on nid00046 on three sub rack and on nid0003...

Страница 244: ...Service Port Kafka not configured by default 9092 Flume not configured by default 41414 Port for SSH 22 Security S3016 244...

Страница 245: ...directory console date Contains node console logs consumer date Contains ERD event collector xtconsumer logs netwatch date Contains logs generated by the xtnetwatch command nlrd date Contains logs gen...

Страница 246: ...ss often Nagios Default location of logs is var log nagios nagios log on the SMW This location can be configured if desired N A Cobbler Located in the Cobbler KVM at var log cobbler N A Grafana Locate...

Страница 247: ...by editing the etc grafana grafana ini file Yes Jupyter Notebook Log levels are controlled by the Application log_level configuration parameter in etc jupyterhub jupyterhub_config py It is set to 30 b...

Страница 248: ...InfluxDB var log influxdb influxd log collectl collectl does not produce any logging information It uses logging as a mechanism for storing metrics These metrics are exported to InfluxDB If collectl...

Страница 249: ...spark test 1522834122688 driver STOP Application Stopped Wed Apr 04 04 30 09 CDT 2018 username spark test 1522834207396 driver START Application Started with 1 driver plus 2 0 executors using 3 0 core...

Страница 250: ...ode e g a login node add the user to the user authorization list as follows smw ux tenant add user u username smw ux tenant relax username If this was seen on a tenant VM and the user is supposed to b...

Страница 251: ...13 34 30 nid00030 sshd 143134 pam_cray_keytab kinit failed for user username see previous logs There may be other errors reported by kinit as well If the logs contain such information first check whe...

Страница 252: ...ux tenant list users error configuration is out of date with the current version The following configuration files need to be brought up to date with the current version of Urika Tenant Management The...

Страница 253: ...pts Look first in ifcfg br1 It should looks something like this DEVICE br1 TYPE Bridge BOOTPROTO static ONBOOT yes NM_CONTROLLED no IPADDR 172 30 51 237 NETMASK 255 255 240 0 GATEWAY 172 30 48 1 DNS1...

Страница 254: ...erated for the tenant VM login session A ticket granting ticket is generated each time a command is run on a physical node through the tenant proxy mechanism Logs Urika GX Tenant Management Logs utp s...

Страница 255: ...ing environment for any given tenant VM resides in a disk image file found on the host node the node named in the etc sysconfig uxtenant hosts host name file named in the tenant configuration at qemu...

Страница 256: ...from this situation as well since we do not know all of the paths that could result in network timeouts during LDAP start up One consequence of this can be that the slapd LDAP server on login1 will t...

Страница 257: ...cified or if the secret is not entered in plain text 4 Start the Mesos cluster CAUTION The Mesos cluster needs to be restarted after running the urika mesos change secret command otherwise the framewo...

Страница 258: ...r var log oozie HUE HUE logs are generated under var log hue Flex scripts urika yam status urika yam flexup urika yam flexdown urika yam flexdown all These scripts generate log files under var log uri...

Страница 259: ...on all the nodes with Mesos slave processes running The following example can be used on a 3 sub rack system pdsh w nid000 00 47 x nid000 00 16 30 31 32 46 47 rm vf var log mesos agent meta slaves lat...

Страница 260: ...Error Message Description Resolution INFO mesos CoarseMesosSchedulerBac kend Blacklisting Mesos slave 20151120 121737 1560611850 505 0 20795 S0 due to too many failures is Spark installed on it INFO...

Страница 261: ...erwise please specify the desired value when using the urika yam flexup command ID names can only contain alphanumeric dot and dash characters not allowed in jhoole Usage urika yam flexup nodes nodes...

Страница 262: ...t be set to the same as the default value Ensure that this parameter is set and the value is the same as default value Marathon port is not set in the etc urika yam conf file please re check the confi...

Страница 263: ...DFS is in a healthy state Marathon and Mesos services are up and running The status of the aforementioned services can be checked using the urika state command For more information see the urika state...

Страница 264: ...ing too many jobs a user waiting for their job to start but previous jobs have not freed up nodes etc Additionally if a user set a job timeout s to 1 hour and the job lasted longer than 1 hour the use...

Страница 265: ...nplugs an Ethernet cable while a CGE job was running or a node died etc Ensure that the Ethernet cable is plugged while jobs are running NCMD Error leasing cookies MUNGE Munge authentication failure s...

Страница 266: ...wn option of the mrun command The first remote compute node did not exit successfully Thus this error message can only occur if the application is killed or if someone used the Marathon REST API in an...

Страница 267: ..._files do echo Fixing hdfsfile hadoop fs setrep 3 hdfsfile done Too many corrupt blocks in name node UI The NameNode might not have access to at least one replication of the block Check if any of the...

Страница 268: ...rk jobs Kill the job using the Spark UI Click on the text kill in the Description column of the Stages tab Kill the job using the Linux kill command Kill the job using the Ctrl C keyboard keys The sys...

Страница 269: ...ror N nodes must be 1 parser error n images must be N nodes parser error No command specified to launch error Not enough CPUs error Not enough CPUs for exclusive access error Not enough nodes parser e...

Страница 270: ...run 1 man page 8 10 Troubleshoot Application Hangs as a Result of NFS File Locking About this task Applications may hang when NFS file systems are projected through DVS and file locking is used To avo...

Страница 271: ...o define a large number of user environment variables Cray recommends that users include those definitions in the user s shell so that they are available at startup and stored where DVS can always loc...

Страница 272: ...too many times or with very large temporary files the SSDs may begin to fill up This can cause Spark jobs to fail or slow down Urika GX checks for any idle nodes once per hour and cleans up any left o...

Отзывы: