
Cause - These errors may be encountered in situations where an admin physically unplugs an Ethernet cable
while a CGE job was running, or a node died, etc.
●
Examples:
○
error("mrun: select(): Exception %s" % str(e))
○
error("mrun: error socket")
○
error %r:%s died\n" % (err,args[0]))
○
error("mrund: select(): Exception %s\n" % str(e))
System service errors
Cause - These errors only occur if the specific system services have failed. The cause of the issue may be
identified by looking at the log messages under
/var/log/messages
on the node the message was
encountered on.
●
Examples:
○
NCMD: Error leasing cookies MUNGE:
○
Munge authentication failure [%s] (%s).\n
For more information, see the
mrun(1)
man page.
8.10 Troubleshoot: Application Hangs as a Result of NFS File Locking
About this task
Applications may hang when NFS file systems are projected through DVS and file locking is used. To avoid this
issue:
Procedure
Specify the
nolock
option in the NFS mount point on DVS servers.
See the
nfs(5)
man page for more information on the
nolock
option.
8.11 Troubleshoot: DVS does not Start after Data Store Move
About this task
If DVS fails after the Cray system's data store is moved to a shared external Lustre file system, verify that DVS
has the correct
lnd_name
that uniquely identifies the Cray system to the LNet router. The default value for
lnd_name
on a single-user Lustre file system is
gni
. Each system sharing an external Lustre file system must
have a unique
gni*
identifier, such as
gni0
,
gni1
.
Troubleshooting
S3016
270