data:image/s3,"s3://crabby-images/6f589/6f589ad97f13d88f47b3182924358ba78f93991e" alt="Cray Urika-GX Administration Manual Download Page 269"
○
error("Unexpected 'frameworks' data from Mesos")
○
error("mrun: Getting mrund state threw exception - %s" % )
○
error("getting marathon controller state threw exception - %s" %)
○
error("Unexpected 'apps' data from Marathon")
○
error("mrun: Launching mrund threw exception - %s" % (str(e)))
○
error("mrun: unexpected 'app' data from Marathon: exception - %s" % (str(e)))
○
error("mrun: startMrund failed")
○
error("mrun: Exception received while waiting for ")
Command-line options Errors
Potential cause - These errors are typically caused by user errors, typos and when not enough nodes are
available to run a job.
●
Format:
Mon Jul 11 2016 11:47:22.281972 UTC[][mrun]:ERROR:Not enough CPUs for
exclusive access. Available: 0 Needed: 1
●
Examples:
○
parser.error("Only --mem_bind=local supported")
○
parser.error("Only --cpu-freq=high supported")
○
parser.error("Only --kill-on-bad-exit=1 supported")
○
parser.error("-n should equal (-N * --ntasks-per-node)")
○
parser.error("-N nodes must be >= 1")
○
parser.error("-n images must be >= -N nodes")
○
parser.error("No command specified to launch");
○
error("Not enough CPUs. "
○
error("Not enough CPUs for exclusive access. " )
○
error("Not enough nodes. " )
○
parser.error("name [%s] must only contain 'a-z','0-9','-' and '.'" )
○
parser.error("[%s] is not executable file" % args[0])
Timeout errors
Cause- The errors indicate timeout and resource contention issues, such as, the job timed out, the machine is
busy, too many users running too many jobs, a user waiting for their job to start, but previous jobs have not freed
up nodes, etc. Additionally, if a user set a job timeout's to 1 hour, and the job lasted longer than 1 hour, they would
get a
Job Cancelled
timeout error.
●
Format:
Mon Jul 11 2016 12:13:08.269371 UTC[][mrun]:ERROR:mrun: Force Terminated
job /mrun/2016-193-12-13-03.174056 Cancelled due to Timeout
●
Examples:
○
error("mrun: --immediate timed out while waiting")
○
error("mrun: Timed out waiting for mrund : %s" % appID)
○
error("mrun: Force Terminated job %s Cancelled due to Timeout" %)
Network errors, such as socket, switch, TCP, node failure
Troubleshooting
S3016
269