
Mesos tracks the frameworks that have registered with it. If jobs are submitted and there are no resources
available, frameworks will not be dropped. Instead, frameworks will remain registered (unless manually killed by
the user) and will continue to receive updated resource offers from Mesos at regular intervals. As an example,
consider a scenario where there are four Spark jobs running and using all of the resources on the cluster. A user
attempts to submit a job with framework
Y
and framework
Y
is waiting for resources. As each Spark job completes
its execution, it releases the resources and Mesos updates its resource availability. Mesos will continue to give
resource offers to the framework
Y
.
Y
can chose either to accept or reject resources. If
Y
decides to accept the
resources, it will schedule its tasks. If
Y
rejects the resources, it will remain registered with Mesos and will
continue to receive resource offers from Mesos.
Allocation of resources to Spark by Mesos
Let us say that the
spark-submit
command is executed with parameters
--total-executor-cores 100
--executor-memory 80G
. Each node has 32 cores. Mesos tries to use as few nodes as possible for the 100
cores requested. So in this case it will start Spark executors on 4 nodes (roundup(100 / 32)). Each executor has
been requested to have 80G of memory. Default value for
spark.mesos.executor.memoryOverhead
is 10%
so it allocates 88G to each executor. So in Mesos, it can been seen that
88 * 4 = 352 GB
allocated to the 4
Spark executors. For more information, see the latest Spark documentation at
Additional points to note:
●
On Urika-GX, the Mesos cluster runs in High Availability mode, with 3 Mesos Masters and Marathon
instances configured with Zookeeper.
●
Unlike Marathon, Mesos does not offer any queue. Urika-GX scripts for flexing clusters and the
mrun
command do not submit their jobs unless they know the resource requirement is satisfied. Once the flex up
request is successful, YARN uses its own queue for all the Hadoop workloads.
●
Spark accepts resource offers with fewer resources than what it requested. For example, if a Spark job wants
1000 cores but only 800 cores are available, Mesos will offer those 800 to the Spark job. Spark will then
choose to accept or reject the offer. If Spark accepts the offer, the job will be executed on the cluster. By
default, Spark will accept any resource offer even if the number of resources in the offer is much less than the
number of nodes the job requested. However, Spark users can control this behavior by specifying a minimum
acceptable resource ratio; for example, they could require that Spark only accept offers of at least 90% of the
requested cores. The configuration parameter that sets the ratio is
spark.scheduler.minRegisteredResourcesRatio
. It can be set on the command line with
--conf
spark.scheduler.minRegisteredResourcesRatio=N
where
N
is between 0.0 and 1.0.
●
mrun
and flex scripts do not behave the way Spark behaves (as described in the previous bullet).
mrun
accepts two command-line options:
○
--immediate=XXX
(default 30 seconds)
○
--wait
(default
False
)
When a user submits an
mrun
job, if more resources are needed than Mesos currently has available, the
command will return immediately, showing system usage and how many nodes are available vs how many
nodes are needed. If the user supplies the
--wait
option, this tells mrun to not return, but instead continue to
poll Mesos until enough nodes are available.
mrun
will continue to poll Mesos for up to
--immediate
seconds before timing out. Finally, once Mesos informs
mrun
there are enough resources available; mrun will
post the job to Marathon.
When the requested resources are not available, flex scripts will display the current resources availability and
exit.
Resource Management
S3016
125