head node throttling under heavy load



head node throttling under heavy load

From: Jan Staunton <jps_at_cs.york.ac.uk>
Date: Fri, 18 Feb 2011 16:27:58 +0000
Message-Id: <F33493F4-23E7-467E-A8A7-AB4FC169082E@cs.york.ac.uk>
Hello all,

I have implemented measures to prevent the head node being overloaded by SGE.  Jobs will be suspended if the load reaches a certain level, allowing head to be always accessible.  A full load of 8 can be reached, but if 10 is reached then SGE will suspend jobs one by one on head until load is suitable.

During this process, I have discovered that any given machine (apart from head) can be overloaded to 3 times the recommended amount before SGE will start suspending jobs on those machines.  We may wish to revise this level lower to stop errant jobs from downing a machine.  We should do this if errant jobs become a problem (admittedly most of the load problems are caused by me :))

In addition to this, I have determined that an instant access fair share scheme is not implementable without significant effort.  If a 24 hour wait for access becomes a problem in the future, a shorter maximum wall clock measure could be implemented.  This doesn't seem to be a problem as of yet.

TL;DR, grid improved.

Cheers

Jan
Received on Fri 18 Feb 2011 - 16:27:59 GMT