Hello all,
I have implemented measures to prevent the head node being overloaded by SGE. Jobs will be suspended if the load reaches a certain level, allowing head to be always accessible. A full load of 8 can be reached, but if 10 is reached then SGE will suspend jobs one by one on head until load is suitable.
During this process, I have discovered that any given machine (apart from head) can be overloaded to 3 times the recommended amount before SGE will start suspending jobs on those machines. We may wish to revise this level lower to stop errant jobs from downing a machine. We should do this if errant jobs become a problem (admittedly most of the load problems are caused by me :))
In addition to this, I have determined that an instant access fair share scheme is not implementable without significant effort. If a 24 hour wait for access becomes a problem in the future, a shorter maximum wall clock measure could be implemented. This doesn't seem to be a problem as of yet.
TL;DR, grid improved.
Cheers
Jan
Received on Fri 18 Feb 2011 - 16:27:59 GMT