2016-01-21
§
|
20:38 |
<YuviPanda> |
failover tools-checker to tools-checker-01 |
[tools] |
20:32 |
<YuviPanda> |
depooled exec nodes on 1007 |
[tools] |
20:32 |
<YuviPanda> |
repooled exec nodes on 1006 |
[tools] |
20:14 |
<YuviPanda> |
depooled all exec nodes in labvirt1006 |
[tools] |
20:11 |
<YuviPanda> |
repooled exec node son 1005 |
[tools] |
19:53 |
<YuviPanda> |
depooled exec nodes on labvirt1005 |
[tools] |
19:49 |
<YuviPanda> |
repooled exec nodes from labvirt1004 |
[tools] |
19:48 |
<YuviPanda> |
failed over proxy to tools-proxy-01 again |
[tools] |
19:31 |
<YuviPanda> |
depooled exec nodes from labvirt1004 |
[tools] |
19:29 |
<YuviPanda> |
repooled exec nodes from labvirt1003 |
[tools] |
19:13 |
<YuviPanda> |
depooled instances on labvirt1003 |
[tools] |
19:06 |
<YuviPanda> |
re-enabled queues on exec nodes that were on labvirt1002 |
[tools] |
19:02 |
<YuviPanda> |
failed over tools proxy to tools-proxy-02 |
[tools] |
18:46 |
<YuviPanda> |
drained and disabled queues on all nodes on labvirt1002 |
[tools] |
18:38 |
<YuviPanda> |
restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead |
[tools] |
2016-01-11
§
|
22:19 |
<valhallasw`cloud> |
reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30 |
[tools] |
22:12 |
<YuviPanda> |
restarted gridengine master again |
[tools] |
22:07 |
<valhallasw`cloud> |
set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0 |
[tools] |
22:05 |
<valhallasw`cloud> |
set maxujobs back to 0, but doesn't help |
[tools] |
21:57 |
<valhallasw`cloud> |
reset to 7:30 |
[tools] |
21:57 |
<valhallasw`cloud> |
that cleared the measure, but jobs still not starting. Ugh! |
[tools] |
21:55 |
<valhallasw`cloud> |
set job_load_adjustments_decay_time = 0:0:0 |
[tools] |
21:45 |
<YuviPanda> |
restarted gridengine master |
[tools] |
21:43 |
<valhallasw`cloud> |
qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting |
[tools] |
21:42 |
<valhallasw`cloud> |
resetting to 0:7:30, as it's not having the intended effect |
[tools] |
21:41 |
<valhallasw`cloud> |
currently 353 jobs in qw state |
[tools] |
21:40 |
<valhallasw`cloud> |
that's load_adjustment_decay_time |
[tools] |
21:40 |
<valhallasw`cloud> |
temporarily sudo qconf -msconf to 0:0:1 |
[tools] |
19:59 |
<YuviPanda> |
Set maxujobs (max concurrent jobs per user) on gridengine to 128 |
[tools] |
17:51 |
<YuviPanda> |
kill all queries running on labsdb1003 |
[tools] |
17:20 |
<YuviPanda> |
stopped webservice for quentinv57-tools |
[tools] |