2016-01-27
§
|
19:11 |
<YuviPanda> |
depooled tools-webgrid-1405 to prep for restart, lots of stuck processes |
[tools] |
18:29 |
<valhallasw`cloud> |
job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 . |
[tools] |
18:26 |
<valhallasw`cloud> |
messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate |
[tools] |
18:24 |
<valhallasw`cloud> |
'sleep' test job also seems to work without issues |
[tools] |
18:23 |
<valhallasw`cloud> |
no errors in log file, qstat works |
[tools] |
18:23 |
<chasemp> |
master sge restarted post dump and restart for jobs db |
[tools] |
18:22 |
<valhallasw`cloud> |
messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016' |
[tools] |
18:20 |
<chasemp> |
master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job |
[tools] |
18:19 |
<valhallasw`cloud> |
dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M |
[tools] |
18:17 |
<valhallasw`cloud> |
SGE Configuration successfully saved to /root/sge_maint_01272016 directory. |
[tools] |
18:14 |
<chasemp> |
grid master stopped |
[tools] |
2016-01-21
§
|
22:24 |
<YuviPanda> |
deleted tools-redis-01 and -02 (are on 1001 and 1002 now) |
[tools] |
21:13 |
<YuviPanda> |
repooled exec nodes on labvirt1010 |
[tools] |
21:08 |
<YuviPanda> |
gridengine-master started, verified shadow hasn't started |
[tools] |
21:00 |
<YuviPanda> |
stop gridengine master |
[tools] |
20:51 |
<YuviPanda> |
repooled exec nodes on labvirt1007 was last message |
[tools] |
20:51 |
<YuviPanda> |
repooled exec nodes on labvirt1006 |
[tools] |
20:39 |
<YuviPanda> |
failover tools-static too tools-web-static-01 |
[tools] |
20:38 |
<YuviPanda> |
failover tools-checker to tools-checker-01 |
[tools] |
20:32 |
<YuviPanda> |
depooled exec nodes on 1007 |
[tools] |
20:32 |
<YuviPanda> |
repooled exec nodes on 1006 |
[tools] |
20:14 |
<YuviPanda> |
depooled all exec nodes in labvirt1006 |
[tools] |
20:11 |
<YuviPanda> |
repooled exec node son 1005 |
[tools] |
19:53 |
<YuviPanda> |
depooled exec nodes on labvirt1005 |
[tools] |
19:49 |
<YuviPanda> |
repooled exec nodes from labvirt1004 |
[tools] |
19:48 |
<YuviPanda> |
failed over proxy to tools-proxy-01 again |
[tools] |
19:31 |
<YuviPanda> |
depooled exec nodes from labvirt1004 |
[tools] |
19:29 |
<YuviPanda> |
repooled exec nodes from labvirt1003 |
[tools] |
19:13 |
<YuviPanda> |
depooled instances on labvirt1003 |
[tools] |
19:06 |
<YuviPanda> |
re-enabled queues on exec nodes that were on labvirt1002 |
[tools] |
19:02 |
<YuviPanda> |
failed over tools proxy to tools-proxy-02 |
[tools] |
18:46 |
<YuviPanda> |
drained and disabled queues on all nodes on labvirt1002 |
[tools] |
18:38 |
<YuviPanda> |
restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead |
[tools] |
2016-01-11
§
|
22:19 |
<valhallasw`cloud> |
reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30 |
[tools] |
22:12 |
<YuviPanda> |
restarted gridengine master again |
[tools] |
22:07 |
<valhallasw`cloud> |
set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0 |
[tools] |
22:05 |
<valhallasw`cloud> |
set maxujobs back to 0, but doesn't help |
[tools] |
21:57 |
<valhallasw`cloud> |
reset to 7:30 |
[tools] |
21:57 |
<valhallasw`cloud> |
that cleared the measure, but jobs still not starting. Ugh! |
[tools] |
21:55 |
<valhallasw`cloud> |
set job_load_adjustments_decay_time = 0:0:0 |
[tools] |
21:45 |
<YuviPanda> |
restarted gridengine master |
[tools] |
21:43 |
<valhallasw`cloud> |
qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting |
[tools] |