2651-2700 of 2917 results (20ms)
2016-01-27 §
18:26 <valhallasw`cloud> messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate [tools]
18:24 <valhallasw`cloud> 'sleep' test job also seems to work without issues [tools]
18:23 <valhallasw`cloud> no errors in log file, qstat works [tools]
18:23 <chasemp> master sge restarted post dump and restart for jobs db [tools]
18:22 <valhallasw`cloud> messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016' [tools]
18:20 <chasemp> master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job [tools]
18:19 <valhallasw`cloud> dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M [tools]
18:17 <valhallasw`cloud> SGE Configuration successfully saved to /root/sge_maint_01272016 directory. [tools]
18:14 <chasemp> grid master stopped [tools]
2016-01-26 §
21:28 <YuviPanda> qstat -u '*' | grep E | awk '{print $1}' | xargs -L1 qmod -cj [tools]
21:16 <chasemp> reboot tools-exec-1217.tools.eqiad.wmflabs [tools]
2016-01-25 §
20:30 <YuviPanda> switched over cron host to tools-cron-01, manually copied all old cron files from tools-submit to tools-cron-01 [tools]
19:06 <chasemp> kill python merge/merge-unique.py tools-exec-1213 as it seemed to be overwhelming nfs [tools]
2016-01-21 §
22:24 <YuviPanda> deleted tools-redis-01 and -02 (are on 1001 and 1002 now) [tools]
21:13 <YuviPanda> repooled exec nodes on labvirt1010 [tools]
21:08 <YuviPanda> gridengine-master started, verified shadow hasn't started [tools]
21:00 <YuviPanda> stop gridengine master [tools]
20:51 <YuviPanda> repooled exec nodes on labvirt1007 was last message [tools]
20:51 <YuviPanda> repooled exec nodes on labvirt1006 [tools]
20:39 <YuviPanda> failover tools-static too tools-web-static-01 [tools]
20:38 <YuviPanda> failover tools-checker to tools-checker-01 [tools]
20:32 <YuviPanda> depooled exec nodes on 1007 [tools]
20:32 <YuviPanda> repooled exec nodes on 1006 [tools]
20:14 <YuviPanda> depooled all exec nodes in labvirt1006 [tools]
20:11 <YuviPanda> repooled exec node son 1005 [tools]
19:53 <YuviPanda> depooled exec nodes on labvirt1005 [tools]
19:49 <YuviPanda> repooled exec nodes from labvirt1004 [tools]
19:48 <YuviPanda> failed over proxy to tools-proxy-01 again [tools]
19:31 <YuviPanda> depooled exec nodes from labvirt1004 [tools]
19:29 <YuviPanda> repooled exec nodes from labvirt1003 [tools]
19:13 <YuviPanda> depooled instances on labvirt1003 [tools]
19:06 <YuviPanda> re-enabled queues on exec nodes that were on labvirt1002 [tools]
19:02 <YuviPanda> failed over tools proxy to tools-proxy-02 [tools]
18:46 <YuviPanda> drained and disabled queues on all nodes on labvirt1002 [tools]
18:38 <YuviPanda> restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead [tools]
2016-01-20 §
14:50 <chasemp> reboot tools-webgrid-lighttpd-1209 as frozen [tools]
2016-01-15 §
18:34 <chasemp> tools-mail-01 is locked up I am rebooting [tools]
2016-01-14 §
01:56 <YuviPanda> rm service.manifest for wikiviewstats to prevent it from constantly trying to start up and fail webservice [tools]
01:32 <YuviPanda> stopped erwin85's tools since it was causing replag on labsdb1002 [tools]
2016-01-11 §
22:19 <valhallasw`cloud> reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30 [tools]
22:12 <YuviPanda> restarted gridengine master again [tools]
22:07 <valhallasw`cloud> set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0 [tools]
22:05 <valhallasw`cloud> set maxujobs back to 0, but doesn't help [tools]
21:57 <valhallasw`cloud> reset to 7:30 [tools]
21:57 <valhallasw`cloud> that cleared the measure, but jobs still not starting. Ugh! [tools]
21:55 <valhallasw`cloud> set job_load_adjustments_decay_time = 0:0:0 [tools]
21:45 <YuviPanda> restarted gridengine master [tools]
21:43 <valhallasw`cloud> qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting [tools]
21:42 <valhallasw`cloud> resetting to 0:7:30, as it's not having the intended effect [tools]
21:41 <valhallasw`cloud> currently 353 jobs in qw state [tools]