2015-08-31
§
|
20:23 |
<valhallasw`cloud> |
doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh* |
[tools] |
20:21 |
<valhallasw`cloud> |
now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues |
[tools] |
19:36 |
<valhallasw`cloud> |
last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs |
[tools] |
19:35 |
<valhallasw`cloud> |
one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi? |
[tools] |
19:31 |
<valhallasw`cloud> |
https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues |
[tools] |
07:31 |
<valhallasw`cloud> |
removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs) |
[tools] |
2015-08-18
§
|
13:57 |
<valhallasw`cloud> |
same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs. |
[tools] |
13:55 |
<valhallasw`cloud> |
no, wait, that's ''tools-webgrid-lighttpd-1411.eqiad.wmflabs'', not the actual host ''tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs''. We should fix that dns mess as well. |
[tools] |
13:54 |
<valhallasw`cloud> |
tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state. |
[tools] |
13:47 |
<valhallasw`cloud> |
that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state |
[tools] |
13:46 |
<valhallasw`cloud> |
starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using <code>for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done</code> |
[tools] |
08:37 |
<valhallasw`cloud> |
sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" |
[tools] |
08:33 |
<valhallasw`cloud> |
tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available) |
[tools] |
08:30 |
<valhallasw`cloud> |
hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config |
[tools] |
08:21 |
<valhallasw`cloud> |
still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" |
[tools] |
08:20 |
<valhallasw`cloud> |
sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs |
[tools] |
08:14 |
<valhallasw`cloud> |
and the hostgroup @webgrid doesn't even exist? (╯°□°)╯︵ ┻━┻ |
[tools] |
08:10 |
<valhallasw`cloud> |
/var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list. |
[tools] |
08:07 |
<valhallasw`cloud> |
sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list |
[tools] |
08:06 |
<valhallasw`cloud> |
ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so. |
[tools] |
08:04 |
<valhallasw`cloud> |
installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps. |
[tools] |
08:00 |
<valhallasw`cloud> |
running puppet agent -tv again |
[tools] |
07:55 |
<valhallasw`cloud> |
argh. Disabling toollabs::node::web::generic again and enabling toollabs::node::web::lighttpd |
[tools] |
07:54 |
<valhallasw`cloud> |
various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in |
[tools] |
07:53 |
<valhallasw`cloud> |
Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory --- ran sudo touch /usr/lib/adminbot/README |
[tools] |
07:37 |
<valhallasw`cloud> |
applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411 |
[tools] |