4001-4049 of 4049 results (31ms)
2015-08-19 §
10:45 <valhallasw`cloud> ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406 [tools]
2015-08-18 §
13:57 <valhallasw`cloud> same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs. [tools]
13:55 <valhallasw`cloud> no, wait, that's ''tools-webgrid-lighttpd-1411.eqiad.wmflabs'', not the actual host ''tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs''. We should fix that dns mess as well. [tools]
13:54 <valhallasw`cloud> tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state. [tools]
13:47 <valhallasw`cloud> that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state [tools]
13:46 <valhallasw`cloud> starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using <code>for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done</code> [tools]
08:37 <valhallasw`cloud> sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" [tools]
08:33 <valhallasw`cloud> tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available) [tools]
08:30 <valhallasw`cloud> hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config [tools]
08:21 <valhallasw`cloud> still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" [tools]
08:20 <valhallasw`cloud> sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs [tools]
08:14 <valhallasw`cloud> and the hostgroup @webgrid doesn't even exist? (╯°□°)╯︵ ┻━┻ [tools]
08:10 <valhallasw`cloud> /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list. [tools]
08:07 <valhallasw`cloud> sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list [tools]
08:06 <valhallasw`cloud> ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so. [tools]
08:04 <valhallasw`cloud> installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps. [tools]
08:00 <valhallasw`cloud> running puppet agent -tv again [tools]
07:55 <valhallasw`cloud> argh. Disabling toollabs::node::web::generic again and enabling toollabs::node::web::lighttpd [tools]
07:54 <valhallasw`cloud> various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in [tools]
07:53 <valhallasw`cloud> Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory --- ran sudo touch /usr/lib/adminbot/README [tools]
07:37 <valhallasw`cloud> applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411 [tools]
07:31 <valhallasw`cloud> reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing? [tools]
07:26 <valhallasw`cloud> andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that... [tools]
2015-08-17 §
16:17 <andrewbogott> disable queues for tools-exec-1205 tools-exec-1207 tools-exec-1208 tools-exec-140 tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-exec-catscan tools-web-static-01 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1205 tools-webgrid lighttpd-1206 tools-webgrid-lighttpd-1406 tools-webproxy-02 [tools]
15:33 <andrewbogott> re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 [tools]
14:50 <andrewbogott> killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 [tools]
2015-08-15 §
05:14 <andrewbogott> resumed tools-exec-gift, seems not to have been the culprit [tools]
05:08 <andrewbogott> suspending tools-exec-gift, just for a moment... [tools]
2015-08-14 §
17:21 <andrewbogott> disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004 [tools]
15:20 <andrewbogott> Adding back to the grid engine queue: tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407 [tools]
14:43 <andrewbogott> killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407 [tools]
2015-08-13 §
18:51 <valhallasw`cloud> which was resolved by scfc earlier [tools]
18:50 <valhallasw`cloud> tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid <br> Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists)) [tools]
16:44 <andrewbogott> disabling job queue for tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407 [tools]
14:48 <andrewbogott> and tools-webgrid-lighttpd-1408 [tools]
14:48 <andrewbogott> rescheduling (and in some cases killing) jobs on tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 [tools]
2015-08-12 §
16:05 <andrewbogott> depooling tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408 [tools]
14:41 <andrewbogott> forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 [tools]
2015-08-11 §
18:17 <andrewbogott> depooling tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 in anticipation of labvirt1001 reboot tomorrow [tools]
2015-08-03 §
19:13 <andrewbogott> deleted tools-static-01 [tools]
2015-08-01 §
18:09 <andrewbogott> depooling/rebooting tools-webgrid-lighttpd-1407 because it’s unable to fork [tools]
2015-07-30 §
15:00 <andrewbogott> rebooting tools-bastion-01 aka tools-login [tools]
2015-07-29 §
23:43 <andrewbogott> draining, rebooting tools-webgrid-lighttpd-1408 [tools]
20:11 <andrewbogott> rebooting tools-webgrid-lighttpd-1404 [tools]
2015-07-28 §
17:49 <valhallasw`cloud> Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server [tools]
17:43 <valhallasw`cloud> rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty [tools]
17:16 <valhallasw`cloud> disabled queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs" [tools]
02:06 <YuviPanda> removed pacct files from tools-bastion-01 [tools]
2015-07-27 §
21:27 <valhallasw`cloud> turned off process accounting on tools-login while we try to find the root cause of [[phab:T107052]]: <pre>accton off</pre> [tools]