7201-7250 of 10000 results (31ms)
2020-11-17 §
08:33 <elukey@cumin1001> START - Cookbook sre.hosts.decommission [production]
08:31 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
08:24 <elukey@cumin1001> START - Cookbook sre.hosts.decommission [production]
2020-11-16 §
17:48 <elukey> enable and run puppet on kafka-main2003 (it will start kafka services) - T267865 [production]
16:46 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
16:40 <elukey@cumin1001> START - Cookbook sre.hosts.decommission [production]
16:01 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
15:50 <elukey@cumin1001> START - Cookbook sre.hosts.decommission [production]
00:19 <elukey> run 'systemctl mask kafka' and 'systemctl mask kafka-mirror-main-eqiad_to_main-codfw@0' on kafka-main2003 (for the brief moment when it was up) to avoid purged issues - T267865 [production]
00:09 <elukey> sudo cumin 'cp2028* or cp2036* or cp2039* or cp4022* or cp4025* or cp4028* or cp4031*' 'systemctl restart purged' -b 3 - T267865 [production]
2020-11-15 §
10:00 <elukey> cumin 'cp2042* or cp2036* or cp2039*' 'systemctl restart purged' -b 1 [production]
09:57 <elukey> restart purged on cp4028 (consumer stuck due to kafka-main2003 down) [production]
09:55 <elukey> restart purged on cp4025 (consumer stuck due to kafka-main2003 down) [production]
09:53 <elukey> restart purged on cp4031 (consumer stuck due to kafka-main2003 down) [production]
09:50 <elukey> restart purged on cp4022 (consumer stuck due to kafka-main2003 down) [production]
09:42 <elukey> restart purged on cp2028 (kafka-main2003 is down and there are connect timeouts errors) [production]
08:27 <elukey> truncate -s 10g /var/lib/hadoop/data/n/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000177/stderr on an-worker1100 [production]
08:24 <elukey> sudo truncate -s 10g /var/lib/hadoop/data/c/yarn/logs/application_1601916545561_173219/container_e25_1601916545561_173219_01_000019/stderr on an-worker1098 [production]
2020-11-10 §
07:40 <elukey> import hue_4.8.0-2 to buster-wikimedia [production]
2020-11-09 §
07:17 <elukey> restart gerrit on gerrit2001 (OOM registered for two days ago, uptime from systemctl since a month ago, probably in a weird state) [production]
2020-11-06 §
14:05 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
14:03 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
14:01 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
14:01 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
11:52 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
11:50 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
11:49 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
11:47 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
11:24 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
11:22 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
11:20 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
11:19 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
10:54 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [production]
10:52 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
10:49 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
10:49 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
09:15 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
09:13 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
09:12 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [production]
09:12 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
2020-11-05 §
15:53 <elukey@cumin1001> END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [production]
15:50 <elukey@cumin1001> START - Cookbook sre.aqs.roll-restart [production]
14:55 <elukey> shutdown kafka-jumbo1001 to swap NICs (1g -> 10g) [production]
06:34 <elukey> truncate application_1601916545561_129457's taskmanager.log (~600G) on an-worker1113 due to partition 'e' full [production]
2020-11-04 §
18:20 <elukey> restart memcached on mc1036 to pick up new settings (see https://gerrit.wikimedia.org/r/639099) [production]
14:17 <elukey> upload 4.8.0-1+deb10u1 to buster-wikimedia [production]
07:09 <elukey> manual cleanup of mcelog and its wmf-auto-restart (failing) on mw1381 (kernel 4.19, doesn't support mcelog) [production]
06:52 <elukey> force start of rasdaemon.service on dumpsdata1002 (its auto-restart unit was failing for it) [production]
06:47 <elukey> set an-presto1004's netbox status as "active" (was: failed) after hw maintenance - T253438 [production]
06:44 <elukey> force restart of uwsgi-ores on ores1005 - daemon down after reload, max client reached error messages in the logs [production]