7001-7050 of 10000 results (23ms)
2021-01-11 §
09:02 <elukey> force puppet on logstash1007 after ES OOM [production]
2021-01-08 §
10:01 <elukey> restart varnishkafka-webrequest on cp5001 - timeouts to kafka-jumbo1001, librdkafka seems not recovering very well [production]
2021-01-05 §
18:13 <elukey> run homer on cr1/cr2-eqiad to update the analytics-in4 filter (https://gerrit.wikimedia.org/r/c/operations/homer/public/+/654469) [production]
12:48 <elukey> add PXE d-i rescue bootable image config for jessie/stretch/buster to tftp [production]
08:32 <elukey> reboot sretest1001 to test some new PXE rescue settings [production]
07:14 <elukey> execute 'apt-get clean' on an-airflow1001 to recover disk space (root partition almost saturated) [production]
2021-01-03 §
11:38 <elukey> powercycle an-worker1114 (kernel errors in the serial console) [production]
09:07 <elukey> reboot ms-be2050 as attempt to recover/fix its broken networking state (started from Dec 30th) - T271041 [production]
2020-12-28 §
09:54 <elukey> reboot an-coord1002 (puppet in D state after issues with broken disk - host in standby, no traffic) [production]
2020-12-24 §
12:41 <elukey@cumin1001> END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [production]
12:37 <elukey@cumin1001> START - Cookbook sre.dns.netbox [production]
12:34 <elukey@cumin1001> END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [production]
12:23 <elukey@cumin1001> START - Cookbook sre.dns.netbox [production]
2020-12-22 §
07:27 <elukey> reboot stat100[4-8] (analytics hadoop clients) for kernel upgrades [production]
2020-12-18 §
07:54 <elukey> on kafka-test10[08-10] - "ip addr flush dev ens5; systemctl restart ifup@ens5.service" [production]
2020-12-17 §
12:23 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE [production]
12:21 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE [production]
11:26 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE [production]
11:24 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [production]
11:22 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE [production]
11:21 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [production]
11:21 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE [production]
11:20 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE [production]
11:20 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [production]
11:18 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [production]
11:16 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [production]
11:16 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [production]
10:50 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
10:45 <elukey@cumin1001> START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
07:18 <elukey> reboot an-airflow1001 for kernel upgrades [production]
07:08 <elukey> update analytics-in4 filter on cr1/cr2-eqiad for https://gerrit.wikimedia.org/r/c/operations/homer/public/+/649706 [production]
2020-12-16 §
09:29 <elukey> force execution of cumin-check-aliases.service on cumin[12]001 hosts to clear alarms [production]
2020-12-11 §
18:30 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=0) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
18:19 <elukey@cumin1001> START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
18:19 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
18:13 <elukey@cumin1001> START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
17:48 <elukey@cumin1001> END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
17:41 <elukey@cumin1001> START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
16:40 <elukey@cumin1001> END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
16:28 <elukey@cumin1001> START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
16:15 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
16:10 <elukey@cumin1001> START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
15:20 <elukey@cumin1001> END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
15:06 <elukey@cumin1001> START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
14:26 <elukey@cumin1001> END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
14:23 <elukey@cumin1001> START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
14:16 <elukey@cumin1001> END (FAIL) - Cookbook sre.hadoop.upgrade-bigtop-distro (exit_code=99) for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
14:04 <elukey@cumin1001> START - Cookbook sre.hadoop.upgrade-bigtop-distro for Hadoop test cluster: Change Hadoop distribution - elukey@cumin1001 [production]
14:03 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
13:57 <elukey@cumin1001> START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]