5851-5900 of 10000 results (15ms)
2022-01-19 §
15:54 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
08:10 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
08:10 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
07:57 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
07:56 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
07:55 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [production]
07:55 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [production]
2022-01-18 §
16:10 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
16:09 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
2022-01-17 §
10:56 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1002.eqiad.wmnet [production]
10:47 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1002.eqiad.wmnet [production]
10:44 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl1001.eqiad.wmnet [production]
10:42 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl1001.eqiad.wmnet [production]
06:59 <elukey> `systemctl reset-failed ifup@ens5.service` on an-test-client1001 and kafka-test1010 [production]
2022-01-13 §
11:03 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1001.eqiad.wmnet with OS buster [production]
10:29 <elukey@cumin1001> START - Cookbook sre.hosts.reimage for host kafka-main1001.eqiad.wmnet with OS buster [production]
10:02 <elukey> run kafka preferred-replica-election on kafka-main1001 to force a rebalance of partition leaders (after kafka-main1002's reimage) [production]
09:59 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1002.eqiad.wmnet with OS buster [production]
09:26 <elukey@cumin1001> START - Cookbook sre.hosts.reimage for host kafka-main1002.eqiad.wmnet with OS buster [production]
08:42 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1003.eqiad.wmnet with OS buster [production]
08:39 <elukey> ipmi mc reset cold for kafka-main1002, mgmt interface not reachable via ssh [production]
08:08 <elukey@cumin1001> START - Cookbook sre.hosts.reimage for host kafka-main1003.eqiad.wmnet with OS buster [production]
08:02 <elukey> ipmi mc reset cold for kafka-main1003, mgmt interface not reachable via ssh [production]
07:57 <elukey> stop kafka* on kafka-main1003 as prep-step for reimage to buster [production]
2022-01-12 §
16:45 <elukey> elukey@prometheus2004:~$ sudo apt-get remove linux-image-4.9.0-8-amd64 linux-image-4.9.0-9-amd64 linux-image-4.9.0-11-amd64 linux-image-4.9.0-12-amd64 linux-image-4.9.0-13-amd64 [production]
16:44 <elukey> elukey@prometheus2003:~$ sudo apt-get remove linux-image-4.9.0-8-amd64 linux-image-4.9.0-9-amd64 linux-image-4.9.0-11-amd64 linux-image-4.9.0-12-amd64 linux-image-4.9.0-13-amd64 [production]
16:40 <elukey> elukey@prometheus1004:~$ sudo apt-get remove linux-image-4.9.0-8-amd64 linux-image-4.9.0-9-amd64 linux-image-4.9.0-11-amd64 linux-image-4.9.0-12-amd64 linux-image-4.9.0-13-amd64 [production]
16:39 <elukey> elukey@prometheus1003:~$ sudo apt-get remove linux-image-4.9.0-11-amd64 linux-image-4.9.0-12-amd64 linux-image-4.9.0-13-amd64 linux-image-4.9.0-8-amd64 linux-image-4.9.0-9-amd64 [production]
16:25 <elukey> stop kafka* on kafka-main1003 to allow dcops maintenance (nic/bios upgrades) - T298867 [production]
16:02 <elukey> stop kafka* on kafka-main1002 to allow dcops maintenance (nic/bios upgrades) - T298867 [production]
15:14 <elukey> stop kafka* on kafka-main1001 to allow dcops maintenance (nic/bios upgrades) - T298867 [production]
13:23 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter1004.eqiad.wmnet [production]
13:18 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter1004.eqiad.wmnet [production]
13:11 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM orespoolcounter1003.eqiad.wmnet [production]
13:08 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM orespoolcounter1003.eqiad.wmnet [production]
11:21 <elukey> move kafka-jumbo nodes to fixed kafka uid/gid - T296990 [production]
2022-01-11 §
18:34 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-coord1002.eqiad.wmnet with OS buster [production]
18:08 <elukey@cumin1001> START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS buster [production]
2022-01-10 §
10:38 <elukey> stop/start kafka daemons on kafka-main1* nodes to move the kafka user to fixed uid/gid - T296641 [production]
2022-01-08 §
10:51 <elukey> restart hive daemons on an-coord1002 (after my last upgrade/rollback of packages the prometheus agent settings were not picked up, so no metrics) [production]
2022-01-03 §
11:29 <elukey> restart cassandra-b on aqs1010 and aqs1015 (instances stuck / trashing, new cluster, not serving live traffic atm) [production]
10:22 <elukey> powercycle an-worker1114 (CPU soft lockup errors in mgmt console) [production]
10:20 <elukey> powercycle an-worker1120 (CPU soft lockup errors in mgmt console) [production]
2021-12-29 §
10:30 <elukey> kill tcpdump process on kubestagemaster1001 (kept a big pcap file opened that kept growing) [production]
2021-12-17 §
19:26 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [production]
17:57 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [production]
17:07 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [production]
16:59 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [production]
16:56 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [production]
16:39 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [production]