5951-6000 of 10000 results (16ms)
2021-12-09 §
17:15 <elukey@cumin1001> START - Cookbook sre.hosts.reimage for host kafka-main2003.codfw.wmnet with OS buster [production]
17:00 <elukey> stop kafka* on kafka-main2003 as pre-step before reimaging [production]
15:44 <elukey> run `ipmitool -I lanplus -H "kafka-main2003.mgmt.codfw.wmnet" -U root -E mc reset cold` from cumin2001 [production]
15:44 <elukey> run `ipmitool -I lanplus -H "kafka-main2003.mgmt.codfw.wmnet" -U root -E mc reset cold` [production]
15:42 <elukey> run `racadm racreset` on kafka-main2003 - mgmt console not reachable via ssh (but pingable) [production]
15:42 <elukey> run `racadm racreset [production]
11:13 <elukey> reboot ores2001 (lost connectivity, we suspect some weird problem with the NIC, but no traces in the kernel logs) [production]
2021-12-06 §
14:45 <elukey> roll restart of nfacctd on netflow* nodes to pick up the new CA bundle for librdkafka [production]
09:09 <elukey> move kafka main codfw to fixed uid/gid for the kafka user (requires a stop/start of all daemons) - T296982 [production]
2021-12-01 §
15:53 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [production]
2021-11-30 §
13:45 <elukey@cumin1001> END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [production]
13:25 <elukey@cumin1001> START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade. [production]
10:39 <elukey> rollout wmf-certificates 0~20211129-1 fleet wide (add group/others permissions to the cert bundle) [production]
2021-11-29 §
08:07 <elukey@deploy1002> Finished deploy [ores/deploy@69ed061]: Upgrade of mwparserfromhell - T296563 (duration: 07m 01s) [production]
08:00 <elukey@deploy1002> Started deploy [ores/deploy@69ed061]: Upgrade of mwparserfromhell - T296563 [production]
07:31 <elukey@deploy1002> Finished deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 - (second attempt, no git update submodules the first time) (duration: 00m 04s) [production]
07:31 <elukey@deploy1002> Started deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 - (second attempt, no git update submodules the first time) [production]
2021-11-28 §
17:14 <elukey@deploy1002> Finished deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 (duration: 02m 11s) [production]
17:12 <elukey@deploy1002> Started deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 [production]
2021-11-27 §
12:22 <elukey> drop /var/tmp/core files from ores100[2,4] root partition full [production]
12:10 <elukey> drop /var/tmp/core files from ores1009, root partition full [production]
11:55 <elukey> disable coredumps for ORES celery units (will cause a roll restart of all celeries) - T296563 [production]
11:46 <elukey> drop ores coredumps from ores1008 [production]
09:56 <elukey> powercycle analytics1071, soft lockup stacktraces in the tty [production]
09:51 <elukey> move ores coredump files from /var/cache/tmp to /srv/coredumps on ores100[6,7,8] and ores2003 to free space on the root partition [production]
2021-11-26 §
15:46 <elukey> move /var/tmp/core/* to /srv/coredumps on ores1008 to free root space [production]
2021-11-25 §
17:12 <elukey@puppetmaster1001> conftool action : set/pooled=true; selector: dnsdisc=inference [production]
08:22 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
08:22 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
08:21 <elukey@deploy1002> helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [production]
08:21 <elukey@deploy1002> helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [production]
07:29 <elukey_> elukey@mwdebug2002:~$ sudo systemctl reset-failed ifup@ens5.service [production]
2021-11-24 §
15:08 <elukey@deploy1002> helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [production]
15:08 <elukey@deploy1002> helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [production]
15:06 <elukey@deploy1002> helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [production]
15:06 <elukey@deploy1002> helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [production]
14:30 <elukey@deploy1002> helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [production]
14:30 <elukey@deploy1002> helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [production]
10:02 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet [production]
10:00 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet [production]
09:56 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet [production]
09:53 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet [production]
09:53 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet [production]
09:49 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet [production]
09:46 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet [production]
09:43 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet [production]
09:19 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet [production]
09:16 <elukey@cumin1001> START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet [production]
07:23 <elukey> reboot kubernetes1018 (role::insetup) to verify negotiated speed of eth interface [production]
07:12 <elukey> drop /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 and other blockmgr-* dirs on stat1006 to free space on the root partition [production]