production SAL

5951-6000 of 10000 results (16ms)

2021-12-09 §
17:15	<elukey@cumin1001>	START - Cookbook sre.hosts.reimage for host kafka-main2003.codfw.wmnet with OS buster	[production]
17:00	<elukey>	stop kafka* on kafka-main2003 as pre-step before reimaging	[production]
15:44	<elukey>	run `ipmitool -I lanplus -H "kafka-main2003.mgmt.codfw.wmnet" -U root -E mc reset cold` from cumin2001	[production]
15:44	<elukey>	run `ipmitool -I lanplus -H "kafka-main2003.mgmt.codfw.wmnet" -U root -E mc reset cold`	[production]
15:42	<elukey>	run `racadm racreset` on kafka-main2003 - mgmt console not reachable via ssh (but pingable)	[production]
15:42	<elukey>	run `racadm racreset	[production]
11:13	<elukey>	reboot ores2001 (lost connectivity, we suspect some weird problem with the NIC, but no traces in the kernel logs)	[production]
2021-12-06 §
14:45	<elukey>	roll restart of nfacctd on netflow* nodes to pick up the new CA bundle for librdkafka	[production]
09:09	<elukey>	move kafka main codfw to fixed uid/gid for the kafka user (requires a stop/start of all daemons) - T296982	[production]
2021-12-01 §
15:53	<elukey@deploy1002>	helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' .	[production]
2021-11-30 §
13:45	<elukey@cumin1001>	END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade.	[production]
13:25	<elukey@cumin1001>	START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-test-eqiad cluster: Roll restart of jvm daemons for openjdk upgrade.	[production]
10:39	<elukey>	rollout wmf-certificates 0~20211129-1 fleet wide (add group/others permissions to the cert bundle)	[production]
2021-11-29 §
08:07	<elukey@deploy1002>	Finished deploy [ores/deploy@69ed061]: Upgrade of mwparserfromhell - T296563 (duration: 07m 01s)	[production]
08:00	<elukey@deploy1002>	Started deploy [ores/deploy@69ed061]: Upgrade of mwparserfromhell - T296563	[production]
07:31	<elukey@deploy1002>	Finished deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 - (second attempt, no git update submodules the first time) (duration: 00m 04s)	[production]
07:31	<elukey@deploy1002>	Started deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 - (second attempt, no git update submodules the first time)	[production]
2021-11-28 §
17:14	<elukey@deploy1002>	Finished deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563 (duration: 02m 11s)	[production]
17:12	<elukey@deploy1002>	Started deploy [ores/deploy@69ed061]: Canary upgrade of mwparserfromhell - T296563	[production]
2021-11-27 §
12:22	<elukey>	drop /var/tmp/core files from ores100[2,4] root partition full	[production]
12:10	<elukey>	drop /var/tmp/core files from ores1009, root partition full	[production]
11:55	<elukey>	disable coredumps for ORES celery units (will cause a roll restart of all celeries) - T296563	[production]
11:46	<elukey>	drop ores coredumps from ores1008	[production]
09:56	<elukey>	powercycle analytics1071, soft lockup stacktraces in the tty	[production]
09:51	<elukey>	move ores coredump files from /var/cache/tmp to /srv/coredumps on ores100[6,7,8] and ores2003 to free space on the root partition	[production]
2021-11-26 §
15:46	<elukey>	move /var/tmp/core/* to /srv/coredumps on ores1008 to free root space	[production]
2021-11-25 §
17:12	<elukey@puppetmaster1001>	conftool action : set/pooled=true; selector: dnsdisc=inference	[production]
08:22	<elukey@deploy1002>	helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.	[production]
08:22	<elukey@deploy1002>	helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.	[production]
08:21	<elukey@deploy1002>	helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.	[production]
08:21	<elukey@deploy1002>	helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.	[production]
07:29	<elukey_>	elukey@mwdebug2002:~$ sudo systemctl reset-failed ifup@ens5.service	[production]
2021-11-24 §
15:08	<elukey@deploy1002>	helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.	[production]
15:08	<elukey@deploy1002>	helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.	[production]
15:06	<elukey@deploy1002>	helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.	[production]
15:06	<elukey@deploy1002>	helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.	[production]
14:30	<elukey@deploy1002>	helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.	[production]
14:30	<elukey@deploy1002>	helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.	[production]
10:02	<elukey@cumin1001>	END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet	[production]
10:00	<elukey@cumin1001>	START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet	[production]
09:56	<elukey@cumin1001>	END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet	[production]
09:53	<elukey@cumin1001>	START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet	[production]
09:53	<elukey@cumin1001>	END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet	[production]
09:49	<elukey@cumin1001>	START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet	[production]
09:46	<elukey@cumin1001>	END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2002.codfw.wmnet	[production]
09:43	<elukey@cumin1001>	START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2002.codfw.wmnet	[production]
09:19	<elukey@cumin1001>	END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-serve-ctrl2001.codfw.wmnet	[production]
09:16	<elukey@cumin1001>	START - Cookbook sre.ganeti.reboot-vm for VM ml-serve-ctrl2001.codfw.wmnet	[production]
07:23	<elukey>	reboot kubernetes1018 (role::insetup) to verify negotiated speed of eth interface	[production]
07:12	<elukey>	drop /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 and other blockmgr-* dirs on stat1006 to free space on the root partition	[production]