production SAL

6451-6500 of 10000 results (11ms)

2021-04-21 §
06:42	<elukey>	upload hue_4.9.0-2+deb10u1 to buster-wikimedia	[production]
2021-04-16 §
07:53	<elukey>	run reprepro --delete clearvanished on apt1001 to clear all cloudera packages	[production]
05:54	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics-tool1001.eqiad.wmnet	[production]
05:42	<elukey@cumin1001>	START - Cookbook sre.hosts.decommission for hosts analytics-tool1001.eqiad.wmnet	[production]
2021-04-15 §
15:09	<elukey@deploy1002>	Finished deploy [analytics/refinery@497f6a5]: Regular analytics weekly train (duration: 13m 12s)	[production]
14:56	<elukey@deploy1002>	Started deploy [analytics/refinery@497f6a5]: Regular analytics weekly train	[production]
10:21	<elukey>	Add kafka-logging100{2,3} to the kafka term in the analytics filters on cr1/cr2 eqiad - ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679740	[production]
06:32	<elukey>	move hue.wikimedia.org to an-tool1009 (from analytics-tool1001)	[production]
2021-04-14 §
12:39	<elukey>	update kafka term for analytics-in{4,6} on cr{1,2}-eqiad to include kafka-logging1001 - ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679296	[production]
2021-04-08 §
16:33	<elukey>	reboot an-worker1100 again to check if all the disks come up correctly	[production]
15:36	<elukey>	reboot an-worker1100 to see if it helps with the strange BBU behavior	[production]
06:44	<elukey@deploy1002>	Finished deploy [analytics/refinery@1dbbd3d] (hadoop-test): (no justification provided) (duration: 02m 20s)	[production]
06:41	<elukey@deploy1002>	Started deploy [analytics/refinery@1dbbd3d] (hadoop-test): (no justification provided)	[production]
2021-04-07 §
15:39	<elukey@cumin1001>	END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0)	[production]
15:30	<elukey@cumin1001>	START - Cookbook sre.aqs.roll-restart	[production]
2021-04-06 §
09:52	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-coord1002.eqiad.wmnet with reason: REIMAGE	[production]
09:50	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-coord1002.eqiad.wmnet with reason: REIMAGE	[production]
2021-04-03 §
16:44	<elukey>	power reset for ms-be2028 - not reachable via ssh, no tty available via mgmt console, NMI unrecoverable errors logged in iLo's system logs	[production]
2021-04-02 §
14:30	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE	[production]
14:28	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE	[production]
13:14	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE	[production]
13:12	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE	[production]
10:54	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE	[production]
10:52	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE	[production]
07:28	<elukey>	manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b	[production]
2021-04-01 §
14:06	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE	[production]
14:04	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE	[production]
06:37	<elukey>	powercycle cp1087 (no ssh, no tty via serial console) - T278729	[production]
06:35	<elukey@puppetmaster1001>	conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet	[production]
2021-03-30 §
07:37	<elukey>	restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734	[production]
06:06	<elukey>	powercycle cp1087 (no ssh, no mgmt console tty)	[production]
06:04	<elukey@puppetmaster1001>	conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet	[production]
2021-03-27 §
19:25	<elukey>	powercycle elastic1060 - T278630	[production]
2021-03-25 §
08:12	<elukey>	upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia	[production]
08:11	<elukey>	upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2	[production]
2021-03-24 §
07:41	<elukey@cumin1001>	END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd2002.codfw.wmnet	[production]
07:27	<elukey@cumin1001>	START - Cookbook sre.ganeti.makevm for new host ml-etcd2002.codfw.wmnet	[production]
07:20	<elukey@cumin1001>	END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ml-etcd2002.codfw.wmnet	[production]
07:10	<elukey@cumin1001>	START - Cookbook sre.hosts.decommission for hosts ml-etcd2002.codfw.wmnet	[production]
2021-03-23 §
13:54	<elukey>	sudo systemctl reload apache2 on prometheus[12]00[34] to pick up new k8s-mlserve instance settings	[production]
07:36	<elukey>	create a 50g lvm volume on prometheus[12]00[34] for the k8s-mlserve cluster - T272918	[production]
2021-03-22 §
14:14	<elukey@deploy1002>	helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.	[production]
14:14	<elukey@deploy1002>	helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.	[production]
11:15	<elukey@deploy1002>	helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.	[production]
11:15	<elukey@deploy1002>	helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.	[production]
11:15	<elukey@deploy1002>	helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.	[production]
11:14	<elukey@deploy1002>	helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.	[production]
10:48	<elukey@deploy1002>	helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.	[production]
10:48	<elukey@deploy1002>	helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.	[production]
10:47	<elukey@deploy1002>	helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.	[production]