production SAL

401-450 of 10000 results (72ms)

2023-07-11 §
09:08	<btullis@cumin1001>	END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host kafkamon1003.eqiad.wmnet	[production]
09:06	<elukey@deploy1002>	helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync	[production]
09:06	<elukey@deploy1002>	helmfile [eqiad] START helmfile.d/services/eventgate-main: sync	[production]
09:06	<jayme>	enabled puppet on 'P{R:Package = envoyproxy}'	[production]
09:01	<elukey@deploy1002>	helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync	[production]
09:01	<elukey@deploy1002>	helmfile [codfw] START helmfile.d/services/eventgate-main: sync	[production]
08:59	<elukey@deploy1002>	helmfile [staging] DONE helmfile.d/services/eventgate-main: sync	[production]
08:59	<btullis@cumin1001>	START - Cookbook sre.hosts.reboot-single for host kafkamon1003.eqiad.wmnet	[production]
08:59	<elukey@deploy1002>	helmfile [staging] START helmfile.d/services/eventgate-main: sync	[production]
08:43	<volans>	previous downtiming completed	[production]
08:40	<volans>	downtiming service 'Check no envoy runtime configuration is left persistent' on envoy hosts	[production]
08:39	<jayme>	disabled puppet on 'P{R:Package = envoyproxy}'	[production]
08:18	<godog>	upgrade prometheus to 2.24.1+ds-1+wmf2 on cloudmetrics*	[production]
08:03	<hashar>	Stopping Jenkins and Zuul for server switch over	[production]
08:01	<jelto@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2002.wikimedia.org with reason: Switch contint hosts for hardware replacement	[production]
08:01	<jelto@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on contint2002.wikimedia.org with reason: Switch contint hosts for hardware replacement	[production]
08:01	<jelto@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on contint2001.wikimedia.org with reason: Switch contint hosts for hardware replacement	[production]
08:01	<jelto@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on contint2001.wikimedia.org with reason: Switch contint hosts for hardware replacement	[production]
07:55	<kart_>	Updated MinT to 2023-07-10-051738-production (T341335, T333969)	[production]
07:54	<kartik@deploy1002>	helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply	[production]
07:49	<kartik@deploy1002>	helmfile [codfw] START helmfile.d/services/machinetranslation: apply	[production]
07:47	<kartik@deploy1002>	helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply	[production]
07:42	<kartik@deploy1002>	helmfile [eqiad] START helmfile.d/services/machinetranslation: apply	[production]
07:38	<kartik@deploy1002>	helmfile [staging] DONE helmfile.d/services/machinetranslation: apply	[production]
07:36	<kartik@deploy1002>	helmfile [staging] START helmfile.d/services/machinetranslation: apply	[production]
07:36	<moritzm>	failover broken ganeti2014 node	[production]
07:28	<moritzm>	powercycle ganeti2014	[production]
07:22	<moritzm>	installing libxpm security updates	[production]
07:08	<moritzm>	rebalance ganeti in drmrs after reboots	[production]
06:59	<elukey>	restart kube-apiserver on ml-serve-ctrl1* as attempt to resolve spikes in latencies	[production]
06:36	<moritzm>	rebalance ganeti group eqiad/B after reboots	[production]
05:24	<rzl>	imported otelcol-contrib 0.81.0 to buster-wikimedia and bullseye-wikimedia in component thirdparty/otelcol-contrib	[production]
04:34	<rzl@deploy1002>	helmfile [staging] START helmfile.d/services/opentelemetry-collector: apply	[production]
02:05	<mutante>	LDAP - added urbanecm to wmf group, removed from nda group (conversion volunteer to staff) T341443	[production]
2023-07-10 §
23:11	<Krinkle>	krinkle@xhgui1001$ Define new `xhgui.watches` table via xhguiadmin@m2-master.eqiad.wmnet database, ref T341499	[production]
22:12	<maryum>	Deployed security patch for T340200	[production]
21:42	<btullis@deploy1002>	helmfile [staging] DONE helmfile.d/services/datahub: sync on main	[production]
21:39	<btullis@deploy1002>	helmfile [staging] START helmfile.d/services/datahub: apply on main	[production]
21:37	<bking@deploy1002>	Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 52s)	[production]
21:36	<bking@deploy1002>	Started deploy [wdqs/wdqs@dff41b7]: 0.3.124	[production]
20:46	<btullis@deploy1002>	helmfile [staging] DONE helmfile.d/services/datahub: sync on main	[production]
20:43	<TheresNoTime>	close UTC late backport window	[production]
20:42	<samtar@deploy1002>	Finished scap: Backport for [[gerrit:936735\|Revert "log additional events on Special:Diff\|MobileDiff"]] (duration: 07m 27s)	[production]
20:42	<btullis@deploy1002>	helmfile [staging] START helmfile.d/services/datahub: apply on main	[production]
20:36	<samtar@deploy1002>	samtar: Backport for [[gerrit:936735\|Revert "log additional events on Special:Diff\|MobileDiff"]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet	[production]
20:34	<samtar@deploy1002>	Started scap: Backport for [[gerrit:936735\|Revert "log additional events on Special:Diff\|MobileDiff"]]	[production]
20:25	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'db2112 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P49544 and previous config saved to /var/cache/conftool/dbconfig/20230710-202536-ladsgroup.json	[production]
20:23	<samtar@deploy1002>	Finished scap: Backport for [[gerrit:936748\|log additional events on Special:Diff\|MobileDiff (T326212)]] (duration: 21m 42s)	[production]
20:23	<inflatador>	bking@wdqs1006 Restart wdqs-blazegraph to hopefully clear the free allocators alerts	[production]
20:19	<TheresNoTime>	syncing https://gerrit.wikimedia.org/r/c/936748 untested (T326212) for test after sync	[production]