production SAL

5201-5250 of 10000 results (66ms)

2022-07-08 §
10:16	<jmm@cumin2002>	START - Cookbook sre.dns.netbox	[production]
10:12	<jmm@cumin2002>	START - Cookbook sre.hosts.decommission for hosts deneb.codfw.wmnet	[production]
09:40	<jmm@cumin2002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ganeti2027.codfw.wmnet with reason: Temporarily remove from Ganeti cluster for reimage	[production]
09:40	<jmm@cumin2002>	START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ganeti2027.codfw.wmnet with reason: Temporarily remove from Ganeti cluster for reimage	[production]
09:25	<jmm@cumin2002>	END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti2016.codfw.wmnet to cluster codfw and group D	[production]
07:33	<akosiaris>	reboot rdb1009 for kernel upgrades	[production]
07:29	<vgutierrez>	restart pybal on lvs6002	[production]
07:22	<akosiaris>	reboot rdb1010 for kernel upgrades	[production]
06:52	<jmm@cumin2002>	START - Cookbook sre.ganeti.addnode for new host ganeti2016.codfw.wmnet to cluster codfw and group D	[production]
06:49	<jmm@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet	[production]
06:47	<TimStarling>	on mwmaint2002: using iptables to simulate cross-DC memcached traffic loss	[production]
06:39	<jmm@cumin2002>	START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet	[production]
06:05	<tstarling@deploy1002>	Synchronized wmf-config/InitialiseSettings.php: Switch $wgCentralAuthTokenCacheType to mcrouter-primary-dc (duration: 03m 18s)	[production]
06:05	<jmm@cumin2002>	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2016.codfw.wmnet with OS bullseye	[production]
06:05	<mwdebug-deploy@deploy1002>	helmfile [codfw] DONE helmfile.d/services/mwdebug: apply	[production]
06:04	<mwdebug-deploy@deploy1002>	helmfile [codfw] START helmfile.d/services/mwdebug: apply	[production]
06:04	<mwdebug-deploy@deploy1002>	helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply	[production]
06:03	<mwdebug-deploy@deploy1002>	helmfile [eqiad] START helmfile.d/services/mwdebug: apply	[production]
05:53	<marostegui@cumin1001>	dbctl commit (dc=all): 'Remove db2077 from dbctl T312191', diff saved to https://phabricator.wikimedia.org/P30963 and previous config saved to /var/cache/conftool/dbconfig/20220708-055334-marostegui.json	[production]
05:49	<jmm@cumin2002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti2016.codfw.wmnet with reason: host reimage	[production]
05:46	<jmm@cumin2002>	START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti2016.codfw.wmnet with reason: host reimage	[production]
05:44	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2076.codfw.wmnet	[production]
05:42	<marostegui@cumin1001>	END (PASS) - Cookbook sre.dns.netbox (exit_code=0)	[production]
05:38	<marostegui@cumin1001>	START - Cookbook sre.dns.netbox	[production]
05:34	<marostegui@cumin1001>	START - Cookbook sre.hosts.decommission for hosts db2076.codfw.wmnet	[production]
05:31	<moritzm>	draining ganeti2027 T311686	[production]
05:29	<marostegui@cumin1001>	dbctl commit (dc=all): 'Remove db2076 from dbctl T312190', diff saved to https://phabricator.wikimedia.org/P30962 and previous config saved to /var/cache/conftool/dbconfig/20220708-052926-marostegui.json	[production]
05:26	<jmm@cumin2002>	START - Cookbook sre.hosts.reimage for host ganeti2016.codfw.wmnet with OS bullseye	[production]
05:23	<marostegui>	dbmaint s3@eqiad T312574	[production]
04:08	<ebernhardson@deploy1002>	Finished deploy [wikimedia/discovery/analytics@b5d49fe]: use mode=reschedule on all airflow sensors (duration: 02m 03s)	[production]
04:06	<ebernhardson@deploy1002>	Started deploy [wikimedia/discovery/analytics@b5d49fe]: use mode=reschedule on all airflow sensors	[production]
03:32	<bking@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343	[production]
03:22	<bking@cumin1001>	END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudelastic1004.wikimedia.org with OS bullseye	[production]
02:27	<ebernhardson@deploy1002>	Finished deploy [wikimedia/discovery/analytics@c271774]: Update rdf-spark-tools to 0.3.112 (duration: 02m 13s)	[production]
02:26	<bking@cumin1001>	START - Cookbook sre.hosts.reimage for host cloudelastic1004.wikimedia.org with OS bullseye	[production]
02:25	<bking@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic cluster reimage to bullseye - bking@cumin1001 - T309343	[production]
02:25	<ebernhardson@deploy1002>	Started deploy [wikimedia/discovery/analytics@c271774]: Update rdf-spark-tools to 0.3.112	[production]
02:12	<krinkle@deploy1002>	Synchronized wmf-config/InitialiseSettings.php: RL use MainStash on dewiki I1c120d64d226 (duration: 03m 21s)	[production]
01:55	<mwdebug-deploy@deploy1002>	helmfile [codfw] DONE helmfile.d/services/mwdebug: apply	[production]
01:54	<mwdebug-deploy@deploy1002>	helmfile [codfw] START helmfile.d/services/mwdebug: apply	[production]
01:54	<mwdebug-deploy@deploy1002>	helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply	[production]
01:53	<mwdebug-deploy@deploy1002>	helmfile [eqiad] START helmfile.d/services/mwdebug: apply	[production]
01:49	<pt1979@cumin2002>	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2182.codfw.wmnet with OS bullseye	[production]
01:35	<pt1979@cumin2002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2182.codfw.wmnet with reason: host reimage	[production]
01:31	<pt1979@cumin2002>	START - Cookbook sre.hosts.downtime for 2:00:00 on db2182.codfw.wmnet with reason: host reimage	[production]
01:12	<pt1979@cumin2002>	START - Cookbook sre.hosts.reimage for host db2182.codfw.wmnet with OS bullseye	[production]
01:12	<mutante>	gitlab1004 - _still_ icinga alerts about rsync to decom'ed host. 'systemctl daemon-reload' to teach it about deleted units, then systemctl reset failed ..then RECOVERY T307142	[production]
00:02	<pt1979@cumin2002>	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2181.codfw.wmnet with OS bullseye	[production]
2022-07-07 §
23:49	<pt1979@cumin2002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2181.codfw.wmnet with reason: host reimage	[production]
23:45	<pt1979@cumin2002>	START - Cookbook sre.hosts.downtime for 2:00:00 on db2181.codfw.wmnet with reason: host reimage	[production]