production SAL

3101-3150 of 10000 results (141ms)

2024-08-20 §
05:16	<marostegui@cumin1002>	START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1184.eqiad.wmnet with reason: Long schema change	[production]
04:52	<marostegui@cumin1002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T372524	[production]
04:52	<marostegui@cumin1002>	dbctl commit (dc=all): 'Set db1163 with weight 0 T372524', diff saved to https://phabricator.wikimedia.org/P67391 and previous config saved to /var/cache/conftool/dbconfig/20240820-045212-root.json	[production]
04:52	<marostegui@cumin1002>	START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T372524	[production]
04:00	<mwpresync@deploy1003>	Pruned MediaWiki: 1.43.0-wmf.16 (duration: 00m 56s)	[production]
03:48	<mwpresync@deploy1003>	Finished scap sync-world: testwikis to 1.43.0-wmf.19 refs T366964 (duration: 46m 32s)	[production]
03:02	<mwpresync@deploy1003>	Started scap sync-world: testwikis to 1.43.0-wmf.19 refs T366964	[production]
00:21	<mutante>	previous message about prometheus can be ignored - race condition that solved itself on next puppet run	[production]
00:04	<mutante>	prometheus3003/prometheus1006 - are trying to use puppetserver1002 but get connection refused from puppetservre1001.eqiad.wmnet port 8140 - causing other puppet errors	[production]
2024-08-19 §
23:59	<mutante>	prometheus - puppet on prometheus hosts very slow - reason appears to be that /srv/prometheus is recursively managed by puppet but has ~ 20x more files than the default soft limit of 1000	[production]
23:55	<mutante>	prometheus - switched ferm::service to firewall::service (gerrit:1057952) - NOOP except /etc/ferm/conf.d/10_prometheus-web becomes /etc/ferm/conf.d/10_prometheus_web with identical rules	[production]
23:15	<ejegg>	fundraising civicrm upgraded from fd01c939 to 1022abf1	[production]
22:30	<andrew@cumin1002>	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1041.eqiad.wmnet with OS bullseye	[production]
22:12	<andrew@cumin1002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage	[production]
22:09	<andrew@cumin1002>	START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage	[production]
21:50	<andrew@cumin1002>	START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye	[production]
21:48	<andrew@cumin1002>	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1040.eqiad.wmnet with OS bullseye	[production]
21:30	<andrew@cumin1002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage	[production]
21:26	<andrew@cumin1002>	START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage	[production]
21:07	<andrew@cumin1002>	START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye	[production]
21:06	<andrew@cumin1002>	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1039.eqiad.wmnet with OS bullseye	[production]
20:57	<eevans@deploy1003>	Finished deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test (duration: 00m 06s)	[production]
20:57	<eevans@deploy1003>	Started deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test	[production]
20:52	<sbassett>	Deployed changes from T372570 to security.wikimedia.org (miscweb)	[production]
20:49	<sbassett@deploy1003>	helmfile [eqiad] DONE helmfile.d/services/miscweb: apply	[production]
20:49	<sbassett@deploy1003>	helmfile [eqiad] START helmfile.d/services/miscweb: apply	[production]
20:49	<sbassett@deploy1003>	helmfile [codfw] DONE helmfile.d/services/miscweb: apply	[production]
20:49	<sbassett@deploy1003>	helmfile [codfw] START helmfile.d/services/miscweb: apply	[production]
20:49	<sbassett@deploy1003>	helmfile [eqiad] DONE helmfile.d/services/miscweb: apply	[production]
20:48	<andrew@cumin1002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage	[production]
20:46	<sbassett@deploy1003>	helmfile [eqiad] START helmfile.d/services/miscweb: apply	[production]
20:45	<eevans@deploy1003>	Finished deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test (duration: 00m 32s)	[production]
20:45	<sbassett@deploy1003>	helmfile [codfw] DONE helmfile.d/services/miscweb: apply	[production]
20:45	<eevans@deploy1003>	Started deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test	[production]
20:44	<mforns@deploy1003>	Finished deploy [airflow-dags/analytics_test@3ec5119]: (no justification provided) (duration: 00m 11s)	[production]
20:44	<andrew@cumin1002>	START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage	[production]
20:44	<mforns@deploy1003>	Started deploy [airflow-dags/analytics_test@3ec5119]: (no justification provided)	[production]
20:42	<sbassett@deploy1003>	helmfile [codfw] START helmfile.d/services/miscweb: apply	[production]
20:26	<andrew@cumin1002>	START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye	[production]
20:26	<andrew@cumin1002>	END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye	[production]
20:00	<andrew@cumin1002>	START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye	[production]
19:59	<andrew@cumin1002>	END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye	[production]
19:54	<ryankemper@cumin2002>	END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2024.codfw.wmnet with OS bullseye	[production]
19:53	<dancy@deploy1003>	Started scap sync-world: testing T371904	[production]
19:52	<dancy@deploy1003>	Installation of scap version "4.98.0" completed for 207 hosts	[production]
19:52	<dancy@deploy1003>	Installing scap version "4.98.0" for 207 hosts	[production]
19:51	<jhancock@cumin2002>	END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm	[production]
19:45	<andrew@cumin1002>	START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye	[production]
19:45	<andrew@cumin1002>	END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye	[production]
19:29	<andrew@cumin1002>	START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye	[production]