production SAL

951-1000 of 10000 results (44ms)

2022-04-05 §
02:29	<mwdebug-deploy@deploy1002>	helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply	[production]
02:29	<mwdebug-deploy@deploy1002>	helmfile [eqiad] START helmfile.d/services/mwdebug: apply	[production]
02:21	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'Depooling db1169 (T298565)', diff saved to https://phabricator.wikimedia.org/P24081 and previous config saved to /var/cache/conftool/dbconfig/20220405-022132-ladsgroup.json	[production]
02:21	<ladsgroup@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance	[production]
02:21	<ladsgroup@cumin1001>	START - Cookbook sre.hosts.downtime for 6:00:00 on db1169.eqiad.wmnet with reason: Maintenance	[production]
02:21	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24080 and previous config saved to /var/cache/conftool/dbconfig/20220405-022124-ladsgroup.json	[production]
02:08	<mwdebug-deploy@deploy1002>	helmfile [codfw] DONE helmfile.d/services/mwdebug: apply	[production]
02:07	<mwdebug-deploy@deploy1002>	helmfile [codfw] START helmfile.d/services/mwdebug: apply	[production]
02:07	<mwdebug-deploy@deploy1002>	helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply	[production]
02:06	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24079 and previous config saved to /var/cache/conftool/dbconfig/20220405-020619-ladsgroup.json	[production]
02:05	<mwdebug-deploy@deploy1002>	helmfile [eqiad] START helmfile.d/services/mwdebug: apply	[production]
01:59	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on cp5002.eqsin.wmnet with reason: downtimed because of hardware failure: T305423	[production]
01:59	<sukhe@cumin2002>	START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on cp5002.eqsin.wmnet with reason: downtimed because of hardware failure: T305423	[production]
01:57	<eileen>	process control config revision changed from 06379640 to 25728a0e	[production]
01:51	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1099:3311', diff saved to https://phabricator.wikimedia.org/P24078 and previous config saved to /var/cache/conftool/dbconfig/20220405-015114-ladsgroup.json	[production]
01:47	<sukhe@cumin2002>	END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cp5002.eqsin.wmnet	[production]
01:42	<eileen>	civicrm revision changed from 84c737b6 to 87bc3114	[production]
01:37	<eileen>	config revision changed from bb0e1af3 to 06379640	[production]
01:36	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24077 and previous config saved to /var/cache/conftool/dbconfig/20220405-013609-ladsgroup.json	[production]
01:15	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3053.esams.wmnet	[production]
01:07	<sukhe@cumin2002>	START - Cookbook sre.hosts.reboot-single for host cp3053.esams.wmnet	[production]
01:06	<sukhe@cumin2002>	START - Cookbook sre.hosts.reboot-single for host cp5002.eqsin.wmnet	[production]
01:02	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3063.esams.wmnet	[production]
00:58	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4034.ulsfo.wmnet	[production]
00:53	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5016.eqsin.wmnet	[production]
00:53	<sukhe@cumin2002>	START - Cookbook sre.hosts.reboot-single for host cp3063.esams.wmnet	[production]
00:51	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1084.eqiad.wmnet	[production]
00:51	<sukhe@cumin2002>	START - Cookbook sre.hosts.reboot-single for host cp4034.ulsfo.wmnet	[production]
00:50	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2042.codfw.wmnet	[production]
00:43	<sukhe@cumin2002>	START - Cookbook sre.hosts.reboot-single for host cp5016.eqsin.wmnet	[production]
00:42	<sukhe@cumin2002>	START - Cookbook sre.hosts.reboot-single for host cp1084.eqiad.wmnet	[production]
00:42	<sukhe@cumin2002>	START - Cookbook sre.hosts.reboot-single for host cp2042.codfw.wmnet	[production]
00:40	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4032.ulsfo.wmnet	[production]
00:39	<mutante>	gitlab1001 - mv 1648814678_2022_04_01_14.9.1_gitlab_backup.tar and other files from April 2nd/April 3rd over from /srv/gitlab-backup to /mnt/gitlab-backup to prevent another outage due to disk space T274463	[production]
00:36	<mutante>	gitlab2001 - apt-get clean to prevent disk space issues	[production]
00:34	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'Depooling db1099:3311 (T298565)', diff saved to https://phabricator.wikimedia.org/P24076 and previous config saved to /var/cache/conftool/dbconfig/20220405-003419-ladsgroup.json	[production]
00:34	<ladsgroup@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance	[production]
00:34	<ladsgroup@cumin1001>	START - Cookbook sre.hosts.downtime for 6:00:00 on db1099.eqiad.wmnet with reason: Maintenance	[production]
00:34	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1163 (T298565)', diff saved to https://phabricator.wikimedia.org/P24075 and previous config saved to /var/cache/conftool/dbconfig/20220405-003405-ladsgroup.json	[production]
00:33	<sukhe@cumin2002>	START - Cookbook sre.hosts.reboot-single for host cp4032.ulsfo.wmnet	[production]
00:33	<dzahn@cumin2002>	conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1046.eqiad.wmnet	[production]
00:33	<dzahn@cumin2002>	conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1047.eqiad.wmnet	[production]
00:32	<mutante>	gitlab.wikimedia.org was down because gitlab1001 ran out of disk space. ran 'apt-get clean' to free 13G which made it recover... T274463 - <+icinga-wm> RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK	[production]
00:30	<mutante>	gitlab.wikimedia.org was down because gitlab1001 ran out of disk space. ran 'apt-get clean' to free 13G which made it recover...	[production]
00:27	<dzahn@cumin2002>	conftool action : set/pooled=yes; selector: dc=eqiad,name=wtp1048.eqiad.wmnet	[production]
00:23	<mutante>	wtp1046, wtp1047, wtp1048 - rebooting, one at a time	[production]
00:21	<dzahn@cumin2002>	conftool action : set/pooled=no; selector: dc=eqiad,name=wtp104[6-8].eqiad.wmnet	[production]
00:19	<ladsgroup@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P24074 and previous config saved to /var/cache/conftool/dbconfig/20220405-001900-ladsgroup.json	[production]
00:18	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5012.eqsin.wmnet	[production]
00:17	<sukhe@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3062.esams.wmnet	[production]