production SAL

201-250 of 10000 results (23ms)

2021-12-02 §
11:01	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on db2089.codfw.wmnet with reason: Maintenance T277354	[production]
11:01	<marostegui@cumin1001>	dbctl commit (dc=all): 'After maintenance db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17970 and previous config saved to /var/cache/conftool/dbconfig/20211202-110110-marostegui.json	[production]
10:46	<marostegui@cumin1001>	dbctl commit (dc=all): 'After maintenance db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17969 and previous config saved to /var/cache/conftool/dbconfig/20211202-104606-marostegui.json	[production]
10:31	<marostegui@cumin1001>	dbctl commit (dc=all): 'After maintenance db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17968 and previous config saved to /var/cache/conftool/dbconfig/20211202-103100-marostegui.json	[production]
10:15	<marostegui@cumin1001>	dbctl commit (dc=all): 'After maintenance db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17967 and previous config saved to /var/cache/conftool/dbconfig/20211202-101555-marostegui.json	[production]
10:15	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depooling db2075 (T277354)', diff saved to https://phabricator.wikimedia.org/P17966 and previous config saved to /var/cache/conftool/dbconfig/20211202-101522-marostegui.json	[production]
10:15	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2075.codfw.wmnet with reason: Maintenance T277354	[production]
10:15	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on db2075.codfw.wmnet with reason: Maintenance T277354	[production]
10:05	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 9 hosts with reason: Maintenance T277354	[production]
10:05	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on 9 hosts with reason: Maintenance T277354	[production]
10:03	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance T277354	[production]
10:03	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance T277354	[production]
10:03	<marostegui@cumin1001>	dbctl commit (dc=all): 'After maintenance db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17964 and previous config saved to /var/cache/conftool/dbconfig/20211202-100307-marostegui.json	[production]
09:52	<moritzm>	draining primary/secondary instances off ganeti2009 T296622	[production]
09:48	<marostegui@cumin1001>	dbctl commit (dc=all): 'After maintenance db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17963 and previous config saved to /var/cache/conftool/dbconfig/20211202-094802-marostegui.json	[production]
09:32	<marostegui@cumin1001>	dbctl commit (dc=all): 'After maintenance db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17962 and previous config saved to /var/cache/conftool/dbconfig/20211202-093257-marostegui.json	[production]
09:27	<jmm@cumin2002>	END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti2010.codfw.wmnet to ganeti01.svc.codfw.wmnet	[production]
09:27	<jmm@cumin2002>	START - Cookbook sre.ganeti.addnode for new host ganeti2010.codfw.wmnet to ganeti01.svc.codfw.wmnet	[production]
09:17	<marostegui@cumin1001>	dbctl commit (dc=all): 'After maintenance db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17961 and previous config saved to /var/cache/conftool/dbconfig/20211202-091753-marostegui.json	[production]
09:16	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depooling db1096:3315 (T277354)', diff saved to https://phabricator.wikimedia.org/P17960 and previous config saved to /var/cache/conftool/dbconfig/20211202-091629-marostegui.json	[production]
09:16	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1096.eqiad.wmnet with reason: Maintenance T277354	[production]
09:16	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on db1096.eqiad.wmnet with reason: Maintenance T277354	[production]
08:51	<jmm@cumin2002>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet	[production]
08:45	<jmm@cumin2002>	START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet	[production]
08:34	<jmm@cumin2002>	END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti2010.codfw.wmnet with OS buster	[production]
08:29	<dcausse>	restarting blazegraph on wdqs1007 (jvm stuck for 4h)	[production]
08:03	<jmm@cumin2002>	START - Cookbook sre.hosts.reimage for host ganeti2010.codfw.wmnet with OS buster	[production]
02:50	<andrew@cumin1001>	END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1028.eqiad.wmnet with OS buster	[production]
02:43	<andrew@cumin1001>	START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster	[production]
02:40	<andrew@cumin1001>	END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1028.eqiad.wmnet with OS buster	[production]
02:15	<andrew@cumin1001>	START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster	[production]
02:14	<andrew@cumin1001>	END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1028.eqiad.wmnet with OS buster	[production]
01:52	<andrew@cumin1001>	START - Cookbook sre.hosts.reimage for host cloudvirt1028.eqiad.wmnet with OS buster	[production]
01:21	<ryankemper>	T280001 Rolling restart of low-traffic pybal hosts complete. All of `wcqs` is pooled and the pybal / ipvs related alerts have cleared	[production]
01:16	<ryankemper>	T280001 Pooled `wcqs200[1-3]` (had been left unpooled from when we last removed wcqs from production)	[production]
01:12	<ryankemper>	T280001 Restarting pybal on low-traffic primaries `lvs2009` and `lvs1015`: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009,lvs1015}' 'sudo systemctl restart pybal'`	[production]
01:12	<ryankemper>	T280001 Restarting pybal on low-traffic primaries `lvs2009` and `lvs1015`: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2009,lvs1015}' 'sudo systemctl restart pybal'`	[production]
01:11	<ryankemper>	T280001 Waited 120s and checked https://icinga.wikimedia.org/alerts, proceeding to primary low-traffic hosts `lvs2009` and `lvs1015`	[production]
01:08	<ryankemper>	T280001 Sanity check of `sudo ipvsadm -L -n` on backup `lvs2010` and `lvs1016` looks good (for ex `lvs1016` has `TCP 10.2.2.67:443 wrr`)	[production]
01:07	<ryankemper>	T280001 Restarting pybal on low-traffic backups: `ryankemper@cumin1001:~$ sudo cumin 'P{lvs2010,lvs1016}' 'sudo systemctl restart pybal'`	[production]
01:02	<ryankemper>	T280001 `ryankemper@cumin1001:~$ sudo cumin 'O:lvs::balancer' 'sudo run-puppet-agent'`	[production]
01:01	<ryankemper>	T280001 Merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/742841	[production]
01:00	<ryankemper>	T280001 About to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/742841 to bring `wcqs` into state `lvs_setup`, after which I'll perform a rolling restart of pybal	[production]
00:24	<urbanecm@deploy1002>	Synchronized php-1.38.0-wmf.9/skins/Vector/: a7586cd4a2559248ea1fd29cf74de535de016501: Update scroll observer to allow event logging (T292586) (duration: 00m 57s)	[production]
2021-12-01 §
22:15	<otto@deploy1002>	Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 07s)	[production]
22:15	<otto@deploy1002>	Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided)	[production]
22:13	<otto@deploy1002>	Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 07s)	[production]
22:13	<otto@deploy1002>	Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided)	[production]
22:12	<otto@deploy1002>	Finished deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided) (duration: 00m 07s)	[production]
22:12	<otto@deploy1002>	Started deploy [airflow-dags/analytics@2f59257] (hadoop-test): (no justification provided)	[production]