production SAL

6351-6400 of 10000 results (20ms)

2021-07-27 §
15:22	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2004.codfw.wmnet	[production]
15:16	<elukey@cumin1001>	START - Cookbook sre.hosts.reboot-single for host ml-serve2004.codfw.wmnet	[production]
15:16	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2003.codfw.wmnet	[production]
15:10	<elukey@cumin1001>	START - Cookbook sre.hosts.reboot-single for host ml-serve2003.codfw.wmnet	[production]
14:57	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2002.codfw.wmnet	[production]
14:52	<elukey@cumin1001>	START - Cookbook sre.hosts.reboot-single for host ml-serve2002.codfw.wmnet	[production]
14:52	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet	[production]
14:47	<elukey@cumin1001>	START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet	[production]
14:40	<elukey@cumin1001>	END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-serve-ctrl2002.codfw.wmnet	[production]
14:34	<elukey@cumin1001>	START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2002.codfw.wmnet	[production]
14:33	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2001.codfw.wmnet	[production]
14:29	<elukey>	reduce vcores for ml-serve-ctrl[12]00[12] after performance testing - T287238	[production]
14:28	<elukey@cumin1001>	START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2001.codfw.wmnet	[production]
12:56	<elukey>	created component/iptables185 for buster-wikimedia + imported packages from buster-backports	[production]
06:50	<elukey>	install iptables from buster-backports (manually) on ml-serve-ctrl200[1,2] as test (+ reboot the nodes for a clean start) - T287238	[production]
2021-07-23 §
15:45	<elukey@deploy1002>	helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.	[production]
15:44	<elukey@deploy1002>	helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.	[production]
15:11	<elukey>	stop ml-serve-ctrl1001 + gnt-instance modify -t plain ml-serve-ctrl1001.eqiad.wmnet on ganeti1009 + start instance back - T287238	[production]
08:24	<elukey>	run 'gnt-instance modify -t plain ml-serve-ctrl1002.eqiad.wmnet' on ganeti1009 as test to track down latency/perf issues with kubelets	[production]
2021-07-20 §
12:23	<elukey>	reboot ml-serve-ctrl vms to pick up new vcores settings	[production]
12:22	<elukey>	bump vcpus from 2 to 4 on ml-serve-ctrl VMs on Ganeti (load/cpu usage increased steadily since we deployed kubelets on them)	[production]
2021-07-19 §
07:11	<elukey>	roll restart kafka mirror maker on kafka-main200* hosts - stuck after Friday's events/incident	[production]
2021-07-15 §
08:29	<elukey>	sudo rm /etc/rawdog/en/feeds/847a7185.state* on planet1002 (corrupted file) - backup in /home/elukey + restart planet-update-en.service	[production]
07:23	<elukey>	restart planet-update-en.service on planet1002	[production]
07:17	<elukey>	remove /etc/rawdog/en/{state,state.lock} on planet1002 (following what rawdog suggested) due to corrupted files (backups available in /home/elukey/en)	[production]
06:51	<elukey>	restart phabricator_clean_tmp_files.service on phab1001 - transient error (tmp files already cleaned up)	[production]
2021-07-14 §
14:13	<elukey>	restart php-fpm on mw2370	[production]
2021-07-13 §
06:53	<elukey>	systemctl reset-failed ifup@ens5 on gitlab2001 - T273026	[production]
2021-07-12 §
15:24	<elukey>	expand ML k8s iBGP neighbors to include the master nodes (ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/704104)	[production]
10:11	<elukey>	add 10g disk to ml-serve-ctrl[12]00[12] for T285927	[production]
2021-07-04 §
08:02	<elukey>	repool eqsin after equinix maintenance - T286113	[production]
2021-07-03 §
17:46	<elukey>	depool eqsin due to loss of power redundancy (equinix maintenance) - T286113	[production]
2021-07-01 §
11:35	<elukey>	reboot ml-serve-ctrl200[1,2] to increase vcpus/memory (1->2 vcores, 2->4g of memory)	[production]
11:33	<elukey>	reboot ml-serve-ctrl100[1,2] to increase vcpus/memory (1->2 vcores, 2->4g of memory)	[production]
2021-06-29 §
08:47	<elukey>	repool mw13[55,84] after debugging - T285634	[production]
08:46	<elukey@puppetmaster1001>	conftool action : set/pooled=yes; selector: name=mw1384.eqiad.wmnet	[production]
08:46	<elukey@puppetmaster1001>	conftool action : set/pooled=yes; selector: name=mw1355.eqiad.wmnet	[production]
08:25	<elukey>	cumin 'A:mw-eqiad' '/usr/local/sbin/restart-php7.2-fpm' -b 2 -s 30 - T285634	[production]
08:21	<elukey>	depool mw1355 (mw appserver) for debugging - T285634	[production]
08:21	<elukey@puppetmaster1001>	conftool action : set/pooled=no; selector: name=mw1355.eqiad.wmnet	[production]
2021-06-27 §
09:10	<elukey>	cumin 'A:mw-eqiad and not P{mw13[67,54,55,72,33,50,51,73,52,49,53,65,71,84,68,70,66,91,89,97,95,99,85,93,87]} and not P{mw14[09,03,11,07,05,01]} and not P{mw12[61-69]} and not P{mwdebug}' '/usr/local/sbin/restart-php7.2-fpm' -b 1 -s 30	[production]
09:10	<elukey>	roll restart the remaining mw appservers to clear out apcu framentation (cumin command to follow)	[production]
08:37	<elukey>	restart php-fpm on mw1268 mw1269 - low busy workers	[production]
08:23	<elukey>	restart php-fpm on mw1401	[production]
2021-06-26 §
16:37	<elukey>	restart php-fpm on mw1387	[production]
15:43	<elukey>	restart php-fpm on mw1393	[production]
15:39	<elukey>	restart php-fpm on mw1405 mw1399 mw1385	[production]
15:37	<elukey>	restart php-fpm on mw1397 mw1395 mw1411 mw1407	[production]
15:31	<elukey>	restart php-fpm on mw1391 mw1389 mw1403	[production]
13:49	<elukey>	restart php-fpm on mw1368 mw1370 mw1366 mw1409	[production]