6351-6400 of 10000 results (16ms)
2021-07-27 §
15:22 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2004.codfw.wmnet [production]
15:16 <elukey@cumin1001> START - Cookbook sre.hosts.reboot-single for host ml-serve2004.codfw.wmnet [production]
15:16 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2003.codfw.wmnet [production]
15:10 <elukey@cumin1001> START - Cookbook sre.hosts.reboot-single for host ml-serve2003.codfw.wmnet [production]
14:57 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2002.codfw.wmnet [production]
14:52 <elukey@cumin1001> START - Cookbook sre.hosts.reboot-single for host ml-serve2002.codfw.wmnet [production]
14:52 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve2001.codfw.wmnet [production]
14:47 <elukey@cumin1001> START - Cookbook sre.hosts.reboot-single for host ml-serve2001.codfw.wmnet [production]
14:40 <elukey@cumin1001> END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host ml-serve-ctrl2002.codfw.wmnet [production]
14:34 <elukey@cumin1001> START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2002.codfw.wmnet [production]
14:33 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-serve-ctrl2001.codfw.wmnet [production]
14:29 <elukey> reduce vcores for ml-serve-ctrl[12]00[12] after performance testing - T287238 [production]
14:28 <elukey@cumin1001> START - Cookbook sre.hosts.reboot-single for host ml-serve-ctrl2001.codfw.wmnet [production]
12:56 <elukey> created component/iptables185 for buster-wikimedia + imported packages from buster-backports [production]
06:50 <elukey> install iptables from buster-backports (manually) on ml-serve-ctrl200[1,2] as test (+ reboot the nodes for a clean start) - T287238 [production]
2021-07-23 §
15:45 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
15:44 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
15:11 <elukey> stop ml-serve-ctrl1001 + gnt-instance modify -t plain ml-serve-ctrl1001.eqiad.wmnet on ganeti1009 + start instance back - T287238 [production]
08:24 <elukey> run 'gnt-instance modify -t plain ml-serve-ctrl1002.eqiad.wmnet' on ganeti1009 as test to track down latency/perf issues with kubelets [production]
2021-07-20 §
12:23 <elukey> reboot ml-serve-ctrl vms to pick up new vcores settings [production]
12:22 <elukey> bump vcpus from 2 to 4 on ml-serve-ctrl VMs on Ganeti (load/cpu usage increased steadily since we deployed kubelets on them) [production]
2021-07-19 §
07:11 <elukey> roll restart kafka mirror maker on kafka-main200* hosts - stuck after Friday's events/incident [production]
2021-07-15 §
08:29 <elukey> sudo rm /etc/rawdog/en/feeds/847a7185.state* on planet1002 (corrupted file) - backup in /home/elukey + restart planet-update-en.service [production]
07:23 <elukey> restart planet-update-en.service on planet1002 [production]
07:17 <elukey> remove /etc/rawdog/en/{state,state.lock} on planet1002 (following what rawdog suggested) due to corrupted files (backups available in /home/elukey/en) [production]
06:51 <elukey> restart phabricator_clean_tmp_files.service on phab1001 - transient error (tmp files already cleaned up) [production]
2021-07-14 §
14:13 <elukey> restart php-fpm on mw2370 [production]
2021-07-13 §
06:53 <elukey> systemctl reset-failed ifup@ens5 on gitlab2001 - T273026 [production]
2021-07-12 §
15:24 <elukey> expand ML k8s iBGP neighbors to include the master nodes (ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/704104) [production]
10:11 <elukey> add 10g disk to ml-serve-ctrl[12]00[12] for T285927 [production]
2021-07-04 §
08:02 <elukey> repool eqsin after equinix maintenance - T286113 [production]
2021-07-03 §
17:46 <elukey> depool eqsin due to loss of power redundancy (equinix maintenance) - T286113 [production]
2021-07-01 §
11:35 <elukey> reboot ml-serve-ctrl200[1,2] to increase vcpus/memory (1->2 vcores, 2->4g of memory) [production]
11:33 <elukey> reboot ml-serve-ctrl100[1,2] to increase vcpus/memory (1->2 vcores, 2->4g of memory) [production]
2021-06-29 §
08:47 <elukey> repool mw13[55,84] after debugging - T285634 [production]
08:46 <elukey@puppetmaster1001> conftool action : set/pooled=yes; selector: name=mw1384.eqiad.wmnet [production]
08:46 <elukey@puppetmaster1001> conftool action : set/pooled=yes; selector: name=mw1355.eqiad.wmnet [production]
08:25 <elukey> cumin 'A:mw-eqiad' '/usr/local/sbin/restart-php7.2-fpm' -b 2 -s 30 - T285634 [production]
08:21 <elukey> depool mw1355 (mw appserver) for debugging - T285634 [production]
08:21 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=mw1355.eqiad.wmnet [production]
2021-06-27 §
09:10 <elukey> cumin 'A:mw-eqiad and not P{mw13[67,54,55,72,33,50,51,73,52,49,53,65,71,84,68,70,66,91,89,97,95,99,85,93,87]*} and not P{mw14[09,03,11,07,05,01]*} and not P{mw12[61-69]*} and not P{mwdebug*}' '/usr/local/sbin/restart-php7.2-fpm' -b 1 -s 30 [production]
09:10 <elukey> roll restart the remaining mw appservers to clear out apcu framentation (cumin command to follow) [production]
08:37 <elukey> restart php-fpm on mw1268 mw1269 - low busy workers [production]
08:23 <elukey> restart php-fpm on mw1401 [production]
2021-06-26 §
16:37 <elukey> restart php-fpm on mw1387 [production]
15:43 <elukey> restart php-fpm on mw1393 [production]
15:39 <elukey> restart php-fpm on mw1405 mw1399 mw1385 [production]
15:37 <elukey> restart php-fpm on mw1397 mw1395 mw1411 mw1407 [production]
15:31 <elukey> restart php-fpm on mw1391 mw1389 mw1403 [production]
13:49 <elukey> restart php-fpm on mw1368 mw1370 mw1366 mw1409 [production]