6451-6500 of 10000 results (27ms)
2021-04-21 §
06:42 <elukey> upload hue_4.9.0-2+deb10u1 to buster-wikimedia [production]
2021-04-16 §
07:53 <elukey> run reprepro --delete clearvanished on apt1001 to clear all cloudera packages [production]
05:54 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics-tool1001.eqiad.wmnet [production]
05:42 <elukey@cumin1001> START - Cookbook sre.hosts.decommission for hosts analytics-tool1001.eqiad.wmnet [production]
2021-04-15 §
15:09 <elukey@deploy1002> Finished deploy [analytics/refinery@497f6a5]: Regular analytics weekly train (duration: 13m 12s) [production]
14:56 <elukey@deploy1002> Started deploy [analytics/refinery@497f6a5]: Regular analytics weekly train [production]
10:21 <elukey> Add kafka-logging100{2,3} to the kafka term in the analytics filters on cr1/cr2 eqiad - ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679740 [production]
06:32 <elukey> move hue.wikimedia.org to an-tool1009 (from analytics-tool1001) [production]
2021-04-14 §
12:39 <elukey> update kafka term for analytics-in{4,6} on cr{1,2}-eqiad to include kafka-logging1001 - ref: https://gerrit.wikimedia.org/r/c/operations/homer/public/+/679296 [production]
2021-04-08 §
16:33 <elukey> reboot an-worker1100 again to check if all the disks come up correctly [production]
15:36 <elukey> reboot an-worker1100 to see if it helps with the strange BBU behavior [production]
06:44 <elukey@deploy1002> Finished deploy [analytics/refinery@1dbbd3d] (hadoop-test): (no justification provided) (duration: 02m 20s) [production]
06:41 <elukey@deploy1002> Started deploy [analytics/refinery@1dbbd3d] (hadoop-test): (no justification provided) [production]
2021-04-07 §
15:39 <elukey@cumin1001> END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [production]
15:30 <elukey@cumin1001> START - Cookbook sre.aqs.roll-restart [production]
2021-04-06 §
09:52 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-coord1002.eqiad.wmnet with reason: REIMAGE [production]
09:50 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-coord1002.eqiad.wmnet with reason: REIMAGE [production]
2021-04-03 §
16:44 <elukey> power reset for ms-be2028 - not reachable via ssh, no tty available via mgmt console, NMI unrecoverable errors logged in iLo's system logs [production]
2021-04-02 §
14:30 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE [production]
14:28 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-coord1001.eqiad.wmnet with reason: REIMAGE [production]
13:14 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [production]
13:12 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [production]
10:54 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [production]
10:52 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [production]
07:28 <elukey> manual fix for an-worker1080's interface in netbox (xe-4/0/11), moved by mistake to public-1b [production]
2021-04-01 §
14:06 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [production]
14:04 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [production]
06:37 <elukey> powercycle cp1087 (no ssh, no tty via serial console) - T278729 [production]
06:35 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet [production]
2021-03-30 §
07:37 <elukey> restart-php7.2-fpm on mw1304, jobrunner completely overwhelmed by ffmpeg/transcode jobs (not publishing metrics, erroring out for memcached timeouts) - T278734 [production]
06:06 <elukey> powercycle cp1087 (no ssh, no mgmt console tty) [production]
06:04 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=cp1087.eqiad.wmnet [production]
2021-03-27 §
19:25 <elukey> powercycle elastic1060 - T278630 [production]
2021-03-25 §
08:12 <elukey> upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 for buster-wikimedia [production]
08:11 <elukey> upgrade hive packages in thirdparty/bigtop15 to 2.3.6-2 [production]
2021-03-24 §
07:41 <elukey@cumin1001> END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host ml-etcd2002.codfw.wmnet [production]
07:27 <elukey@cumin1001> START - Cookbook sre.ganeti.makevm for new host ml-etcd2002.codfw.wmnet [production]
07:20 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts ml-etcd2002.codfw.wmnet [production]
07:10 <elukey@cumin1001> START - Cookbook sre.hosts.decommission for hosts ml-etcd2002.codfw.wmnet [production]
2021-03-23 §
13:54 <elukey> sudo systemctl reload apache2 on prometheus[12]00[34] to pick up new k8s-mlserve instance settings [production]
07:36 <elukey> create a 50g lvm volume on prometheus[12]00[34] for the k8s-mlserve cluster - T272918 [production]
2021-03-22 §
14:14 <elukey@deploy1002> helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [production]
14:14 <elukey@deploy1002> helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [production]
11:15 <elukey@deploy1002> helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [production]
11:15 <elukey@deploy1002> helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [production]
11:15 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
11:14 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
10:48 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
10:48 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
10:47 <elukey@deploy1002> helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [production]