7851-7900 of 10000 results (23ms)
2020-04-12 §
10:18 <elukey> restart wdqs-updater on wdqs1004 (logs show no reports from the past hours, last one were stack traces related to a json decode failure) [production]
06:35 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=restbase1025.eqiad.wmnet [production]
06:32 <elukey> powerdown restbase1025 - T250027 [production]
06:20 <elukey> powercycle restbase1025 (not reachable, serial console shows blank, racadm getsel reports errors with DIMM_B2) [production]
2020-04-11 §
09:30 <elukey@cumin1001> END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) [production]
09:20 <elukey@cumin1001> START - Cookbook sre.presto.roll-restart-workers [production]
2020-04-07 §
06:55 <elukey@cumin1001> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [production]
06:53 <elukey@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
06:52 <elukey@cumin1001> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [production]
05:26 <elukey@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
2020-04-06 §
19:05 <elukey@cumin1001> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [production]
19:03 <elukey@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
19:00 <elukey@cumin1001> END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [production]
18:58 <elukey@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
18:57 <elukey@cumin1001> END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [production]
18:51 <elukey@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
18:42 <elukey@cumin1001> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [production]
16:54 <elukey@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
15:04 <elukey@cumin1001> END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [production]
14:09 <elukey@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
14:07 <elukey@cumin1001> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [production]
14:07 <elukey@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
13:26 <elukey> reboot stat1008 as test to verify ROCm 3.3 upgrades [production]
13:22 <elukey> stat1008 upgraded to ROCm 3.3 (enables Tensorflow 2.x) [production]
11:52 <elukey@cumin1001> END (PASS) - Cookbook sre.aqs.roll-restart (exit_code=0) [production]
11:48 <elukey@cumin1001> START - Cookbook sre.aqs.roll-restart [production]
11:18 <elukey> import AMD ROCm 3.3 packages in buster-wikimedia (component thirdparty/rocm33) - T247082 [production]
08:54 <elukey> bootstrap wdqs200[7,8] - T246343 [production]
07:35 <elukey> restart elasticsearch_6@cloudelastic-chi-eqiad on cloudelastic1003 as attempt to fix heavy GC runs (old gen) - T231517 [production]
2020-04-02 §
10:17 <elukey> set up TLS encryption for all pmacct instances on netflow* to Kafka Jumbo [production]
05:29 <elukey> powercycle analytics1045 (host not responsive to ssh, weird chars showed in mgmt serial console) [production]
2020-03-31 §
17:38 <elukey> restart elasticsearch_6@cloudelastic-chi-eqiad.service on cloudelastic1001 to see if it recovers from a trashing/gc state - T231517 [production]
2020-03-29 §
08:24 <elukey> powercycle elastic1059 - mgmt/serial console stuck, no ssh - racadm getsel shows a lot of OEM errors occurred, nothing specific [production]
2020-03-28 §
16:54 <elukey> restart yarn on analytics1071 [production]
2020-03-27 §
07:36 <elukey> execute 'rm /etc/logrotate.d/ceph-common' on cloudvirt[1,2]* and cloudcontrol* to stop daily cronspam (file not in the puppet catalog anymore) [production]
2020-03-26 §
09:50 <elukey> reboot stat1008 - gpu + drivers in a weird state after multiple tests [production]
2020-03-24 §
07:33 <elukey> restart update-openstack-mirror.service on sodium [production]
2020-03-23 §
14:28 <elukey@deploy1001> helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [production]
14:28 <elukey@deploy1001> helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [production]
14:25 <elukey@deploy1001> helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [production]
14:25 <elukey@deploy1001> helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [production]
14:13 <elukey@deploy1001> helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'canary' . [production]
14:13 <elukey@deploy1001> helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'production' . [production]
11:27 <elukey> upload oozie 4.3.0-3 to thirparty/bigtop14 on wikimedia-stretch - T244499 [production]
2020-03-20 §
11:10 <elukey> upload oozie 4.3.0-2 packages to thirdparty/bigtop14 on wikimedia-stretch [production]
07:46 <elukey> upload hadoop_2.8.5-2 (and related debs) to thirdparty/bigtop14 on wikimedia-stretch (manually rebuilt via docker after patch backports from upstream) [production]
2020-03-19 §
06:49 <elukey> execute 'sudo rm /etc/logrotate.d/ceph-common' on cloudvirt-dev and cloudcontrol-dev to stop daily cronspam [production]
2020-03-17 §
17:24 <elukey@deploy1001> Finished deploy [analytics/superset/deploy@3f3ddcb]: Upgrade PyHive to 0.6.2 (duration: 00m 43s) [production]
17:24 <elukey@deploy1001> Started deploy [analytics/superset/deploy@3f3ddcb]: Upgrade PyHive to 0.6.2 [production]
09:57 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [production]