9451-9500 of 10000 results (26ms)
2017-06-20 §
17:16 <elukey> restart redis-instance-tcp_6380.service on rdb2004 to force sync with its master [production]
16:05 <elukey> reboot kafka1013 for kernel upgrade [production]
14:47 <elukey> rolling restart of druid100[123] for kernel upgrades [production]
14:05 <elukey> reboot kafka2001 for kernel upgrade [production]
12:00 <elukey> reboot analytics1029 -> analytics1069 for kernel upgrades (Hadoop worker nodes) [production]
10:03 <elukey> reboot kafka1012, analytics1028, aqs1004 for kernel upgrades (canary hosts) [production]
2017-06-19 §
12:04 <elukey> run 'echo "autoLearnMode=1" > /tmp/disable_learn && megacli -AdpBbuCmd -SetBbuProperties -f /tmp/disable_learn -a0' on all the analytics workers to disable BBU Auto learn - T167809 [production]
2017-06-14 §
07:04 <elukey> restart pdfrender on scb200[2,4] (xpra race condition) [production]
07:03 <elukey> restart pdfrender on scb1004 (xpra race condition) [production]
2017-06-13 §
10:11 <elukey> completed rollout of https://gerrit.wikimedia.org/r/354449 [production]
09:27 <elukey> puppet disabled on kafka*, analytics*, druid*, conf* for https://gerrit.wikimedia.org/r/354449 - incremental rollout [production]
06:55 <elukey> executed "cumin 'mw2*.codfw.wmnet' 'find /var/log/hhvm/* -user root -exec chown www-data:www-data {} \;'" to fix the last occurences of wrong root:adm hhvm log occurrences [production]
2017-06-12 §
08:22 <elukey> powercycle scb2005 (console frozen, host unresponsive) [production]
07:40 <elukey> restarted citoid on scb1001 (kept failing health checks for Error: write EPIPE) [production]
07:26 <elukey> ran restart-pdfrender on scb1001 (OOM errors in the dmesg from hours ago) [production]
07:22 <elukey> ran restart-pdfrender on scb1002 (OOM errors in the dmesg from hours ago) [production]
2017-06-11 §
14:14 <elukey> executed cumin 'mw22[51-60].codfw.wmnet' 'find /var/log/hhvm/* -user root -exec chown www-data:www-data {} \;' to reduce cron-spam (new hosts added in March) - T146464 [production]
2017-06-09 §
07:51 <elukey> run megacli -LDSetProp -Direct -LALL -aALL on analytics[1058-1068] - T166140 [production]
07:26 <elukey> run megacli -LDSetProp ADRA -LALL -aALL on analytics[1058-1068] - T166140 [production]
07:15 <elukey> deleted /etc/logrotate.d/nova-manage from labtestvirt2003 to reduce cronspam (same solution used in T132422#2679434) [production]
2017-06-08 §
09:05 <elukey> upgrade zookeeper packages to 3.4.5+dfsg-2+deb8u2 on conf100[123], conf200[23] and druid100[123] [production]
2017-06-07 §
17:14 <elukey> restart nutcracker on thumbor1002 (too many connections approaching the 1024 ulimit) [production]
12:40 <elukey> upgrade zookeeper packages on conf2002 to 3.4.5+dfsg-2+deb8u2 [production]
2017-06-06 §
13:39 <elukey> shutdown analytics1033 and analytics1039 to replace their BBU - T166140 [production]
2017-06-02 §
04:42 <elukey> removed some old scap revs for the Analytics refinery on stat1002 to free space (git fat jars replicating after each deployment, known issue) [production]
2017-06-01 §
17:02 <elukey> sto mysql, eventlogging_sync and shutdown db1047 (analytics-store) for maintenance - T159266 [production]
15:03 <elukey> restart kafka100[23] for jvm upgrades [production]
05:58 <elukey> powercycle cp3032 - T166758 [production]
05:43 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=cp3032.esams.wmnet [production]
2017-05-31 §
07:47 <elukey> restart kafka on kafka10[14,22,20] for jvm upgrades [production]
2017-05-30 §
13:44 <elukey> restart kafka on kafka1013 for jvm upgrades [production]
13:21 <elukey> restart kafka on kafka1001 for jvm upgrades [production]
12:43 <elukey> restart kafka on kafka200[123] for jvm upgrades (main-codfw, eventbus) [production]
12:07 <elukey> restart kafka on kafka1012 for jvm upgrades [production]
08:23 <elukey> restart jmxtrans on all the kafka brokers (analytics+main-codfw/eqiad) for jvm upgrades [production]
08:17 <elukey> restart kafka on kafka1018 for jvm upgrades [production]
2017-05-26 §
12:44 <elukey> Restart Hadoop daemons on analytics100[12] (Hadoop master nodes) for jvm upgrades [production]
2017-05-25 §
13:04 <elukey> restart cassandra-a on aqs1004 to test https://gerrit.wikimedia.org/r/354107 [production]
10:01 <elukey> restart HDFS datanode daemons on all the hadoop worker nodes for jvm upgrades [production]
09:39 <elukey> reimage analytics1030 to Debian Jessie - T165529 [production]
09:35 <elukey> restart Yarn nodemanager daemons on all the hadoop worker nodes for jvm upgrades [production]
2017-05-24 §
13:54 <elukey> upgrade Druid daemons on druid100[123] to 0.10 - T164008 [production]
2017-05-23 §
12:47 <elukey@tin> Finished deploy [analytics/refinery@679aeea]: Updated stat1002 with the last refinery deployment (duration: 00m 42s) [production]
12:46 <elukey@tin> Started deploy [analytics/refinery@679aeea]: Updated stat1002 with the last refinery deployment [production]
12:46 <elukey@tin> Finished deploy [analytics/refinery@679aeea]: (no justification provided) (duration: 00m 01s) [production]
12:45 <elukey@tin> Started deploy [analytics/refinery@679aeea]: (no justification provided) [production]
11:56 <elukey> set vm.dirty_backround_bytes=25165824 on aqs1004 as part of testing for https://gerrit.wikimedia.org/r/#/c/354107 (Rollback: set vm.dirty_backround_ratio=10) [production]
09:15 <elukey> reverted manual hack on mw1161 with scap pull [production]
08:15 <elukey> apply manually https://gerrit.wikimedia.org/r/#/c/351854/2/wmf-config/jobqueue.php (persistent connections between hhvm and redis) to mw1161 as production test [production]
2017-05-18 §
16:11 <elukey> upgraded cassandra-tools-wmf on aqs hosts [production]