9201-9250 of 10000 results (25ms)
2017-11-16 §
14:50 <elukey> updating puppet compiler's facts (following https://wikitech.wikimedia.org/w/index.php?title=Nova_Resource:Puppet3-diffs#FAQ) [production]
13:07 <elukey> restart aqs on aqs100[5-9] to apply localQuorum (https://gerrit.wikimedia.org/r/391765) - T164348 [production]
09:44 <elukey> restart aqs on aqs1004 to apply localQuorum (https://gerrit.wikimedia.org/r/391765) - T164348 [production]
2017-11-15 §
12:41 <elukey> re-enable eventlogging after maintenance [production]
12:09 <elukey> executed sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 on all jobrunners [production]
09:08 <elukey> stop eventlogging on eventlog1001, eventlogging replication on db1108/db1047/dbstore1002 as preparation steps to migrate the log db from db1046 to db1107 [production]
08:51 <elukey> reboot thorium (hosting all analytics websites) for kernel updates [production]
2017-11-14 §
10:32 <elukey> removed old target configs from /srv/prometheus/analytics/targets on prometheus100[34] after https://gerrit.wikimedia.org/r/391179 [production]
2017-11-13 §
18:20 <elukey> drain + shutdown analytics1029 as prep step to replace the BBU - T178742 [production]
11:18 <elukey> restart of all the druid daemons on druid100[1-6] to apply the new prometheus jmx jvm exporters - T177459 [production]
09:02 <elukey> restart of druid brokers on druid100[1-6] to apply https://gerrit.wikimedia.org/r/390419 - https://gerrit.wikimedia.org/r/390419 [production]
2017-11-08 §
10:28 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=aqs1005.eqiad.wmnet [production]
10:18 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=aqs1004.eqiad.wmnet [production]
10:07 <elukey> reboot aqs100[4-9] for jvm and kernel updates [production]
2017-11-07 §
17:55 <elukey> stop ircecho on einstenium (puppet shower from nitrogen) [production]
14:26 <elukey> rolling restart of kafka on kafka-jumbo* for jvm security updates [production]
10:24 <elukey> create staging database on db1108 (researchers scratch pad) - T177405 [production]
2017-11-06 §
13:03 <elukey> rolling restart of druid historical daemons on druid100[1-6] to apply https://gerrit.wikimedia.org/r/#/c/389429 [production]
2017-11-04 §
13:34 <elukey@puppetmaster1001> conftool action : set/pooled=yes; selector: name=scb1002.eqiad.wmnet [production]
13:19 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=scb1002.eqiad.wmnet [production]
2017-11-03 §
08:16 <elukey> drop CommandInvocation_15243810 and CommandInvocation_15243810_15423246 from analytics dbs (db1046/db1047/db1108/dbstore1002) - data archived on HDFS - T166712 [production]
07:29 <elukey> depooled mw1191 and powercycle - host down with 'CPU 1 check errors' in racadm getsel [production]
2017-11-02 §
11:43 <elukey> drop table log.PageContentSaveComplete_5588433 from db1046,db1047,db1108,dbstore1002, archived on hdfs - T177101 [production]
11:26 <elukey> drop log.MediaViewer_10867062_15423246 from db1047,db1108 since already archived in hdfs - T168303 [production]
2017-11-01 §
08:35 <elukey> forced umount/mount for /mnt/hdfs on stat1005 (not working after repeated oom kill actions) [production]
2017-10-30 §
08:42 <elukey> raised priority of refreshlink and htmlcacheupdate job execution on jobrunners (https://gerrit.wikimedia.org/r/#/c/386636/) - T173710 [production]
2017-10-28 §
16:51 <elukey> restart varnish backend on cp1055 - mailbox lag + T179156 [production]
12:14 <elukey@puppetmaster1001> conftool action : set/pooled=yes; selector: name=mw1313.eqiad.wmnet [production]
12:10 <elukey> manually killed (SIGTERM) hhvm on mw1313 - high load, hhvm-dump-debug not responsive [production]
12:01 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=mw1313.eqiad.wmnet [production]
11:53 <elukey> restart hhvm on mw1285 - hhvm-dump-debug in /tmp/hhvm.17700.bt [production]
2017-10-27 §
12:03 <elukey> execute systemctl reset-failed kafka-mirror-main-eqiad_to_jumbo-eqiad.service on kafka-jumbo hosts (old unit not deployed anymore) [production]
2017-10-25 §
13:30 <elukey> restart yarn nodemanager and hdfs datanode on analytics1030 to apply new JVM settings - T178876 [production]
2017-10-24 §
15:58 <elukey> drop MediaViewer_10867062_15423246 and MobileWebUIClickTracking_10742159_15423246 from the log database on db1046 (archived on hadoop) - T168303 [production]
15:48 <elukey> drop table Edit_13457736_15423246 from the log database (Eventlogging) on db104[6,7], dbstore1002 [production]
15:39 <elukey> set net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 to mw[1308-1311] - T136094 [production]
2017-10-23 §
15:59 <elukey> forced BBU learn cycle on db1046 - T166141 [production]
2017-10-18 §
16:29 <elukey> stop ircecho on einstenium for puppet shower [production]
09:30 <elukey> drop MobileWikiAppToCInteraction_10375484_15423246 from the log database on dbstore1002,db1047,db1046 - T177960 [production]
2017-10-12 §
13:48 <elukey> deployed the new Analytics Public Druid cluster - T176223 [production]
2017-10-11 §
15:39 <elukey> drop tables listed in https://phabricator.wikimedia.org/T171629#3674250 from db1046, db1047, dbstore1002 [production]
08:38 <elukey> reboot kafka-jumbo hosts for kernel updates [production]
2017-10-10 §
14:22 <elukey> add druid public cluster's IPs to analytics-in4 on cr1/cr2 - T177511 [production]
2017-10-09 §
10:15 <elukey> rebooting snapshot1001 for NFS stuck - T169680 [production]
2017-10-08 §
08:19 <elukey> restart varnish backend on cp4026 to stop 503s [production]
2017-10-06 §
11:10 <elukey> rolling restart of all the druid daemons on druid100[1-6] to pick up new logging changes [production]
10:43 <elukey> restart replication (mysql+ eventlogging_sync) on dbstore1002 after mysql restart [production]
09:53 <elukey> stop replication (eventlogging_sync + mysql) on dbstore1002 as prep step for mysql restart [production]
2017-10-05 §
05:57 <elukey> restart varnish backend on cp3040 [production]
05:42 <elukey> restart varnish backend on cp3030 [production]