9501-9550 of 10000 results (33ms)
2017-05-18 §
13:14 <elukey> AMEND prev: reloaded kafkatee on oxygen [production]
13:14 <elukey> reloaded kafkatee to test T151748 [production]
2017-05-17 §
13:42 <elukey> shutdown analytics1030 for T165529 [production]
2017-05-11 §
12:19 <elukey> reboot kafka100[23] for kernel upgrades (kafka main-eqiad, eventbus eqiad) [production]
2017-05-10 §
20:55 <elukey> restart hhvm on mw1268 (HHVM 3.12, HPHP::Treadmill::getAgeOldestRequest issue) [production]
14:57 <elukey> reboot kafka1001 for kernel upgrades (kafka main-eqiad, eventbus eqiad) [production]
13:53 <elukey> reboot kafka200[23] for kernel upgrades (kafka main-codfw cluster, eventbus codfw) [production]
2017-05-09 §
17:29 <elukey> executing varnish-backend-restart on cp1072 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661 [production]
17:25 <elukey> executing varnish-backend-restart on cp1074 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661 [production]
16:08 <elukey> playing with mw2146 for T163674 [production]
16:00 <elukey> stopping Hadoop daemons and shutting down analytics[1032-1033,1040].eqiad.wmnet - T132256 [production]
14:16 <elukey> correction: reboot kafka2001 for kernel upgrades (eventbus codfw) [production]
14:16 <elukey> reboot kafka1001 for kernel upgrades (eventbus codfw) [production]
11:03 <elukey> forced net.netfilter.nf_conntrack_tcp_timeout_time_wait = 65 to all the kafka brokers [production]
10:34 <elukey> reboot kafka1022 for kernel upgrades [production]
10:09 <elukey> reboot kafka1020 for kernel upgrades [production]
07:11 <elukey> reboot kafka1014 for kernel upgrades [production]
2017-05-08 §
09:25 <elukey> rolling restart of cassandra on aqs* hosts to pick up new jvm upgrades [production]
08:55 <elukey> restart Kafka mirror maker on kafka101[24] [production]
08:47 <elukey> reboot kafka1013 for kernel upgrades [production]
2017-05-07 §
21:09 <elukey> depooled cp4016.ulsfo.wmnet (sudo -i depool from localhost) due to issues with vhtcpd (segfaults in dmesg). [production]
08:43 <elukey> depooled cp4018.ulsfo.wmnet (sudo -i depool from localhost) due to issues with HTCP) [production]
2017-05-05 §
15:18 <elukey> increase nginx error log verbosity on mw2146 as test for T163674 (correct task) [production]
15:13 <elukey> increase nginx error log verbosity on mw2146 as test for T164586 [production]
12:16 <elukey> reboot kafka1018 for kernel upgrades [production]
09:00 <elukey> re-arm keyholder on mira (new scap key added for librenms) [production]
08:48 <elukey> re-arming keyholder on naos [production]
2017-05-04 §
10:22 <elukey> executed DEL ocg_job_status on rdb1007:6379 (new ocg_job_status hash is stored on the ocg* hosts) - T159850 [production]
09:40 <elukey> stop kafka on kafka1012 and reboot the host for kernel upgrade [production]
2017-05-03 §
14:12 <elukey> restart kafka-mirror-main-eqiad_to_analytics.service on kafka1012 [production]
09:19 <elukey> reboot mc[1019-1036].eqiad.wmnet for kernel upgrades [production]
2017-05-02 §
16:16 <elukey@naos> Synchronized wmf-config/ProductionServices.php: Replace Redis lock IPs after hw refresh (duration: 01m 16s) [production]
15:01 <elukey> stop and masked memcached on mc10[01-18].eqiad.wmnet [production]
10:20 <elukey> restart ocg on ocg1002 (localhost:8000 - frontend - not reachable) [production]
08:40 <elukey> run puppet and restart nutcracker on eqiad hosts with profile::mediawiki::nutcracker [production]
08:32 <elukey> stop and mask redis on mc1001-mc1018 - T137345 [production]
07:59 <elukey> Swap mc1001->mc1012 with mc1019->mc2030 - T137345 (more informative :) [production]
07:58 <elukey> wap mc1001->mc1012 with mc1019->mc2030 [production]
2017-04-30 §
15:31 <elukey> set tombstone_failure_threshold=1000 to restbase1009-a with P5165 on restbase1009-a - T160759 [production]
15:24 <elukey> set tombstone_failure_threshold=10000 to restbase1009-a with P5165 on restbase1009-a - T160759 [production]
07:45 <elukey> deleted /srv/cassandra-a/commitlog/CommitLog-5-1490738321543.log from restbase1009-a (empty commit log file created before OOM - backup in /home/elukey) [production]
2017-04-29 §
10:50 <elukey> set sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 to kafka[1018,1020,1022].eqiad.wmnet (was 120 - maybe related to T136094 ?) [production]
10:39 <elukey> start ferm on kafka1020/18 (nodes were previously down for maintenance, not sure why ferm wasn't started) [production]
2017-04-27 §
15:56 <elukey> restart of jmxtrans on all the hadoop worker nodes [production]
15:50 <elukey> forced 'service ferm start' on the failed analytics hosts [production]
07:56 <elukey> aqs100[69] back serving AQS traffic [production]
06:50 <elukey> executed kafka preferred-replica-election to rebalance topic leaders in the analytics cluster after maintenance [production]
2017-04-26 §
20:31 <elukey> restart zookeeper on conf1003 after network maintenance [production]
19:50 <elukey> restart kafka nodes (kafka1018 and kafka1020) after network maintenance [production]
17:46 <elukey> restart nutcracker on the eqiad mw hosts to pick up the new shard config (spamming elasticsearch memcached and triggering alarms) [production]