9551-9600 of 10000 results (26ms)
2017-04-26 §
17:44 <elukey> unmasking and starting daemons on restbase-dev1003 [production]
16:14 <elukey> stop and mask cassandra and restbase on restbase-dev1003 for row-d maintenance [production]
14:26 <elukey> depooling aqs100[69] from AQS for network maintenance [production]
14:20 <elukey> stop zookeeper on conf1003 for row-d maintenance (Hadoop, Kafka related) [production]
13:53 <elukey> stop kafka on kafka1020 and kafka1018 for row-d extended maintenance (D2) [production]
13:22 <elukey> restart HDFS on analytics100[12] (Hadoop master nodes) to pick up recent topology changes for the cluster [production]
08:32 <elukey> Gracefully stopping hadoop daemons on Hadoop nodes affected by Row-D maintenance [production]
2017-04-24 §
13:47 <elukey> reimage analytics1003 to Jessie (Oozie/Hive/Camus not available during this timeframe in the Analytics Hadoop cluster) [production]
2017-04-21 §
08:20 <elukey> rolling restart of aqs (nodejs) on aqs* to pick up upgrades [production]
2017-04-20 §
16:08 <elukey> uploaded piwik 2.17.1-1 to jessie-wikimedia main [production]
13:11 <elukey> upgrading Piwik to 2.17.1 (brief downtime during the maintenance announced) [production]
12:12 <elukey> restart Yarn Resource manager on analytics1001 (hadoop master) to pick up new JVM settings [production]
10:07 <elukey> restart Yarn Resource manager on analytics1002 (hadoop master standby) to pick up new JVM settings [production]
2017-04-19 §
09:11 <elukey> cleaning up ocg1003's /srv/deployment/ocg/postmortem dir (root partition filled up) [production]
2017-04-18 §
17:09 <elukey> restart nutcracker in codfw (profile::mediawiki::nutcracker) to make sure that all the daemons are running with the latest config [production]
15:20 <elukey> restored default output-buffer config for rdb2005:6479 [production]
14:22 <elukey> executed config set client-output-buffer-limit "normal 0 0 0 slave 2147483648 2147483648 300 pubsub 33554432 8388608 60" on rdb2005:6749 as attempt to solve slave lagging - T159850 [production]
14:11 <elukey> executed CONFIG SET appendfsync everysec (default) to restore defaults on rdb2005:6479- T159850 [production]
14:04 <elukey> executed CONFIG SET appendfsync no on rdb2005:6479 to test if fsync stalls affect replication - T159850 [production]
2017-04-16 §
15:44 <elukey> restart ocg on ocg1003 to clean up deleted files in lsof [production]
15:35 <elukey> executing sudo find -name *.pdf -mtime +3 -exec rm {} \; on ocg1003's /srv/deployment/ocg/output to clean up some disk space - T162780 [production]
2017-04-14 §
10:29 <elukey> rollback systctl settings on mw1306 after experiment (stop jobchron/runner, stop hhvm, restore systctl settings, restart hhvm and job* daemons) [production]
09:50 <elukey> temporarily set sysctl -w net.netfilter.nf_conntrack_max=524288 on mw1306 (jobrunner) as test - (rollback: sysctl -w net.netfilter.nf_conntrack_max=262144") [production]
09:43 <elukey> temporarily set sysctl -w net.ipv4.ip_local_port_range="15000 64000" on mw1306 (jobrunner) as test - (rollback: sysctl -w net.ipv4.ip_local_port_range="32768 60999") - T157968 [production]
08:32 <elukey> restored appendfsync to 'everysec' on Redis rdb2005:6380 (end of performance experiment) [production]
07:23 <elukey> executed CONFIG SET appendfsync no on redis2005:6780 as performance test [production]
2017-04-13 §
16:55 <elukey> restored default value of client-output-buffer-limit on rdb1007:6379 - T159850 [production]
12:52 <elukey> temporary set config set client-output-buffer-limit "slave 5368709120 5368709120 180" on rdb1007:6379 [production]
12:34 <elukey> temporary set config set client-output-buffer-limit "slave 3221225472 3221225472 180" on rdb1007:6379 [production]
11:59 <elukey> temporary set config set client-output-buffer-limit "slave 2536870912 2536870912 60" on rdb1007:6379 [production]
11:37 <elukey> temporary set config set client-output-buffer-limit "slave 2147483648 2147483648 60" on rdb1007:6379 to give time to rdb2005's replication to catch up - T159850 [production]
10:47 <elukey> reverted previous config for Redis rdb2005 [production]
10:22 <elukey> executed CONFIG SET appendfsync no (prev value: "everysec") to Redis instance 6380 on rdb2005 - T125735 [production]
06:29 <elukey> re-arm keyholder on mira after reboot [production]
06:14 <elukey> powercycle mira - eth0 errors in the dmesg, CPU system utilization skyrocketed [production]
2017-04-12 §
13:41 <elukey> apply SLOWLOG RESET and CONFIG SET slowlog-max-len 100000 (prev value 10000, 10ms) to rdb1005:6380 to track down slow reqs - T125735 [production]
13:33 <elukey> restored slowlog-log-slower-than 10000 on rdb2005 [production]
13:25 <elukey> applied CONFIG SET slowlog-log-slower-than 300000 to Redis 6379 on rdb2005 and reset slowlog history to play with the stats [production]
12:23 <elukey> restart HDFS datanode daemons on all the Hadoop worker node to pick up the new JVM settings [production]
11:57 <elukey> restart Yarn nodemanager daemons on all the Hadoop worker node to pick up the new JVM settings [production]
06:37 <elukey> reimage mw2246.codfw.wmnet mw2152.codfw.wmnet to remove the /tmp partition (codfw videoscalers, switchover prep) [production]
2017-04-11 §
18:34 <elukey> restart hhvm on mw1165 (debug in /tmp/hhvm.5384.bt.) [production]
12:47 <elukey> reimage mw2246 (Debian codfw videoscaler) to Trusty [production]
11:33 <elukey> resume reboot of analytics1040->1050 for kernel upgrades [production]
06:30 <elukey> restart hhvm on mw1299 - dump debug in /tmp/hhvm.84379.bt [production]
2017-04-10 §
17:51 <elukey> restore Hadoop masters to analytics1001 [production]
14:05 <elukey> reimage anaytics1001 to Debian Jessie [production]
13:19 <elukey> reboot analytics1040->1050 to pick up the new kernel [production]
08:39 <elukey> manual failover of Hadoop master daemons from analyitics1001 to analytics1002 (T160333) [production]
2017-04-07 §
14:13 <elukey> restart hadoop-hdfs-namenode on an1002 (Hadoop Master standby) to pick up new jvm settings [production]