8301-8350 of 10000 results (23ms)
2019-06-28 §
11:36 <elukey> roll restart eventstreams on all scb1* nodes [production]
11:33 <elukey> restart eventstreams on scb1001 [production]
09:16 <elukey> systemctl reset-failed kafka* units on kafka2002 (role spare, failed units, already masked) [production]
08:43 <elukey> roll restart of eventstreams on all scb2* nodes, service now working (kafka transport failures logged) [production]
2019-06-27 §
13:15 <elukey> start druid drop datasource test - might affect AQS - T226035 [production]
2019-06-26 §
09:04 <elukey> reboot druid100[4-6] for kernel and openjdk upgrades [production]
07:09 <elukey> reboot of druid100[1-3] hosts for kernel + openjdk upgrades [production]
05:59 <elukey> systemctl mask + reset-failed kafka on kafka10[12-23] - T226517 [production]
2019-06-24 §
19:32 <elukey> restart yarn/hdfs on analytics1072 to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518767/ (broken disk) [production]
09:23 <elukey> reboot of kafka-jumbo100[1-6] for kernel + openjdk upgrades [production]
08:56 <elukey> re-enable eventloggign mysql consumers after maintenance on eventlog1002 [production]
08:42 <elukey> reboot an-master100[1,2] for kernel + openjdk upgrades [production]
07:51 <elukey> stop mysql consumer on eventlog1002 (so traffic to db1107 will be stopped, to allow maintenance to happen) [production]
06:16 <elukey> powercycle analytics1060 (stuck, no ssh, no console com2 available) [production]
2019-06-18 §
09:08 <elukey> reboot analytics-tool1004 a second time to pick up the new kernel upgrades [production]
07:45 <elukey> roll restart of cassandra on aqs* to pick up new openjdk upgrades [production]
07:39 <elukey> reboot matomo1001 for kernel upgrades [production]
07:36 <elukey> reboot archiva1001 for kernel upgrades [production]
07:32 <elukey> reboot analytics-tool100* and an-tool100* for kernel upgrades [production]
07:21 <elukey> upload matomo_3.9.1-3 to stretch-wikimedia and upgrade matomo1001 [production]
2019-06-17 §
14:45 <elukey> stop eventlogging on eventlog1002 and reboot for kernel upgrades [production]
13:34 <elukey> reboot of an-worker* (Hadoop worker nodes) for kernel + openjdk upgrades [production]
09:31 <elukey> set cpu governor to performance (was powersave) on analytics1070 (hadoop worker node) [production]
2019-06-16 §
08:21 <elukey> roll restart of druid brokers on druid100[4-6], stuck after regular data drop maintenance [production]
2019-06-15 §
17:35 <elukey> restart hadoop-yarn-resourcemanager on an-masters as attempt to fix yarn.w.o [production]
2019-06-07 §
15:09 <elukey> restart thorium for kernel upgrades [production]
08:46 <elukey> start the reboot of the Analytics Hadoop's worker nodes for kernel+openjdk upgrades [production]
2019-06-06 §
17:30 <elukey> restart mcrouter on mw2271 (codfw proxy) to pick up new config changes [production]
14:43 <elukey> restart mcrouter on mw2255 (codfw proxy) to pick up new config changes [production]
12:10 <elukey> restart mcrouter on mw2235 [production]
10:55 <elukey> restart mcrouter on mw2163 (codfw mcrouter proxy) [production]
10:19 <elukey> rolling restart of mcrouter on mw1* hosts to pick up config change (batch of 5 hosts, depool/run-puppet/pool) [production]
10:12 <elukey> disable puppet on mw1* and mw[2163,2235,2255,2271] as prep step for mcrouter config deploy [production]
2019-06-05 §
13:57 <elukey> restart mcrouter on MediaWiki app/api canaries to pick up new config change (timeouts before marking a memcached shard as TKO from 3 to 10) - T203786 [production]
2019-06-04 §
08:32 <elukey> remove memcached nutcracker config from mw1* hosts (not used). Changes will be picked up when nutcracker will be restarted (after reboots, etc..) - T214275 [production]
08:03 <elukey> restart hive-server2 on an-coord1001 to pick up new GC/Heap settings [production]
06:57 <elukey> restart hive metastore on an-coord1001 to apply new GC/heap settings [production]
06:21 <elukey> restart pdfrender on scb1002 (flapping) [production]
2019-06-03 §
08:17 <elukey> manually removed phab_clean_tmp from www-data's crontab on phab1001 to reduce cronspam [production]
07:58 <elukey> refresh field list for logstash (via kibana Management -> Index patterns -> etc..) [production]
06:50 <elukey> roll restart varnishkafka (via puppet) for a config change - T224236 [production]
2019-05-31 §
14:32 <elukey> powercycle notebook1003 - host stuck due to user processes, no ssh available, OOM didn't trigger [production]
2019-05-21 §
14:25 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [production]
14:25 <elukey@cumin1001> START - Cookbook sre.hosts.decommission [production]
06:55 <elukey> reboot of stat100[4,5,6,7] and notebook100[3,4] for kernel upgrades [production]
2019-05-20 §
06:25 <elukey> rebuild and upload memkeys 20181031-1 to stretch-wikimedia [production]
06:20 <elukey> upgrade memkeys to version 20181031-1 on all the mc* hosts (was deployied only on a few of them) - T208376 [production]
06:00 <elukey> powercycle analytics1071 - soft lockups error messages in the dmesg [production]
2019-05-16 §
08:32 <elukey> depool/restart-nutcracker-pool mw1293/1313 - T214275 [production]
08:22 <elukey> depool/restart-nutcracker-pool mw1238 - T214275 [production]