8551-8600 of 10000 results (24ms)
2018-11-03 §
09:35 <elukey> run tcpdump on mc1035 to grab memcache traffic (rotating pcaps, ~30G maximum) [production]
2018-11-01 §
09:10 <elukey> added a tmux session on mw1314m mw1344, mw1316 that checks mcrouter stats every 10s [production]
2018-10-31 §
09:24 <elukey> upgraded memkeys to 20181031-1 on all the mc* - T208376 [production]
09:16 <elukey> upload memkeys 20181031-1 to jessie-wikimedia thirdparty [production]
2018-10-30 §
12:13 <elukey> start memkeys on mc1035 to periodically dump the status of the most used keys - memkeys will use a bit of resources, please stop it if needed (root tmux) - T203786 [production]
2018-10-29 §
08:56 <elukey> restart yarn on an-master100[1,2] to pick up new zookeeper timeout settings (10s -> 20s) - T206943 [production]
2018-10-28 §
17:30 <elukey> restart yarn resource manager on an-master1002 to force failover to an-master1001 - T206943 [production]
2018-10-26 §
15:32 <elukey> rolling restart of all prometheus-mcrouter-exporters on app/api servers - metrics not reported after the last mcrouter restart [production]
2018-10-25 §
15:36 <elukey> shutdown aqs1006 to replace one broken disk - T206915 [production]
14:28 <elukey> upgrade druid on druid100[4-6] to Druid 0.12.3 [production]
10:11 <elukey> upgrade druid100[1-3] to druid 0.12.3 [production]
09:15 <elukey@deploy1001> Finished deploy [analytics/turnilo/deploy@84bf1ad]: Upgrade to 1.8.1 (duration: 00m 10s) [production]
09:15 <elukey@deploy1001> Started deploy [analytics/turnilo/deploy@84bf1ad]: Upgrade to 1.8.1 [production]
06:06 <elukey> upload druid 0.12.3-1 debs to stretch-wikimedia [production]
2018-10-24 §
07:04 <elukey> powercycle wdqs1008 [production]
06:59 <elukey> powercycle wdqs1007 [production]
06:55 <elukey> powercycle wdqs1006 (depool first) [production]
06:46 <elukey> powercycle wdqs1005 [production]
06:33 <elukey> powercycle wdqs1004 [production]
2018-10-23 §
06:50 <elukey> powercycle ms-be2017 (frozen since ~8hrs ago) [production]
06:42 <elukey> restart yarn and hdfs daemon on analytics1068 to pick up correct config (the host was down since before we swapped the Hadoop masters due to hw failure) [production]
2018-10-22 §
17:19 <elukey@deploy1001> Finished deploy [analytics/refinery@1de5f44]: Deploy new version of Camus and pageview whitelist (duration: 07m 05s) [production]
17:12 <elukey@deploy1001> Started deploy [analytics/refinery@1de5f44]: Deploy new version of Camus and pageview whitelist [production]
2018-10-16 §
13:08 <elukey> restart memcached on mc1035 with -R 200 (will wipe the object cache shard as consequence) - T203786 [production]
2018-10-15 §
12:44 <elukey> complete rolling restart of eventbus on kafka[12]00[1-3] for python security upgrades (only codfw was done) [production]
12:41 <elukey> upgrade prometheus-memcached-exporter on swift and thumbor [production]
08:50 <elukey> restart hadoop yarn resource managers on an-master* to pick up new jvm settings [production]
2018-10-14 §
08:54 <elukey> restart Yarn resource manager on an-master1002 to force an-master1001 to take the leadership back - T206943 [production]
08:34 <elukey> powercycle restbase1015 (frozen, no ssh, no metrics, no root console via serial available) [production]
2018-10-12 §
09:01 <elukey> rolling restart of eventbus on kafka[1,2]00[1-3] to pick up python security upgrades [production]
2018-10-11 §
14:15 <elukey> reboot eventlog1002 for kernel upgrades [production]
12:43 <elukey> upgrade prometheus-memcached-exporter on mc1* [production]
12:38 <elukey> upgrade prometheus-memcached-exporter on mc2* [production]
12:15 <elukey> upgrade prometheus-memcached-exporter on mc2035 [production]
12:14 <elukey> upload prometheus-memcached-exporter_0.4.1+git20181010.2fa99eb-1 to (jessie|stretch)-wikimedia [production]
07:36 <elukey> roll restart of aqs on aqs100[4-9] to pick up new Druid settings [production]
2018-10-10 §
07:51 <elukey> cleaned up some log files from eventlog1002 [production]
2018-10-09 §
09:25 <elukey> swapped Hadoop's hive/oozie from analytics1003 to an-coord1001 [production]
08:16 <elukey> update puppet compiler facts [production]
2018-10-08 §
16:28 <elukey> restart eventlogging on eventlog1002 for python security upgrades [production]
13:43 <elukey> restart confd on esams nodes to pick up new srv settings [production]
13:41 <elukey> restart navtiming.service on webperf1001 to pick up the dns change for etcd [production]
13:37 <elukey> restart confd on all the other eqiad nodes to pick up new srv records [production]
13:32 <elukey> restart confd on cp1* to pick up new srv records [production]
10:41 <elukey> restart mcrouter on mw2201 with more verbose logging settings as test [production]
2018-10-07 §
16:35 <elukey> run a script in tmux (my username) on mw2201 to poll the status of a mcrouter key/route every 10s using its admin api (very lightweight but kill if needed) [production]
2018-10-06 §
18:09 <elukey> restart Yarn Resource Manager on an-master1002 to force an-master1001 to take the active role back (failed over due to a zk conn issue) [production]
2018-10-05 §
17:12 <elukey> set etcd in codfw as read/write (was readonly) and eqiad as readonly (was read/write) [production]
11:59 <elukey> deleted bohrium from ganeti via gnt-instance [production]
10:10 <elukey> restart confd on labs-puppetmaster to pick up new etcd settings (eqiad -> codfw) [production]