8401-8450 of 10000 results (29ms)
2019-04-17 §
13:48 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [production]
13:48 <elukey@cumin1001> START - Cookbook sre.hosts.downtime [production]
09:09 <elukey> restart eventlogging on eventlog1002 due to errors in processors and consumer lag accumulated after the last Kafka Jumbo roll restart [production]
2019-04-16 §
15:16 <elukey> roll restart kafka on kafka-jumbo100[1-6] to pick up openjdk upgrades [production]
14:20 <elukey> roll restart of all the druid daemons on druid100[1-6] to pick up new openjdk updates [production]
2019-04-12 §
16:14 <elukey> install ifstat on all the mc1* hosts for network bandwidth investigation [production]
10:13 <elukey> matomo updated to 3.9.1 on matomo1001 + deb upload to wikimedia-stretch - T218037 [production]
2019-04-11 §
09:57 <elukey> roll restart druid-coordinator/overlord on druid100[4-6] to pick up new jvm settings [production]
2019-04-10 §
16:51 <elukey@puppetmaster1001> conftool action : set/pooled=no; selector: name=druid1004.eqiad.wmnet [production]
16:01 <elukey> restart brokers on druid100[3-6] - locking after segments get deleted [production]
08:56 <elukey> restart druid-broker on druid100[4-6] - stuck after attempt datasource delete action [production]
08:36 <elukey> update thirdparty/cloudera packages to cdh 5.16.1 for jessie/stretch-wikimedia - T218343 [production]
2019-04-09 §
17:56 <elukey> restart keyholder-agent on deploy1001 to pick up new settings for analytics (+ arm all the keys) [production]
17:42 <elukey> restart keyholder-proxy.service on deploy1001 as attempt to reload perms for the analytics_deploy key [production]
12:13 <elukey> powercycle logstash1012 - no ssh, no mgmt console available, seems completely stuck [production]
2019-04-05 §
10:42 <elukey> restart druid broker on druid100[5,6] - exceptions in the logs after old datasource removal [production]
10:41 <elukey> restart druid broker on druid1004 - exceptions in the logs after old datasource removal [production]
08:32 <elukey> roll restart of aqs on aqs100* to pick up new druid settings [production]
07:51 <elukey> restart gerrit on cobalt (timeouts and general slowdown) [production]
2019-04-03 §
21:32 <elukey> start hadoop-hdfs-namenode on an-master1002 after outage due to big job hitting HDFS [production]
17:57 <elukey> restart hadoop-hdfs-namenode on an-master1001 as precautionary measure after the outage (currently standby) [production]
17:19 <elukey> restart hadoop-hdfs-namenode on an-master1002 after forced shutdown due to errors [production]
2019-04-02 §
10:08 <elukey> manually purge varnishkafka graphite alert's URL as attempt to avoid a flapping alert - T219842 [production]
2019-03-28 §
08:33 <elukey> move hadoop yarn configuration from hdfs back to zookeeper - T218758 [production]
2019-03-27 §
18:10 <elukey> interface::rps applied to all the mc10XX hosts - T203786 [production]
16:38 <elukey> mc20XX and mc1022 have interface::rps enabled - T203786 [production]
15:19 <elukey> slowly rolling out interface::rps to all the mcXXXX nodes - T203786 [production]
2019-03-26 §
11:21 <elukey> temporary install ifstat on mc1022 + tmux session to log in/out bandwidth usage every 1s for T203786 [production]
2019-03-25 §
17:24 <elukey> restart pdfrender on scb1004 [production]
2019-03-22 §
09:04 <elukey> start tcpdump on mc1022 to gather traffic for analysis [production]
2019-03-21 §
13:37 <elukey> upgrade openjdk-8 on an-worker1080 and restarted hadoop daemons [production]
11:54 <elukey> restart yarn node managers on an-worker10[82,89,92] - shutdown after a long yarn failover and only now downtime is expired [production]
10:46 <elukey> restart hadoop yarn resource managers on an-master100[1,2] to pick up new settings [production]
07:03 <elukey> restart pdfrender on scb1002 [production]
2019-03-20 §
07:32 <elukey> pool kafka1001 in pybal's eventbus service after yesterday's network maintenance [production]
2019-03-19 §
16:30 <elukey> stop eventlogging's mysql kafka consumers on eventlog1002, eventlogging's db replication on db1108 to ease db1107's maintenance [production]
16:29 <elukey> stop eventlogging's mysql kafka consumers on eventlog1002, eventlogging's db replication on db1108 to ease db1107's maintenance [production]
2019-03-17 §
08:49 <elukey> restart pdfrender on scb1004 [production]
2019-03-12 §
11:39 <elukey> raise mysql's max_user_connection to 1000 for the Analytics user on labsdb1012 [production]
08:57 <elukey> restart memcached on mc1019 to apply new settings - T217731 [production]
2019-03-11 §
09:44 <elukey> roll restart of aqs on aqs100* to pick up new druid settings [production]
2019-03-08 §
07:59 <elukey@deploy1001> Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 40s) [production]
07:58 <elukey@deploy1001> Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster [production]
07:57 <elukey@deploy1001> Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 02s) [production]
07:57 <elukey@deploy1001> Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster [production]
07:52 <elukey@deploy1001> Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 01m 18s) [production]
07:51 <elukey@deploy1001> Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster [production]
07:35 <elukey@deploy1001> Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 30s) [production]
07:34 <elukey@deploy1001> Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster [production]
2019-03-07 §
09:15 <elukey> fixed vlan-analytics1-d-eqiad members on asw2-d-eqiad - T205507 [production]