production SAL

8401-8450 of 10000 results (29ms)

2019-04-17 §
13:48	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
13:48	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
09:09	<elukey>	restart eventlogging on eventlog1002 due to errors in processors and consumer lag accumulated after the last Kafka Jumbo roll restart	[production]
2019-04-16 §
15:16	<elukey>	roll restart kafka on kafka-jumbo100[1-6] to pick up openjdk upgrades	[production]
14:20	<elukey>	roll restart of all the druid daemons on druid100[1-6] to pick up new openjdk updates	[production]
2019-04-12 §
16:14	<elukey>	install ifstat on all the mc1* hosts for network bandwidth investigation	[production]
10:13	<elukey>	matomo updated to 3.9.1 on matomo1001 + deb upload to wikimedia-stretch - T218037	[production]
2019-04-11 §
09:57	<elukey>	roll restart druid-coordinator/overlord on druid100[4-6] to pick up new jvm settings	[production]
2019-04-10 §
16:51	<elukey@puppetmaster1001>	conftool action : set/pooled=no; selector: name=druid1004.eqiad.wmnet	[production]
16:01	<elukey>	restart brokers on druid100[3-6] - locking after segments get deleted	[production]
08:56	<elukey>	restart druid-broker on druid100[4-6] - stuck after attempt datasource delete action	[production]
08:36	<elukey>	update thirdparty/cloudera packages to cdh 5.16.1 for jessie/stretch-wikimedia - T218343	[production]
2019-04-09 §
17:56	<elukey>	restart keyholder-agent on deploy1001 to pick up new settings for analytics (+ arm all the keys)	[production]
17:42	<elukey>	restart keyholder-proxy.service on deploy1001 as attempt to reload perms for the analytics_deploy key	[production]
12:13	<elukey>	powercycle logstash1012 - no ssh, no mgmt console available, seems completely stuck	[production]
2019-04-05 §
10:42	<elukey>	restart druid broker on druid100[5,6] - exceptions in the logs after old datasource removal	[production]
10:41	<elukey>	restart druid broker on druid1004 - exceptions in the logs after old datasource removal	[production]
08:32	<elukey>	roll restart of aqs on aqs100* to pick up new druid settings	[production]
07:51	<elukey>	restart gerrit on cobalt (timeouts and general slowdown)	[production]
2019-04-03 §
21:32	<elukey>	start hadoop-hdfs-namenode on an-master1002 after outage due to big job hitting HDFS	[production]
17:57	<elukey>	restart hadoop-hdfs-namenode on an-master1001 as precautionary measure after the outage (currently standby)	[production]
17:19	<elukey>	restart hadoop-hdfs-namenode on an-master1002 after forced shutdown due to errors	[production]
2019-04-02 §
10:08	<elukey>	manually purge varnishkafka graphite alert's URL as attempt to avoid a flapping alert - T219842	[production]
2019-03-28 §
08:33	<elukey>	move hadoop yarn configuration from hdfs back to zookeeper - T218758	[production]
2019-03-27 §
18:10	<elukey>	interface::rps applied to all the mc10XX hosts - T203786	[production]
16:38	<elukey>	mc20XX and mc1022 have interface::rps enabled - T203786	[production]
15:19	<elukey>	slowly rolling out interface::rps to all the mcXXXX nodes - T203786	[production]
2019-03-26 §
11:21	<elukey>	temporary install ifstat on mc1022 + tmux session to log in/out bandwidth usage every 1s for T203786	[production]
2019-03-25 §
17:24	<elukey>	restart pdfrender on scb1004	[production]
2019-03-22 §
09:04	<elukey>	start tcpdump on mc1022 to gather traffic for analysis	[production]
2019-03-21 §
13:37	<elukey>	upgrade openjdk-8 on an-worker1080 and restarted hadoop daemons	[production]
11:54	<elukey>	restart yarn node managers on an-worker10[82,89,92] - shutdown after a long yarn failover and only now downtime is expired	[production]
10:46	<elukey>	restart hadoop yarn resource managers on an-master100[1,2] to pick up new settings	[production]
07:03	<elukey>	restart pdfrender on scb1002	[production]
2019-03-20 §
07:32	<elukey>	pool kafka1001 in pybal's eventbus service after yesterday's network maintenance	[production]
2019-03-19 §
16:30	<elukey>	stop eventlogging's mysql kafka consumers on eventlog1002, eventlogging's db replication on db1108 to ease db1107's maintenance	[production]
16:29	<elukey>	stop eventlogging's mysql kafka consumers on eventlog1002, eventlogging's db replication on db1108 to ease db1107's maintenance	[production]
2019-03-17 §
08:49	<elukey>	restart pdfrender on scb1004	[production]
2019-03-12 §
11:39	<elukey>	raise mysql's max_user_connection to 1000 for the Analytics user on labsdb1012	[production]
08:57	<elukey>	restart memcached on mc1019 to apply new settings - T217731	[production]
2019-03-11 §
09:44	<elukey>	roll restart of aqs on aqs100* to pick up new druid settings	[production]
2019-03-08 §
07:59	<elukey@deploy1001>	Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 40s)	[production]
07:58	<elukey@deploy1001>	Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster	[production]
07:57	<elukey@deploy1001>	Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 02s)	[production]
07:57	<elukey@deploy1001>	Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster	[production]
07:52	<elukey@deploy1001>	Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 01m 18s)	[production]
07:51	<elukey@deploy1001>	Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster	[production]
07:35	<elukey@deploy1001>	Finished deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster (duration: 00m 30s)	[production]
07:34	<elukey@deploy1001>	Started deploy [analytics/superset/deploy@UNKNOWN]: Test deployment for Buster	[production]
2019-03-07 §
09:15	<elukey>	fixed vlan-analytics1-d-eqiad members on asw2-d-eqiad - T205507	[production]