production SAL

7801-7850 of 10000 results (34ms)

2020-05-14 §
09:29	<elukey>	upload matomo-3.13.3 to thirdparty/matomo on stretch\|buster-wikimedia	[production]
08:57	<elukey>	imported gpg key 1FD752571FE36FF23F78F91B81E2E78B66FED89E in apt1001 (Matomo public debian repo)	[production]
2020-05-13 §
21:30	<elukey>	powercycle analytics1055	[production]
07:14	<elukey>	upload spark2_2.4.4-bin-hadoop2.6-2 for buster/stretch on apt1001	[production]
2020-05-11 §
17:51	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
17:49	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
17:16	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
17:14	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
06:04	<elukey>	restart wikimedia-discovery-golden on stat1007 - apparenlty killed by no memory left to allocate on the system	[production]
2020-05-10 §
08:44	<elukey>	Power cycle analytics1052 after eno1 issue	[production]
2020-05-07 §
09:11	<elukey>	roll restart cassandra on aqs1005 to pick up new openjdk upgrades (canary)	[production]
05:33	<elukey>	restart hadoop yarn nodemanager on analytics1071	[production]
2020-05-06 §
06:00	<elukey>	powercycle analytics1060 - host stuck - T251973	[production]
2020-05-05 §
15:26	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
15:24	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
15:03	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
15:00	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
2020-05-04 §
07:07	<elukey>	execute ifdown eno1; ifup eno1 on analytics1052 - interface neg speed flapping	[production]
06:41	<elukey>	upload prometheus-druid-exporter 0.8-1 to stretch-wikimedia	[production]
2020-04-29 §
17:54	<elukey@cumin1001>	END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0)	[production]
17:44	<elukey@cumin1001>	START - Cookbook sre.presto.roll-restart-workers	[production]
08:52	<elukey@cumin1001>	END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0)	[production]
08:45	<elukey@cumin1001>	START - Cookbook sre.zookeeper.roll-restart-zookeeper	[production]
2020-04-28 §
09:22	<elukey@cumin1001>	END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0)	[production]
09:12	<elukey@cumin1001>	START - Cookbook sre.presto.roll-restart-workers	[production]
09:12	<elukey@cumin1001>	END (FAIL) - Cookbook sre.presto.roll-restart-workers (exit_code=99)	[production]
09:12	<elukey@cumin1001>	START - Cookbook sre.presto.roll-restart-workers	[production]
2020-04-27 §
13:10	<elukey>	roll restart elastic on cloudelastic-chi again to pick up new JVM settings - T231517	[production]
07:25	<elukey>	roll restart elastic-chi on cloudelastic100[1-4] to pick up the last JVM GC settings - T231517	[production]
07:14	<elukey>	powercycle an-worker1089 - unreachable via ssh, mgmt serial available, soft cpu lock events registered in dmesg	[production]
06:59	<elukey>	force ifdown/ifup eno1 on analytics1052 - interface negotiated speed flapping	[production]
06:30	<elukey@puppetmaster1001>	conftool action : set/pooled=inactive; selector: name=mw1280.eqiad.wmnet	[production]
2020-04-26 §
18:08	<elukey>	powercycle puppetmaster1001 - mgmt serial console not usable, no ssh, racadm getsel doesn't show anything	[production]
2020-04-22 §
05:50	<elukey@deploy1001>	Finished deploy [analytics/refinery@30facc4]: Test of new scap settings (duration: 04m 42s)	[production]
05:45	<elukey@deploy1001>	Started deploy [analytics/refinery@30facc4]: Test of new scap settings	[production]
05:25	<elukey@deploy1001>	deploy aborted: log (duration: 00m 02s)	[production]
05:24	<elukey@deploy1001>	Started deploy [analytics/refinery@30facc4]: log	[production]
2020-04-20 §
10:37	<elukey>	apt-get purge rsync on mwlog* after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/589600/	[production]
06:41	<elukey>	execute find -mtime +30 -delete in /var/log/airflow/scheduler on an-airflow1001 to free space	[production]
2020-04-16 §
15:54	<elukey>	restart chi on cloudelastic1001 with -XX:NewRatio=3 - T231517	[production]
11:29	<elukey>	restart atskafka on cp3050 after maintenance	[production]
11:17	<elukey>	stop atskafka on cp3050 to re-create the topic atskafka_test_webrequest_text on Kafka Jumbo - T250347	[production]
09:33	<elukey>	restart atskafka on cp3050 to pick up snappy compression - T250347	[production]
05:33	<elukey>	restart hadoop-yarn-nodemanager on an-worker108[4,5] - failed after GC OOM events (heavy spark jobs)	[production]
2020-04-15 §
09:08	<elukey>	restart druid brokers on druid100[4-6] - stuck after datasource deletion	[production]
07:35	<elukey>	restart cloudelastic-chi on cloudelastic1002 to apply new jvm settings - T231517	[production]
2020-04-14 §
14:15	<elukey>	enable TLS between weblog1001,mwlog2001.codfw.wmnet,mwlog1001 and Kafka Jumbo/Logging - T250147	[production]
08:49	<elukey>	restart elastic-chi on cloudelastic1001 with -XX:NewSize=10G - T231517	[production]
07:33	<elukey>	apply CMS GC settings to chi on cloudelastic1001 - T231517	[production]
2020-04-13 §
06:36	<elukey>	temporary stopped puppet on restbase2014 to avoid attempts to start cassandra on each run - T250050	[production]