production SAL

8301-8350 of 10000 results (23ms)

2019-06-28 §
11:36	<elukey>	roll restart eventstreams on all scb1* nodes	[production]
11:33	<elukey>	restart eventstreams on scb1001	[production]
09:16	<elukey>	systemctl reset-failed kafka* units on kafka2002 (role spare, failed units, already masked)	[production]
08:43	<elukey>	roll restart of eventstreams on all scb2* nodes, service now working (kafka transport failures logged)	[production]
2019-06-27 §
13:15	<elukey>	start druid drop datasource test - might affect AQS - T226035	[production]
2019-06-26 §
09:04	<elukey>	reboot druid100[4-6] for kernel and openjdk upgrades	[production]
07:09	<elukey>	reboot of druid100[1-3] hosts for kernel + openjdk upgrades	[production]
05:59	<elukey>	systemctl mask + reset-failed kafka on kafka10[12-23] - T226517	[production]
2019-06-24 §
19:32	<elukey>	restart yarn/hdfs on analytics1072 to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/518767/ (broken disk)	[production]
09:23	<elukey>	reboot of kafka-jumbo100[1-6] for kernel + openjdk upgrades	[production]
08:56	<elukey>	re-enable eventloggign mysql consumers after maintenance on eventlog1002	[production]
08:42	<elukey>	reboot an-master100[1,2] for kernel + openjdk upgrades	[production]
07:51	<elukey>	stop mysql consumer on eventlog1002 (so traffic to db1107 will be stopped, to allow maintenance to happen)	[production]
06:16	<elukey>	powercycle analytics1060 (stuck, no ssh, no console com2 available)	[production]
2019-06-18 §
09:08	<elukey>	reboot analytics-tool1004 a second time to pick up the new kernel upgrades	[production]
07:45	<elukey>	roll restart of cassandra on aqs* to pick up new openjdk upgrades	[production]
07:39	<elukey>	reboot matomo1001 for kernel upgrades	[production]
07:36	<elukey>	reboot archiva1001 for kernel upgrades	[production]
07:32	<elukey>	reboot analytics-tool100* and an-tool100* for kernel upgrades	[production]
07:21	<elukey>	upload matomo_3.9.1-3 to stretch-wikimedia and upgrade matomo1001	[production]
2019-06-17 §
14:45	<elukey>	stop eventlogging on eventlog1002 and reboot for kernel upgrades	[production]
13:34	<elukey>	reboot of an-worker* (Hadoop worker nodes) for kernel + openjdk upgrades	[production]
09:31	<elukey>	set cpu governor to performance (was powersave) on analytics1070 (hadoop worker node)	[production]
2019-06-16 §
08:21	<elukey>	roll restart of druid brokers on druid100[4-6], stuck after regular data drop maintenance	[production]
2019-06-15 §
17:35	<elukey>	restart hadoop-yarn-resourcemanager on an-masters as attempt to fix yarn.w.o	[production]
2019-06-07 §
15:09	<elukey>	restart thorium for kernel upgrades	[production]
08:46	<elukey>	start the reboot of the Analytics Hadoop's worker nodes for kernel+openjdk upgrades	[production]
2019-06-06 §
17:30	<elukey>	restart mcrouter on mw2271 (codfw proxy) to pick up new config changes	[production]
14:43	<elukey>	restart mcrouter on mw2255 (codfw proxy) to pick up new config changes	[production]
12:10	<elukey>	restart mcrouter on mw2235	[production]
10:55	<elukey>	restart mcrouter on mw2163 (codfw mcrouter proxy)	[production]
10:19	<elukey>	rolling restart of mcrouter on mw1* hosts to pick up config change (batch of 5 hosts, depool/run-puppet/pool)	[production]
10:12	<elukey>	disable puppet on mw1* and mw[2163,2235,2255,2271] as prep step for mcrouter config deploy	[production]
2019-06-05 §
13:57	<elukey>	restart mcrouter on MediaWiki app/api canaries to pick up new config change (timeouts before marking a memcached shard as TKO from 3 to 10) - T203786	[production]
2019-06-04 §
08:32	<elukey>	remove memcached nutcracker config from mw1* hosts (not used). Changes will be picked up when nutcracker will be restarted (after reboots, etc..) - T214275	[production]
08:03	<elukey>	restart hive-server2 on an-coord1001 to pick up new GC/Heap settings	[production]
06:57	<elukey>	restart hive metastore on an-coord1001 to apply new GC/heap settings	[production]
06:21	<elukey>	restart pdfrender on scb1002 (flapping)	[production]
2019-06-03 §
08:17	<elukey>	manually removed phab_clean_tmp from www-data's crontab on phab1001 to reduce cronspam	[production]
07:58	<elukey>	refresh field list for logstash (via kibana Management -> Index patterns -> etc..)	[production]
06:50	<elukey>	roll restart varnishkafka (via puppet) for a config change - T224236	[production]
2019-05-31 §
14:32	<elukey>	powercycle notebook1003 - host stuck due to user processes, no ssh available, OOM didn't trigger	[production]
2019-05-21 §
14:25	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)	[production]
14:25	<elukey@cumin1001>	START - Cookbook sre.hosts.decommission	[production]
06:55	<elukey>	reboot of stat100[4,5,6,7] and notebook100[3,4] for kernel upgrades	[production]
2019-05-20 §
06:25	<elukey>	rebuild and upload memkeys 20181031-1 to stretch-wikimedia	[production]
06:20	<elukey>	upgrade memkeys to version 20181031-1 on all the mc* hosts (was deployied only on a few of them) - T208376	[production]
06:00	<elukey>	powercycle analytics1071 - soft lockups error messages in the dmesg	[production]
2019-05-16 §
08:32	<elukey>	depool/restart-nutcracker-pool mw1293/1313 - T214275	[production]
08:22	<elukey>	depool/restart-nutcracker-pool mw1238 - T214275	[production]