production SAL

9151-9200 of 10000 results (23ms)

2017-12-18 §
14:13	<elukey>	temporarily stopped mysql consumers on eventlog1001 to ease a mysql backup on db1107 - T183123	[production]
08:57	<elukey>	rolling restart of the Yarn nodemanagers (hadoop) on analytics10[456]* to pick up new settings - T182276	[production]
2017-12-15 §
16:10	<elukey>	re-enable piwik on bohrium after mysql backup restore	[production]
10:31	<elukey>	rolling restart of yarn nodemanagers on an103* to apply new config - T182276	[production]
09:50	<elukey>	restore piwik database on bohrium after mysql corruption - piwik disabled	[production]
2017-12-14 §
18:24	<elukey>	replace kafka1018 with kafka1023 (Analytics Kafka cluster)	[production]
13:41	<elukey>	update facts for puppet compiler to pick up new hosts	[production]
2017-12-13 §
14:01	<elukey>	restart Yarn nodemanagers on analytics102[8,9] to apply new settings - T182276	[production]
11:59	<elukey>	forced remount of /mnt/hdfs after OOM event on stat1005	[production]
2017-12-12 §
15:24	<elukey>	rename notebook1002 -> kafka1023 - step 3, replace notebook1002 with kafka1023 in the puppet config	[production]
15:02	<elukey>	clear recdns records related to notebook1002/kafka1023 (rec_control wipe-cache kafka1023.eqiad.wmnet kafka1023.mgmt.eqiad.wmnet notebook1002.eqiad.wmnet 14.5.64.10.in-addr.arpa 104.3.65.10.in-addr.arpa) - T181518	[production]
14:46	<elukey>	start rename notebook1002 -> kafka1023 - step 2, dns config (host already shutdown) - T181518	[production]
2017-12-11 §
09:05	<elukey>	set notebook1002 as role::spare as prep step to reimage it to kafka1023	[production]
08:12	<elukey>	powercycle ganeti1008 - all vms stuck, console com2 showed a ton of printks without a clear indicator of the root cause	[production]
2017-12-10 §
20:33	<elukey>	execute restart-hhvm on mw1312 - hhvm stuck multiple times queueing requests	[production]
20:01	<elukey>	ran kafka preferred-replica-election for the kafka analytics cluster (1012->1022) to re-add kafka1012 to the kafka brokers acting as partition leaders (will spread the load in a better way)	[production]
2017-12-08 §
11:45	<elukey>	updated prometheus-druid-exporter on druid* to 0.6	[production]
11:39	<elukey>	upload prometheus-druid-exporter 0.6 to stretch/jessie wikimedia	[production]
2017-12-07 §
20:35	<elukey>	restart hhvm on mw1235 - hhvm-dump-debug hanging out, not stacktrace available	[production]
20:31	<elukey>	restart hhvm on mw1281 - hhvm stuck (hhvm-dump-debug timing out)	[production]
17:25	<elukey@puppetmaster1001>	conftool action : set/pooled=yes; selector: name=mw1314.eqiad.wmnet	[production]
15:42	<elukey>	hhvm-dump-debug for mw1314 saved to /tmp/hhvm.17991.bt.	[production]
15:30	<elukey@puppetmaster1001>	conftool action : set/pooled=no; selector: name=mw1314.eqiad.wmnet	[production]
10:50	<elukey>	powercycle analytics1003 - no serial console, ssh stuck in System is booting up. See pam_nologin(8)	[production]
10:12	<elukey>	reboot analytics1003 for kernel+jvm updates - T179943	[production]
08:28	<elukey>	install prometheus-druid-exporter 0.5 on druid*	[production]
08:26	<elukey>	upload prometheus-druid-exporter 0.5-1 to jessie/stretch-wikimedia	[production]
2017-12-05 §
10:45	<elukey>	reboot druid1003 for kernel+jvm updates - T179943	[production]
09:42	<elukey>	reboot analytics100[12] for kernel+jvm updates (Hadoop Master nodes) - T179943	[production]
2017-12-04 §
14:30	<elukey>	reboot druid100[23] for kernel updates	[production]
14:01	<elukey>	reboot analytics106* (hadoop worker nodes) for kernel+jvm updates - T179943	[production]
09:24	<elukey>	reboot analytics104* (hadoop worker nodes) for kernel+jvm updates - T179943	[production]
2017-12-01 §
12:44	<elukey>	reboot druid1001 for kernel+jvm updates - T179943	[production]
10:57	<elukey>	reboot analytics1028 for kernel + jvm updates (Hadoop HDFS journalnode) - T179943	[production]
09:23	<elukey>	reboot analytics104* for kernel+jvm updates - T179943	[production]
08:40	<elukey>	reboot the remaining analytics103* hadoop workers to pick up kernel+jvm updates - T179943	[production]
2017-11-30 §
16:12	<elukey>	drain and reboot analytics1031->39 to pick up jvm+kernel updates - T179943	[production]
09:14	<elukey>	drain and reboot analytics1029/1030 for jvm+kernel updates (Hadoop worker canaries)	[production]
2017-11-29 §
14:36	<elukey>	reboot druid100[456] for jvm+kernel updates - T179943	[production]
13:18	<elukey>	reboot kafka100[23] for jvm+kernel updates - T179943	[production]
11:30	<elukey>	reboot kafka1001 for kernel + jvm updates - T179943	[production]
2017-11-28 §
14:17	<elukey>	reboot kafka10[12-22] for kernel + jvm updates - T179943	[production]
14:03	<elukey>	reboot kafka200[123] for kernel + jvm updates - T179943	[production]
2017-11-27 §
13:22	<elukey>	remove eventlogging replication support (log database) from dbstore1002 - T156844	[production]
2017-11-24 §
08:07	<elukey>	re-enabling piwik on bohrium (only VM running on ganeti1006 atm) after mysql tables restore completed	[production]
2017-11-22 §
16:16	<elukey>	restart druid broker,coordinator,historical daemons on druid100[123] to pick up new logging settings	[production]
2017-11-21 §
09:39	<elukey>	upload prometheus-druid-exporter 0.4 to jessie/stretch-wikimedia	[production]
2017-11-20 §
14:05	<elukey>	upload prometheus-druid-exporter 0.3 to jessie-wikimedia	[production]
13:30	<elukey>	upload prometheus-druid-exporter 0.3 to stretch-wikimedia	[production]
2017-11-17 §
08:04	<elukey>	reboot stat100[456] for kernel updates	[production]