production SAL

9451-9500 of 10000 results (26ms)

2017-06-20 §
17:16	<elukey>	restart redis-instance-tcp_6380.service on rdb2004 to force sync with its master	[production]
16:05	<elukey>	reboot kafka1013 for kernel upgrade	[production]
14:47	<elukey>	rolling restart of druid100[123] for kernel upgrades	[production]
14:05	<elukey>	reboot kafka2001 for kernel upgrade	[production]
12:00	<elukey>	reboot analytics1029 -> analytics1069 for kernel upgrades (Hadoop worker nodes)	[production]
10:03	<elukey>	reboot kafka1012, analytics1028, aqs1004 for kernel upgrades (canary hosts)	[production]
2017-06-19 §
12:04	<elukey>	run 'echo "autoLearnMode=1" > /tmp/disable_learn && megacli -AdpBbuCmd -SetBbuProperties -f /tmp/disable_learn -a0' on all the analytics workers to disable BBU Auto learn - T167809	[production]
2017-06-14 §
07:04	<elukey>	restart pdfrender on scb200[2,4] (xpra race condition)	[production]
07:03	<elukey>	restart pdfrender on scb1004 (xpra race condition)	[production]
2017-06-13 §
10:11	<elukey>	completed rollout of https://gerrit.wikimedia.org/r/354449	[production]
09:27	<elukey>	puppet disabled on kafka, analytics, druid, conf for https://gerrit.wikimedia.org/r/354449 - incremental rollout	[production]
06:55	<elukey>	executed "cumin 'mw2.codfw.wmnet' 'find /var/log/hhvm/ -user root -exec chown www-data:www-data {} \;'" to fix the last occurences of wrong root:adm hhvm log occurrences	[production]
2017-06-12 §
08:22	<elukey>	powercycle scb2005 (console frozen, host unresponsive)	[production]
07:40	<elukey>	restarted citoid on scb1001 (kept failing health checks for Error: write EPIPE)	[production]
07:26	<elukey>	ran restart-pdfrender on scb1001 (OOM errors in the dmesg from hours ago)	[production]
07:22	<elukey>	ran restart-pdfrender on scb1002 (OOM errors in the dmesg from hours ago)	[production]
2017-06-11 §
14:14	<elukey>	executed cumin 'mw22[51-60].codfw.wmnet' 'find /var/log/hhvm/* -user root -exec chown www-data:www-data {} \;' to reduce cron-spam (new hosts added in March) - T146464	[production]
2017-06-09 §
07:51	<elukey>	run megacli -LDSetProp -Direct -LALL -aALL on analytics[1058-1068] - T166140	[production]
07:26	<elukey>	run megacli -LDSetProp ADRA -LALL -aALL on analytics[1058-1068] - T166140	[production]
07:15	<elukey>	deleted /etc/logrotate.d/nova-manage from labtestvirt2003 to reduce cronspam (same solution used in T132422#2679434)	[production]
2017-06-08 §
09:05	<elukey>	upgrade zookeeper packages to 3.4.5+dfsg-2+deb8u2 on conf100[123], conf200[23] and druid100[123]	[production]
2017-06-07 §
17:14	<elukey>	restart nutcracker on thumbor1002 (too many connections approaching the 1024 ulimit)	[production]
12:40	<elukey>	upgrade zookeeper packages on conf2002 to 3.4.5+dfsg-2+deb8u2	[production]
2017-06-06 §
13:39	<elukey>	shutdown analytics1033 and analytics1039 to replace their BBU - T166140	[production]
2017-06-02 §
04:42	<elukey>	removed some old scap revs for the Analytics refinery on stat1002 to free space (git fat jars replicating after each deployment, known issue)	[production]
2017-06-01 §
17:02	<elukey>	sto mysql, eventlogging_sync and shutdown db1047 (analytics-store) for maintenance - T159266	[production]
15:03	<elukey>	restart kafka100[23] for jvm upgrades	[production]
05:58	<elukey>	powercycle cp3032 - T166758	[production]
05:43	<elukey@puppetmaster1001>	conftool action : set/pooled=no; selector: name=cp3032.esams.wmnet	[production]
2017-05-31 §
07:47	<elukey>	restart kafka on kafka10[14,22,20] for jvm upgrades	[production]
2017-05-30 §
13:44	<elukey>	restart kafka on kafka1013 for jvm upgrades	[production]
13:21	<elukey>	restart kafka on kafka1001 for jvm upgrades	[production]
12:43	<elukey>	restart kafka on kafka200[123] for jvm upgrades (main-codfw, eventbus)	[production]
12:07	<elukey>	restart kafka on kafka1012 for jvm upgrades	[production]
08:23	<elukey>	restart jmxtrans on all the kafka brokers (analytics+main-codfw/eqiad) for jvm upgrades	[production]
08:17	<elukey>	restart kafka on kafka1018 for jvm upgrades	[production]
2017-05-26 §
12:44	<elukey>	Restart Hadoop daemons on analytics100[12] (Hadoop master nodes) for jvm upgrades	[production]
2017-05-25 §
13:04	<elukey>	restart cassandra-a on aqs1004 to test https://gerrit.wikimedia.org/r/354107	[production]
10:01	<elukey>	restart HDFS datanode daemons on all the hadoop worker nodes for jvm upgrades	[production]
09:39	<elukey>	reimage analytics1030 to Debian Jessie - T165529	[production]
09:35	<elukey>	restart Yarn nodemanager daemons on all the hadoop worker nodes for jvm upgrades	[production]
2017-05-24 §
13:54	<elukey>	upgrade Druid daemons on druid100[123] to 0.10 - T164008	[production]
2017-05-23 §
12:47	<elukey@tin>	Finished deploy [analytics/refinery@679aeea]: Updated stat1002 with the last refinery deployment (duration: 00m 42s)	[production]
12:46	<elukey@tin>	Started deploy [analytics/refinery@679aeea]: Updated stat1002 with the last refinery deployment	[production]
12:46	<elukey@tin>	Finished deploy [analytics/refinery@679aeea]: (no justification provided) (duration: 00m 01s)	[production]
12:45	<elukey@tin>	Started deploy [analytics/refinery@679aeea]: (no justification provided)	[production]
11:56	<elukey>	set vm.dirty_backround_bytes=25165824 on aqs1004 as part of testing for https://gerrit.wikimedia.org/r/#/c/354107 (Rollback: set vm.dirty_backround_ratio=10)	[production]
09:15	<elukey>	reverted manual hack on mw1161 with scap pull	[production]
08:15	<elukey>	apply manually https://gerrit.wikimedia.org/r/#/c/351854/2/wmf-config/jobqueue.php (persistent connections between hhvm and redis) to mw1161 as production test	[production]
2017-05-18 §
16:11	<elukey>	upgraded cassandra-tools-wmf on aqs hosts	[production]