production SAL

9751-9800 of 10000 results (25ms)

2017-02-02 §
15:01	<elukey>	Replace Redis/Memcached shards mc200[4567] with mc202[2345]	[production]
11:40	<elukey>	Swap mc2002 with mc2020, mc2003 with mc2021 (Redis codfw replicas) - T155755	[production]
10:53	<elukey>	Swap mc2001 with mc2019 (Redis codfw replicas) - T155755	[production]
2017-02-01 §
16:20	<elukey>	restarting Yarn Node Manager daemons on all the Hadoop nodes to bandaid a memory leak causing OOMs	[production]
12:09	<elukey@tin>	Finished deploy [analytics/refinery@e6254a4]: (no justification provided) (duration: 04m 41s)	[production]
12:04	<elukey@tin>	Started deploy [analytics/refinery@e6254a4]: (no justification provided)	[production]
07:41	<elukey>	bootstrapping aqs1008-a on aqs1008 (new AQS cassandra node)	[production]
2017-01-31 §
16:11	<elukey>	started Cassandra nodetool cleanup for aqs1007-a	[production]
16:03	<elukey>	started Cassandra nodetool cleanup for aqs1004-b	[production]
14:12	<elukey>	restarting hhvm on mw1204 (dump debug in /tmp/hhvm.29120.bt)	[production]
13:58	<elukey>	rebooted analytics1039 to pick up uuids in fstab - T147879	[production]
11:14	<elukey>	updating the puppet compiler's facts	[production]
08:44	<elukey@puppetmaster1001>	conftool action : set/pooled=yes; selector: name=aqs1007.eqiad.wmnet	[production]
08:26	<elukey>	started Cassandra nodetool cleanup for aqs1004-a	[production]
2017-01-30 §
09:25	<elukey>	bootstrapping new cassandra instance (aqs1007-b) on AQS - https://gerrit.wikimedia.org/r/#/c/334753/	[production]
08:45	<elukey>	restarting aqs on aqs100[4567] to pick up NSS updates	[production]
08:19	<elukey>	set mw1236.eqiad.wmnet pooled=inactive because powered off (no mentions on the SAL, still trying to find why)	[production]
2017-01-26 §
19:13	<elukey>	restore analytics1001 as RM and HDFS masters	[production]
18:36	<elukey>	restarting Yarn node managers on an102[89] and an103[01], impacted by the switch restart	[production]
17:57	<elukey>	boostrapping aqs1007-a cassandra instance	[production]
17:34	<elukey@tin>	Finished deploy [analytics/aqs/deploy@5917fd4]: (no message) (duration: 02m 25s)	[production]
17:31	<elukey@tin>	Starting deploy [analytics/aqs/deploy@5917fd4]: (no message)	[production]
13:53	<elukey>	restarting cassandra on aqs100[56] to complete the openjdk update	[production]
12:54	<elukey>	restarting the aqs1004-b casandra instance to pick up the new openjdk (last test before complete rollout)	[production]
12:28	<elukey>	restarting the aqs1004-a casandra instance to pick up the new openjdk	[production]
2017-01-25 §
18:02	<elukey>	running authdns-update on ns0.w.o to pick up changes made in https://gerrit.wikimedia.org/r/334040	[production]
09:25	<elukey>	updating puppet-compiler facts	[production]
07:28	<elukey>	upgrading aqs100[56] to node6	[production]
2017-01-24 §
16:37	<elukey>	upgrading aqs1004 to node6	[production]
2017-01-23 §
15:19	<elukey>	whitelisted dbproxy1011 on cr1/cr2 for analytics-in4 input filter	[production]
11:54	<elukey>	whitelisted dbproxy1010 on cr1/cr2 for analytics-in4 input filter	[production]
2017-01-20 §
10:39	<elukey>	manually forcing a /etc/init.d/apache2 reload on mw1259 (videoscaler) to replicate the effects of a logrotate run and test why alarms go off.	[production]
2017-01-16 §
15:01	<elukey>	restarting hhvm on mw1167 - hhvm-dump-debug in /tmp/hhvm.20360.bt	[production]
2017-01-11 §
22:26	<elukey>	added mw1239.eqiad.wmnet back to service - T148421	[production]
22:20	<elukey>	restarting hhvm on mw1198 (dump-debug in /tmp/hhvm.9737.bt)	[production]
2017-01-05 §
07:54	<elukey>	chown www-data:www-data all the root:adm hhvm log files on mw eqiad hosts (T132324)	[production]
2017-01-03 §
07:58	<elukey>	chown www-data:www-data all the root:adm hhvm log files on mw codfw hosts (T132324)	[production]
2017-01-02 §
13:24	<elukey>	powercycled mw1280, not pingable and mgmt console frozen	[production]
2016-12-22 §
14:51	<elukey>	restarting the yarn node manager java daemons on all the Hadoop worker nodes due to suspect memory leak	[production]
14:14	<elukey>	the previous entry is missing: "on analytics1032"	[production]
14:13	<elukey>	manually starting the yarn nodemanager after OOM	[production]
07:26	<elukey>	created /var/log/squid3/access.log.1.gz on aluminum to fix cronspam - T132324	[production]
2016-12-21 §
15:04	<elukey>	removed mongodb* packages from stat1003 after https://gerrit.wikimedia.org/r/328519	[production]
08:42	<elukey>	restarted hhvm/jobrunner (and killed ffmpeg processes) on mw116[89]	[production]
2016-12-20 §
08:27	<elukey>	renamed some log files ($something.1.gz to $something.1a.gz) on cp1008 and rutherium to unblock logrotation and reduce cronspam - T132324	[production]
2016-12-19 §
13:39	<elukey>	Manually raise hhvm.server.connection_timeout_seconds on mw1259 to one day	[production]
10:16	<elukey>	reimaging mw1168 and mw1169 to Trusty - T153488	[production]
09:38	<elukey>	stopping jobrunner/jobchron daemons on mw116[89] as prep step for repurpose to videoscalers - T153488	[production]
09:20	<elukey>	killing irc-echo	[production]
2016-12-18 §
16:45	<elukey>	starting cassandra instances on restbase1009, restbase1011 and restbase1013 (one at the time) - T153588	[production]