production SAL

9501-9550 of 10000 results (33ms)

2017-05-18 §
13:14	<elukey>	AMEND prev: reloaded kafkatee on oxygen	[production]
13:14	<elukey>	reloaded kafkatee to test T151748	[production]
2017-05-17 §
13:42	<elukey>	shutdown analytics1030 for T165529	[production]
2017-05-11 §
12:19	<elukey>	reboot kafka100[23] for kernel upgrades (kafka main-eqiad, eventbus eqiad)	[production]
2017-05-10 §
20:55	<elukey>	restart hhvm on mw1268 (HHVM 3.12, HPHP::Treadmill::getAgeOldestRequest issue)	[production]
14:57	<elukey>	reboot kafka1001 for kernel upgrades (kafka main-eqiad, eventbus eqiad)	[production]
13:53	<elukey>	reboot kafka200[23] for kernel upgrades (kafka main-codfw cluster, eventbus codfw)	[production]
2017-05-09 §
17:29	<elukey>	executing varnish-backend-restart on cp1072 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661	[production]
17:25	<elukey>	executing varnish-backend-restart on cp1074 as attempt to mitigate "FetchError Could not get storage" and "ExpKill LRU_Fail" - T145661	[production]
16:08	<elukey>	playing with mw2146 for T163674	[production]
16:00	<elukey>	stopping Hadoop daemons and shutting down analytics[1032-1033,1040].eqiad.wmnet - T132256	[production]
14:16	<elukey>	correction: reboot kafka2001 for kernel upgrades (eventbus codfw)	[production]
14:16	<elukey>	reboot kafka1001 for kernel upgrades (eventbus codfw)	[production]
11:03	<elukey>	forced net.netfilter.nf_conntrack_tcp_timeout_time_wait = 65 to all the kafka brokers	[production]
10:34	<elukey>	reboot kafka1022 for kernel upgrades	[production]
10:09	<elukey>	reboot kafka1020 for kernel upgrades	[production]
07:11	<elukey>	reboot kafka1014 for kernel upgrades	[production]
2017-05-08 §
09:25	<elukey>	rolling restart of cassandra on aqs* hosts to pick up new jvm upgrades	[production]
08:55	<elukey>	restart Kafka mirror maker on kafka101[24]	[production]
08:47	<elukey>	reboot kafka1013 for kernel upgrades	[production]
2017-05-07 §
21:09	<elukey>	depooled cp4016.ulsfo.wmnet (sudo -i depool from localhost) due to issues with vhtcpd (segfaults in dmesg).	[production]
08:43	<elukey>	depooled cp4018.ulsfo.wmnet (sudo -i depool from localhost) due to issues with HTCP)	[production]
2017-05-05 §
15:18	<elukey>	increase nginx error log verbosity on mw2146 as test for T163674 (correct task)	[production]
15:13	<elukey>	increase nginx error log verbosity on mw2146 as test for T164586	[production]
12:16	<elukey>	reboot kafka1018 for kernel upgrades	[production]
09:00	<elukey>	re-arm keyholder on mira (new scap key added for librenms)	[production]
08:48	<elukey>	re-arming keyholder on naos	[production]
2017-05-04 §
10:22	<elukey>	executed DEL ocg_job_status on rdb1007:6379 (new ocg_job_status hash is stored on the ocg* hosts) - T159850	[production]
09:40	<elukey>	stop kafka on kafka1012 and reboot the host for kernel upgrade	[production]
2017-05-03 §
14:12	<elukey>	restart kafka-mirror-main-eqiad_to_analytics.service on kafka1012	[production]
09:19	<elukey>	reboot mc[1019-1036].eqiad.wmnet for kernel upgrades	[production]
2017-05-02 §
16:16	<elukey@naos>	Synchronized wmf-config/ProductionServices.php: Replace Redis lock IPs after hw refresh (duration: 01m 16s)	[production]
15:01	<elukey>	stop and masked memcached on mc10[01-18].eqiad.wmnet	[production]
10:20	<elukey>	restart ocg on ocg1002 (localhost:8000 - frontend - not reachable)	[production]
08:40	<elukey>	run puppet and restart nutcracker on eqiad hosts with profile::mediawiki::nutcracker	[production]
08:32	<elukey>	stop and mask redis on mc1001-mc1018 - T137345	[production]
07:59	<elukey>	Swap mc1001->mc1012 with mc1019->mc2030 - T137345 (more informative :)	[production]
07:58	<elukey>	wap mc1001->mc1012 with mc1019->mc2030	[production]
2017-04-30 §
15:31	<elukey>	set tombstone_failure_threshold=1000 to restbase1009-a with P5165 on restbase1009-a - T160759	[production]
15:24	<elukey>	set tombstone_failure_threshold=10000 to restbase1009-a with P5165 on restbase1009-a - T160759	[production]
07:45	<elukey>	deleted /srv/cassandra-a/commitlog/CommitLog-5-1490738321543.log from restbase1009-a (empty commit log file created before OOM - backup in /home/elukey)	[production]
2017-04-29 §
10:50	<elukey>	set sysctl -w net.netfilter.nf_conntrack_tcp_timeout_time_wait=65 to kafka[1018,1020,1022].eqiad.wmnet (was 120 - maybe related to T136094 ?)	[production]
10:39	<elukey>	start ferm on kafka1020/18 (nodes were previously down for maintenance, not sure why ferm wasn't started)	[production]
2017-04-27 §
15:56	<elukey>	restart of jmxtrans on all the hadoop worker nodes	[production]
15:50	<elukey>	forced 'service ferm start' on the failed analytics hosts	[production]
07:56	<elukey>	aqs100[69] back serving AQS traffic	[production]
06:50	<elukey>	executed kafka preferred-replica-election to rebalance topic leaders in the analytics cluster after maintenance	[production]
2017-04-26 §
20:31	<elukey>	restart zookeeper on conf1003 after network maintenance	[production]
19:50	<elukey>	restart kafka nodes (kafka1018 and kafka1020) after network maintenance	[production]
17:46	<elukey>	restart nutcracker on the eqiad mw hosts to pick up the new shard config (spamming elasticsearch memcached and triggering alarms)	[production]