production SAL

8651-8700 of 10000 results (24ms)

2018-08-29 §
06:03	<elukey>	migrate archiva.wikimedia.org to archiva1001 (upgrading archiva to its latest upstream version + Debian Stretch + Java 8)	[production]
2018-08-27 §
07:44	<elukey>	force remount of /mnt/hdfs on stat1005 (transport not connected errors)	[production]
2018-08-24 §
11:58	<elukey>	upgrade nodejs packages on aqs* for security upgrade (rolling restart of aqs daemon included)	[production]
2018-08-23 §
10:15	<elukey>	restart druid-broker on druid100[4-5] due to unresponsiveness (still unclear why)	[production]
2018-08-22 §
14:57	<elukey>	upload archiva 2.2.3-2 to stretch-wikimedia	[production]
14:56	<elukey>	update pcc's facts	[production]
10:43	<elukey>	upload archiva 2.2.3-1 to stretch-wikimedia/main - T192639	[production]
2018-08-01 §
14:52	<elukey>	created user 'research' on db1108 (eventlogging slave - only select grant for the log database)	[production]
08:38	<elukey>	restart hadoop-yarn-nodemanager on analytics10[31-77] to apply the new memory settings	[production]
07:59	<elukey>	restart hadoop-yarn-nodemanager on analytics10[28-30] to test new memory settings	[production]
06:59	<elukey>	restart eventlogging on eventlog1002 to pick up new logging settings	[production]
06:58	<elukey@deploy1001>	Finished deploy [eventlogging/analytics@762ca2b]: Deploy https://gerrit.wikimedia.org/r/#/c/eventlogging/+/449422/ (duration: 00m 07s)	[production]
06:58	<elukey@deploy1001>	Started deploy [eventlogging/analytics@762ca2b]: Deploy https://gerrit.wikimedia.org/r/#/c/eventlogging/+/449422/	[production]
2018-07-31 §
16:40	<elukey>	repool druid1005 after network maintenance	[production]
16:25	<elukey>	pool eventbus on kafka1002 after network maintenance	[production]
16:16	<elukey>	precautionary restart of eventbus on kafka1002 after network downtime (DNS name res errors, Kafka broker conn issues, etc..)	[production]
2018-07-30 §
08:07	<elukey@deploy1001>	Finished deploy [eventlogging/analytics@54d43e4]: Band aid for T200630 (duration: 00m 05s)	[production]
08:07	<elukey@deploy1001>	Started deploy [eventlogging/analytics@54d43e4]: Band aid for T200630	[production]
2018-07-28 §
17:35	<elukey>	restart eventlogging on eventlog1002 after tons of kafka disconnects (still not clear what happened)	[production]
2018-07-27 §
14:18	<elukey>	execute echo 'https://wikimania.wikimedia.org' \| mwscript purgeList.php on mwmain1001	[production]
2018-07-24 §
17:17	<elukey>	restart eventstreams on scb2* nodes (hopefully last time before deploying the fix) to avoid mem leaks issues during the EU night	[production]
08:21	<elukey>	rolling restart of kafka jumbo/main-(eqiad\|codfw) clusters to pick up the new max open files limit (infinity -> 128k)	[production]
2018-07-23 §
14:53	<elukey>	delete empty/not-used/wrongly-created topics in Kafka main-eqiad - T199510	[production]
2018-07-22 §
16:15	<elukey>	rolling restart of eventstreams on scb2* nodes to reduce the memory pressure before the weekend (still waiting for a permanent fix)	[production]
2018-07-21 §
17:35	<elukey>	rolling restart of eventstreams on scb2* nodes to reduce the memory pressure before the weekend (still waiting for a permanent fix)	[production]
2018-07-20 §
15:33	<elukey>	rolling restart of eventstreams on scb2* nodes to reduce the memory pressure before the weekend (still waiting for a permanent fix)	[production]
15:19	<elukey>	powercycle ms-be1016 - RAID errors, no ssh available, I/O errors in com2 console	[production]
2018-07-19 §
19:04	<elukey>	roll restart eventstreams on scb2* hosts to prevent OOM issues over the EU night - T199813	[production]
14:17	<elukey>	roll restart kafka on kafka-jumbo* and kafka main-codfw (kafka2*) to pick up new Xmx/Xms settings (1g -> 2g)	[production]
13:57	<elukey>	restart kafka on kafka-jumbo1001 to raise Xmx/Xms jvm settings (1g -> 2g)	[production]
08:08	<elukey>	restart eventstreams on scb2* hosts to pick up new Kafka settings (pointing it to main-codfw) - T199813	[production]
2018-07-18 §
12:57	<elukey>	restart eventstreams on scb200[5,6] as precautionary after mem consumption too high	[production]
12:57	<elukey>	restart eventstreams on scb2003 as precautionary after mem consumption too high	[production]
12:56	<elukey>	restart eventstreams on scb2001 as precautionary after mem consumption too high (still investigating a fix)	[production]
12:24	<elukey>	manually added queued.max.messages.kbytes: 65535 to eventstreams on scb2002 as test for T199813	[production]
08:31	<elukey>	drain + reboot analytics1030 for kernel updates	[production]
2018-07-17 §
18:33	<elukey>	rolling restart eventstreams on scb2* nodes to avoid OOMs during the EU night	[production]
2018-07-13 §
06:50	<elukey>	powercycle ms-be1041 after diagnostic tests	[production]
06:42	<elukey>	unblocked stuck dpkg processes on an107[2,5] that broke puppet	[production]
2018-07-12 §
06:26	<elukey>	restart rsyslog on wezen - T199406	[production]
2018-07-11 §
21:53	<elukey>	restart rsyslog on lithium - in:imtcp stuck in EAGAIN (Resource temporarily unavailable) due to a old socket to tegmen.wikimedia.org	[production]
21:47	<elukey>	re-enable kafka mirror maker on kafka100[1-3]	[production]
21:16	<elukey>	starting kafka on kafka100[1-3] after zk cleanup	[production]
19:31	<elukey>	cleaned up change-prop.retry.change-prop.retry in /srv/kafka/data on kafka100[1-3]	[production]
18:34	<elukey>	restarted topic nuke script for kafka main	[production]
18:14	<elukey>	start kafka on kafka1002	[production]
18:14	<elukey>	stop mirror makers on kafka100[1-3]	[production]
17:31	<elukey>	restart kafka on kafka1001 (oom registered)	[production]
17:12	<elukey>	restart kafka on kafka1003 with 2G heap settings	[production]
17:10	<elukey>	restart kafka on kafka1002 with 2G heap settings	[production]