production SAL

8351-8400 of 10000 results (22ms)

2019-05-16 §
05:34	<elukey>	roll restart of nutcracker on mw2* to pick up new config changes (no more memcached config) - T214275	[production]
2019-05-15 §
17:09	<elukey>	powerup elastic2038 (was down for maintenance)	[production]
16:50	<elukey>	restart Hadoop HDFS namenodes on an-master100[1,2] to pick up new settings	[production]
16:28	<elukey>	restart nutcracker on mw2240 to pick up the new config (no more memcached settings)	[production]
10:31	<elukey>	superset.wikimedia.org moved to analytics-tool1004 (Buster + python 3.7 + Superset 0.32 upgrade)	[production]
10:04	<elukey@deploy1001>	Finished deploy [analytics/superset/deploy@9cdb9c5]: Superset 0.32 - update pyhive dependency (duration: 00m 26s)	[production]
10:04	<elukey@deploy1001>	Started deploy [analytics/superset/deploy@9cdb9c5]: Superset 0.32 - update pyhive dependency	[production]
08:45	<elukey@deploy1001>	Finished deploy [analytics/superset/deploy@31c2c30]: Superset 0.32 (duration: 00m 26s)	[production]
08:44	<elukey@deploy1001>	Started deploy [analytics/superset/deploy@31c2c30]: Superset 0.32	[production]
08:36	<elukey>	stop superset on analytics-tool1003 as prep step for the migration to the new host - T212243	[production]
07:33	<elukey>	restart nutcracker on mw2245 to pick up config changes (removal of memcached config)	[production]
07:29	<elukey>	powercycle an-worker1094 (OEM event occurred, checking if temporary)	[production]
06:24	<elukey>	force remount of /mnt/hdfs on stat1007 - fuse hdfs stuck	[production]
2019-05-13 §
14:00	<elukey>	roll restart of aqs on aqs1* to pick up new druid settings	[production]
07:08	<elukey>	slow roll restart of celery on ores* nodes to allow cores to be generated upon segfault - T222866	[production]
2019-05-12 §
15:32	<elukey>	rollback python-kafka one eventlog1002 to 1.4.1-1~stretch1 - T222941	[production]
12:14	<elukey>	restart eventlogging on eventlog1002 - all processors stuck due to kafka python (T222941)	[production]
2019-05-11 §
06:37	<elukey>	restart eventlogging on eventlog1002 - huge kafka consumer lag accumulated (T222941)	[production]
2019-05-10 §
05:40	<elukey>	execute kafka preferred-replica-election on kafka-jumbo1001 as attempt to rebalance traffic (1002 seems handling way more than others since some days)	[production]
05:32	<elukey>	restart eventlogging daemons on eventlog1002 - kafka consumer errors in the logs, some lag built over time	[production]
2019-05-09 §
08:23	<elukey>	upload uwsgi 2.0.14+20161117-3+deb9u2+wmf1 packages to stretch-wikimedia - T212697	[production]
07:50	<elukey>	roll restart HDFS masters on an-master100[1,2] to pick up new logging settings	[production]
2019-05-08 §
09:24	<elukey>	install uwsgi-core_2.0.14+20161117-3+deb9u2+wmf1 on netmon2001 to test a uwsgi bug fix - T212697	[production]
07:45	<elukey>	install uwsgi-core_2.0.14+20161117-3+deb9u2+wmf1 on netmon1002 to test a uwsgi bug fix - T212697	[production]
06:29	<elukey>	restart uwsgi-netbox on netmon1002 after the daily segfault (upon restart)	[production]
2019-05-07 §
06:44	<elukey>	restart uwsgi-netbox on netmon1002 after segfault	[production]
2019-05-06 §
17:19	<elukey>	restart netbox on netmon1002 as test	[production]
09:35	<elukey>	restart netbox on netmon1002 (trying to reproduce the segfault) - T212697	[production]
2019-05-05 §
14:42	<elukey>	restart pdfrender on scb1004	[production]
2019-05-01 §
17:59	<elukey>	force remount of /mnt/hdfs on notebook1003 (fuse hdfs got stuck)	[production]
2019-04-30 §
15:45	<elukey>	restart hadoop hdfs namenodes on an-master100[1,2] to pick up new logging settings - T220702	[production]
12:34	<elukey>	moved /home to /srv/home (more space in a dedicated partition) on stat1005	[production]
09:02	<elukey>	roll restart hdfs namenodes on an-master100[1,2] to pick up new settings - T220702	[production]
2019-04-29 §
08:33	<elukey>	restart keyholder on deploy1001 + rearm keys	[production]
08:28	<elukey>	restart keyholder-proxy on deploy1001 (attempt to see if new analytics scap settings got applied)	[production]
2019-04-27 §
17:44	<elukey>	restart pdfrender on scb1002 (alert flapping)	[production]
2019-04-26 §
08:42	<elukey>	restart pdfrender on scb1003 (alert flapping)	[production]
2019-04-24 §
06:38	<elukey>	restart pdfrender on scb1003	[production]
2019-04-23 §
09:19	<elukey>	dumping Kafka consumer offsets' history on logstash1012 for T221202	[production]
05:52	<elukey>	powercycle wtp2019 - no ssh, mgmt console stuck	[production]
2019-04-19 §
06:39	<elukey>	roll restart of druid daemons on druid100[1-3] to pick up new jvm settings	[production]
2019-04-18 §
13:08	<elukey>	roll restart of cassandra on aqs* to pick up new openjdk upgrades	[production]
08:54	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
08:54	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
08:54	<elukey@cumin1001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)	[production]
08:54	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
08:53	<elukey>	reboot kafka10[12-23] (old Analytics cluster) for kernel + openjdk upgrades	[production]
2019-04-17 §
14:13	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
14:12	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
13:52	<elukey>	upgrading hadoop cdh distrubition to 5.16.1 on all the Hadoop-related nodes - T218343	[production]