production SAL

1-50 of 10000 results (19ms)

2020-11-21 §
09:18	<joal>	Drop historical logs of 'Wikidata Concepts Monitor ETL' on HDFS keeping one example - freeing 60Tb	[production]
09:17	<joal>	Drop historical logs of '	[production]
08:28	<ariel@deploy1001>	Finished deploy [dumps/dumps@1a76a9a]: revinfo updates (duration: 00m 05s)	[production]
08:28	<ariel@deploy1001>	Started deploy [dumps/dumps@1a76a9a]: revinfo updates	[production]
08:10	<elukey>	remove big stderrlog fine in /var/lib/hadoop/data/d/yarn/logs/application_1605880843685_1450 on an-worker1110	[production]
08:05	<elukey>	remove big stderrlog fine in /var/lib/hadoop/data/e/yarn/logs/application_1605880843685_1450 on an-worker1105	[production]
2020-11-20 §
23:38	<mutante>	synced puppet-compiler facts - new hosts should be usable in compiler	[production]
22:30	<mutante>	cumin1001 - sudo systemctl start cumin-check-aliases -> <+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK T268369	[production]
21:30	<razzi@cumin1001>	END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)	[production]
20:26	<razzi@cumin1001>	START - Cookbook sre.ganeti.makevm	[production]
20:09	<razzi@cumin1001>	END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)	[production]
19:52	<mutante>	releases2002 - systemctl disable wmf_auto_restart_rsync; rm /usr/lib/systemd/system/wmf_auto_restart_rsync.* ; systemctl daemon-reload ; systemctl reset-failed - clear up systemd unit that was not absented and fix Icinga alerts	[production]
19:45	<mutante>	releases2002 systemctl reset-failed (wmf_auto_restart_rsync.service failed but hopefully fixed)	[production]
19:39	<mutante>	Icinga: ACKing all the "unhandled CRIT" alerts on clouddb* an an-coord* that have disabled notifications to remove monitoring noise. from 72 to 25 active alerts	[production]
19:14	<razzi@cumin1001>	START - Cookbook sre.ganeti.makevm	[production]
18:47	<elukey@cumin1001>	END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)	[production]
18:42	<elukey@cumin1001>	START - Cookbook sre.hosts.decommission	[production]
18:37	<elukey@cumin1001>	END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)	[production]
18:36	<razzi@cumin1001>	END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)	[production]
18:31	<elukey@cumin1001>	START - Cookbook sre.hosts.decommission	[production]
18:31	<elukey@cumin1001>	END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)	[production]
18:18	<elukey@cumin1001>	START - Cookbook sre.hosts.decommission	[production]
18:14	<dwisehaupt>	shifting 100% of thank_you mail through frmxs ahead of tomorrow's banner test - T267259	[production]
17:37	<pt1979@cumin2001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)	[production]
17:35	<pt1979@cumin2001>	START - Cookbook sre.hosts.downtime	[production]
17:32	<razzi@cumin1001>	START - Cookbook sre.ganeti.makevm	[production]
17:24	<razzi@cumin1001>	END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)	[production]
16:48	<volans@cumin1001>	END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)	[production]
16:40	<volans@cumin1001>	START - Cookbook sre.hosts.decommission	[production]
16:29	<razzi@cumin1001>	START - Cookbook sre.ganeti.makevm	[production]
16:29	<razzi@cumin1001>	END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97)	[production]
16:28	<razzi>	removed canceled ip address records for kafka-test1002 from netbox	[production]
16:11	<pt1979@cumin2001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)	[production]
16:09	<pt1979@cumin2001>	START - Cookbook sre.hosts.downtime	[production]
16:01	<razzi@cumin1001>	START - Cookbook sre.ganeti.makevm	[production]
16:01	<razzi@cumin1001>	END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)	[production]
15:42	<razzi@cumin1001>	START - Cookbook sre.ganeti.makevm	[production]
15:09	<andrew@cumin1001>	END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)	[production]
15:01	<andrew@cumin1001>	START - Cookbook sre.hosts.decommission	[production]
14:59	<andrew@cumin1001>	END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97)	[production]
14:58	<andrew@cumin1001>	START - Cookbook sre.hosts.decommission	[production]
14:30	<elukey>	force umount/mount for /mnt/hdfs on all stat1* nodes to pick up new openjdk settings	[production]
14:28	<elukey@cumin1001>	END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0)	[production]
14:00	<elukey>	restart hadoop daemons on an-master[1001-1002] (Hadoop masters) to pick up new rack settings and openjdk upgrades	[production]
13:59	<elukey@cumin1001>	START - Cookbook sre.hadoop.roll-restart-masters	[production]
13:34	<liw>	finished trying to test scap on beta cluster	[production]
13:24	<bblack>	cp*: remove remnants of expiring globalsign-2019 unified cert, including ocsp config+outputs	[production]
13:12	<liw>	testing upcoming Scap release on beta	[production]
13:00	<bblack>	dns*: upgrade remainder of fleet to gdnsd to 3.4.1	[production]
12:54	<elukey@cumin1001>	END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)	[production]