1-50 of 10000 results (38ms)
2020-11-21 §
09:18 <joal> Drop historical logs of 'Wikidata Concepts Monitor ETL' on HDFS keeping one example - freeing 60Tb [production]
09:17 <joal> Drop historical logs of ' [production]
08:28 <ariel@deploy1001> Finished deploy [dumps/dumps@1a76a9a]: revinfo updates (duration: 00m 05s) [production]
08:28 <ariel@deploy1001> Started deploy [dumps/dumps@1a76a9a]: revinfo updates [production]
08:10 <elukey> remove big stderrlog fine in /var/lib/hadoop/data/d/yarn/logs/application_1605880843685_1450 on an-worker1110 [production]
08:05 <elukey> remove big stderrlog fine in /var/lib/hadoop/data/e/yarn/logs/application_1605880843685_1450 on an-worker1105 [production]
2020-11-20 §
23:38 <mutante> synced puppet-compiler facts - new hosts should be usable in compiler [production]
22:30 <mutante> cumin1001 - sudo systemctl start cumin-check-aliases -> <+icinga-wm> RECOVERY - Check systemd state on cumin1001 is OK T268369 [production]
21:30 <razzi@cumin1001> END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [production]
20:26 <razzi@cumin1001> START - Cookbook sre.ganeti.makevm [production]
20:09 <razzi@cumin1001> END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [production]
19:52 <mutante> releases2002 - systemctl disable wmf_auto_restart_rsync; rm /usr/lib/systemd/system/wmf_auto_restart_rsync.* ; systemctl daemon-reload ; systemctl reset-failed - clear up systemd unit that was not absented and fix Icinga alerts [production]
19:45 <mutante> releases2002 systemctl reset-failed (wmf_auto_restart_rsync.service failed but hopefully fixed) [production]
19:39 <mutante> Icinga: ACKing all the "unhandled CRIT" alerts on clouddb* an an-coord* that have disabled notifications to remove monitoring noise. from 72 to 25 active alerts [production]
19:14 <razzi@cumin1001> START - Cookbook sre.ganeti.makevm [production]
18:47 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
18:42 <elukey@cumin1001> START - Cookbook sre.hosts.decommission [production]
18:37 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
18:36 <razzi@cumin1001> END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [production]
18:31 <elukey@cumin1001> START - Cookbook sre.hosts.decommission [production]
18:31 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
18:18 <elukey@cumin1001> START - Cookbook sre.hosts.decommission [production]
18:14 <dwisehaupt> shifting 100% of thank_you mail through frmxs ahead of tomorrow's banner test - T267259 [production]
17:37 <pt1979@cumin2001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [production]
17:35 <pt1979@cumin2001> START - Cookbook sre.hosts.downtime [production]
17:32 <razzi@cumin1001> START - Cookbook sre.ganeti.makevm [production]
17:24 <razzi@cumin1001> END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [production]
16:48 <volans@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
16:40 <volans@cumin1001> START - Cookbook sre.hosts.decommission [production]
16:29 <razzi@cumin1001> START - Cookbook sre.ganeti.makevm [production]
16:29 <razzi@cumin1001> END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97) [production]
16:28 <razzi> removed canceled ip address records for kafka-test1002 from netbox [production]
16:11 <pt1979@cumin2001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [production]
16:09 <pt1979@cumin2001> START - Cookbook sre.hosts.downtime [production]
16:01 <razzi@cumin1001> START - Cookbook sre.ganeti.makevm [production]
16:01 <razzi@cumin1001> END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [production]
15:42 <razzi@cumin1001> START - Cookbook sre.ganeti.makevm [production]
15:09 <andrew@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
15:01 <andrew@cumin1001> START - Cookbook sre.hosts.decommission [production]
14:59 <andrew@cumin1001> END (ERROR) - Cookbook sre.hosts.decommission (exit_code=97) [production]
14:58 <andrew@cumin1001> START - Cookbook sre.hosts.decommission [production]
14:30 <elukey> force umount/mount for /mnt/hdfs on all stat1* nodes to pick up new openjdk settings [production]
14:28 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) [production]
14:00 <elukey> restart hadoop daemons on an-master[1001-1002] (Hadoop masters) to pick up new rack settings and openjdk upgrades [production]
13:59 <elukey@cumin1001> START - Cookbook sre.hadoop.roll-restart-masters [production]
13:34 <liw> finished trying to test scap on beta cluster [production]
13:24 <bblack> cp*: remove remnants of expiring globalsign-2019 unified cert, including ocsp config+outputs [production]
13:12 <liw> testing upcoming Scap release on beta [production]
13:00 <bblack> dns*: upgrade remainder of fleet to gdnsd to 3.4.1 [production]
12:54 <elukey@cumin1001> END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [production]