4151-4200 of 10000 results (32ms)
2023-07-07 §
11:02 <aborrero@cumin1001> START - Cookbook sre.dns.netbox [production]
10:28 <taavi> backfilling {project}.wmcloud.org and other currently-named DNS zones to projects that don't have them [admin]
10:13 <moritzm> rebooting puppetdb1003 [production]
10:09 <moritzm> rebooting puppetserver1001 [production]
10:07 <wm-bot> <sebastian-berlin-wmse> Deploy code with reverted M2C changes (a7fb483) in order to debug errors on tools.isa. Started from scratch using python3.11 and kubernetes, and a copy of the database on tools.isa. [tools.isa-dev]
10:06 <jmm@cumin2002> END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb2003.codfw.wmnet [production]
10:05 <moritzm> rebooting puppetserver2001 [production]
10:05 <jiji@deploy1002> helmfile [staging] DONE helmfile.d/services/ipoid: apply [production]
10:03 <jiji@deploy1002> helmfile [staging] START helmfile.d/services/ipoid: apply [production]
09:59 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [production]
09:56 <btullis> `sudo systemctl start hadoop-hdfs-namenode.service ` on an-master1001 [analytics]
09:55 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host puppetdb2003.codfw.wmnet [production]
09:55 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [production]
09:52 <jmm@cumin2002> END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host debmonitor2003.codfw.wmnet [production]
09:52 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [production]
09:46 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [production]
09:46 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [production]
09:45 <stevemunene@cumin1001> END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) restart masters for Hadoop analytics cluster: Restart of jvm daemons. [production]
09:39 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [production]
09:37 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [production]
09:35 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [production]
09:34 <jmm@cumin2002> END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host lists1003.wikimedia.org [production]
09:33 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [production]
09:29 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [production]
09:29 <stevemunene@cumin1001> START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop analytics cluster: Restart of jvm daemons. [production]
09:28 <stevemunene> running sre.hadoop.roll-restart-masters restart the maters to completely remove any reference of analytics[1058-1069] T317861 [analytics]
09:26 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [production]
09:24 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3002.esams.wmnet [production]
09:24 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host lists1003.wikimedia.org [production]
09:20 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1004.eqiad.wmnet [production]
09:19 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host people1004.eqiad.wmnet [production]
09:19 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host netflow3002.esams.wmnet [production]
09:18 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [production]
09:17 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2003.codfw.wmnet [production]
09:15 <stevemunene> run puppet on hadoop masters to pick up changes from recently decommissioned hosts [analytics]
09:13 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host people2003.codfw.wmnet [production]
09:12 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [production]
08:53 <btullis@deploy1002> helmfile [staging] DONE helmfile.d/services/datahub: sync on main [production]
08:50 <btullis@deploy1002> helmfile [staging] START helmfile.d/services/datahub: apply on main [production]
08:48 <moritzm> installing bookworm kernel updates [production]
08:47 <jmm@cumin2002> END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: xhgui2002.codfw.wmnet [production]
08:47 <jmm@cumin2002> START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: xhgui2002.codfw.wmnet [production]
08:46 <jmm@cumin2002> END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: xhgui1002.eqiad.wmnet [production]
08:46 <jmm@cumin2002> START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: xhgui1002.eqiad.wmnet [production]
08:12 <elukey> wipe kafka-test cluster (data + zookeper config) to start clean after the issue happened yesterday [analytics]
08:05 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-test[1006-1010].eqiad.wmnet with reason: resetting cluster [production]
08:05 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-test[1006-1010].eqiad.wmnet with reason: resetting cluster [production]
01:55 <bking@cumin1001> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [production]
00:28 <bking@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
2023-07-06 §
23:14 <mutante> mx1001 - rm /usr/local/bin/otrs_aliases ; rm /lib/systemd/system/generate_otrs_aliases.* after deploying gerrit:932316 which renamed script and timer without absenting them [production]