2024-07-30
§
|
14:33 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.provision for host pc1017.mgmt.eqiad.wmnet with reboot policy GRACEFUL |
[production] |
13:30 |
<elukey> |
deprecate the sre-admins posix group fleetwide (replaced by ops-limited) - T360356 |
[production] |
10:08 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED |
[production] |
10:02 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.provision for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED |
[production] |
08:11 |
<elukey@cumin1002> |
END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED |
[production] |
08:05 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.provision for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED |
[production] |
08:03 |
<elukey@cumin1002> |
END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED |
[production] |
08:02 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.provision for host wikikube-worker1240.mgmt.eqiad.wmnet with reboot policy FORCED |
[production] |
2024-07-26
§
|
13:42 |
<elukey> |
move dump_cloud_ip_ranges's write to /srv/private capabilities back to puppetmaster1001 - T368023 |
[production] |
13:19 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye |
[production] |
13:02 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage |
[production] |
12:58 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage |
[production] |
12:42 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye |
[production] |
10:03 |
<elukey@cumin1002> |
END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye |
[production] |
08:35 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage |
[production] |
08:32 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage |
[production] |
08:16 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye |
[production] |
2024-07-22
§
|
16:02 |
<elukey> |
remove /srv/kafka/data/eqiad.resource-purge-3 on kafka-main2001 to force a refetch of data from good replicas and circumvent data corruption - T370574 |
[production] |
15:58 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main2001.codfw.wmnet with reason: attempt to remove a data dir on disk |
[production] |
15:57 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main2001.codfw.wmnet with reason: attempt to remove a data dir on disk |
[production] |
15:49 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-test1006.eqiad.wmnet with reason: attempt to remove a data dir on disk |
[production] |
15:49 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-test1006.eqiad.wmnet with reason: attempt to remove a data dir on disk |
[production] |
10:24 |
<elukey> |
kafka preferred-replica-election on kafka-main - T370574 |
[production] |
08:32 |
<elukey> |
restart kafka on kafka-main2005 - T370574 |
[production] |
08:31 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-main2005.codfw.wmnet with reason: restart attempt |
[production] |
08:30 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-main2005.codfw.wmnet with reason: restart attempt |
[production] |
08:07 |
<elukey> |
restart kafka on kafka-main2001 - T370574 |
[production] |
08:06 |
<elukey> |
restart kafka on kafka-main2001 - sre.hosts.downtime |
[production] |
08:06 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on kafka-main2001.codfw.wmnet with reason: restart attempt |
[production] |
08:05 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.downtime for 0:30:00 on kafka-main2001.codfw.wmnet with reason: restart attempt |
[production] |
2024-07-17
§
|
09:02 |
<elukey@puppetserver1001> |
conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet |
[production] |
08:57 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4037.ulsfo.wmnet |
[production] |
08:48 |
<elukey@cumin1002> |
START - Cookbook sre.hosts.reboot-single for host cp4037.ulsfo.wmnet |
[production] |
08:47 |
<elukey@puppetserver1001> |
conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet |
[production] |
07:49 |
<elukey> |
restart hadoop-mapreduce-historyserver.service on an-master1003 - failed for Java OOM |
[production] |
07:38 |
<elukey@cumin1002> |
END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-d1-codfw |
[production] |
07:36 |
<elukey@cumin1002> |
START - Cookbook sre.network.tls for network device lsw1-d1-codfw |
[production] |