2024-03-22
§
|
14:40 |
<eoghan@cumin1002> |
START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org |
[production] |
14:37 |
<eoghan@cumin1002> |
END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org |
[production] |
14:37 |
<eoghan@cumin1002> |
START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org |
[production] |
14:35 |
<eoghan@cumin1002> |
END (FAIL) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org |
[production] |
14:35 |
<eoghan@cumin1002> |
START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org |
[production] |
14:35 |
<eoghan@cumin1002> |
END (ERROR) - Cookbook sre.gitlab.failover (exit_code=93) Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org |
[production] |
14:20 |
<urandom> |
restarting Cassandra decommission of restbase1024-{b,c} — T360548 |
[production] |
14:11 |
<topranks> |
disabling LAG from asw-b-codfw to ssw-aX-codfw T360776 |
[production] |
14:07 |
<cmooney@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on asw-b-codfw with reason: prepping to decom switch stack |
[production] |
14:07 |
<cmooney@cumin1002> |
START - Cookbook sre.hosts.downtime for 4:00:00 on asw-b-codfw with reason: prepping to decom switch stack |
[production] |
13:31 |
<brouberol@deploy1002> |
helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. |
[production] |
13:31 |
<brouberol@deploy1002> |
helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. |
[production] |
13:29 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
13:29 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
13:28 |
<brouberol@deploy1002> |
helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. |
[production] |
13:28 |
<brouberol@deploy1002> |
helmfile [staging-codfw] START helmfile.d/admin 'apply'. |
[production] |
13:23 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
13:23 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
13:17 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
13:17 |
<elukey> |
`elukey@cumin1002:~$ sudo cumin 'stat100[4,5,8,9]*' 'kill `pgrep -u kcv-wikimf`'` to unblock puppet on various stat nodes |
[production] |
13:17 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
13:07 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
13:07 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
13:06 |
<brouberol@deploy1002> |
helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. |
[production] |
13:06 |
<brouberol@deploy1002> |
helmfile [staging-codfw] START helmfile.d/admin 'apply'. |
[production] |
12:44 |
<eoghan@cumin1002> |
START - Cookbook sre.gitlab.failover Failover of gitlab from gitlab1004.wikimedia.org to gitlab1003.wikimedia.org |
[production] |
12:35 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
12:35 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
12:17 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
12:17 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
12:03 |
<reedy@deploy1002> |
Synchronized php-1.42.0-wmf.23/includes/htmlform/fields/HTMLHiddenField.php: T360717 (duration: 13m 06s) |
[production] |
11:55 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
11:55 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
11:52 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
11:52 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
11:39 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
11:39 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
10:59 |
<btullis@cumin1002> |
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1168.eqiad.wmnet |
[production] |
10:59 |
<btullis@cumin1002> |
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1168.eqiad.wmnet |
[production] |
10:56 |
<klausman@deploy1002> |
helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . |
[production] |
10:55 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
10:54 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
10:47 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
10:47 |
<logmsgbot> |
@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply |
[production] |
10:39 |
<btullis@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1168.eqiad.wmnet with reason: Investigating disk errors |
[production] |
10:38 |
<btullis@cumin1002> |
START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1168.eqiad.wmnet with reason: Investigating disk errors |
[production] |
10:36 |
<btullis@cumin1002> |
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1168.eqiad.wmnet |
[production] |
10:36 |
<btullis@cumin1002> |
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1168.eqiad.wmnet |
[production] |
10:34 |
<btullis@cumin1002> |
END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-worker1168.eqiad.wmnet |
[production] |
10:34 |
<btullis@cumin1002> |
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-worker1168.eqiad.wmnet |
[production] |