2021-04-29
§
|
04:27 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1118 for reimage', diff saved to https://phabricator.wikimedia.org/P15623 and previous config saved to /var/cache/conftool/dbconfig/20210429-042757-marostegui.json |
[production] |
02:59 |
<milimetric@deploy1002> |
Finished deploy [analytics/refinery@740226b] (thin): Hotfix for referrer job (duration: 00m 06s) |
[production] |
02:59 |
<milimetric@deploy1002> |
Started deploy [analytics/refinery@740226b] (thin): Hotfix for referrer job |
[production] |
02:58 |
<milimetric@deploy1002> |
Finished deploy [analytics/refinery@740226b]: Hotfix for referrer job (duration: 14m 40s) |
[production] |
02:44 |
<milimetric@deploy1002> |
Started deploy [analytics/refinery@740226b]: Hotfix for referrer job |
[production] |
01:44 |
<krinkle@deploy1002> |
Synchronized wmf-config/mc.php: I5869b3c3ba4a (duration: 01m 08s) |
[production] |
01:23 |
<ryankemper> |
T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
01:21 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) |
[production] |
01:21 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
01:20 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) |
[production] |
01:20 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
01:19 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) |
[production] |
01:19 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
01:19 |
<ryankemper> |
T280382 Aborted data transfer; `wdqs2007` is hosed (see https://phabricator.wikimedia.org/T281437) |
[production] |
01:18 |
<ryankemper@cumin1001> |
END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) |
[production] |
00:40 |
<tstarling@deploy1002> |
Synchronized php-1.37.0-wmf.3/includes/specials/pagers/ImageListPager.php: T281405 (duration: 01m 08s) |
[production] |
00:11 |
<ryankemper> |
T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
00:06 |
<ryankemper> |
T280382 `wdqs1013.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv` |
[production] |
2021-04-28
§
|
23:42 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |
23:38 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE |
[production] |
23:36 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE |
[production] |
23:36 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE |
[production] |
23:34 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE |
[production] |
23:33 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE |
[production] |
23:32 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE |
[production] |
23:06 |
<dpifke@deploy1002> |
Finished deploy [performance/navtiming@cf8b2e9]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/682886 (duration: 00m 05s) |
[production] |
23:06 |
<dpifke@deploy1002> |
Started deploy [performance/navtiming@cf8b2e9]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/682886 |
[production] |
22:44 |
<dwisehaupt> |
civiproxy revision changed to 99cecb924a - initial rollout of code for testing |
[production] |
22:26 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
22:26 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
22:23 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |
22:18 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
22:18 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
22:18 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
22:15 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
21:49 |
<legoktm@deploy1002> |
helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. |
[production] |
21:49 |
<legoktm@deploy1002> |
helmfile [staging-eqiad] START helmfile.d/admin 'apply'. |
[production] |
21:47 |
<legoktm@deploy1002> |
helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. |
[production] |
21:46 |
<robh@cumin1001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
21:44 |
<legoktm@deploy1002> |
helmfile [staging-codfw] START helmfile.d/admin 'apply'. |
[production] |
21:44 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
21:41 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE |
[production] |
21:39 |
<ryankemper@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE |
[production] |
21:39 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
21:39 |
<ryankemper@cumin1001> |
END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) |
[production] |
21:38 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
21:37 |
<ryankemper> |
T280382 `wdqs2007` is reachable again; glancing at `/srv/wdqs` its `wikidata.jnl` is `839G` when it should be `975G` so I'll re-do the wikidata journal transfer |
[production] |
21:32 |
<ryankemper> |
T280382 [WDQS] `wdqs2007` ssh is unreachable; power cycling via `racadm>>racadm serveraction powercycle` |
[production] |
21:24 |
<ryankemper> |
T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1013.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (previous reimage timed out, instance appears to have rebooted) |
[production] |
21:07 |
<robh@cumin1001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE |
[production] |