551-600 of 10000 results (50ms)
2021-04-29 §
01:19 <ryankemper@cumin1001> END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [production]
01:19 <ryankemper@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
01:19 <ryankemper> T280382 Aborted data transfer; `wdqs2007` is hosed (see https://phabricator.wikimedia.org/T281437) [production]
01:18 <ryankemper@cumin1001> END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [production]
00:40 <tstarling@deploy1002> Synchronized php-1.37.0-wmf.3/includes/specials/pagers/ImageListPager.php: T281405 (duration: 01m 08s) [production]
00:11 <ryankemper> T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` [production]
00:06 <ryankemper> T280382 `wdqs1013.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/mapper/vg0-srv 2.7T 998G 1.6T 39% /srv` [production]
2021-04-28 §
23:42 <ryankemper@cumin1001> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [production]
23:38 <robh@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [production]
23:36 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [production]
23:36 <robh@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [production]
23:34 <robh@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [production]
23:33 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [production]
23:32 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [production]
23:06 <dpifke@deploy1002> Finished deploy [performance/navtiming@cf8b2e9]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/682886 (duration: 00m 05s) [production]
23:06 <dpifke@deploy1002> Started deploy [performance/navtiming@cf8b2e9]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/682886 [production]
22:44 <dwisehaupt> civiproxy revision changed to 99cecb924a - initial rollout of code for testing [production]
22:26 <ryankemper> T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [production]
22:26 <ryankemper@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
22:23 <ryankemper@cumin1001> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [production]
22:18 <ryankemper> T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` [production]
22:18 <ryankemper@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
22:18 <robh@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
22:15 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:49 <legoktm@deploy1002> helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [production]
21:49 <legoktm@deploy1002> helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [production]
21:47 <legoktm@deploy1002> helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [production]
21:46 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:44 <legoktm@deploy1002> helmfile [staging-codfw] START helmfile.d/admin 'apply'. [production]
21:44 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:41 <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE [production]
21:39 <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE [production]
21:39 <ryankemper@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
21:39 <ryankemper@cumin1001> END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [production]
21:38 <ryankemper> T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [production]
21:37 <ryankemper> T280382 `wdqs2007` is reachable again; glancing at `/srv/wdqs` its `wikidata.jnl` is `839G` when it should be `975G` so I'll re-do the wikidata journal transfer [production]
21:32 <ryankemper> T280382 [WDQS] `wdqs2007` ssh is unreachable; power cycling via `racadm>>racadm serveraction powercycle` [production]
21:24 <ryankemper> T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1013.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (previous reimage timed out, instance appears to have rebooted) [production]
21:07 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [production]
21:05 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [production]
21:04 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [production]
21:03 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:03 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [production]
21:01 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:01 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [production]
21:01 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [production]
20:00 <robh@cumin1001> END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [production]
19:57 <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.1" [production]
19:56 <robh@cumin1001> START - Cookbook sre.dns.netbox [production]
19:13 <jhuneidi@deploy1002> Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 07s) [production]