2021-06-04
§
|
02:33 |
<ryankemper> |
[WDQS] `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1006.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "repair overinflated wikidata jnl" --blazegraph_instance blazegraph` |
[production] |
02:32 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
02:30 |
<ryankemper> |
T280382 `wdqs1005.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.9T 998G 1.8T 36% /srv` |
[production] |
02:25 |
<ryankemper> |
[WDQS] `ryankemper@wdqs1012:~$ sudo pool` (caught up on lag) |
[production] |
02:09 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2007.codfw.wmnet --dest wdqs2001.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage` |
[production] |
02:06 |
<ebernhardson> |
post-deploy restart airflow-(webserver|scheduer) on an-airflow1001 |
[production] |
02:05 |
<ebernhardson@deploy1002> |
Finished deploy [wikimedia/discovery/analytics@500179f]: Stop overwriting uploads in swift (duration: 04m 40s) |
[production] |
02:00 |
<ebernhardson@deploy1002> |
Started deploy [wikimedia/discovery/analytics@500179f]: Stop overwriting uploads in swift |
[production] |
01:38 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |
01:24 |
<ryankemper@cumin2002> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
00:12 |
<ryankemper@cumin2002> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |
00:08 |
<reedy@deploy1002> |
Synchronized wmf-config/CommonSettings.php: T280886 (duration: 00m 57s) |
[production] |
00:07 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2007.codfw.wmnet --dest wdqs2001.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin2002` tmux session `wdqs_reimage` |
[production] |
00:06 |
<ryankemper@cumin2002> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
00:05 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1008.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage` |
[production] |
00:05 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
00:05 |
<ryankemper@cumin1001> |
END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) |
[production] |
2021-06-03
§
|
23:41 |
<reedy@deploy1002> |
Synchronized wmf-config/CommonSettings.php: T280886 (duration: 00m 56s) |
[production] |
23:40 |
<reedy@deploy1002> |
Synchronized wmf-config/InitialiseSettings.php: T280886 (duration: 00m 57s) |
[production] |
23:33 |
<mutante> |
installing OS on fresh VM doh5001 |
[production] |
23:30 |
<ryankemper@cumin2002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2001.codfw.wmnet with reason: REIMAGE |
[production] |
23:28 |
<ryankemper@cumin2002> |
START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2001.codfw.wmnet with reason: REIMAGE |
[production] |
23:09 |
<thcipriani@deploy1002> |
Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:694686|Restrict changetags to sysops and bots on meta]] T283625 (duration: 00m 58s) |
[production] |
22:41 |
<ryankemper> |
T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs2001.codfw.wmnet` on `ryankemper@cumin2002` tmux session `wdqs_reimage` |
[production] |
22:39 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1008.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage` |
[production] |
22:39 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
22:36 |
<ryankemper> |
T280382 Cancelled transfer to `wdqs1005`; the source host `wdqs1013` has a `wikidata.jnl` that is 80% too big; will transfer from different node -> `wdqs1005` and then fix the journal on `wdqs1013` after |
[production] |
22:36 |
<ryankemper@cumin1001> |
END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) |
[production] |
22:35 |
<ryankemper> |
T280382 `wdqs2005.codfw.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.6T 998G 1.5T 40% /srv` |
[production] |
22:28 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.dns.netbox (exit_code=0) |
[production] |
22:15 |
<robh@cumin1001> |
START - Cookbook sre.dns.netbox |
[production] |
21:55 |
<ryankemper@cumin2002> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |
20:54 |
<shdubsh> |
restart kafka on kafka-logging to take new retention config |
[production] |
20:47 |
<sbassett> |
Deployed security patch for T282932 |
[production] |
20:37 |
<ebernhardson> |
restart mjolnir-kafka-bulk-daemon on search-loader[12]001 |
[production] |
20:35 |
<ebernhardson@deploy1002> |
Finished deploy [search/mjolnir/deploy@1c40c83]: bulk daemon: accept events for search_updates swift container (duration: 01m 00s) |
[production] |
20:34 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1013.eqiad.wmnet --dest wdqs1005.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs_reimage` |
[production] |
20:34 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
20:34 |
<ebernhardson@deploy1002> |
Started deploy [search/mjolnir/deploy@1c40c83]: bulk daemon: accept events for search_updates swift container |
[production] |
20:34 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2005.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin2002` tmux session `wdqs_reimage` |
[production] |
20:34 |
<ryankemper@cumin2002> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
19:58 |
<mutante> |
[mwmaint1002:~] $ /usr/local/bin/systemd-timer-mail-wrapper -T root@mwmaint1002.eqiad.wmnet --only-on-error /usr/local/bin/cross-validate-accounts |
[production] |
19:56 |
<mutante> |
[mwmaint1002:~] $ sudo systemctl start daily_account_consistency_check.service |
[production] |
19:41 |
<dzahn@cumin1001> |
END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host doh5002.wikimedia.org |
[production] |
19:41 |
<dzahn@cumin1001> |
START - Cookbook sre.ganeti.makevm for new host doh5002.wikimedia.org |
[production] |
19:39 |
<ebernhardson@deploy1002> |
Finished deploy [wikimedia/discovery/analytics@339d402]: ship pip and wheel packages for virtualenvs (duration: 04m 27s) |
[production] |
19:37 |
<dzahn@cumin1001> |
END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host doh5001.wikimedia.org |
[production] |
19:34 |
<ebernhardson@deploy1002> |
Started deploy [wikimedia/discovery/analytics@339d402]: ship pip and wheel packages for virtualenvs |
[production] |
19:33 |
<mutante> |
[deneb:~] $ sudo systemctl start docker-reporter-releng-images - T251918 - icinga-wm> RECOVERY - Check systemd state on deneb is OK |
[production] |
19:33 |
<ryankemper@cumin2002> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |