2021-04-30
§
|
05:30 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15664 and previous config saved to /var/cache/conftool/dbconfig/20210430-053038-root.json |
[production] |
05:16 |
<marostegui> |
Upgrade kernel on db1114 |
[production] |
05:15 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1114 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15663 and previous config saved to /var/cache/conftool/dbconfig/20210430-051558-marostegui.json |
[production] |
05:08 |
<marostegui@cumin1001> |
END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1080.eqiad.wmnet |
[production] |
04:57 |
<marostegui@cumin1001> |
START - Cookbook sre.hosts.decommission for hosts db1080.eqiad.wmnet |
[production] |
04:56 |
<ryankemper> |
[WDQS] `ryankemper@wdqs1006:~$ sudo systemctl restart wdqs-blazegraph` |
[production] |
04:43 |
<ryankemper> |
T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` |
[production] |
04:43 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
04:42 |
<ryankemper> |
T261239 `elastic2033`, which is known to be in a state of hardware failure (we have a ticket open), is holding up the reboot of codfw. I don't think we have a good way to exclude a node currently. Going to just proceed to `eqiad` for now |
[production] |
04:41 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 |
[production] |
04:39 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1003.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
04:39 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) |
[production] |
04:39 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
04:05 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE |
[production] |
04:03 |
<ryankemper@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE |
[production] |
03:50 |
<ryankemper> |
T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1010.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
03:47 |
<ryankemper> |
T280563 about half of codfw nodes have been rebooted before the failure caused by write queue not emptying fast enough, kicking it off again:`sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` |
[production] |
03:45 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 |
[production] |
01:08 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 |
[production] |
2021-04-29
§
|
23:36 |
<thcipriani@deploy1002> |
Synchronized README: Config: [[gerrit:683749|Revert "DEMO: Add newline to README"]] (duration: 00m 56s) |
[production] |
23:18 |
<ryankemper> |
T280563 successful reboot of `relforge100[3,4]`; `relforge` cluster is back to green status. |
[production] |
23:16 |
<thcipriani@deploy1002> |
Synchronized README: Config: [[gerrit:683747|DEMO: Add newline to README]] (duration: 00m 56s) |
[production] |
23:08 |
<ryankemper> |
T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` (amended command) |
[production] |
23:06 |
<ryankemper> |
T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` |
[production] |
23:05 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563 |
[production] |
22:46 |
<ryankemper> |
T280563 Current master is `relforge1003-relforge-eqiad`, will reboot `1004` first then `1003` after |
[production] |
22:44 |
<ryankemper> |
T280563 Bleh, we never moved the new config into spicerack, so it's trying to talk to the old relforge hosts which no longer exist. Will reboot relforge manually and use the cookbook for codfw/eqiad, and circle back later for the spicerack change |
[production] |
22:37 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 |
[production] |
22:36 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 |
[production] |
22:32 |
<ryankemper> |
T280563 Spotted the issue; forgot to set `--without-lvs` for relforge reboot |
[production] |
22:27 |
<ryankemper> |
T280563 `urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fbe4bb8a518>: Failed to establish a new connection: [Errno -2] Name or service not known` |
[production] |
22:26 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563 |
[production] |
22:26 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563 |
[production] |
22:21 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 |
[production] |
22:21 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 |
[production] |
22:21 |
<ryankemper@cumin1001> |
END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 |
[production] |
22:20 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 |
[production] |
21:36 |
<mutante> |
icinga - enabling disabled notifications for random an-worker nodes where mgmt interface had enabled alerts but the actual host didnt |
[production] |
21:32 |
<mutante> |
icinga - enabled notifications for checks on ms-backup1001 - they were all manually disabled but none of the checks had any status change since 50 days which indicates it was forgotten to turn them back on which is a common issue with disabling notifications |
[production] |
21:16 |
<mutante> |
backup1001 - sudo check_bacula.py --icinga |
[production] |
20:54 |
<marostegui> |
Stop mysql on tendril for the UTC night, dbtree and tendrill will remain down for a few hours T281486 |
[production] |
20:16 |
<marostegui> |
Restart tendril database - T281486 |
[production] |
20:00 |
<jhuneidi@deploy1002> |
rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.3 refs T278347 |
[production] |
19:46 |
<jhuneidi@deploy1002> |
Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 08s) |
[production] |
19:45 |
<jhuneidi@deploy1002> |
rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 refs T278347 |
[production] |
19:32 |
<dpifke@deploy1002> |
Finished deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484 (duration: 00m 05s) |
[production] |
19:32 |
<dpifke@deploy1002> |
Started deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484 |
[production] |
19:01 |
<Krinkle> |
graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki/wanobjectcache/revision_row_1/ (bad data from Sep 2019) |
[production] |
18:59 |
<Krinkle> |
graphite1004/2003: prune /var/lib/carbon/whisper/rl-minify-* (bad data from Aug 2018) |
[production] |
18:58 |
<Krinkle> |
graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki_ExternalGuidance_init_Google_tr_fr (bad data from Nov 2019) |
[production] |