2021-05-04
§
|
05:45 |
<marostegui> |
Stop mysql on db1158 to clone db1178 |
[production] |
05:45 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1158 to clone db1178 T275633', diff saved to https://phabricator.wikimedia.org/P15680 and previous config saved to /var/cache/conftool/dbconfig/20210504-054539-marostegui.json |
[production] |
05:36 |
<marostegui> |
Deploy schema change on s6 codfw, lag will appear - T266486 T268392 T273360 |
[production] |
05:31 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1118 (re)pooling @ 25%: Repool db1118', diff saved to https://phabricator.wikimedia.org/P15678 and previous config saved to /var/cache/conftool/dbconfig/20210504-053149-root.json |
[production] |
05:26 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15677 and previous config saved to /var/cache/conftool/dbconfig/20210504-052612-root.json |
[production] |
05:11 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15676 and previous config saved to /var/cache/conftool/dbconfig/20210504-051108-root.json |
[production] |
05:07 |
<marostegui> |
Restart sanitarium hosts to pick up new filters T263817 |
[production] |
04:56 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15675 and previous config saved to /var/cache/conftool/dbconfig/20210504-045605-root.json |
[production] |
04:41 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15674 and previous config saved to /var/cache/conftool/dbconfig/20210504-044101-root.json |
[production] |
04:06 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
03:38 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
03:38 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
03:36 |
<ryankemper> |
T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` |
[production] |
03:35 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
02:09 |
<pt1979@cumin2001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on phab2002.codfw.wmnet with reason: REIMAGE |
[production] |
02:07 |
<pt1979@cumin2001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on phab2002.codfw.wmnet with reason: REIMAGE |
[production] |
01:41 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
2021-05-03
§
|
23:18 |
<urbanecm@deploy1002> |
Synchronized wmf-config/InitialiseSettings.php: 230ef5716b34ca83348667f289180313b76ce8a3: Prepare for new configuration option (T277951) (duration: 00m 57s) |
[production] |
23:15 |
<urbanecm@deploy1002> |
Synchronized wmf-config/InitialiseSettings.php: 7c47ee17b3936fb1f79590187a9e0028276e4a9d: Replace $wgRelatedArticlesFooterWhitelistedSkins (T277958) (duration: 00m 57s) |
[production] |
23:14 |
<urbanecm@deploy1002> |
sync-file aborted: 7c47ee17b3936fb1f79590187a9e0028276e4a9d: Replace $wgRelatedArticlesFooterWhitelistedSkins (T277958)¨ (duration: 00m 01s) |
[production] |
22:17 |
<legoktm> |
ran disable_list for: iegcom wikien-l fundraiser spcommittee-private-l spcommittee-l mediation-en-l test-second wikifr-colloque-l |
[production] |
22:14 |
<mutante> |
[backup1001:~] $ sudo check_bacula.py --icinga |
[production] |
21:56 |
<ryankemper> |
T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` |
[production] |
21:55 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
21:54 |
<ryankemper> |
T280563 eqiad reboot failed with: `curator.exceptions.FailedExecution: Exception encountered. Rerun with loglevel DEBUG and/or check Elasticsearch logs for more information. Exception: ConnectionTimeout caused by - ReadTimeoutError(HTTPSConnectionPool(host='search.svc.eqiad.wmnet', port=9243): Read timed out. (read timeout=10))` |
[production] |
21:52 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
21:47 |
<ryankemper> |
T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` |
[production] |
21:46 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563 |
[production] |
21:32 |
<krinkle@deploy1002> |
Synchronized wmf-config/InitialiseSettings.php: d95b91648 (duration: 00m 58s) |
[production] |
21:27 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1011.eqiad.wmnet with reason: REIMAGE |
[production] |
21:25 |
<ryankemper@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1011.eqiad.wmnet with reason: REIMAGE |
[production] |
21:22 |
<ryankemper> |
[WDQS] `ryankemper@wdqs1003:~$ sudo pool` |
[production] |
21:20 |
<ryankemper> |
T280382 [WDQS] `ryankemper@puppetmaster1001:~$ sudo confctl select 'name=wdqs1011.eqiad.wmnet' set/pooled=no` |
[production] |
21:19 |
<ryankemper@puppetmaster1001> |
conftool action : set/pooled=no; selector: name=wdqs1011.eqiad.wmnet |
[production] |
21:09 |
<ryankemper> |
T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1011.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
21:06 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
21:05 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
21:02 |
<ryankemper> |
T280382 `wdqs1010.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/md2 2.6T 975G 1.5T 39% /srv` |
[production] |
20:56 |
<ryankemper> |
T280382 [WDQS] `ryankemper@wdqs2001:~$ sudo run-puppet-agent --force` |
[production] |
20:44 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |
20:42 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) |
[production] |
20:37 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
20:37 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
19:40 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE |
[production] |
19:39 |
<ryankemper@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs2007.codfw.wmnet with reason: REIMAGE |
[production] |
19:24 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --without-lvs --source wdqs1003.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
19:24 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
19:21 |
<ryankemper@puppetmaster1001> |
conftool action : set/pooled=no; selector: name=wdqs1004.eqiad.wmnet |
[production] |
19:21 |
<ryankemper> |
T280382 [WDQS] `sudo confctl select 'name=wdqs1004.eqiad.wmnet' set/pooled=no` (`wdqs1004` failed re-image [not sure why yet] and won't let me ssh in to depool so using conftool instead) |
[production] |
18:20 |
<Urbanecm> |
Morning B&C window done |
[production] |