3701-3750 of 10000 results (31ms)
2021-03-03 §
09:42 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 30%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14598 and previous config saved to /var/cache/conftool/dbconfig/20210303-094208-root.json [production]
09:41 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1132,1135-1138].eqiad.wmnet [production]
09:39 <elukey@cumin1001> START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1132,1135-1138].eqiad.wmnet [production]
09:38 <marostegui@cumin1001> dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14597 and previous config saved to /var/cache/conftool/dbconfig/20210303-093847-root.json [production]
09:31 <aborrero@cumin1001> START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet [production]
09:30 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE [production]
09:28 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE [production]
09:28 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE [production]
09:27 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14596 and previous config saved to /var/cache/conftool/dbconfig/20210303-092705-root.json [production]
09:25 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE [production]
09:23 <marostegui@cumin1001> dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14595 and previous config saved to /var/cache/conftool/dbconfig/20210303-092343-root.json [production]
09:16 <jayme@deploy1002> helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [production]
09:16 <jayme@deploy1002> helmfile [staging-codfw] START helmfile.d/admin 'sync'. [production]
09:12 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 15%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14594 and previous config saved to /var/cache/conftool/dbconfig/20210303-091201-root.json [production]
09:08 <marostegui@cumin1001> dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14593 and previous config saved to /var/cache/conftool/dbconfig/20210303-090840-root.json [production]
09:02 <zpapierski@deploy1002> Finished deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 (duration: 08m 17s) [production]
09:00 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P14592 and previous config saved to /var/cache/conftool/dbconfig/20210303-090030-marostegui.json [production]
08:56 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14591 and previous config saved to /var/cache/conftool/dbconfig/20210303-085658-root.json [production]
08:54 <zpapierski@deploy1002> Started deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 [production]
08:50 <marostegui@cumin1001> dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14590 and previous config saved to /var/cache/conftool/dbconfig/20210303-085014-marostegui.json [production]
08:48 <test> tcpircbot --joe [production]
08:40 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE [production]
08:40 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE [production]
08:32 <godog> stop/mask tcpircbot-logmsgbot on pontoon-icinga-01 - T276299 [production]
07:30 <_joe_> test [production]
07:17 <_joe_> test log [production]
06:41 <marostegui> Testing log [production]
06:27 <ryankemper> T275345 T274555 `sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001` [production]
06:26 <ryankemper> T275345 T274555 `sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001` [production]
06:21 <ryankemper> T275345 T274555 Re-pooling `elastic2045` and `elastic2054` (commands follow) [production]
06:20 <ryankemper> T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}` [production]
06:18 <ryankemper> T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}` [production]
06:17 <ryankemper> T275345 T274555 Unbanning `elastic2045` and `elastic2054` from our cluster now that both hosts have been re-imaged and are running without errors (commands follow) [production]
06:15 <ryankemper> T274555 Removed downtime for `elastic2054` [production]
05:32 <ryankemper> T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` on `ryankemper@cumin2001` tmux session `elastic_reimage_elastic2054` [production]
05:31 <ryankemper> T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` [production]
05:27 <ryankemper> Downtime `wdqs1012` until `2021-03-03 19:25:40` (~14 hours from now). Its `wdqs-updater` is failing; ultimately it's blazegraph journal is probably in a bad state meaning we'd have to copy one over from a healthy node, but not kicking that off right now so that we can investigate a little bit first [production]
05:16 <ryankemper> T275345 `ryankemper@elastic2045:~$ sudo apt-get upgrade wmf-elasticsearch-search-plugins` [production]
03:50 <ryankemper> Depooled `wdqs1012` until I've got its updater back online [production]
03:24 <ryankemper> `ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph` ~2 mins ago [production]
02:45 <ejegg> updated fundraising CiviCRM from e1dacbe348 to b13e70d968 [production]
02:09 <ejegg> updated payments-wiki from 365bf54393 to 65dbf0ed9d [production]
00:42 <Urbanecm> Finished deployment in Evening B&C window; logmsgbot is currently down, and a simple restart did not bring it back up [production]
00:41 <Urbanecm> 00:40:16 Synchronized wmf-config/config/idwiki.yaml: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 3/3) (duration: 01m 09s) [production]
00:38 <Urbanecm> 00:38:12 Synchronized dblists/growthexperiments.dblist: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 2/3) (duration: 01m 10s) [production]
00:31 <Urbanecm> 00:31:26 Synchronized wmf-config/InitialiseSettings.php: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 1/3) (duration: 01m 11s) [production]
00:21 <dwisehaupt> replication restarted on frdb2001 after utf8mb4 conversion completed. [production]
00:21 <mutante> alert1001 systemctl restart tcpircbot-logmsgbot [production]
00:08 <urbanecm@deploy1002> sync-file aborted: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 1/3) (duration: 06m 45s) [production]
2021-03-02 §
23:52 <mutante> mwmaint2002 - find /home -nouser -delete [production]