3351-3400 of 10000 results (35ms)
2021-03-03 ยง
10:12 <marostegui@cumin1001> dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14605 and previous config saved to /var/cache/conftool/dbconfig/20210303-101255-root.json [production]
10:12 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 60%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14604 and previous config saved to /var/cache/conftool/dbconfig/20210303-101215-root.json [production]
10:05 <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet [production]
10:00 <aborrero@cumin1001> START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet [production]
10:00 <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet [production]
09:57 <marostegui@cumin1001> dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14602 and previous config saved to /var/cache/conftool/dbconfig/20210303-095751-root.json [production]
09:57 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14601 and previous config saved to /var/cache/conftool/dbconfig/20210303-095712-root.json [production]
09:55 <aborrero@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudnet1003.eqiad.wmnet with reason: HW issue [production]
09:54 <aborrero@cumin1001> START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudnet1003.eqiad.wmnet with reason: HW issue [production]
09:54 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P14600 and previous config saved to /var/cache/conftool/dbconfig/20210303-095417-marostegui.json [production]
09:53 <marostegui@cumin1001> dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14599 and previous config saved to /var/cache/conftool/dbconfig/20210303-095351-root.json [production]
09:42 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 30%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14598 and previous config saved to /var/cache/conftool/dbconfig/20210303-094208-root.json [production]
09:41 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1132,1135-1138].eqiad.wmnet [production]
09:39 <elukey@cumin1001> START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1132,1135-1138].eqiad.wmnet [production]
09:38 <marostegui@cumin1001> dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14597 and previous config saved to /var/cache/conftool/dbconfig/20210303-093847-root.json [production]
09:31 <aborrero@cumin1001> START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet [production]
09:30 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE [production]
09:28 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE [production]
09:28 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE [production]
09:27 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14596 and previous config saved to /var/cache/conftool/dbconfig/20210303-092705-root.json [production]
09:25 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE [production]
09:23 <marostegui@cumin1001> dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14595 and previous config saved to /var/cache/conftool/dbconfig/20210303-092343-root.json [production]
09:16 <jayme@deploy1002> helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [production]
09:16 <jayme@deploy1002> helmfile [staging-codfw] START helmfile.d/admin 'sync'. [production]
09:12 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 15%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14594 and previous config saved to /var/cache/conftool/dbconfig/20210303-091201-root.json [production]
09:08 <marostegui@cumin1001> dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14593 and previous config saved to /var/cache/conftool/dbconfig/20210303-090840-root.json [production]
09:02 <zpapierski@deploy1002> Finished deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 (duration: 08m 17s) [production]
09:00 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P14592 and previous config saved to /var/cache/conftool/dbconfig/20210303-090030-marostegui.json [production]
08:56 <marostegui@cumin1001> dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14591 and previous config saved to /var/cache/conftool/dbconfig/20210303-085658-root.json [production]
08:54 <zpapierski@deploy1002> Started deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 [production]
08:50 <marostegui@cumin1001> dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14590 and previous config saved to /var/cache/conftool/dbconfig/20210303-085014-marostegui.json [production]
08:48 <test> tcpircbot --joe [production]
08:40 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE [production]
08:40 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE [production]
08:32 <godog> stop/mask tcpircbot-logmsgbot on pontoon-icinga-01 - T276299 [production]
07:30 <_joe_> test [production]
07:17 <_joe_> test log [production]
06:41 <marostegui> Testing log [production]
06:27 <ryankemper> T275345 T274555 `sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001` [production]
06:26 <ryankemper> T275345 T274555 `sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001` [production]
06:21 <ryankemper> T275345 T274555 Re-pooling `elastic2045` and `elastic2054` (commands follow) [production]
06:20 <ryankemper> T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}` [production]
06:18 <ryankemper> T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}` [production]
06:17 <ryankemper> T275345 T274555 Unbanning `elastic2045` and `elastic2054` from our cluster now that both hosts have been re-imaged and are running without errors (commands follow) [production]
06:15 <ryankemper> T274555 Removed downtime for `elastic2054` [production]
05:32 <ryankemper> T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` on `ryankemper@cumin2001` tmux session `elastic_reimage_elastic2054` [production]
05:31 <ryankemper> T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` [production]
05:27 <ryankemper> Downtime `wdqs1012` until `2021-03-03 19:25:40` (~14 hours from now). Its `wdqs-updater` is failing; ultimately it's blazegraph journal is probably in a bad state meaning we'd have to copy one over from a healthy node, but not kicking that off right now so that we can investigate a little bit first [production]
05:16 <ryankemper> T275345 `ryankemper@elastic2045:~$ sudo apt-get upgrade wmf-elasticsearch-search-plugins` [production]
03:50 <ryankemper> Depooled `wdqs1012` until I've got its updater back online [production]