351-400 of 10000 results (57ms)
2022-08-02 ยง
09:28 <btullis@cumin1001> END (PASS) - Cookbook sre.zookeeper.roll-restart-zookeeper (exit_code=0) for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [production]
09:26 <btullis@puppetmaster1001> conftool action : set/pooled=inactive; selector: cluster=wikireplicas-a,name=dbproxy1019.eqiad.wmnet [production]
09:25 <btullis@puppetmaster1001> conftool action : set/pooled=yes; selector: cluster=wikireplicas-a,name=dbproxy1018.eqiad.wmnet [production]
09:22 <btullis@cumin1001> START - Cookbook sre.zookeeper.roll-restart-zookeeper for Zookeeper A:zookeeper-analytics cluster: Roll restart of jvm daemons. [production]
09:17 <marostegui@cumin1001> dbctl commit (dc=all): 'db1181 (re)pooling @ 5%: After restart', diff saved to https://phabricator.wikimedia.org/P32136 and previous config saved to /var/cache/conftool/dbconfig/20220802-091754-root.json [production]
09:17 <marostegui@cumin1001> dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32135 and previous config saved to /var/cache/conftool/dbconfig/20220802-091749-root.json [production]
09:15 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db2143', diff saved to https://phabricator.wikimedia.org/P32134 and previous config saved to /var/cache/conftool/dbconfig/20220802-091518-root.json [production]
09:02 <marostegui@cumin1001> dbctl commit (dc=all): 'db1181 (re)pooling @ 2%: After restart', diff saved to https://phabricator.wikimedia.org/P32133 and previous config saved to /var/cache/conftool/dbconfig/20220802-090250-root.json [production]
09:02 <marostegui@cumin1001> dbctl commit (dc=all): 'db1174 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32132 and previous config saved to /var/cache/conftool/dbconfig/20220802-090245-root.json [production]
08:47 <marostegui@cumin1001> dbctl commit (dc=all): 'db1181 (re)pooling @ 1%: After restart', diff saved to https://phabricator.wikimedia.org/P32131 and previous config saved to /var/cache/conftool/dbconfig/20220802-084745-root.json [production]
08:47 <marostegui@cumin1001> dbctl commit (dc=all): 'db1174 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P32130 and previous config saved to /var/cache/conftool/dbconfig/20220802-084740-root.json [production]
08:46 <marostegui> stop mysql on db2095 db2107 db2109 db2137 db2147 db2159 db2160 pc2012 for pdu maintenance on codfw b5 T310070 [production]
07:49 <moritzm> upgrading drmrs ganeti clusters to 3.0.2 T312637 [production]
07:33 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, T311686 [production]
07:33 <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to plain disks, T311686 [production]
07:22 <godog> bounce icinga on alert2001 - T314353 [production]
07:18 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, T311686 [production]
07:18 <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 1:00:00 on kubetcd2005.codfw.wmnet with reason: Switch instance to DRBD, T311686 [production]
06:58 <elukey> restart rsyslog on ml-serve2006 [production]
06:56 <ladsgroup@deploy1002> Synchronized php-1.39.0-wmf.22/extensions/FlaggedRevs/maintenance/pruneRevData.php: Backport: [[gerrit:819077|pruneRevData: Make cleaning in larger batches (T296380)]] (duration: 03m 26s) [production]
06:56 <mwdebug-deploy@deploy1002> helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [production]
06:55 <mwdebug-deploy@deploy1002> helmfile [codfw] START helmfile.d/services/mwdebug: apply [production]
06:55 <mwdebug-deploy@deploy1002> helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [production]
06:54 <mwdebug-deploy@deploy1002> helmfile [eqiad] START helmfile.d/services/mwdebug: apply [production]
06:46 <godog> bounce icinga on alert1001 - T314353 [production]
05:48 <marostegui@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts db2088.codfw.wmnet [production]
05:48 <marostegui@cumin1001> END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [production]
05:44 <marostegui@cumin1001> START - Cookbook sre.dns.netbox [production]
05:35 <marostegui@cumin1001> START - Cookbook sre.hosts.decommission for hosts db2088.codfw.wmnet [production]
05:29 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1181', diff saved to https://phabricator.wikimedia.org/P32127 and previous config saved to /var/cache/conftool/dbconfig/20220802-052923-root.json [production]
05:24 <marostegui> dbmait x1@eqiad T314087 [production]
04:17 <ryankemper> [Elastic] Small amendment to my earlier statement; based off epoch time `be_x_oldwiki_titlesuggest_1659407912` was not an old index hanging around after a reindex operation, but rather the new one that the reindex operation was trying to create, but had not yet finished (therefore didn't switch over the aliases). It presumably got interrupted by the reimage of `elastic2059`. [production]
04:15 <ryankemper> [Elastic] Blew away red index like so: `ryankemper@cumin1001:~$ curl -XDELETE https://search.svc.codfw.wmnet:9243/be_x_oldwiki_titlesuggest_1659407912`. Cluster is back to `green` status. [production]
04:07 <ryankemper> [Elastic] Per `curl -s https://search.svc.codfw.wmnet:9243/_cat/aliases | grep -i be_x` I see `be_x_oldwiki_titlesuggest ` alias points to `be_x_oldwiki_titlesuggest_1658396688`. I think this means the red index is an old index from an in-progress reindex operation. I likely just need to delete `be_x_oldwiki_titlesuggest_1659407912` but doing some quick digging first [production]
04:04 <ryankemper> [Elastic] Red cluster status in main codfw elasticsearch cluster (`https://search.svc.codfw.wmnet:9243`); culprit appears to be index `be_x_oldwiki_titlesuggest_1659407912`. Confusingly it has 2 replicas set so it's not clear to me how we got into this state starting from green (in the past we've gone into red status from indices that erroneously had 0 replicas in production) [production]
03:47 <mwdebug-deploy@deploy1002> helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [production]
03:46 <mwdebug-deploy@deploy1002> helmfile [codfw] START helmfile.d/services/mwdebug: apply [production]
03:46 <mwdebug-deploy@deploy1002> helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [production]
03:45 <mwdebug-deploy@deploy1002> helmfile [eqiad] START helmfile.d/services/mwdebug: apply [production]
03:40 <krinkle@deploy1002> Synchronized multiversion/: I0802db272695 (duration: 03m 10s) [production]
03:40 <mwdebug-deploy@deploy1002> helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [production]
03:39 <mwdebug-deploy@deploy1002> helmfile [codfw] START helmfile.d/services/mwdebug: apply [production]
03:39 <mwdebug-deploy@deploy1002> helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [production]
03:38 <mwdebug-deploy@deploy1002> helmfile [eqiad] START helmfile.d/services/mwdebug: apply [production]
03:34 <krinkle@deploy1002> Synchronized wmf-config/: I9b89c0ff5c2 (duration: 03m 32s) [production]
03:33 <mwdebug-deploy@deploy1002> helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [production]
03:32 <mwdebug-deploy@deploy1002> helmfile [codfw] START helmfile.d/services/mwdebug: apply [production]
03:32 <mwdebug-deploy@deploy1002> helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [production]
03:31 <mwdebug-deploy@deploy1002> helmfile [eqiad] START helmfile.d/services/mwdebug: apply [production]
03:27 <krinkle@deploy1002> Synchronized multiversion/: I6e97d39a3, Ib843ebced31 (duration: 03m 30s) [production]