3101-3150 of 10000 results (40ms)
2021-03-10 §
09:12 <jmm@cumin2001> START - Cookbook sre.hosts.reboot-single for host ms-be2028.codfw.wmnet [production]
08:39 <marostegui> Upgrade mysql and kernel on db2132 [production]
08:25 <marostegui> Upgrade mysql and kernel on db2078 [production]
08:21 <jmm@cumin2001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thorium.eqiad.wmnet [production]
08:20 <moritzm> pruning obsolete kernels from ganeti hosts in eqiad/codfw [production]
08:17 <moritzm> powercycling thorium, stuck on reboot [production]
08:16 <marostegui@cumin1001> dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14719 and previous config saved to /var/cache/conftool/dbconfig/20210310-081627-root.json [production]
08:11 <marostegui> Check tables on db1150:3315 - T276742 [production]
08:09 <jmm@cumin2001> START - Cookbook sre.hosts.reboot-single for host thorium.eqiad.wmnet [production]
08:05 <jmm@cumin2001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics-tool1001.eqiad.wmnet [production]
08:03 <jmm@cumin2001> START - Cookbook sre.hosts.reboot-single for host analytics-tool1001.eqiad.wmnet [production]
08:01 <marostegui@cumin1001> dbctl commit (dc=all): 'db1085 (re)pooling @ 60%: 10', diff saved to https://phabricator.wikimedia.org/P14718 and previous config saved to /var/cache/conftool/dbconfig/20210310-080123-root.json [production]
07:52 <marostegui> Deploy schema change on s7 codfw (lag will appear) T276150 T276156 [production]
07:46 <marostegui@cumin1001> dbctl commit (dc=all): 'db1085 (re)pooling @ 30%: 10', diff saved to https://phabricator.wikimedia.org/P14717 and previous config saved to /var/cache/conftool/dbconfig/20210310-074618-root.json [production]
07:33 <filippo@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host graphite1004.eqiad.wmnet [production]
07:29 <filippo@cumin1001> START - Cookbook sre.hosts.reboot-single for host graphite1004.eqiad.wmnet [production]
07:26 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1085 for schema change', diff saved to https://phabricator.wikimedia.org/P14716 and previous config saved to /var/cache/conftool/dbconfig/20210310-072642-marostegui.json [production]
07:25 <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db1113:3316', diff saved to https://phabricator.wikimedia.org/P14715 and previous config saved to /var/cache/conftool/dbconfig/20210310-072508-marostegui.json [production]
07:07 <elukey> sudo apt-get remove linux-image-4.9.0-9-amd64 on sodium to free space for /boot [production]
07:06 <marostegui@cumin1001> dbctl commit (dc=all): 'Repool db2145', diff saved to https://phabricator.wikimedia.org/P14714 and previous config saved to /var/cache/conftool/dbconfig/20210310-070642-marostegui.json [production]
07:03 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1113:3316 for schema change', diff saved to https://phabricator.wikimedia.org/P14713 and previous config saved to /var/cache/conftool/dbconfig/20210310-070312-marostegui.json [production]
07:01 <elukey> remove the oldest kernel on ganeti nodes to free space for /boot [production]
07:00 <marostegui> Depool clouddb1016 [production]
06:45 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1111.eqiad.wmnet with reason: REIMAGE [production]
06:43 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1111.eqiad.wmnet with reason: REIMAGE [production]
06:17 <elukey> reimage an-worker1111 to buster [production]
05:27 <ryankemper> T266470 Rollout of updated certificate complete. We're now ready to implement envoy for `wdqs-test` which will allow `wdqs1009` to be reachable via port 443 and thereby allow us to go live with `query-preview.wikidata.org` when the time comes [production]
05:26 <ryankemper> T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"'` and `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo run-puppet-agent'` [production]
05:24 <ryankemper> T266470 Test queries passing on `wdqs1004`, and `https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&from=now-1h&to=now` looks as expected. Proceeding to rest of fleet [production]
05:20 <ryankemper> T266470 Enabled puppet on single public wdqs host to verify certificate update is without issue: `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` followed by `ryankemper@wdqs1004:~$ sudo run-puppet-agent` [production]
05:18 <ryankemper> Enabling puppet on single public wdqs host to verify certificate update is without issue: `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` followed by `ryankemper@wdqs1004:~$ sudo run-puppet-agent` [production]
05:15 <ryankemper> T266470 [`/srv/private`] All changes commited to private git repo, commit SHA `ec1d6cfae8c72e4f807b343cdb9f25c27817d98d` [production]
05:13 <ryankemper> T266470 [`/srv/private`] `chown gitpuppet:gitpuppet` on all modified files (were owned by root, probably because I sudo'd - may be that a git commit hook would have caught that but explicitly chowning just to be safe) [production]
05:06 <ryankemper> T266470 New `wdqs.discovery.wmnet.crt` added to public `operations/puppet` repo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/670337/ [production]
04:58 <ryankemper> T266470 The above two actions mean that we're ready to generate the new certificate files. Proceeding: `sudo cergen -c 'wdqs.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d` on `ryankemper@puppetmaster1001:/srv/private` [production]
04:57 <ryankemper> T266470 `sudo rm -fv certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.crt.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.csr.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.jks certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.p12 certificates/wdqs.discovery.wmnet/truststore.jks` (full paths not provided to fit the IRC line) [production]
04:56 <ryankemper> T266470 In the `/srv/private` repo, `/srv/private/modules/secret/secrets/certificates/certificate.manifests.d/wdqs.certs.yaml` has been edited to add the relevant `alt_names` [production]
04:55 <ryankemper> T266470 Certificate revoked: `ryankemper@puppetmaster1001:/srv/private$ sudo puppet cert clean wdqs.discovery.wmnet` [production]
04:53 <ryankemper> T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"'` [production]
04:53 <ryankemper> T266470 ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"' [production]
04:52 <ryankemper> T266470 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation [production]
01:08 <krinkle@deploy1002> Synchronized php-1.36.0-wmf.34/extensions/NavigationTiming/modules/ext.navigationTiming.js: T276826 Ibd9ddf14d64 (duration: 01m 14s) [production]
00:02 <robh@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: REIMAGE [production]
00:00 <robh@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: REIMAGE [production]
2021-03-09 §
23:59 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: REIMAGE [production]
23:58 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: REIMAGE [production]
22:04 <mutante> phab1001 - manually running phab public task dumd script after making changes to redirect stdout [production]
20:42 <elukey> reimaged an-worker1091 to buster [production]
20:41 <bstorm> depooled labsdb1009 T276980 [production]
20:25 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1091.eqiad.wmnet with reason: REIMAGE [production]