| 2021-03-10
      
      § | 
    
  | 05:27 | <ryankemper> | T266470 Rollout of updated certificate complete. We're now ready to implement envoy for `wdqs-test` which will allow `wdqs1009` to be reachable via port 443 and thereby allow us to go live with `query-preview.wikidata.org` when the time comes | [production] | 
            
  | 05:26 | <ryankemper> | T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"'` and `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo run-puppet-agent'` | [production] | 
            
  | 05:24 | <ryankemper> | T266470 Test queries passing on `wdqs1004`,  and `https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&var-cluster_name=wdqs&from=now-1h&to=now` looks as expected. Proceeding to rest of fleet | [production] | 
            
  | 05:20 | <ryankemper> | T266470 Enabled puppet on single public wdqs host to verify certificate update is without issue: `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` followed by `ryankemper@wdqs1004:~$ sudo run-puppet-agent` | [production] | 
            
  | 05:18 | <ryankemper> | Enabling puppet on single public wdqs host to verify certificate update is without issue: `ryankemper@wdqs1004:~$ sudo enable-puppet "revoking old cert and generating new one with new alt_names - T266470 - root"` followed by `ryankemper@wdqs1004:~$ sudo run-puppet-agent` | [production] | 
            
  | 05:15 | <ryankemper> | T266470 [`/srv/private`] All changes commited to private git repo, commit SHA `ec1d6cfae8c72e4f807b343cdb9f25c27817d98d` | [production] | 
            
  | 05:13 | <ryankemper> | T266470 [`/srv/private`] `chown gitpuppet:gitpuppet` on all modified files (were owned by root, probably because I sudo'd - may be that a git commit hook would have caught that but explicitly chowning just to be safe) | [production] | 
            
  | 05:06 | <ryankemper> | T266470 New `wdqs.discovery.wmnet.crt` added to public `operations/puppet` repo: https://gerrit.wikimedia.org/r/c/operations/puppet/+/670337/ | [production] | 
            
  | 04:58 | <ryankemper> | T266470 The above two actions mean that we're ready to generate the new certificate files. Proceeding: `sudo cergen -c 'wdqs.*' --generate --base-path /srv/private/modules/secret/secrets/certificates /srv/private/modules/secret/secrets/certificates/certificate.manifests.d` on `ryankemper@puppetmaster1001:/srv/private` | [production] | 
            
  | 04:57 | <ryankemper> | T266470 `sudo rm -fv certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.crt.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.csr.pem certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.jks certificates/wdqs.discovery.wmnet/wdqs.discovery.wmnet.keystore.p12 certificates/wdqs.discovery.wmnet/truststore.jks` (full paths not provided to fit the IRC line) | [production] | 
            
  | 04:56 | <ryankemper> | T266470 In the `/srv/private` repo, `/srv/private/modules/secret/secrets/certificates/certificate.manifests.d/wdqs.certs.yaml` has been edited to add the relevant `alt_names` | [production] | 
            
  | 04:55 | <ryankemper> | T266470 Certificate revoked: `ryankemper@puppetmaster1001:/srv/private$ sudo puppet cert clean wdqs.discovery.wmnet` | [production] | 
            
  | 04:53 | <ryankemper> | T266470 `ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"'` | [production] | 
            
  | 04:53 | <ryankemper> | T266470 ryankemper@cumin1001:~$ sudo -E cumin 'A:wdqs-all' 'sudo disable-puppet "revoking old cert and generating new one with new alt_names - T266470"' | [production] | 
            
  | 04:52 | <ryankemper> | T266470 Temporarily disabling puppet on all `wdqs*` hosts in preparation for `wdqs.discovery.wmnet` certificate revocation | [production] | 
            
  | 01:08 | <krinkle@deploy1002> | Synchronized php-1.36.0-wmf.34/extensions/NavigationTiming/modules/ext.navigationTiming.js: T276826 Ibd9ddf14d64 (duration: 01m 14s) | [production] | 
            
  | 00:02 | <robh@cumin1001> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 00:00 | <robh@cumin1001> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  
    | 2021-03-09
      
      § | 
    
  | 23:59 | <robh@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1002.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 23:58 | <robh@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on ms-backup1001.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 22:04 | <mutante> | phab1001 - manually running phab public task dumd script after making changes to redirect stdout | [production] | 
            
  | 20:42 | <elukey> | reimaged an-worker1091 to buster | [production] | 
            
  | 20:41 | <bstorm> | depooled labsdb1009 T276980 | [production] | 
            
  | 20:25 | <elukey@cumin1001> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1091.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 20:25 | <bstorm> | downtimed labsdb1009 so it doesn't keep paging T276980 | [production] | 
            
  | 20:23 | <elukey@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1091.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 20:09 | <brennen> | train status: 1.36.0-wmf.32 (T274938) on group0 at 20:06:32 UTC; logs initially quiet. | [production] | 
            
  | 20:06 | <brennen@deploy1002> | rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.34 | [production] | 
            
  | 19:05 | <brennen@deploy1002> | Pruned MediaWiki: 1.36.0-wmf.31 (duration: 03m 34s) | [production] | 
            
  | 19:04 | <pt1979@cumin2001> | END (PASS) - Cookbook sre.dns.netbox (exit_code=0) | [production] | 
            
  | 18:59 | <pt1979@cumin2001> | START - Cookbook sre.dns.netbox | [production] | 
            
  | 18:54 | <brennen@deploy1002> | Finished scap: testwikis wikis to 1.36.0-wmf.34 (duration: 47m 25s) | [production] | 
            
  | 18:52 | <elukey@cumin1001> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1087.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 18:49 | <elukey@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1087.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 18:47 | <dcausse> | re-pool wdqs1004 | [production] | 
            
  | 18:37 | <mbsantos@deploy1002> | helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . | [production] | 
            
  | 18:35 | <mbsantos@deploy1002> | helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . | [production] | 
            
  | 18:34 | <pt1979@cumin2001> | END (PASS) - Cookbook sre.dns.netbox (exit_code=0) | [production] | 
            
  | 18:29 | <pt1979@cumin2001> | START - Cookbook sre.dns.netbox | [production] | 
            
  | 18:26 | <elukey> | reimage an-worker1087 to buster | [production] | 
            
  | 18:16 | <mbsantos@deploy1002> | helmfile [codfw] Ran 'sync' command on namespace 'proton' for release 'production' . | [production] | 
            
  | 18:13 | <mbsantos@deploy1002> | helmfile [staging] Ran 'sync' command on namespace 'proton' for release 'production' . | [production] | 
            
  | 18:12 | <brennen@deploy1002> | Started scap: testwikis wikis to 1.36.0-wmf.34 | [production] | 
            
  | 18:10 | <mbsantos@deploy1002> | helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . | [production] | 
            
  | 18:05 | <mbsantos@deploy1002> | helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . | [production] | 
            
  | 18:03 | <mbsantos@deploy1002> | helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . | [production] | 
            
  | 18:02 | <marxarelli> | deleting shut down memc* deployment-prep instances to free up quota for replacement db instances (T276968) | [production] | 
            
  | 18:02 | <elukey@cumin1001> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1085.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 18:00 | <elukey@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1085.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 17:50 | <papaul> | rebooting db2073 for firmware upgrade | [production] |