2021-01-27
ยง
|
09:04 |
<jbond42> |
deploy fix to enable-puppet |
[production] |
09:03 |
<godog> |
swift codfw-prod decrease SSD weight for ms-be20[16-27] - T272837 |
[production] |
08:36 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Pool db1160 with more weight T258361', diff saved to https://phabricator.wikimedia.org/P13978 and previous config saved to /var/cache/conftool/dbconfig/20210127-083618-marostegui.json |
[production] |
08:29 |
<marostegui> |
Stop mysql on db1089 to clone db1169 T258361 |
[production] |
08:28 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1089 to clone db1169 T258361', diff saved to https://phabricator.wikimedia.org/P13976 and previous config saved to /var/cache/conftool/dbconfig/20210127-082826-marostegui.json |
[production] |
08:11 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Repool db1121', diff saved to https://phabricator.wikimedia.org/P13975 and previous config saved to /var/cache/conftool/dbconfig/20210127-081150-marostegui.json |
[production] |
08:07 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1121', diff saved to https://phabricator.wikimedia.org/P13974 and previous config saved to /var/cache/conftool/dbconfig/20210127-080753-marostegui.json |
[production] |
08:06 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13973 and previous config saved to /var/cache/conftool/dbconfig/20210127-080645-root.json |
[production] |
07:57 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13972 and previous config saved to /var/cache/conftool/dbconfig/20210127-075715-marostegui.json |
[production] |
07:51 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13971 and previous config saved to /var/cache/conftool/dbconfig/20210127-075142-root.json |
[production] |
07:36 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13970 and previous config saved to /var/cache/conftool/dbconfig/20210127-073638-root.json |
[production] |
07:26 |
<elukey> |
powercycle analytics1073 - kernel soft lock up bug registered, os needs a reboot |
[production] |
07:21 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: After moving clouddb replicas', diff saved to https://phabricator.wikimedia.org/P13969 and previous config saved to /var/cache/conftool/dbconfig/20210127-072135-root.json |
[production] |
07:05 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1085 T272008', diff saved to https://phabricator.wikimedia.org/P13968 and previous config saved to /var/cache/conftool/dbconfig/20210127-070502-marostegui.json |
[production] |
06:57 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13967 and previous config saved to /var/cache/conftool/dbconfig/20210127-065715-marostegui.json |
[production] |
06:39 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Give db1160 some more small weight T258361', diff saved to https://phabricator.wikimedia.org/P13966 and previous config saved to /var/cache/conftool/dbconfig/20210127-063930-marostegui.json |
[production] |
06:13 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Pool db1160 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P13965 and previous config saved to /var/cache/conftool/dbconfig/20210127-061336-marostegui.json |
[production] |
06:03 |
<twentyafterfour> |
phabricator appears to be up and running fine |
[production] |
06:03 |
<twentyafterfour> |
phabricator is read-write |
[production] |
06:01 |
<twentyafterfour> |
phabricator is read-only |
[production] |
06:00 |
<marostegui> |
m3 master restart, phabricator will go on read only - T272596 |
[production] |
05:50 |
<marostegui> |
Deploy schema change on s3 T270055 |
[production] |
03:48 |
<ryankemper> |
(Restarted `wdqs-blazegraph` on `wdqs1012`) |
[production] |
02:24 |
<ebernhardson@deploy1001> |
Finished deploy [wikimedia/discovery/analytics@9c85a21]: transfer_to_es: start date 2020 -> 2021 (duration: 02m 59s) |
[production] |
02:21 |
<ebernhardson@deploy1001> |
Started deploy [wikimedia/discovery/analytics@9c85a21]: transfer_to_es: start date 2020 -> 2021 |
[production] |
01:58 |
<ryankemper> |
[WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` |
[production] |
01:57 |
<ryankemper> |
[WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` |
[production] |
01:57 |
<ryankemper> |
[WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` |
[production] |
01:56 |
<ryankemper@deploy1001> |
Finished deploy [wdqs/wdqs@6c6b2cb]: 0.3.61 (duration: 07m 50s) |
[production] |
01:50 |
<ryankemper> |
[WDQS Deploy] Tests passing following deploy of `0.3.61` on canary `wdqs1003`; proceeding to rest of fleet |
[production] |
01:48 |
<ryankemper@deploy1001> |
Started deploy [wdqs/wdqs@6c6b2cb]: 0.3.61 |
[production] |
01:48 |
<ryankemper> |
[WDQS Deploy] Gearing up for deploy of wdqs `0.3.61`. Pre-deploy tests passing on canary `wdqs1003` |
[production] |
01:39 |
<ebernhardson@deploy1001> |
Finished deploy [wikimedia/discovery/analytics@ee948e0]: transfer_to_es: Enable catchup (duration: 01m 11s) |
[production] |
01:38 |
<ebernhardson@deploy1001> |
Started deploy [wikimedia/discovery/analytics@ee948e0]: transfer_to_es: Enable catchup |
[production] |
01:25 |
<legoktm@cumin1001> |
conftool action : set/pooled=yes; selector: name=mw2296.codfw.wmnet |
[production] |
01:25 |
<legoktm@cumin1001> |
conftool action : set/pooled=yes; selector: name=mw2295.codfw.wmnet |
[production] |
01:23 |
<ryankemper> |
T272713 [Deploy envoy for `wdqs-internal`] Roll-out complete. Will monitor `wdqs-internal` for any issues. All the remaining `WDQS SPARQL` alerts should clear shortly |
[production] |
01:21 |
<ryankemper> |
T272713 [Deploy envoy for `wdqs-internal`] Test queries to `wdqs1003.eqiad.wmnet` passed, and metrics in Grafana (https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs-internal&from=1611706751381&to=1611710190405) look good. Rolling out to rest of fleet |
[production] |
01:21 |
<legoktm@cumin1001> |
conftool action : set/pooled=no; selector: name=mw2296.codfw.wmnet |
[production] |
01:20 |
<legoktm@cumin1001> |
conftool action : set/pooled=no; selector: name=mw2295.codfw.wmnet |
[production] |
01:14 |
<ebernhardson@deploy1001> |
Finished deploy [wikimedia/discovery/analytics@246b640]: remove link recommendations from hourly transfer deps (duration: 03m 31s) |
[production] |
01:10 |
<ebernhardson@deploy1001> |
Started deploy [wikimedia/discovery/analytics@246b640]: remove link recommendations from hourly transfer deps |
[production] |
00:54 |
<legoktm@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2296.codfw.wmnet with reason: REIMAGE |
[production] |
00:52 |
<legoktm@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2295.codfw.wmnet with reason: REIMAGE |
[production] |
00:51 |
<ryankemper> |
T272713 [Deploy envoy for `wdqs-internal`] Fixed typo in private key in commit `ea152df802b55e939d34494a4965ed83a80a24f2`. Puppet run on `wdqs1003` was successful as a result. Monitoring... |
[production] |
00:49 |
<legoktm@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on mw2295.codfw.wmnet with reason: REIMAGE |
[production] |
00:49 |
<legoktm@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on mw2296.codfw.wmnet with reason: REIMAGE |
[production] |
00:45 |
<ryankemper> |
T272713 [Deploy envoy for `wdqs-internal`] Discovered source of the above failure; the secret key in the puppetmaster `/srv/private` repo has a typo in its name (my error): it had `wqds` instead of `wdqs`. Opening up a patch now |
[production] |
00:44 |
<ryankemper> |
T272713 [Deploy envoy for `wdqs-internal`] `...Error while evaluating a Function Call, secret(): invalid secret ssl/wdqs-internal.discovery.wmnet.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/profile/manifests/tlsproxy/envoy.pp, line: 129) on node wdqs1003.eqiad.wmnet` |
[production] |
00:36 |
<ryankemper> |
[Deploy envoy for `wdqs-internal`] `...Error while evaluating a Function Call, secret(): invalid secret ssl/wdqs-internal.discovery.wmnet.key (file: /etc/puppet/modules/sslcert/manifests/certificate.pp, line: 91, column: 26) (file: /etc/puppet/modules/profile/manifests/tlsproxy/envoy.pp, line: 129) on node wdqs1003.eqiad.wmnet` |
[production] |