2020-12-17
§
|
11:32 |
<kartik@deploy1001> |
helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . |
[production] |
11:27 |
<godog> |
bounce apache2 on grafana1002 |
[production] |
11:26 |
<elukey@cumin1001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE |
[production] |
11:24 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE |
[production] |
11:22 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE |
[production] |
11:21 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE |
[production] |
11:21 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE |
[production] |
11:20 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE |
[production] |
11:20 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE |
[production] |
11:18 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE |
[production] |
11:16 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE |
[production] |
11:16 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE |
[production] |
11:10 |
<jbond@cumin1001> |
END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) |
[production] |
11:08 |
<jbond@cumin1001> |
START - Cookbook sre.hosts.reboot-single |
[production] |
10:50 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 |
[production] |
10:45 |
<elukey@cumin1001> |
START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 |
[production] |
10:21 |
<jbond42> |
updating RemoteIP on phabricator https://gerrit.wikimedia.org/r/c/operations/puppet/+/649872 |
[production] |
09:57 |
<vgutierrez> |
repool ats-tls on cp5011 |
[production] |
09:00 |
<marostegui> |
Sanitize s1 and s5 on db1154 T268742 |
[production] |
08:30 |
<godog> |
swift codfw-prod: more weight to ms-be20[58-61] - T269337 |
[production] |
07:49 |
<ryankemper> |
[wdqs deploy] (wdqs deploy complete) |
[production] |
07:19 |
<marostegui> |
Stop mysql on db1082 to clone db1154 |
[production] |
07:19 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1082 for cloning db1154:3315 T268742 ', diff saved to https://phabricator.wikimedia.org/P13563 and previous config saved to /var/cache/conftool/dbconfig/20201217-071903-marostegui.json |
[production] |
07:18 |
<elukey> |
reboot an-airflow1001 for kernel upgrades |
[production] |
07:08 |
<elukey> |
update analytics-in4 filter on cr1/cr2-eqiad for https://gerrit.wikimedia.org/r/c/operations/homer/public/+/649706 |
[production] |
07:08 |
<ryankemper> |
[wdqs] depooled `wdqs1013` while it catches up on lag |
[production] |
07:06 |
<ryankemper> |
[wdqs deploy] Restarting `wdqs-categories` across all wdqs instances, one host at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` |
[production] |
07:05 |
<ryankemper> |
[wdqs deploy] Restarting `wdqs-categories` across all test instances: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` |
[production] |
07:05 |
<ryankemper> |
[wdqs-deploy] Restarting `wdqs-updater` across all instances, 4 instances at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` |
[production] |
07:04 |
<ryankemper@deploy1001> |
Finished deploy [wdqs/wdqs@90f9bdd]: 0.3.56 (duration: 10m 39s) |
[production] |
06:54 |
<ryankemper> |
[wdqs deploy] Tests passing on canary instance `wdqs1003` following canary deploy, proceeding to rest of fleet |
[production] |
06:53 |
<ryankemper@deploy1001> |
Started deploy [wdqs/wdqs@90f9bdd]: 0.3.56 |
[production] |
06:53 |
<ryankemper> |
[wdqs deploy] All tests passing on canary instance `wdqs1003` prior to deploy |
[production] |
06:52 |
<kart_> |
Updated cxserver to 2020-12-16-164911-production (T234220, T269437) |
[production] |
06:52 |
<kart_> |
Updated cxserver to 2020-12-16-164911-production (T234220, T234220) |
[production] |
06:22 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool es1013 for decommissioning T268436', diff saved to https://phabricator.wikimedia.org/P13562 and previous config saved to /var/cache/conftool/dbconfig/20201217-062249-marostegui.json |
[production] |
06:22 |
<kartik@deploy1001> |
helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . |
[production] |
06:19 |
<kartik@deploy1001> |
helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . |
[production] |
06:17 |
<kartik@deploy1001> |
helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . |
[production] |
06:13 |
<marostegui@cumin1001> |
END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) |
[production] |
06:05 |
<marostegui@cumin1001> |
START - Cookbook sre.hosts.decommission |
[production] |
05:56 |
<marostegui> |
Stop mysql on db1106 to clone db1154 |
[production] |
05:55 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1106 for cloning db1154:3311 T268742 ', diff saved to https://phabricator.wikimedia.org/P13560 and previous config saved to /var/cache/conftool/dbconfig/20201217-055556-marostegui.json |
[production] |
01:35 |
<andrew@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1019.eqiad.wmnet with reason: REIMAGE |
[production] |
01:33 |
<andrew@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1019.eqiad.wmnet with reason: REIMAGE |
[production] |
01:01 |
<twentyafterfour> |
preparing to update phabricator translations |
[production] |
00:22 |
<mutante> |
running puppet on mw2266, mw2370, mw2354 |
[production] |