351-400 of 10000 results (23ms)
2020-12-17 §
11:27 <godog> bounce apache2 on grafana1002 [production]
11:26 <elukey@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE [production]
11:24 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [production]
11:22 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE [production]
11:21 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1001.eqiad.wmnet with reason: REIMAGE [production]
11:21 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1003.eqiad.wmnet with reason: REIMAGE [production]
11:20 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-worker1002.eqiad.wmnet with reason: REIMAGE [production]
11:20 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [production]
11:18 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [production]
11:16 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: REIMAGE [production]
11:16 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: REIMAGE [production]
11:10 <jbond@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [production]
11:08 <jbond@cumin1001> START - Cookbook sre.hosts.reboot-single [production]
10:50 <elukey@cumin1001> END (PASS) - Cookbook sre.hadoop.stop-cluster (exit_code=0) for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
10:45 <elukey@cumin1001> START - Cookbook sre.hadoop.stop-cluster for Hadoop test cluster: Stop the Hadoop cluster before maintenance. - elukey@cumin1001 [production]
10:21 <jbond42> updating RemoteIP on phabricator https://gerrit.wikimedia.org/r/c/operations/puppet/+/649872 [production]
09:57 <vgutierrez> repool ats-tls on cp5011 [production]
09:00 <marostegui> Sanitize s1 and s5 on db1154 T268742 [production]
08:30 <godog> swift codfw-prod: more weight to ms-be20[58-61] - T269337 [production]
07:49 <ryankemper> [wdqs deploy] (wdqs deploy complete) [production]
07:19 <marostegui> Stop mysql on db1082 to clone db1154 [production]
07:19 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1082 for cloning db1154:3315 T268742 ', diff saved to https://phabricator.wikimedia.org/P13563 and previous config saved to /var/cache/conftool/dbconfig/20201217-071903-marostegui.json [production]
07:18 <elukey> reboot an-airflow1001 for kernel upgrades [production]
07:08 <elukey> update analytics-in4 filter on cr1/cr2-eqiad for https://gerrit.wikimedia.org/r/c/operations/homer/public/+/649706 [production]
07:08 <ryankemper> [wdqs] depooled `wdqs1013` while it catches up on lag [production]
07:06 <ryankemper> [wdqs deploy] Restarting `wdqs-categories` across all wdqs instances, one host at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` [production]
07:05 <ryankemper> [wdqs deploy] Restarting `wdqs-categories` across all test instances: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` [production]
07:05 <ryankemper> [wdqs-deploy] Restarting `wdqs-updater` across all instances, 4 instances at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [production]
07:04 <ryankemper@deploy1001> Finished deploy [wdqs/wdqs@90f9bdd]: 0.3.56 (duration: 10m 39s) [production]
06:54 <ryankemper> [wdqs deploy] Tests passing on canary instance `wdqs1003` following canary deploy, proceeding to rest of fleet [production]
06:53 <ryankemper@deploy1001> Started deploy [wdqs/wdqs@90f9bdd]: 0.3.56 [production]
06:53 <ryankemper> [wdqs deploy] All tests passing on canary instance `wdqs1003` prior to deploy [production]
06:52 <kart_> Updated cxserver to 2020-12-16-164911-production (T234220, T269437) [production]
06:52 <kart_> Updated cxserver to 2020-12-16-164911-production (T234220, T234220) [production]
06:22 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool es1013 for decommissioning T268436', diff saved to https://phabricator.wikimedia.org/P13562 and previous config saved to /var/cache/conftool/dbconfig/20201217-062249-marostegui.json [production]
06:22 <kartik@deploy1001> helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [production]
06:19 <kartik@deploy1001> helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [production]
06:17 <kartik@deploy1001> helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [production]
06:13 <marostegui@cumin1001> END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [production]
06:05 <marostegui@cumin1001> START - Cookbook sre.hosts.decommission [production]
05:56 <marostegui> Stop mysql on db1106 to clone db1154 [production]
05:55 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1106 for cloning db1154:3311 T268742 ', diff saved to https://phabricator.wikimedia.org/P13560 and previous config saved to /var/cache/conftool/dbconfig/20201217-055556-marostegui.json [production]
01:35 <andrew@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1019.eqiad.wmnet with reason: REIMAGE [production]
01:33 <andrew@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1019.eqiad.wmnet with reason: REIMAGE [production]
01:01 <twentyafterfour> preparing to update phabricator translations [production]
00:22 <mutante> running puppet on mw2266, mw2370, mw2354 [production]
2020-12-16 §
23:56 <bstorm> bootstrapped meta_p database for the new s7 replicas T269427 [production]
20:12 <marxarelli> group1 to 1.36.0-wmf.22 complete. no new errors or concerning rates (refs T267415) [production]
20:06 <dduvall@deploy1001> Synchronized php: group1 wikis to 1.36.0-wmf.22 (duration: 01m 01s) [production]
20:05 <legoktm> added myself to the ops LDAP group [production]