2021-03-18
§
|
07:32 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1126 (re)pooling @ 50%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14944 and previous config saved to /var/cache/conftool/dbconfig/20210318-073250-root.json |
[production] |
07:20 |
<dcausse> |
depooling & restarting blazegraph on wdqs1005 |
[production] |
07:19 |
<marostegui> |
Deploy schema change on s4 codfw master, lag will appear - T276150 T276156 |
[production] |
07:17 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1126 (re)pooling @ 25%: Slowly repool db1126', diff saved to https://phabricator.wikimedia.org/P14943 and previous config saved to /var/cache/conftool/dbconfig/20210318-071747-root.json |
[production] |
07:15 |
<marostegui@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE |
[production] |
07:13 |
<marostegui@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE |
[production] |
06:32 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Add db1161 to dbctl, depooled T258361', diff saved to https://phabricator.wikimedia.org/P14942 and previous config saved to /var/cache/conftool/dbconfig/20210318-063241-marostegui.json |
[production] |
06:32 |
<elukey> |
force a manual run of create_virtualenv.sh on an-tool1010 - superset down |
[analytics] |
06:22 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Repool db2120', diff saved to https://phabricator.wikimedia.org/P14941 and previous config saved to /var/cache/conftool/dbconfig/20210318-062201-marostegui.json |
[production] |
06:04 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'Depool db1126 for schema change', diff saved to https://phabricator.wikimedia.org/P14940 and previous config saved to /var/cache/conftool/dbconfig/20210318-060445-marostegui.json |
[production] |
04:12 |
<bstorm> |
rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight |
[tools] |
03:59 |
<bstorm> |
rebooting grid master. sorry for the cron spam |
[tools] |
03:49 |
<bstorm> |
restarting sssd on tools-sgegrid-master |
[tools] |
03:46 |
<andrewbogott> |
restarting slapd on seaborgium, serpens, and r-o ldap replicas (we're getting irregular connection failures) |
[production] |
03:37 |
<bstorm> |
deleted a massive number of stuck jobs that misfired from the cron server |
[tools] |
03:35 |
<bstorm> |
rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it |
[tools] |
01:46 |
<bstorm> |
killed the toolschecker cron job, which had an LDAP error, and ran it again by hand |
[tools] |
00:05 |
<eileen> |
tools revision changed from b7b4060c30 to ef54260b0d |
[production] |
2021-03-17
§
|
23:42 |
<urbanecm@deploy1002> |
Synchronized wmf-config/InitialiseSettings.php: c730dd5feb865a8325279cd4e76c133512f14251: idwiki: Deploy Growth features to newcomers (T259024) (duration: 01m 08s) |
[production] |
23:40 |
<urbanecm@deploy1002> |
Synchronized wmf-config/CommonSettings.php: 5c14e7d2045f0905f7e85b249e821bbe8d69c600: Define confirmed group in MediaWikiServices hook (T275334, T277704, T275310, T275333) (duration: 01m 08s) |
[production] |
23:30 |
<ebernhardson@deploy1002> |
Synchronized php-1.36.0-wmf.35/extensions/CirrusSearch/profiles/FallbackProfiles.config.php: Add fallback profile including glent m1 (duration: 01m 42s) |
[production] |
22:27 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1038.eqiad.wmnet with reason: REIMAGE |
[production] |
22:25 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1037.eqiad.wmnet with reason: REIMAGE |
[production] |
22:25 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on mc1038.eqiad.wmnet with reason: REIMAGE |
[production] |
22:23 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on mc1037.eqiad.wmnet with reason: REIMAGE |
[production] |
20:57 |
<bstorm> |
deployed changes to rbac for kubernetes to add kubectl top access for tools |
[tools] |
20:52 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1184.eqiad.wmnet with reason: REIMAGE |
[production] |
20:50 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1183.eqiad.wmnet with reason: REIMAGE |
[production] |
20:48 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1184.eqiad.wmnet with reason: REIMAGE |
[production] |
20:48 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1182.eqiad.wmnet with reason: REIMAGE |
[production] |
20:47 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1183.eqiad.wmnet with reason: REIMAGE |
[production] |
20:46 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE |
[production] |
20:45 |
<razzi> |
release wikistats 2.9.0 |
[analytics] |
20:45 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1182.eqiad.wmnet with reason: REIMAGE |
[production] |
20:44 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: REIMAGE |
[production] |
20:43 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1181.eqiad.wmnet with reason: REIMAGE |
[production] |
20:42 |
<andrew@deploy1002> |
Finished deploy [horizon/deploy@17ea780]: display volume usage summaries (duration: 03m 34s) |
[production] |
20:42 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1179.eqiad.wmnet with reason: REIMAGE |
[production] |
20:41 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1180.eqiad.wmnet with reason: REIMAGE |
[production] |
20:40 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: REIMAGE |
[production] |
20:39 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1179.eqiad.wmnet with reason: REIMAGE |
[production] |
20:39 |
<andrew@deploy1002> |
Started deploy [horizon/deploy@17ea780]: display volume usage summaries |
[production] |
20:38 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: REIMAGE |
[production] |
20:37 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1178.eqiad.wmnet with reason: REIMAGE |
[production] |
20:35 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on db1177.eqiad.wmnet with reason: REIMAGE |
[production] |
20:30 |
<hashar> |
Reloaded Zuul for I2368478e4c4ab8752581f55a7c5ab493fafdeb41 |
[releng] |
20:26 |
<andrewbogott> |
moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv |
[tools] |
20:19 |
<dzahn@cumin1001> |
END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw2238.codfw.wmnet |
[production] |
20:15 |
<ottomata> |
install anaconda-wmf 2020.02~wmf3 on analytics cluster clients and workers - T262847 |
[analytics] |
20:08 |
<dzahn@cumin1001> |
START - Cookbook sre.hosts.decommission for hosts mw2238.codfw.wmnet |
[production] |