production SAL

3701-3750 of 10000 results (37ms)

2021-03-03 §
09:42	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 30%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14598 and previous config saved to /var/cache/conftool/dbconfig/20210303-094208-root.json	[production]
09:41	<elukey@cumin1001>	END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1132,1135-1138].eqiad.wmnet	[production]
09:39	<elukey@cumin1001>	START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1132,1135-1138].eqiad.wmnet	[production]
09:38	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14597 and previous config saved to /var/cache/conftool/dbconfig/20210303-093847-root.json	[production]
09:31	<aborrero@cumin1001>	START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet	[production]
09:30	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE	[production]
09:28	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE	[production]
09:28	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE	[production]
09:27	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14596 and previous config saved to /var/cache/conftool/dbconfig/20210303-092705-root.json	[production]
09:25	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE	[production]
09:23	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14595 and previous config saved to /var/cache/conftool/dbconfig/20210303-092343-root.json	[production]
09:16	<jayme@deploy1002>	helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.	[production]
09:16	<jayme@deploy1002>	helmfile [staging-codfw] START helmfile.d/admin 'sync'.	[production]
09:12	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 15%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14594 and previous config saved to /var/cache/conftool/dbconfig/20210303-091201-root.json	[production]
09:08	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14593 and previous config saved to /var/cache/conftool/dbconfig/20210303-090840-root.json	[production]
09:02	<zpapierski@deploy1002>	Finished deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 (duration: 08m 17s)	[production]
09:00	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P14592 and previous config saved to /var/cache/conftool/dbconfig/20210303-090030-marostegui.json	[production]
08:56	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14591 and previous config saved to /var/cache/conftool/dbconfig/20210303-085658-root.json	[production]
08:54	<zpapierski@deploy1002>	Started deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66	[production]
08:50	<marostegui@cumin1001>	dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14590 and previous config saved to /var/cache/conftool/dbconfig/20210303-085014-marostegui.json	[production]
08:48	<test>	tcpircbot --joe	[production]
08:40	<elukey@cumin1001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE	[production]
08:40	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE	[production]
08:32	<godog>	stop/mask tcpircbot-logmsgbot on pontoon-icinga-01 - T276299	[production]
07:30	<_joe_>	test	[production]
07:17	<_joe_>	test log	[production]
06:41	<marostegui>	Testing log	[production]
06:27	<ryankemper>	T275345 T274555 `sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001`	[production]
06:26	<ryankemper>	T275345 T274555 `sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001`	[production]
06:21	<ryankemper>	T275345 T274555 Re-pooling `elastic2045` and `elastic2054` (commands follow)	[production]
06:20	<ryankemper>	T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}`	[production]
06:18	<ryankemper>	T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}`	[production]
06:17	<ryankemper>	T275345 T274555 Unbanning `elastic2045` and `elastic2054` from our cluster now that both hosts have been re-imaged and are running without errors (commands follow)	[production]
06:15	<ryankemper>	T274555 Removed downtime for `elastic2054`	[production]
05:32	<ryankemper>	T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` on `ryankemper@cumin2001` tmux session `elastic_reimage_elastic2054`	[production]
05:31	<ryankemper>	T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet`	[production]
05:27	<ryankemper>	Downtime `wdqs1012` until `2021-03-03 19:25:40` (~14 hours from now). Its `wdqs-updater` is failing; ultimately it's blazegraph journal is probably in a bad state meaning we'd have to copy one over from a healthy node, but not kicking that off right now so that we can investigate a little bit first	[production]
05:16	<ryankemper>	T275345 `ryankemper@elastic2045:~$ sudo apt-get upgrade wmf-elasticsearch-search-plugins`	[production]
03:50	<ryankemper>	Depooled `wdqs1012` until I've got its updater back online	[production]
03:24	<ryankemper>	`ryankemper@wdqs1012:~$ sudo systemctl restart wdqs-blazegraph` ~2 mins ago	[production]
02:45	<ejegg>	updated fundraising CiviCRM from e1dacbe348 to b13e70d968	[production]
02:09	<ejegg>	updated payments-wiki from 365bf54393 to 65dbf0ed9d	[production]
00:42	<Urbanecm>	Finished deployment in Evening B&C window; logmsgbot is currently down, and a simple restart did not bring it back up	[production]
00:41	<Urbanecm>	00:40:16 Synchronized wmf-config/config/idwiki.yaml: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 3/3) (duration: 01m 09s)	[production]
00:38	<Urbanecm>	00:38:12 Synchronized dblists/growthexperiments.dblist: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 2/3) (duration: 01m 10s)	[production]
00:31	<Urbanecm>	00:31:26 Synchronized wmf-config/InitialiseSettings.php: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 1/3) (duration: 01m 11s)	[production]
00:21	<dwisehaupt>	replication restarted on frdb2001 after utf8mb4 conversion completed.	[production]
00:21	<mutante>	alert1001 systemctl restart tcpircbot-logmsgbot	[production]
00:08	<urbanecm@deploy1002>	sync-file aborted: 80edca8a385870a0e60a98198c99c9839fc01d80: Enable Growth features in idwiki in stealth mode (T259024; 1/3) (duration: 06m 45s)	[production]
2021-03-02 §
23:52	<mutante>	mwmaint2002 - find /home -nouser -delete	[production]