production SAL

2001-2050 of 10000 results (36ms)

2021-03-03 §
10:25	<jayme@deploy1002>	helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.	[production]
10:25	<jayme@deploy1002>	helmfile [staging-codfw] START helmfile.d/admin 'sync'.	[production]
10:12	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1166 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14605 and previous config saved to /var/cache/conftool/dbconfig/20210303-101255-root.json	[production]
10:12	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 60%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14604 and previous config saved to /var/cache/conftool/dbconfig/20210303-101215-root.json	[production]
10:05	<aborrero@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet	[production]
10:00	<aborrero@cumin1001>	START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet	[production]
10:00	<aborrero@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1003.eqiad.wmnet	[production]
09:57	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1166 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14602 and previous config saved to /var/cache/conftool/dbconfig/20210303-095751-root.json	[production]
09:57	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14601 and previous config saved to /var/cache/conftool/dbconfig/20210303-095712-root.json	[production]
09:55	<aborrero@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on cloudnet1003.eqiad.wmnet with reason: HW issue	[production]
09:54	<aborrero@cumin1001>	START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on cloudnet1003.eqiad.wmnet with reason: HW issue	[production]
09:54	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1166 for schema change', diff saved to https://phabricator.wikimedia.org/P14600 and previous config saved to /var/cache/conftool/dbconfig/20210303-095417-marostegui.json	[production]
09:53	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1157 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P14599 and previous config saved to /var/cache/conftool/dbconfig/20210303-095351-root.json	[production]
09:42	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 30%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14598 and previous config saved to /var/cache/conftool/dbconfig/20210303-094208-root.json	[production]
09:41	<elukey@cumin1001>	END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker[1132,1135-1138].eqiad.wmnet	[production]
09:39	<elukey@cumin1001>	START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1132,1135-1138].eqiad.wmnet	[production]
09:38	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1157 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P14597 and previous config saved to /var/cache/conftool/dbconfig/20210303-093847-root.json	[production]
09:31	<aborrero@cumin1001>	START - Cookbook sre.hosts.reboot-single for host cloudnet1003.eqiad.wmnet	[production]
09:30	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE	[production]
09:28	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1138.eqiad.wmnet with reason: REIMAGE	[production]
09:28	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE	[production]
09:27	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14596 and previous config saved to /var/cache/conftool/dbconfig/20210303-092705-root.json	[production]
09:25	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1137.eqiad.wmnet with reason: REIMAGE	[production]
09:23	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1157 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P14595 and previous config saved to /var/cache/conftool/dbconfig/20210303-092343-root.json	[production]
09:16	<jayme@deploy1002>	helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.	[production]
09:16	<jayme@deploy1002>	helmfile [staging-codfw] START helmfile.d/admin 'sync'.	[production]
09:12	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 15%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14594 and previous config saved to /var/cache/conftool/dbconfig/20210303-091201-root.json	[production]
09:08	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1157 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P14593 and previous config saved to /var/cache/conftool/dbconfig/20210303-090840-root.json	[production]
09:02	<zpapierski@deploy1002>	Finished deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66 (duration: 08m 17s)	[production]
09:00	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P14592 and previous config saved to /var/cache/conftool/dbconfig/20210303-090030-marostegui.json	[production]
08:56	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1164 (re)pooling @ 10%: Slowly repool db1164 in s1 for the first time', diff saved to https://phabricator.wikimedia.org/P14591 and previous config saved to /var/cache/conftool/dbconfig/20210303-085658-root.json	[production]
08:54	<zpapierski@deploy1002>	Started deploy [wdqs/wdqs@dbfd1f6]: Deploying emergency fix - WDQS 0.3.66	[production]
08:50	<marostegui@cumin1001>	dbctl commit (dc=all): 'Increase weight for db1164 in s1 T258361', diff saved to https://phabricator.wikimedia.org/P14590 and previous config saved to /var/cache/conftool/dbconfig/20210303-085014-marostegui.json	[production]
08:48	<test>	tcpircbot --joe	[production]
08:40	<elukey@cumin1001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE	[production]
08:40	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1136.eqiad.wmnet with reason: REIMAGE	[production]
08:32	<godog>	stop/mask tcpircbot-logmsgbot on pontoon-icinga-01 - T276299	[production]
07:30	<_joe_>	test	[production]
07:17	<_joe_>	test log	[production]
06:41	<marostegui>	Testing log	[production]
06:27	<ryankemper>	T275345 T274555 `sudo confctl select 'name=elastic2054.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001`	[production]
06:26	<ryankemper>	T275345 T274555 `sudo confctl select 'name=elastic2045.codfw.wmnet' set/pooled=yes` on `ryankemper@puppetmaster1001`	[production]
06:21	<ryankemper>	T275345 T274555 Re-pooling `elastic2045` and `elastic2054` (commands follow)	[production]
06:20	<ryankemper>	T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9400/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}`	[production]
06:18	<ryankemper>	T275345 T274555 `curl -H 'Content-Type: application/json' -XPUT http://localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.exclude":{"_name": null,"_ip": null}}}'` => `{"acknowledged":true,"persistent":{},"transient":{}}`	[production]
06:17	<ryankemper>	T275345 T274555 Unbanning `elastic2045` and `elastic2054` from our cluster now that both hosts have been re-imaged and are running without errors (commands follow)	[production]
06:15	<ryankemper>	T274555 Removed downtime for `elastic2054`	[production]
05:32	<ryankemper>	T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet` on `ryankemper@cumin2001` tmux session `elastic_reimage_elastic2054`	[production]
05:31	<ryankemper>	T274555 `sudo -i wmf-auto-reimage-host --conftool -p T274555 elastic2054.codfw.wmnet`	[production]
05:27	<ryankemper>	Downtime `wdqs1012` until `2021-03-03 19:25:40` (~14 hours from now). Its `wdqs-updater` is failing; ultimately it's blazegraph journal is probably in a bad state meaning we'd have to copy one over from a healthy node, but not kicking that off right now so that we can investigate a little bit first	[production]