production SAL

1-50 of 10000 results (24ms)

2021-05-02 §
13:40	<dcaro@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cloudmetrics1002.eqiad.wmnet with reason: Flaky host	[production]
13:40	<dcaro@cumin1001>	START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cloudmetrics1002.eqiad.wmnet with reason: Flaky host	[production]
2021-05-01 §
19:12	<Urbanecm>	Invalidate password for MaraBot@SUL (T281586)	[production]
16:58	<legoktm@deploy1002>	Synchronized logos/config.yaml: Add eswiki 20th anniversary logos (duration: 00m 57s)	[production]
16:56	<legoktm@deploy1002>	Synchronized wmf-config/logos.php: Use eswiki 20th anniversary logos (T280908) (duration: 00m 56s)	[production]
16:50	<legoktm@deploy1002>	Synchronized static/images/project-logos/: Add eswiki 20th anniversary logos (duration: 00m 57s)	[production]
07:22	<elukey>	powercycle elastic2033 - no ssh, no tty available via mgmt	[production]
2021-04-30 §
21:54	<mutante>	people1003 - rsycncing /home from peopel1002	[production]
15:30	<dcaro@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudmetrics1002.eqiad.wmnet with reason: Flaky host	[production]
15:29	<dcaro@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on cloudmetrics1002.eqiad.wmnet with reason: Flaky host	[production]
15:25	<bstorm>	hard rebooting cloudmetrics1002 T275605	[production]
11:40	<ladsgroup@deploy1002>	Synchronized static/favicon/wikitech.ico: Config: [[gerrit:683835\|Update wikitech logo]] (duration: 00m 56s)	[production]
11:36	<ladsgroup@deploy1002>	Synchronized static/images/project-logos/wikitech-1.5x.png: Config: [[gerrit:683835\|Update wikitech logo]] (duration: 00m 56s)	[production]
11:34	<ladsgroup@deploy1002>	Synchronized static/images/project-logos/wikitech-2x.png: Config: [[gerrit:683835\|Update wikitech logo]] (duration: 00m 57s)	[production]
11:33	<ladsgroup@deploy1002>	Synchronized static/images/project-logos/wikitech.png: Config: [[gerrit:683835\|Update wikitech logo]] (duration: 00m 57s)	[production]
11:31	<ladsgroup@deploy1002>	Synchronized logos/config.yaml: Config: [[gerrit:683835\|Update wikitech logo]] (duration: 00m 57s)	[production]
09:04	<dcaro@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: primary nic disconnected	[production]
09:03	<dcaro@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: primary nic disconnected	[production]
08:11	<moritzm>	remove mc1027 from debmonitor, server is broken and won't return (T276415)	[production]
07:38	<moritzm>	installing iputils updates from Buster point release	[production]
06:15	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15667 and previous config saved to /var/cache/conftool/dbconfig/20210430-061549-root.json	[production]
06:00	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15666 and previous config saved to /var/cache/conftool/dbconfig/20210430-060046-root.json	[production]
05:51	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563	[production]
05:45	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15665 and previous config saved to /var/cache/conftool/dbconfig/20210430-054542-root.json	[production]
05:30	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Repool db1114', diff saved to https://phabricator.wikimedia.org/P15664 and previous config saved to /var/cache/conftool/dbconfig/20210430-053038-root.json	[production]
05:16	<marostegui>	Upgrade kernel on db1114	[production]
05:15	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1114 to enable report_host T266483', diff saved to https://phabricator.wikimedia.org/P15663 and previous config saved to /var/cache/conftool/dbconfig/20210430-051558-marostegui.json	[production]
05:08	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1080.eqiad.wmnet	[production]
04:57	<marostegui@cumin1001>	START - Cookbook sre.hosts.decommission for hosts db1080.eqiad.wmnet	[production]
04:56	<ryankemper>	[WDQS] `ryankemper@wdqs1006:~$ sudo systemctl restart wdqs-blazegraph`	[production]
04:43	<ryankemper>	T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_eqiad "eqiad reboot to apply sec updates" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts`	[production]
04:43	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad reboot to apply sec updates - ryankemper@cumin1001 - T280563	[production]
04:42	<ryankemper>	T261239 `elastic2033`, which is known to be in a state of hardware failure (we have a ticket open), is holding up the reboot of codfw. I don't think we have a good way to exclude a node currently. Going to just proceed to `eqiad` for now	[production]
04:41	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563	[production]
04:39	<ryankemper>	T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1003.eqiad.wmnet --dest wdqs1010.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage`	[production]
04:39	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99)	[production]
04:39	<ryankemper@cumin1001>	START - Cookbook sre.wdqs.data-transfer	[production]
04:05	<ryankemper@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE	[production]
04:03	<ryankemper@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1010.eqiad.wmnet with reason: REIMAGE	[production]
03:50	<ryankemper>	T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1010.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage`	[production]
03:47	<ryankemper>	T280563 about half of codfw nodes have been rebooted before the failure caused by write queue not emptying fast enough, kicking it off again:`sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts`	[production]
03:45	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563	[production]
01:08	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563	[production]
2021-04-29 §
23:36	<thcipriani@deploy1002>	Synchronized README: Config: [[gerrit:683749\|Revert "DEMO: Add newline to README"]] (duration: 00m 56s)	[production]
23:18	<ryankemper>	T280563 successful reboot of `relforge100[3,4]`; `relforge` cluster is back to green status.	[production]
23:16	<thcipriani@deploy1002>	Synchronized README: Config: [[gerrit:683747\|DEMO: Add newline to README]] (duration: 00m 56s)	[production]
23:08	<ryankemper>	T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` (amended command)	[production]
23:06	<ryankemper>	T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts`	[production]
23:05	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563	[production]
22:46	<ryankemper>	T280563 Current master is `relforge1003-relforge-eqiad`, will reboot `1004` first then `1003` after	[production]