production SAL

9951-10000 of 10000 results (40ms)

2021-02-24 §
09:56	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' .	[production]
09:51	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1157 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P14455 and previous config saved to /var/cache/conftool/dbconfig/20210224-095150-root.json	[production]
09:45	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1157 for schema change', diff saved to https://phabricator.wikimedia.org/P14454 and previous config saved to /var/cache/conftool/dbconfig/20210224-094523-marostegui.json	[production]
09:34	<marostegui>	Update pc2007, pc2010, db2071	[production]
09:31	<marostegui>	Update db1077	[production]
09:27	<jiji@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1033.eqiad.wmnet	[production]
09:20	<jiji@cumin1001>	START - Cookbook sre.hosts.reboot-single for host mc1033.eqiad.wmnet	[production]
09:19	<effie>	upgrade memcached on mc1033, mc2033	[production]
09:07	<jmm@cumin2001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1002.wikimedia.org with reason: REIMAGE	[production]
09:06	<volans>	run "sudo find . -user root -exec chown netbox. '{}' \;" in /srv/deployment/netbox/deploy-cache/revs on netbox* hosts to prevent scap failures on cleanup - T265084	[production]
09:05	<jmm@cumin2001>	START - Cookbook sre.hosts.downtime for 2:00:00 on bast1002.wikimedia.org with reason: REIMAGE	[production]
09:01	<elukey>	roll restart druid brokers on druid public	[production]
08:58	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' .	[production]
08:53	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventstreams-internal' for release 'main' .	[production]
08:52	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'production' .	[production]
08:52	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .	[production]
08:50	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .	[production]
08:50	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .	[production]
08:48	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .	[production]
08:48	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .	[production]
08:35	<moritzm>	reimaging bast1002 to Buster	[production]
08:33	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .	[production]
08:32	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' .	[production]
08:30	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' .	[production]
08:26	<jayme@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'apertium' for release 'production' .	[production]
08:04	<jynus>	restarting db2101, db2139, db2141 T271913	[production]
07:56	<moritzm>	installing remaining openldap updates for buster	[production]
06:24	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1090.eqiad.wmnet	[production]
06:18	<marostegui@cumin1001>	START - Cookbook sre.hosts.decommission for hosts db1090.eqiad.wmnet	[production]
04:10	<ryankemper>	T267927 [WDQS Data Reload] Running `/srv/deployment/wdqs/wdqs/loadData.sh -n wdq -d /srv/wdqs/munged/ -s 864` on `ryankemper@wdqs2008` tmux session `data_reload`	[production]
04:04	<ryankemper>	[WDQS] Depooled `wdqs2008`	[production]
03:16	<pt1979@cumin2001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2149.codfw.wmnet with reason: REIMAGE	[production]
03:13	<pt1979@cumin2001>	START - Cookbook sre.hosts.downtime for 2:00:00 on db2149.codfw.wmnet with reason: REIMAGE	[production]
03:03	<pt1979@cumin2001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2148.codfw.wmnet with reason: REIMAGE	[production]
03:01	<pt1979@cumin2001>	START - Cookbook sre.hosts.downtime for 2:00:00 on db2148.codfw.wmnet with reason: REIMAGE	[production]
02:58	<ryankemper>	[WDQS Data Reload] Restarting reload on test node `wdqs1009` from where it last left off: `/srv/deployment/wdqs/wdqs/loadData.sh -n wdq -d /srv/wdqs/munged/ -s 947`	[production]
02:57	<ryankemper>	[WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good	[production]
02:39	<pt1979@cumin2001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2147.codfw.wmnet with reason: REIMAGE	[production]
02:37	<pt1979@cumin2001>	START - Cookbook sre.hosts.downtime for 2:00:00 on db2147.codfw.wmnet with reason: REIMAGE	[production]
02:35	<pt1979@cumin2001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on db2146.codfw.wmnet with reason: REIMAGE	[production]
02:33	<pt1979@cumin2001>	START - Cookbook sre.hosts.downtime for 2:00:00 on db2146.codfw.wmnet with reason: REIMAGE	[production]
02:30	<ryankemper>	[WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'`	[production]
02:29	<ryankemper>	[WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`	[production]
02:29	<ryankemper>	[WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`	[production]
02:27	<ryankemper@deploy1001>	Finished deploy [wdqs/wdqs@b5fc9d5]: 0.3.64 (duration: 06m 24s)	[production]
02:24	<ebernhardson@deploy1001>	Finished deploy [wikimedia/discovery/analytics@25549e7]: ores_bulk_ingest: use backoffs starting at 30sec (duration: 01m 37s)	[production]
02:22	<gehel@cumin2001>	END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)	[production]
02:22	<ebernhardson@deploy1001>	Started deploy [wikimedia/discovery/analytics@25549e7]: ores_bulk_ingest: use backoffs starting at 30sec	[production]
02:20	<ryankemper@deploy1001>	Started deploy [wdqs/wdqs@b5fc9d5]: 0.3.64	[production]
02:18	<ryankemper@deploy1001>	Finished deploy [wdqs/wdqs@b5fc9d5]: 0.3.64 (duration: 11m 22s)	[production]