production SAL

451-500 of 10000 results (25ms)

2020-09-04 §
21:37	<ryankemper>	Tests on canary `wdqs1003` passing, beginning full wdqs deploy	[production]
21:36	<ryankemper@deploy1001>	Started deploy [wdqs/wdqs@c7e6b35]: 0.3.47	[production]
21:31	<ryankemper>	`ryankemper@wdqs2002:~$ sudo systemctl restart wdqs-blazegraph`	[production]
21:06	<mutante>	apt1001 - removed all libnginx-mod* packages except libnginx-mod-http-echo ; sudo apt-get autoremove ; run puppet ; restarted nginx - apt.wikimedia.org switched to nginx-light (T261962)	[production]
21:02	<mutante>	apt1001 - remove all libnginx-mod* packages except libnginx-mod-http-echo	[production]
20:59	<mutante>	apt2001 - sudo apt-get autoremove	[production]
20:51	<mutante>	apt2001 - apt-get remove --purge libnginx* and run puppet to replace nginx-full with nginx-light (T261962)	[production]
20:43	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
20:41	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
20:39	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
20:38	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
20:38	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
20:36	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
20:36	<cmjohnson@cumin1001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)	[production]
20:35	<cmjohnson@cumin1001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)	[production]
20:34	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
20:32	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
20:31	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
20:31	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
20:30	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
20:30	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
20:05	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
20:04	<cmjohnson@cumin1001>	END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)	[production]
20:03	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
20:01	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
20:01	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
20:00	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
19:59	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
19:59	<cmjohnson@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
19:57	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
19:57	<cmjohnson@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
19:22	<mutante>	Icinga - ACKing with sticky - alerts on test and dev hosts	[production]
18:10	<milimetric@deploy1001>	Finished deploy [analytics/aqs/deploy@95d6432]: AQS: new editors by country endpoint, low risk so trying on a Friday with SRE blessing (duration: 07m 35s)	[production]
18:02	<milimetric@deploy1001>	Started deploy [analytics/aqs/deploy@95d6432]: AQS: new editors by country endpoint, low risk so trying on a Friday with SRE blessing	[production]
10:31	<elukey@cumin1001>	END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)	[production]
10:29	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1087 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12492 and previous config saved to /var/cache/conftool/dbconfig/20200904-102955-marostegui.json	[production]
10:28	<marostegui>	Deploy MCR schema change on db1087 (sanitarium master), this will generate lag (probably a few days) on s8 labsdb hosts T238966	[production]
09:48	<marostegui>	Restart prometheus-mysqld-exporter on db2125	[production]
09:11	<elukey@cumin1001>	START - Cookbook sre.hadoop.roll-restart-workers	[production]
08:58	<elukey@cumin1001>	END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0)	[production]
08:31	<elukey@cumin1001>	START - Cookbook sre.hadoop.roll-restart-workers	[production]
08:29	<elukey>	roll restart of the hadoop workers (test and analytics cluster) for openjdk upgrades	[production]
08:08	<moritzm>	installing 4.19.132 kernel on buster systems (only installing the deb, reboots separately)	[production]
07:30	<moritzm>	installing 4.9.228 kernel on stretch systems (only installing the deb, reboots separately)	[production]
05:13	<marostegui>	Deploy MCR schema change on s4 eqiad master T238966	[production]
01:51	<milimetric@deploy1001>	Finished deploy [analytics/aqs/deploy@95d6432]: AQS: Deploying new geoeditors endpoints (duration: 63m 18s)	[production]
01:35	<pt1979@cumin2001>	END (PASS) - Cookbook sre.dns.netbox (exit_code=0)	[production]
01:30	<pt1979@cumin2001>	START - Cookbook sre.dns.netbox	[production]
01:23	<ryankemper>	(Following the restart of blazegraph, service has been restored to `wdqs2003`. See https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1599182219699&to=1599182547699)	[production]
01:16	<ryankemper>	Glancing at https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1599170628749&to=1599182011243, looks like `wdqs2003`'s blazegaph isn't happy based off the null data entries. Restarting blazegraph: `ryankemper@wdqs2003:~$ sudo systemctl restart wdqs-blazegraph`	[production]