production SAL

551-600 of 10000 results (56ms)

2020-01-15 §
01:32	<mutante>	lvs1015 powercycling, crashed, nothing on console, lots of unknowns in icinga	[production]
01:17	<mutante>	dbproxy1017 and dbproxy1021 were showing "haproxy failover" icinga alerts. did the check described on https://wikitech.wikimedia.org/wiki/HAProxy#Failover and it claimed on both that db1133 was DOWN..but checking db1133 itself showed it was up and working normal. in that case the docs said to 'systemctl reload haproxy'. done on both and things recovered	[production]
01:13	<mutante>	dbproxy1017 - systemctl reload haproxy	[production]
00:22	<bstorm_>	restarted maintain-dbusers on labstore1004 after recovering the m5 DB's connection issue	[production]
00:12	<bstorm_>	set max_connections to 600 temporarily while troubleshooting on m5 (db1133)	[production]
2020-01-14 §
20:11	<milimetric@deploy1001>	Finished deploy [analytics/aqs/deploy@1cf0530]: Increment service-runner to latest version (duration: 04m 48s)	[production]
20:07	<milimetric@deploy1001>	Started deploy [analytics/aqs/deploy@1cf0530]: Increment service-runner to latest version	[production]
19:22	<urbanecm@deploy1001>	Synchronized wmf-config/CommonSettings.php: SWAT: e400916: [wikitech] Restore contentadmin ability to manage abuse filters (duration: 01m 05s)	[production]
18:11	<vgutierrez>	repooling cp5012	[production]
18:06	<vgutierrez>	depool cp5012 for some ats parent select debugging	[production]
17:43	<vgutierrez>	repooling cp4027	[production]
17:39	<vgutierrez>	depooling cp4027 for some ats-tls parent balancing tests	[production]
17:21	<_joe_>	upload docker-report 0.0.2 to {buster,stretch}-wikimedia T242604	[production]
16:53	<liw@deploy1001>	rebuilt and synchronized wikiversions files: group0 to 1.35.0-wmf.15	[production]
16:46	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
16:44	<liw>	branch is cut for 1.35.0-wmv.15; train window is closed, but I'll continue train since the next time slot seems to not have anything	[production]
16:44	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
16:41	<marostegui>	Enable puppet back on install1002 and install2002 - T242481	[production]
16:31	<liw@deploy1001>	Finished scap: testwiki to php-1.34.0-wmf.15 and rebuild l10n cache (try 2) (duration: 43m 29s)	[production]
16:26	<marostegui>	Disable temporarily puppet on install1002 and install2002 - T242481	[production]
16:08	<volans@deploy1001>	Finished deploy [debmonitor/deploy@e72911c]: Release v0.2.4 (duration: 01m 09s)	[production]
16:07	<volans@deploy1001>	Started deploy [debmonitor/deploy@e72911c]: Release v0.2.4	[production]
15:47	<liw@deploy1001>	Started scap: testwiki to php-1.34.0-wmf.15 and rebuild l10n cache (try 2)	[production]
15:02	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
15:02	<marostegui>	Copy data from db1080 to db1107 T242702	[production]
15:02	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1080 for tranfer', diff saved to https://phabricator.wikimedia.org/P10144 and previous config saved to /var/cache/conftool/dbconfig/20200114-150223-marostegui.json	[production]
15:00	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
14:51	<liw@deploy1001>	scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="testwiki" --outdir="/tmp/scap_l10n_44869219" --threads=30 --lang en --quiet' returned non-zero exit status 1 (duration: 03m 55s)	[production]
14:47	<liw@deploy1001>	Started scap: testwiki to php-1.35.0-wmf.15 and rebuild l10n cache	[production]
14:43	<marostegui@cumin1001>	dbctl commit (dc=all): 'Slowly repool db1080', diff saved to https://phabricator.wikimedia.org/P10143 and previous config saved to /var/cache/conftool/dbconfig/20200114-144341-marostegui.json	[production]
14:26	<marostegui>	Move db1114 under db1080	[production]
14:24	<marostegui>	Stop db1080 and db1107 replication in sync	[production]
14:21	<XioNoX>	push firewall policies to pfw3-eqiad - T242681	[production]
14:15	<XioNoX>	push firewall policies to pfw3-codfw - T242681	[production]
14:12	<liw>	branch cut for 1.35.0-wmf.15	[production]
14:09	<vgutierrez>	upgrade ats to 8.0.5-1wm12 in cp5006 and cp5012 - T242620	[production]
14:03	<aborrero@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
14:03	<aborrero@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
13:54	<marostegui>	Upgrade db1080	[production]
13:52	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1080 for upgrade', diff saved to https://phabricator.wikimedia.org/P10142 and previous config saved to /var/cache/conftool/dbconfig/20200114-135238-marostegui.json	[production]
12:16	<vgutierrez@puppetmaster1001>	conftool action : set/weight=1; selector: service=nginx,name=ncredir3002.esams.wmnet	[production]
12:16	<vgutierrez@puppetmaster1001>	conftool action : set/weight=1; selector: service=nginx,name=ncredir3001.esams.wmnet	[production]
12:14	<vgutierrez@puppetmaster1001>	conftool action : set/weight=1; selector: service=nginx,name=ncredir4001.ulsfo.wmnet	[production]
12:14	<vgutierrez@puppetmaster1001>	conftool action : set/weight=1; selector: service=nginx,name=ncredir4002.ulsfo.wmnet	[production]
12:02	<aborrero@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
12:02	<aborrero@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
12:02	<aborrero@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
12:01	<aborrero@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
11:51	<vgutierrez>	restarting pybal on lvs4005 (high-traffic1 LVS) - T242321	[production]
11:49	<vgutierrez>	restarting pybal on lvs4007 (secondary LVS) - T242321	[production]