production SAL

3701-3750 of 10000 results (37ms)

2020-10-29 §
09:52	<elukey>	add gdnsd.service to all gdnsd hosts (with LimitNOFILE=infinity as override) - no daemon restart done - T266746	[production]
09:41	<marostegui>	Deploy schema change on s8 wikidata codfw master (db2079) T264109	[production]
09:33	<elukey>	clean up 10.64.21.7/24 and 2620:0:861:105:10:64:21:7/64 from netbox (an-test-ui1001 already have ips previously allocated by makevm)	[production]
09:32	<elukey@cumin1001>	END (ERROR) - Cookbook sre.ganeti.makevm (exit_code=97)	[production]
09:23	<elukey@cumin1001>	START - Cookbook sre.ganeti.makevm	[production]
08:54	<vgutierrez>	turn off ECDHE-ECDSA-AES128-SHA support on the main caching cluster - T258405	[production]
08:54	<moritzm>	fixing up stray jenkins auto restart timers on secondary releases server	[production]
08:53	<vgutierrez>	A:cp (except cp3052, running varnish 5) upgrade libvmod-netmapper to 1.9-1 T266567 T264398	[production]
08:48	<moritzm>	fixing up stray mcelog auto restart timers on kubestage*	[production]
08:38	<moritzm>	fixing up stray cas auto restart timers on secondary IDP servers	[production]
08:19	<moritzm>	fixing up stray pmacctd auto restart timers on netflow*	[production]
08:19	<moritzm>	fixing up stray pcacctd auto restart timers on netflow*	[production]
08:02	<marostegui>	Disconnect replication codfw -> eqiad on s1 T266663	[production]
07:56	<vgutierrez>	set LimitNOFILE=500000 for gdnsd on authdns1001	[production]
07:54	<marostegui>	Disconnect replication codfw -> eqiad on s4 T266663	[production]
07:50	<vgutierrez>	restart haproxy on authdns2001	[production]
07:49	<marostegui>	Disconnect replication codfw -> eqiad on s8 T266663	[production]
07:48	<godog>	swift codfw-prod: bump object weight for ms-be2057 - T261633	[production]
07:46	<marostegui>	Disconnect replication codfw -> eqiad on s3 T266663	[production]
07:43	<vgutierrez>	restart anycast-healthchecker on authdns2001	[production]
07:34	<vgutierrez>	set LimitNOFILE=500000 for gdnsd on authdns2001	[production]
07:27	<elukey>	"sudo truncate -s 10g /var/log/daemon.log" on authdns2001	[production]
06:52	<marostegui>	Disconnect replication codfw -> eqiad on s2 T266663	[production]
06:38	<marostegui>	Disconnect replication codfw -> eqiad on s7 T266663	[production]
06:36	<marostegui>	Disconnect replication codfw -> eqiad on s6 T266663	[production]
06:25	<elukey>	execute 'truncate -s 10g /var/log/syslog.1 on authdns2001 - root partition full	[production]
06:23	<marostegui>	Disconnect replication codfw -> eqiad on s5 T266663	[production]
06:10	<marostegui>	Disconnect replication codfw -> eqiad on es4 and es5 T266663	[production]
06:07	<marostegui>	Disconnect replication codfw -> eqiad on x1 T266663	[production]
05:58	<marostegui>	Disconnect replication codfw -> eqiad on pc1, pc2 and pc3 T266663	[production]
04:06	<ryankemper@cumin1001>	END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0)	[production]
01:41	<mutante>	scandium reimaged a second time after making puppet changes to ensure nodejs/npm is NOT installed anymore (T257906)	[production]
01:17	<ryankemper>	T266492 Beginning rolling restart of eqiad cirrus cluster, 3 nodes at a time, on `ryankemper@cumin1001` tmux session `elasticsearch_restart_eqiad`	[production]
01:16	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-restart	[production]
00:51	<ryankemper>	Finished restart of wdqs categories across production hosts; wdqs deploy is complete and the service is healthy	[production]
00:14	<Amir1>	rolling restart of ores	[production]
00:12	<dzahn@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
00:10	<dzahn@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
00:04	<ryankemper>	Beginning restart of wdqs categories across production hosts, one at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 60 && systemctl restart wdqs-categories && sleep 30 && pool'`	[production]
00:03	<ryankemper>	Restarted wdqs categories across test hosts: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'`	[production]
00:03	<ryankemper>	Restarted wdqs updater across all hosts: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'`	[production]
00:02	<ryankemper>	Following wdqs deploy, https://query.wikidata.org successfully responds to an example query	[production]
00:01	<ryankemper@deploy1001>	Finished deploy [wdqs/wdqs@8c97b17]: 0.3.53 (duration: 09m 29s)	[production]
2020-10-28 §
23:54	<ryankemper>	Canary `wdqs1003` tests pass, proceeding with wdqs deploy to rest of fleet	[production]
23:52	<ryankemper@deploy1001>	Started deploy [wdqs/wdqs@8c97b17]: 0.3.53	[production]
23:52	<ryankemper@deploy1001>	deploy aborted: 0.3.53 (duration: 00m 00s)	[production]
23:52	<ryankemper@deploy1001>	Started deploy [wdqs/wdqs@8c97b17]: 0.3.53	[production]
22:54	<mutante>	scandium - scap pull after reinstalling OS	[production]
22:14	<dzahn@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
22:12	<dzahn@cumin1001>	START - Cookbook sre.hosts.downtime	[production]