production SAL

2401-2450 of 10000 results (27ms)

2020-09-08 §
19:12	<jhuneidi@deploy1001>	Finished scap: testwikis wikis to 1.36.0-wmf.8 (duration: 71m 45s)	[production]
18:22	<elukey>	rm /srv/prometheus/ops/targets/mjolnir_msearch_eqiad.yaml on prometheus100[3,4] as cleanup after https://gerrit.wikimedia.org/r/621988 - T260305	[production]
18:00	<jhuneidi@deploy1001>	Started scap: testwikis wikis to 1.36.0-wmf.8	[production]
17:58	<ryankemper@cumin1001>	START - Cookbook sre.wdqs.data-reload	[production]
17:57	<ryankemper@cumin1001>	END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)	[production]
17:54	<Amir1>	Deployed patch for T262240	[production]
17:53	<ryankemper@cumin1001>	START - Cookbook sre.wdqs.data-reload	[production]
17:23	<andrewbogott>	rebooting cloudvirt1033	[production]
17:03	<klausman>	attempted to add rock-dkms_3.3-19_all.deb to thirdparty/amd-rocm33 for use on analytics servers with GPUs	[production]
16:35	<otto@deploy1001>	Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: Set canary_events_enabled: true for eventgate test streams and eventlogging_Test - T251609 (duration: 00m 58s)	[production]
16:34	<herron>	increased elk5 logstash JVM heaps to 2g (to help decrease kafka-logging consumer lag)	[production]
16:12	<longma>	1.36.0-wmf.8 was branched at e81e81e91473cc8259c473165863aca8ecea2784 for T257976	[production]
16:03	<akosiaris@deploy1001>	helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .	[production]
16:03	<akosiaris@deploy1001>	helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .	[production]
16:02	<akosiaris@deploy1001>	helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .	[production]
15:34	<jayme@cumin1001>	conftool action : set/pooled=yes; selector: name=kubernetes1004.*	[production]
15:32	<jayme@cumin1001>	conftool action : set/pooled=yes; selector: service=kubesvc,name=kubernetes1013.*	[production]
15:30	<elukey>	roll restart of hadoop master daemons on an-master100[1,2] after the cookbook failed	[production]
15:26	<elukey@cumin1001>	END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99)	[production]
15:20	<_joe_>	restarted celery-ores-worker.service on ores1007	[production]
15:19	<_joe_>	restarted ferm on wdqs1011	[production]
15:18	<elukey@cumin1001>	START - Cookbook sre.hadoop.roll-restart-masters	[production]
15:16	<_joe_>	starting wdqs-updater on wdqs1005	[production]
15:15	<bblack@cumin1001>	conftool action : set/pooled=yes; selector: name=cp1090.eqiad.wmnet	[production]
15:14	<bblack@cumin1001>	conftool action : set/pooled=yes; selector: name=cp108[789].eqiad.wmnet	[production]
15:14	<bblack>	repool cp1087-90 (eqiad row D)	[production]
15:13	<herron>	rolling restart of elk5 logstashes	[production]
15:10	<marostegui>	Start mysql on db1106 after PDU maintenance is done	[production]
15:03	<jayme@cumin1001>	conftool action : set/pooled=inactive; selector: service=kubesvc,name=kubernetes1013.*	[production]
15:03	<jayme@cumin1001>	conftool action : set/pooled=inactive; selector: name=kubernetes1004.*	[production]
15:03	<XioNoX>	request virtual-chassis vc-port set pic-slot 1 member 4 port 0	[production]
15:03	<XioNoX>	request virtual-chassis vc-port set pic-slot 0 member 2 port 50	[production]
15:02	<XioNoX>	request virtual-chassis vc-port set pic-slot 1 member 1 port 1	[production]
14:53	<marostegui>	Reload dbproxy1016 to recover the alert	[production]
14:45	<jynus>	restarting bacula-dir @ backup1001	[production]
14:44	<XioNoX>	reboot asw2-d3-eqiad	[production]
14:33	<moritzm>	bouncing ferm on hosts where ferm.service failed due to DNS resolution issues for prometheus hosts	[production]
14:31	<volans>	restarted ssh on mc1033 from console	[production]
14:16	<XioNoX>	request virtual-chassis vc-port delete pic-slot 1 member 4 port 0	[production]
14:16	<XioNoX>	request virtual-chassis vc-port delete pic-slot 0 member 2 port 50	[production]
14:14	<XioNoX>	request virtual-chassis vc-port delete pic-slot 1 member 1 port 1	[production]
14:13	<akosiaris>	drain kubernetes1013, kubernetes1004. They are on row D	[production]
14:13	<bblack>	dns1002 - disable puppet + bird service (stop advertising recdns from row D)	[production]
14:03	<kormat@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)	[production]
14:03	<kormat@cumin1001>	START - Cookbook sre.hosts.downtime	[production]
13:59	<bblack@cumin1001>	conftool action : set/pooled=no; selector: name=cp1090.eqiad.wmnet	[production]
13:59	<bblack>	depooling cp1087-1090	[production]
13:59	<bblack@cumin1001>	conftool action : set/pooled=no; selector: name=cp108[789].eqiad.wmnet	[production]
13:57	<XioNoX>	asw2-d-eqiad> request system reboot member 3	[production]
13:35	<cmjohnson1>	the power cable was not properly seated and lost power to asw2-d3-eqiad	[production]