production SAL

4501-4550 of 10000 results (23ms)

2022-12-06 §
14:32	<elukey@deploy1002>	helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.	[production]
14:32	<elukey@deploy1002>	helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.	[production]
14:31	<elukey@deploy1002>	helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.	[production]
2022-12-02 §
07:49	<elukey@deploy1002>	helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.	[production]
07:49	<elukey@deploy1002>	helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.	[production]
07:49	<elukey@deploy1002>	helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.	[production]
07:49	<elukey@deploy1002>	helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.	[production]
07:49	<elukey@deploy1002>	helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.	[production]
07:49	<elukey@deploy1002>	helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.	[production]
07:43	<elukey@deploy1002>	helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.	[production]
07:43	<elukey@deploy1002>	helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.	[production]
07:41	<elukey@deploy1002>	helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.	[production]
07:41	<elukey@deploy1002>	helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.	[production]
2022-12-01 §
10:56	<elukey>	deleted knative controller + net-istio controllers on ml-serve-eqiad to clear out some weird state (causing high latencies for the k8s api)	[production]
2022-11-30 §
15:54	<elukey@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ores2009.codfw.wmnet with reason: DCOps maintenance	[production]
15:54	<elukey@cumin1001>	START - Cookbook sre.hosts.downtime for 0:30:00 on ores2009.codfw.wmnet with reason: DCOps maintenance	[production]
2022-11-29 §
10:26	<elukey>	restart kube-apiserver on ml-serve-ctrl* to clear out some knative controller issue	[production]
2022-11-25 §
11:24	<elukey>	restart turnilo on an-tool1007 to pick up new settings for webrequest_sampled_live	[production]
2022-11-23 §
09:19	<elukey>	restart kube-apiserver on ml-staging-ctrl2001 as attempt to mitigate weird LIST latencies	[production]
09:14	<elukey>	restart kube-apiserver on ml-serve-ctrl1001 as attempt to mitigate weird LIST latencies	[production]
2022-11-21 §
09:31	<elukey@deploy1002>	helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.	[production]
09:31	<elukey@deploy1002>	helmfile [staging-eqiad] START helmfile.d/admin 'sync'.	[production]
09:29	<elukey@deploy1002>	helmfile [staging] DONE helmfile.d/services/eventgate-main: sync	[production]
09:28	<elukey@deploy1002>	helmfile [staging] START helmfile.d/services/eventgate-main: sync	[production]
09:15	<elukey>	restart ml-serve-codfw's kube-apiserver to clear some knative LIST certificate workload (still not sure what it is but it seems a bug related to our ancient version)	[production]
2022-11-19 §
08:10	<elukey>	re-created knative pods misbehaving for ml-serve-codfw (causing latency alerts)	[production]
2022-11-18 §
09:16	<elukey>	push the 'k8s_116' tag for docker-registry.discovery.wmnet/pause - T322920	[production]
2022-11-17 §
07:47	<elukey>	restart kube-apiserver on ml-serve-ctrl2002 - high LIST latencies for knative, attempt to clear them out	[production]
2022-11-11 §
10:15	<elukey@cumin1001>	END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons.	[production]
09:55	<elukey@cumin1001>	START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons.	[production]
09:54	<elukey@cumin1001>	END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons.	[production]
09:35	<elukey@cumin1001>	START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons.	[production]
2022-11-07 §
15:55	<elukey>	upgrade istioctl to 1.15.3 on apt1001 for {buster,bullseye}-wikimedia - T322193	[production]
09:38	<elukey>	restart rsyslog on ml-serve2001	[production]
07:37	<elukey>	`elukey@aux-k8s-worker1002:~$ sudo systemctl reset-failed ifup@ens13.service`	[production]
2022-11-06 §
08:23	<elukey>	restart rsyslog on centralog2002	[production]
08:19	<elukey@deploy1002>	helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.	[production]
08:19	<elukey@deploy1002>	helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.	[production]
08:17	<elukey@deploy1002>	helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.	[production]
08:17	<elukey@deploy1002>	helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.	[production]
07:50	<elukey>	restart kube-apiserver on ml-serve-ctrl1001	[production]
07:48	<elukey>	restart kube-apiserver on ml-serve-ctrl1002 - high HTTP 409 registered since days ago	[production]
2022-11-05 §
09:39	<elukey>	reinstall kubernetes-node on ml-staging200[12] to allow puppet to run (cleanup after yesterday issue, worker nodes had master role applied)	[production]
09:32	<elukey>	restart kube-apiserver on ml-staging-ctrl2001	[production]
09:31	<elukey>	restart kube-apiserver on ml-staging-ctrl2002	[production]
2022-11-04 §
15:00	<elukey>	`elukey@cumin1001:~$ sudo cumin 'ms-fe2*' 'systemctl restart swift-proxy' -b 1 -s 20`	[production]
14:48	<elukey>	restart swift-proxy on ms-fe1011	[production]
11:27	<elukey>	restart kube-apiserver on ml-serve-ctrl2002 - high latencies for LIST (knative resources)	[production]
2022-11-03 §
17:39	<elukey>	`sudo truncate -s 20G /var/log/nginx/etcd_access.log.1` on conf100[7-9], root partition full	[production]
09:26	<elukey@cumin1001>	END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons.	[production]