4501-4550 of 10000 results (32ms)
2022-12-06 §
14:32 <elukey@deploy1002> helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [production]
14:32 <elukey@deploy1002> helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [production]
14:31 <elukey@deploy1002> helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [production]
2022-12-02 §
07:49 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
07:49 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
07:49 <elukey@deploy1002> helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [production]
07:49 <elukey@deploy1002> helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [production]
07:49 <elukey@deploy1002> helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [production]
07:49 <elukey@deploy1002> helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [production]
07:43 <elukey@deploy1002> helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [production]
07:43 <elukey@deploy1002> helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [production]
07:41 <elukey@deploy1002> helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [production]
07:41 <elukey@deploy1002> helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [production]
2022-12-01 §
10:56 <elukey> deleted knative controller + net-istio controllers on ml-serve-eqiad to clear out some weird state (causing high latencies for the k8s api) [production]
2022-11-30 §
15:54 <elukey@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on ores2009.codfw.wmnet with reason: DCOps maintenance [production]
15:54 <elukey@cumin1001> START - Cookbook sre.hosts.downtime for 0:30:00 on ores2009.codfw.wmnet with reason: DCOps maintenance [production]
2022-11-29 §
10:26 <elukey> restart kube-apiserver on ml-serve-ctrl* to clear out some knative controller issue [production]
2022-11-25 §
11:24 <elukey> restart turnilo on an-tool1007 to pick up new settings for webrequest_sampled_live [production]
2022-11-23 §
09:19 <elukey> restart kube-apiserver on ml-staging-ctrl2001 as attempt to mitigate weird LIST latencies [production]
09:14 <elukey> restart kube-apiserver on ml-serve-ctrl1001 as attempt to mitigate weird LIST latencies [production]
2022-11-21 §
09:31 <elukey@deploy1002> helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [production]
09:31 <elukey@deploy1002> helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [production]
09:29 <elukey@deploy1002> helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [production]
09:28 <elukey@deploy1002> helmfile [staging] START helmfile.d/services/eventgate-main: sync [production]
09:15 <elukey> restart ml-serve-codfw's kube-apiserver to clear some knative LIST certificate workload (still not sure what it is but it seems a bug related to our ancient version) [production]
2022-11-19 §
08:10 <elukey> re-created knative pods misbehaving for ml-serve-codfw (causing latency alerts) [production]
2022-11-18 §
09:16 <elukey> push the 'k8s_116' tag for docker-registry.discovery.wmnet/pause - T322920 [production]
2022-11-17 §
07:47 <elukey> restart kube-apiserver on ml-serve-ctrl2002 - high LIST latencies for knative, attempt to clear them out [production]
2022-11-11 §
10:15 <elukey@cumin1001> END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES eqiad cluster: Roll restart of ORES's daemons. [production]
09:55 <elukey@cumin1001> START - Cookbook sre.ores.roll-restart-workers for ORES eqiad cluster: Roll restart of ORES's daemons. [production]
09:54 <elukey@cumin1001> END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [production]
09:35 <elukey@cumin1001> START - Cookbook sre.ores.roll-restart-workers for ORES codfw cluster: Roll restart of ORES's daemons. [production]
2022-11-07 §
15:55 <elukey> upgrade istioctl to 1.15.3 on apt1001 for {buster,bullseye}-wikimedia - T322193 [production]
09:38 <elukey> restart rsyslog on ml-serve2001 [production]
07:37 <elukey> `elukey@aux-k8s-worker1002:~$ sudo systemctl reset-failed ifup@ens13.service` [production]
2022-11-06 §
08:23 <elukey> restart rsyslog on centralog2002 [production]
08:19 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
08:19 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
08:17 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
08:17 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
07:50 <elukey> restart kube-apiserver on ml-serve-ctrl1001 [production]
07:48 <elukey> restart kube-apiserver on ml-serve-ctrl1002 - high HTTP 409 registered since days ago [production]
2022-11-05 §
09:39 <elukey> reinstall kubernetes-node on ml-staging200[12] to allow puppet to run (cleanup after yesterday issue, worker nodes had master role applied) [production]
09:32 <elukey> restart kube-apiserver on ml-staging-ctrl2001 [production]
09:31 <elukey> restart kube-apiserver on ml-staging-ctrl2002 [production]
2022-11-04 §
15:00 <elukey> `elukey@cumin1001:~$ sudo cumin 'ms-fe2*' 'systemctl restart swift-proxy' -b 1 -s 20` [production]
14:48 <elukey> restart swift-proxy on ms-fe1011 [production]
11:27 <elukey> restart kube-apiserver on ml-serve-ctrl2002 - high latencies for LIST (knative resources) [production]
2022-11-03 §
17:39 <elukey> `sudo truncate -s 20G /var/log/nginx/etcd_access.log.1` on conf100[7-9], root partition full [production]
09:26 <elukey@cumin1001> END (PASS) - Cookbook sre.ores.roll-restart-workers (exit_code=0) for ORES codfw cluster: Roll restart of ORES's daemons. [production]