6101-6150 of 10000 results (30ms)
2021-10-21 §
16:03 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [production]
06:35 <elukey> `systemctl reload nginx` on cloudelastic100[5,6] to pick up the new TLS certificate and clear alerts - T293826 [production]
2021-10-20 §
14:44 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [production]
14:44 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [production]
08:01 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
08:01 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
06:28 <elukey> reboot analytics1066 - OS showing CPU soft lockups, tons of defunct processes (including node manager) and high CPU usage [production]
2021-10-19 §
10:29 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
10:28 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
2021-10-18 §
07:34 <elukey> depool + restart blazegraph on wdqs1013 [production]
2021-10-15 §
13:30 <elukey> start topic rebalancing for kafka main-eqiad (long maintenance, it will last a couple of days) [production]
2021-10-14 §
16:37 <elukey> drop kubeflow-kfserving* docker images from deneb [production]
14:19 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [production]
14:06 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [production]
14:05 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
14:05 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
13:56 <elukey@deploy1002> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [production]
13:55 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
13:54 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
13:54 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
13:54 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
13:52 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
13:52 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
2021-10-13 §
14:50 <elukey> restart pybal on lvs1015 (low-traffic primary) to pick up new config for inference.discovery.wmnet - T289835 [production]
14:44 <elukey@puppetmaster1001> conftool action : ge; selector: cluster=ml_serve,service=inference [production]
14:36 <elukey> restart pybal on lvs1016 (low-traffic secondary) to pick up new config for inference.discovery.wmnet - T289835 [production]
08:21 <elukey> run kafka preferred-replica-election on kafka-main1001 to rebalance partition leaders - T288825 [production]
07:33 <elukey> increase kafka topic partition size of the top 4 high traffic topics of main-eqiad as described in https://phabricator.wikimedia.org/T288825#7422726 [production]
06:26 <elukey> `kafka topics --alter --topic {eqiad,codfw}.change-prop.transcludes.resource-change --partitions 3` on kafka-main2001 - T288825 [production]
2021-10-12 §
12:15 <elukey> `kafka topics --alter --topic codfw.mediawiki.job.cirrusSearchElasticaWrite --partitions 5` - T288825 [production]
12:15 <elukey> `kafka topics --alter --topic eqiad.mediawiki.job.cirrusSearchElasticaWrite --partitions 5` - T288825 [production]
12:10 <elukey> `kafka topics --alter --topic codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite --partitions 5` - T288825 [production]
12:09 <elukey> `kafka topics --alter --topic eqiad.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite --partitions 5` - T288825 [production]
11:58 <elukey> `kafka topics --alter --topic codfw.resource-purge --partitions 5` on kafka-main2001 - T288825 [production]
11:49 <elukey> `kafka topics --alter --topic eqiad.resource-purge --partitions 5` on kafka-main2001 - T288825 [production]
07:40 <elukey> run kafka preferred-replica-election on kafka-main2001 to rebalance partition leaders after the last topic moves - T288825 [production]
2021-10-11 §
17:08 <elukey> force kafka preferred-replica-election on kafka-main2001 after another batch of topic partitions moves - T288825 [production]
13:42 <elukey> force kafka preferred-replica-election on kafka-main2001 after another batch of topic partitions moves - T288825 [production]
09:37 <elukey> force kafka preferred-replica-election on kafka-main2001 after another batch of topic partitions moves - T288825 [production]
09:09 <elukey> force kafka preferred-replica-election on kafka-main2001 after the first 50 topic partitions moves - T288825 [production]
07:57 <elukey> start kafka topics rebalancing for main-codfw (long running maintenance) - T288825 [production]
2021-10-08 §
15:48 <elukey@deploy1002> helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [production]
15:48 <elukey@deploy1002> helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [production]
2021-10-05 §
13:39 <elukey@cumin1001> END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - elukey@cumin1001 [production]
13:39 <elukey@cumin1001> START - Cookbook sre.aqs.roll-restart for AQS aqs cluster: Roll restart of all AQS's nodejs daemons. - elukey@cumin1001 [production]
12:43 <elukey> import AMD ROCm 4.2 to buster-wikimedia's thirdparty/amd-rocm42 - T287267 [production]
07:57 <elukey> upgrade GPU drivers (AMD ROCm 4.3.1) on an-worker1[096-101] [production]
07:26 <elukey@puppetmaster1001> conftool action : set/pooled=yes; selector: name=wdqs1004.wmnet [production]
06:38 <elukey> reboot an-worker1096 after installing new GPU drivers [production]
2021-10-04 §
14:19 <elukey> import AMD ROCm 4.3.1 packages in buster-wikimedia's thirdparty/amd-rocm431 - T287267 [production]