401-450 of 10000 results (90ms)
2024-08-12 §
15:06 <isaranto@deploy1003> helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [production]
14:46 <jgiannelos@deploy1003> helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [production]
14:45 <jgiannelos@deploy1003> helmfile [eqiad] START helmfile.d/services/mobileapps: apply [production]
14:44 <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: security update - bking@cumin2002 - T371874 [production]
14:42 <elukey> powercycle ms-be1078 - causing frontend errors in swift-eqiad, network link is down (if down/up didn't work, nothing in the dmesg/syslog) [production]
14:42 <jgiannelos@deploy1003> helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [production]
14:41 <jgiannelos@deploy1003> helmfile [codfw] START helmfile.d/services/mobileapps: apply [production]
14:38 <jgiannelos@deploy1003> helmfile [eqiad] START helmfile.d/services/mobileapps: apply [production]
14:38 <jgiannelos@deploy1003> helmfile [eqiad] START helmfile.d/services/mobileapps: apply [production]
14:34 <jgiannelos@deploy1003> helmfile [eqiad] START helmfile.d/services/mobileapps: apply [production]
14:23 <zabe@deploy1003> Finished scap: Backport for [[gerrit:1061152|Further configuration for bdrwiki (T371760)]] (duration: 21m 07s) [production]
14:01 <zabe@deploy1003> Started scap sync-world: Backport for [[gerrit:1061152|Further configuration for bdrwiki (T371760)]] [production]
13:46 <hnowlan@deploy1003> helmfile [eqiad] DONE helmfile.d/services/shellbox-video: apply [production]
13:46 <hnowlan@deploy1003> helmfile [eqiad] START helmfile.d/services/shellbox-video: apply [production]
13:33 <klausman@deploy1003> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [production]
13:33 <klausman@deploy1003> helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [production]
13:25 <jgiannelos@deploy1003> helmfile [staging] DONE helmfile.d/services/mobileapps: apply [production]
13:24 <jgiannelos@deploy1003> helmfile [staging] START helmfile.d/services/mobileapps: apply [production]
13:24 <jgiannelos@deploy1003> helmfile [staging] START helmfile.d/services/mobileapps: apply [production]
12:37 <elukey> restart exim4 on list2001 to pick up the new TLS material [production]
12:35 <elukey> restart exim4 on list1004 to pick up the new TLS material [production]
12:32 <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [production]
12:32 <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [production]
12:11 <elukey@cumin1002> START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Openjdk upgrade - elukey@cumin1002 [production]
12:04 <kevinbazira@deploy1003> helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [production]
12:03 <kevinbazira@deploy1003> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [production]
11:59 <kevinbazira@deploy1003> helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [production]
11:26 <hnowlan> rebuilding php7.4-fpm and php7.4-fpm-multiversion-base to pick up healthz worker awareness change (r/1060867) [production]
11:22 <ladsgroup@cumin1002> conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [production]
11:10 <kevinbazira@deploy1003> helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [production]
11:06 <isaranto@deploy1003> helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [production]
11:04 <isaranto@deploy1003> helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [production]
11:03 <isaranto@deploy1003> helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [production]
10:19 <vgutierrez> restarting apache on puppetmaster1003 [production]
09:54 <kamila_> rebooting puppetmaster1001 due to intermittent network failures [production]
09:46 <ayounsi@cumin1002> END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 54994 [production]
09:43 <ayounsi@cumin1002> START - Cookbook sre.network.peering with action 'email' for AS: 54994 [production]
09:17 <urbanecm@deploy1003> Finished scap: Backport for [[gerrit:1061148|MenteeOverviewApi: Do not apply undefined/null params (T372164)]] (duration: 19m 54s) [production]
09:11 <urbanecm@deploy1003> urbanecm: Continuing with sync [production]
09:11 <godog> bounce grafana after https://gerrit.wikimedia.org/r/c/operations/puppet/+/1061955 [production]
09:10 <urbanecm@deploy1003> urbanecm: Backport for [[gerrit:1061148|MenteeOverviewApi: Do not apply undefined/null params (T372164)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [production]
08:57 <urbanecm@deploy1003> Started scap sync-world: Backport for [[gerrit:1061148|MenteeOverviewApi: Do not apply undefined/null params (T372164)]] [production]
07:39 <arnaudb@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: index corruption [production]
07:39 <arnaudb@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: index corruption [production]
07:38 <arnaudb@cumin1002> dbctl commit (dc=all): 'db2189 - s2', diff saved to https://phabricator.wikimedia.org/P67270 and previous config saved to /var/cache/conftool/dbconfig/20240812-073846-arnaudb.json [production]
2024-08-11 §
07:58 <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [production]
07:58 <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [production]
07:58 <marostegui@cumin1002> dbctl commit (dc=all): 'Repooling after maintenance db1235 (T367856)', diff saved to https://phabricator.wikimedia.org/P67269 and previous config saved to /var/cache/conftool/dbconfig/20240811-075839-marostegui.json [production]
07:43 <marostegui@cumin1002> dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P67268 and previous config saved to /var/cache/conftool/dbconfig/20240811-074332-marostegui.json [production]
07:28 <marostegui@cumin1002> dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P67267 and previous config saved to /var/cache/conftool/dbconfig/20240811-072825-marostegui.json [production]