2021-11-25
§
|
07:47 |
<jayme> |
elevated MediaWiki exceptions and fatals (from ~07:35) due to a mistake during re-deploy of eventgate-main |
[production] |
07:45 |
<jelto@deploy1002> |
helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' . |
[production] |
07:35 |
<jelto@deploy1002> |
helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . |
[production] |
07:32 |
<jelto@deploy1002> |
helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . |
[production] |
07:32 |
<jelto@deploy1002> |
helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . |
[production] |
07:29 |
<elukey_> |
elukey@mwdebug2002:~$ sudo systemctl reset-failed ifup@ens5.service |
[production] |
07:27 |
<marostegui@cumin1001> |
START - Cookbook sre.hosts.reimage for host db1128.eqiad.wmnet with OS bullseye |
[production] |
07:23 |
<ladsgroup@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance T296143 |
[production] |
07:23 |
<ladsgroup@cumin1001> |
START - Cookbook sre.hosts.downtime for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance T296143 |
[production] |
07:20 |
<jelto@cumin1001> |
conftool action : set/pooled=false; selector: name=eqiad,dnsdisc=(apertium|api-gateway|apple-search|blubberoid|citoid|cxserver|echostore|eventgate-analytics|eventgate-analytics-external|eventgate-logging-external|eventstreams|eventstreams-internal|linkrecommendation|mathoid|mobileapps|proton|push-notifications|recommendation-api|sessionstore|shellbox|shellbox-constraints|shellbox-media|shellbox-syntax |
[production] |
07:17 |
<jelto@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 32 hosts with reason: helm3 de-deploy T251305 |
[production] |
07:17 |
<jelto@cumin1001> |
START - Cookbook sre.hosts.downtime for 3:00:00 on 32 hosts with reason: helm3 de-deploy T251305 |
[production] |
07:10 |
<jelto> |
downtime PyBal backends health check on lvs1015 and lvs1016 for helm3 de-deploy T251305. I'm keeping an eye on icing and remove downtime as soon as I'm finished |
[production] |
07:09 |
<jelto> |
start re-deploy procedure in eqiad Kubernetes T251305 |
[production] |
06:31 |
<marostegui> |
Restart tendril's DB |
[production] |
05:51 |
<ryankemper> |
[WDQS Deploy] Deploy complete. Successful test query placed on query.wikidata.org, there's no relevant criticals in Icinga, and Grafana looks good |
[production] |
04:45 |
<ryankemper@deploy1002> |
Finished deploy [wdqs/wdqs@29c5cd7] (wcqs): Deploy 0.3.93 to WCQS (duration: 05m 27s) |
[production] |
04:43 |
<ryankemper> |
[WCQS Deploy] Tests look good following deploy of `0.3.93` to canary `wcqs1002.eqiad.wmnet`, proceeding to rest of fleet |
[production] |
04:40 |
<ryankemper@deploy1002> |
Started deploy [wdqs/wdqs@29c5cd7] (wcqs): Deploy 0.3.93 to WCQS |
[production] |
04:39 |
<ryankemper> |
[WDQS Deploy] Restarting `wdqs-categories` across lvs-managed hosts, one node at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 45 && systemctl restart wdqs-categories && sleep 45 && pool'` |
[production] |
04:38 |
<ryankemper> |
[WDQS Deploy] Restarted `wdqs-categories` across all test hosts simultaneously: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` |
[production] |
04:38 |
<ryankemper> |
[WDQS Deploy] Restarted `wdqs-updater` across all hosts, 4 hosts at a time: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` |
[production] |
04:35 |
<ryankemper@deploy1002> |
Finished deploy [wdqs/wdqs@29c5cd7]: 0.3.93 (duration: 09m 23s) |
[production] |
04:30 |
<ryankemper> |
[Elastic] Cleaning up dangling apt packages: `ryankemper@cumin1001:~$ sudo cumin -b 4 'elastic*' 'sudo apt autoremove -y'` |
[production] |
04:27 |
<ryankemper> |
[WDQS Deploy] Tests passing following deploy of `0.3.93` on canary `wdqs1003`; proceeding to rest of fleet |
[production] |
04:25 |
<ryankemper@deploy1002> |
Started deploy [wdqs/wdqs@29c5cd7]: 0.3.93 |
[production] |
04:25 |
<ryankemper> |
[WDQS Deploy] Gearing up for deploy of wdqs `0.3.93`. Pre-deploy tests passing on canary `wdqs1003` |
[production] |
03:12 |
<pt1979@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2072.codfw.wmnet with OS buster |
[production] |
02:42 |
<pt1979@cumin2002> |
START - Cookbook sre.hosts.reimage for host elastic2072.codfw.wmnet with OS buster |
[production] |
02:34 |
<pt1979@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2071.codfw.wmnet with OS buster |
[production] |
02:23 |
<pt1979@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2070.codfw.wmnet with OS buster |
[production] |
02:04 |
<pt1979@cumin2002> |
START - Cookbook sre.hosts.reimage for host elastic2071.codfw.wmnet with OS buster |
[production] |
01:54 |
<pt1979@cumin2002> |
START - Cookbook sre.hosts.reimage for host elastic2070.codfw.wmnet with OS buster |
[production] |
01:49 |
<pt1979@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2068.codfw.wmnet with OS buster |
[production] |
01:34 |
<pt1979@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2067.codfw.wmnet with OS buster |
[production] |
01:19 |
<pt1979@cumin2002> |
START - Cookbook sre.hosts.reimage for host elastic2068.codfw.wmnet with OS buster |
[production] |
01:04 |
<pt1979@cumin2002> |
START - Cookbook sre.hosts.reimage for host elastic2067.codfw.wmnet with OS buster |
[production] |
00:37 |
<pt1979@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2066.codfw.wmnet with OS buster |
[production] |
2021-11-24
§
|
23:59 |
<pt1979@cumin2002> |
START - Cookbook sre.hosts.reimage for host elastic2066.codfw.wmnet with OS buster |
[production] |
23:52 |
<pt1979@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2065.codfw.wmnet with OS buster |
[production] |
23:44 |
<mutante> |
puppetmaster1001:~] $ sudo puppet cert sign gitlab-runner1001.eqiad.wmnet | sudo install_console gitlab-runner1001.eqiad.wmnet (T295481) |
[production] |
23:26 |
<mutante> |
ganeti - bringing up new VM - sudo gnt-instance start gitlab-runner1001.eqiad.wmnet ; ran puppet on install1003; installing OS T295481 |
[production] |
23:22 |
<pt1979@cumin2002> |
START - Cookbook sre.hosts.reimage for host elastic2065.codfw.wmnet with OS buster |
[production] |
23:11 |
<pt1979@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2064.codfw.wmnet with OS buster |
[production] |
23:09 |
<mutante> |
mwmaint1002 - sudo /usr/bin/find /var/lib/puppet/clientbucket/ -type f -size 1M -delete - to fix Icinga alert about large files in client bucket |
[production] |
23:08 |
<dzahn@cumin1001> |
END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host gitlab-runner1001.eqiad.wmnet |
[production] |
23:03 |
<mutante> |
wcqs1001 - sudo systemctl restart wcqs-blazegraph - after <+jinxer-wm> (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wcqs1001:9195 is burning free allocators |
[production] |
22:52 |
<dzahn@cumin1001> |
START - Cookbook sre.ganeti.makevm for new host gitlab-runner1001.eqiad.wmnet |
[production] |
22:50 |
<mutante> |
Creating a new Ganeti VM and wondering which row to put it? [ganeti1009:~] $ for row in A B C D; do echo "row ${row}: $(sudo gnt-instance list -o name -F "pnode.group == 'row_${row}'" | wc -l) VMs"; done |
[production] |
22:43 |
<dzahn@cumin1001> |
END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts gitlab-runner1001.wikimedia.org |
[production] |