1-50 of 10000 results (20ms)
2026-03-10 §
01:37 <ryankemper> [WDQS] T410573 repooled wdqs1011.eqiad.wmnet - erroneously depooled since `2025-11-19` by failed `sre.wdqs.reboot` cookbook [production]
00:45 <wm-bot2> Deployment completed: https://github.com/cluebotng/component-configs/actions/runs/22881598726 (https://github.com/cluebotng/component-configs/commits/bc32d8044077ff83db8b985b87df029ff564ad29) [tools.cluebotng-review]
00:42 <vriley@cumin1003> START - Cookbook sre.hosts.reimage for host contint1003.wikimedia.org with OS trixie [production]
00:39 <vriley@cumin1003> END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host contint1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [production]
00:29 <vriley@cumin1003> START - Cookbook sre.hosts.provision for host contint1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [production]
2026-03-09 §
22:51 <rzl> root@apt1002:~# reprepro --noskipold --restrict vopsbot update bookworm-wikimedia [production]
22:34 <bking@cumin2002> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dse-k8s-ctrl1001.eqiad.wmnet [production]
22:32 <bking@cumin2002> START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1001.eqiad.wmnet [production]
22:30 <bking@cumin2002> END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM dse-k8s-ctrl1002.eqiad.wmnet [production]
22:29 <bking@cumin2002> START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1002.eqiad.wmnet [production]
22:28 <bking@cumin2002> END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM dse-k8s-ctrl1002.eqiad.wmnet [production]
22:28 <bking@cumin2002> START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1002.eqiad.wmnet [production]
22:28 <bking@cumin2002> END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM dse-k8s-ctrl1002.eqiad.wmnet [production]
22:28 <bking@cumin2002> START - Cookbook sre.ganeti.reboot-vm for VM dse-k8s-ctrl1002.eqiad.wmnet [production]
22:03 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudgw2004-dev.codfw.wmnet with OS trixie [production]
22:02 <alexsanford> Redeployed security fix for T419186 [production]
21:53 <bd808> Reboot deployment-shellbox01 on the off chance that is makes the new permissions error go away (T419440) [releng]
21:44 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [production]
21:40 <andrew@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [production]
21:38 <Reedy> rm -rf /var/log/extdist T418469 T253588 [extdist]
21:37 <cdobbins@puppetserver1001> conftool action : set/pooled=yes; selector: name=cp7002.magru.wmnet [production]
21:34 <cdobbins@cumin2002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp7002.magru.wmnet with OS trixie [production]
21:29 <alexsanford> Deployed security fix for T419186 [production]
21:22 <andrew@cumin2002> START - Cookbook sre.hosts.reimage for host cloudgw2004-dev.codfw.wmnet with OS trixie [production]
21:21 <andrew@cumin2002> END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudgw2004-dev.codfw.wmnet with OS trixie [production]
21:17 <dani@deploy2002> Finished scap sync-world: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] (duration: 08m 15s) [production]
21:13 <dani@deploy2002> dani: Continuing with sync [production]
21:11 <dani@deploy2002> dani: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [production]
21:09 <dani@deploy2002> Started scap sync-world: Backport for [[gerrit:1249370|Pre-deploy participant recruitment survey on ptwiki and trwiki (T419275)]] [production]
21:08 <andrew@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [production]
21:05 <cdobbins@cumin2002> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [production]
21:02 <cdobbins@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on cp7002.magru.wmnet with reason: host reimage [production]
21:01 <andrew@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2004-dev.codfw.wmnet with reason: host reimage [production]
21:01 <tgr_> removed private code for T397244 [production]
21:01 <ryankemper> [WDQS] Alright, these are re-entering a failed state soon enough that we will need to identify the offender if we want to restore proper service. We could put some temporary hack to restart every few minutes so we at least maintain some uptime, but root cause is the usual 'we need a requestctl rule to block whoever's killing us' scenario [production]
21:00 <cdobbins@puppetserver1001> conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [reason: Trixie reimaging] [production]
20:57 <ryankemper> [WDQS] Auto-remediation would have eventually restarted these, but some of them were staying below our current threshold of `threads > 1200`. May want to lower threshold, or examine an additional metric-type to look at in the future [production]
20:56 <ryankemper> [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'A:wdqs-main AND P{wdqs1*}' 'systemctl restart wdqs-blazegraph'` [production]
20:54 <ryankemper> [WDQS] `ryankemper@cumin2002:~$ sudo -E cumin 'A:wdqs-main AND P{wdqs2*}' 'systemctl restart wdqs-blazegraph'` [production]
20:44 <andrew@cumin2002> START - Cookbook sre.hosts.reimage for host cloudgw2004-dev.codfw.wmnet with OS trixie [production]
20:43 <tgr@deploy2002> Unlocked for deployment [MediaWiki]: working on private change (duration: 10m 10s) [production]
20:36 <cdobbins@cumin2002> START - Cookbook sre.hosts.reimage for host cp7002.magru.wmnet with OS trixie [production]
20:33 <tgr@deploy2002> Locking from deployment [MediaWiki]: working on private change [production]
20:31 <tgr@deploy2002> Finished scap sync-world: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] (duration: 13m 36s) [production]
20:27 <tgr@deploy2002> cscott, tgr, anzx: Continuing with sync [production]
20:19 <tgr@deploy2002> cscott, tgr, anzx: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [production]
20:17 <tgr@deploy2002> Started scap sync-world: Backport for [[gerrit:1247119|Enable parser survey for opted-out users on German/French/Polish wikis (T414852)]], [[gerrit:1249316|lift IP cap for womens month editathon (T419109)]] [production]
20:13 <aaron@deploy2002> Finished scap sync-world: Backport for [[gerrit:1249363|Remove redundant math spec file from wwwportal (T418188)]] (duration: 06m 56s) [production]
20:09 <aaron@deploy2002> aaron: Continuing with sync [production]
20:08 <aaron@deploy2002> aaron: Backport for [[gerrit:1249363|Remove redundant math spec file from wwwportal (T418188)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [production]