501-550 of 10000 results (23ms)
2021-06-08 §
09:36 <dcaro> actually, there's several different errors, will open tasks for each of them [admin-monitoring]
09:31 <dcaro> there's a bunch of novafullstack vms in error because it timed out when trying to allocate network, though there's a "successfully plugged vif" message from neutron, cleaning up for now [admin-monitoring]
09:04 <jayme> removing docker-images from registry: releng/ci-jessie, releng/ci-src-setup, releng/composer-php56, releng/composer-test-php56, releng/npm, releng/npm-test, releng/npm-test-3d2png, releng/npm-test-graphoid, releng/npm-test-librdkafka, releng/npm-test-maps-service, releng/php56, releng/quibble-jessie, releng/quibble-jessie-hhvm, releng/quibble-jessie-php56 - T251918 [production]
08:31 <dcausse> depooling wdqs1006 (lag) [production]
08:29 <dcausse> restarting blazegraph on wdqs1006 [production]
08:19 <elukey@cumin1001> END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [production]
08:13 <oblivian@deploy1002> helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [production]
08:13 <elukey@cumin1001> START - Cookbook sre.dns.netbox [production]
07:49 <jmm@cumin1001> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2002.codfw.wmnet [production]
07:41 <jmm@cumin1001> START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet [production]
07:40 <oblivian@deploy1002> helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [production]
07:37 <oblivian@deploy1002> helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [production]
07:35 <oblivian@deploy1002> helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [production]
07:29 <marostegui@cumin1001> dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16324 and previous config saved to /var/cache/conftool/dbconfig/20210608-072937-root.json [production]
07:14 <marostegui@cumin1001> dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16323 and previous config saved to /var/cache/conftool/dbconfig/20210608-071433-root.json [production]
06:59 <marostegui@cumin1001> dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16322 and previous config saved to /var/cache/conftool/dbconfig/20210608-065930-root.json [production]
06:52 <tgr> T283606: running mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki={ar,bn,cs,vi}wiki --verbose --search-index with gerrit:696307 applied [production]
06:44 <marostegui@cumin1001> dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16321 and previous config saved to /var/cache/conftool/dbconfig/20210608-064426-root.json [production]
06:40 <marostegui@cumin1001> dbctl commit (dc=all): 'Depool db1161 for upgrade', diff saved to https://phabricator.wikimedia.org/P16320 and previous config saved to /var/cache/conftool/dbconfig/20210608-064055-marostegui.json [production]
06:27 <elukey> clean some airflow logs on an-airflow1001 as one off to free space (had a chat with the Search team first) [production]
06:08 <elukey> restart yarn nodemanager on analytics1075 to clear the un-healthy state after some days of downtime (one-off issue but let's keep an eye on it) [analytics]
05:46 <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE [production]
05:44 <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE [production]
05:17 <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE [production]
05:15 <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE [production]
04:54 <marostegui> Repool clouddb1019:3314 [production]
04:07 <ryankemper@cumin1001> END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) [production]
02:38 <ryankemper@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
02:38 <ryankemper> T284445 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1011.eqiad.wmnet --dest wdqs1012.eqiad.wmnet --reason "repairing overinflated blazegraph journal" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs` [production]
02:37 <ryankemper> T284445 after manually stopping blazegraph/wdqs-updater, `sudo rm -fv /srv/wdqs/wikidata.jnl` on `wdqs1012` (clearing old overinflated journal file away before xferring new one) [production]
02:34 <ryankemper> [WDQS] `ryankemper@wdqs1005:~$ sudo depool` (catching up on ~7h of lag) [production]
2021-06-07 §
22:02 <urbanecm> urbanecm@deployment-sessionstore04:~$ sudo service cassandra start # T263617 [releng]
22:02 <urbanecm> urbanecm@deployment-sessionstore04:~$ sudo touch /etc/cassandra/service-enabled #T263617 [releng]
21:40 <James_F> Docker: Pushing node12-test ano node12-test-browser 0.0.2 for T284492 [releng]
21:35 <wm-bot> <lucaswerkmeister> deployed 547231388b (add create link for duplicates in bulk mode) [tools.lexeme-forms]
21:26 <otto@cumin1001> END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) [production]
21:12 <sbassett> Deployed security patch for T284364 [production]
20:04 <wm-bot> <lucaswerkmeister> deployed daf88503e0 (l10n updates) [tools.lexeme-forms]
19:30 <ryankemper> T284479 [Cirrussearch] We'll keep monitoring. For now this incident is resolved. Glancing at our current volume relative to what we'd expect, the numbers we see match what we'd expect. If we're accidentally banning any innocent requests they must be an incredibly small percentage of the total otherwise we'd see significantly lower volume than expected [production]
19:25 <ryankemper> T284479 [Cirrussearch] Seeing the expected drop in `entity_full_text` requests here: https://grafana-rw.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=now-12h&to=now As a result we're no longer rejecting any requests [production]
19:21 <ryankemper> T284479 [Cirrussearch] We're working on rolling out https://gerrit.wikimedia.org/r/698607, which will ban search API requests that match the Google App Engine IP range `2600:1900::0/28` AND whose user agent includes `HeadlessChrome` [production]
19:19 <cdanis> T284479 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin -b16 'A:cp-text' "run-puppet-agent" [production]
19:07 <andrew@deploy1002> Finished deploy [horizon/deploy@6199b67]: disable shelve/unshelve T284462 (duration: 04m 53s) [production]
19:02 <andrew@deploy1002> Started deploy [horizon/deploy@6199b67]: disable shelve/unshelve T284462 [production]
19:01 <andrew@deploy1002> Finished deploy [horizon/deploy@6199b67]: disable shelve/unshelve (duration: 02m 01s) [production]
18:59 <andrew@deploy1002> Started deploy [horizon/deploy@6199b67]: disable shelve/unshelve [production]
18:57 <herron> prometheus3001: moved /srv back to vda1 filesystem T243057 [production]
18:39 <bstorm> cleaning up more error conditions on grid queues [tools]
18:25 <urbanecm> [urbanecm@mwmaint1002 /srv/mediawiki/php-1.37.0-wmf.7]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=skwiki --phab=T284149 [production]
18:24 <urbanecm@deploy1002> Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/includes/WelcomeSurvey.php: 368b5d9: 0e79aee: WelcomeSurvey backports (T284127, T284257; 2/2) (duration: 00m 57s) [production]