production SAL

101-150 of 10000 results (41ms)

2021-06-08 §
08:13	<elukey@cumin1001>	START - Cookbook sre.dns.netbox	[production]
07:49	<jmm@cumin1001>	END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cumin2002.codfw.wmnet	[production]
07:41	<jmm@cumin1001>	START - Cookbook sre.hosts.reboot-single for host cumin2002.codfw.wmnet	[production]
07:40	<oblivian@deploy1002>	helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .	[production]
07:37	<oblivian@deploy1002>	helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .	[production]
07:35	<oblivian@deploy1002>	helmfile [staging] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' .	[production]
07:29	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1161 (re)pooling @ 100%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16324 and previous config saved to /var/cache/conftool/dbconfig/20210608-072937-root.json	[production]
07:14	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1161 (re)pooling @ 75%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16323 and previous config saved to /var/cache/conftool/dbconfig/20210608-071433-root.json	[production]
06:59	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1161 (re)pooling @ 50%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16322 and previous config saved to /var/cache/conftool/dbconfig/20210608-065930-root.json	[production]
06:52	<tgr>	T283606: running mwscript extensions/GrowthExperiments/maintenance/fixLinkRecommendationData.php --wiki={ar,bn,cs,vi}wiki --verbose --search-index with gerrit:696307 applied	[production]
06:44	<marostegui@cumin1001>	dbctl commit (dc=all): 'db1161 (re)pooling @ 25%: Repool after upgrade', diff saved to https://phabricator.wikimedia.org/P16321 and previous config saved to /var/cache/conftool/dbconfig/20210608-064426-root.json	[production]
06:40	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depool db1161 for upgrade', diff saved to https://phabricator.wikimedia.org/P16320 and previous config saved to /var/cache/conftool/dbconfig/20210608-064055-marostegui.json	[production]
06:27	<elukey>	clean some airflow logs on an-airflow1001 as one off to free space (had a chat with the Search team first)	[production]
05:46	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE	[production]
05:44	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE	[production]
05:17	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE	[production]
05:15	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime for 2:00:00 on db2123.codfw.wmnet with reason: REIMAGE	[production]
04:54	<marostegui>	Repool clouddb1019:3314	[production]
04:07	<ryankemper@cumin1001>	END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0)	[production]
02:38	<ryankemper@cumin1001>	START - Cookbook sre.wdqs.data-transfer	[production]
02:38	<ryankemper>	T284445 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1011.eqiad.wmnet --dest wdqs1012.eqiad.wmnet --reason "repairing overinflated blazegraph journal" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `wdqs`	[production]
02:37	<ryankemper>	T284445 after manually stopping blazegraph/wdqs-updater, `sudo rm -fv /srv/wdqs/wikidata.jnl` on `wdqs1012` (clearing old overinflated journal file away before xferring new one)	[production]
02:34	<ryankemper>	[WDQS] `ryankemper@wdqs1005:~$ sudo depool` (catching up on ~7h of lag)	[production]
2021-06-07 §
21:26	<otto@cumin1001>	END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0)	[production]
21:12	<sbassett>	Deployed security patch for T284364	[production]
19:30	<ryankemper>	T284479 [Cirrussearch] We'll keep monitoring. For now this incident is resolved. Glancing at our current volume relative to what we'd expect, the numbers we see match what we'd expect. If we're accidentally banning any innocent requests they must be an incredibly small percentage of the total otherwise we'd see significantly lower volume than expected	[production]
19:25	<ryankemper>	T284479 [Cirrussearch] Seeing the expected drop in `entity_full_text` requests here: https://grafana-rw.wikimedia.org/d/000000455/elasticsearch-percentiles?viewPanel=47&orgId=1&from=now-12h&to=now As a result we're no longer rejecting any requests	[production]
19:21	<ryankemper>	T284479 [Cirrussearch] We're working on rolling out https://gerrit.wikimedia.org/r/698607, which will ban search API requests that match the Google App Engine IP range `2600:1900::0/28` AND whose user agent includes `HeadlessChrome`	[production]
19:19	<cdanis>	T284479 ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕞🍵 sudo cumin -b16 'A:cp-text' "run-puppet-agent"	[production]
19:07	<andrew@deploy1002>	Finished deploy [horizon/deploy@6199b67]: disable shelve/unshelve T284462 (duration: 04m 53s)	[production]
19:02	<andrew@deploy1002>	Started deploy [horizon/deploy@6199b67]: disable shelve/unshelve T284462	[production]
19:01	<andrew@deploy1002>	Finished deploy [horizon/deploy@6199b67]: disable shelve/unshelve (duration: 02m 01s)	[production]
18:59	<andrew@deploy1002>	Started deploy [horizon/deploy@6199b67]: disable shelve/unshelve	[production]
18:57	<herron>	prometheus3001: moved /srv back to vda1 filesystem T243057	[production]
18:25	<urbanecm>	[urbanecm@mwmaint1002 /srv/mediawiki/php-1.37.0-wmf.7]$ mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=skwiki --phab=T284149	[production]
18:24	<urbanecm@deploy1002>	Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/includes/WelcomeSurvey.php: 368b5d9: 0e79aee: WelcomeSurvey backports (T284127, T284257; 2/2) (duration: 00m 57s)	[production]
18:22	<urbanecm@deploy1002>	Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/extension.json: 368b5d9: 0e79aee: WelcomeSurvey backports (T284127, T284257; 1/2) (duration: 00m 56s)	[production]
18:20	<urbanecm@deploy1002>	Synchronized php-1.37.0-wmf.7/extensions/GrowthExperiments/maintenance/initWikiConfig.php: 7089728: b2482fb: initWikiConfig GE backports (T284072) (duration: 00m 58s)	[production]
18:16	<urbanecm@deploy1002>	Synchronized wmf-config/InitialiseSettings.php: 15e09109b7c45de967a496a0eb58ad267dbc5079: skwiki: Make Growth features available in dark mode (T284149; 3/3) (duration: 00m 56s)	[production]
18:14	<urbanecm@deploy1002>	Synchronized dblists/growthexperiments.dblist: 15e09109b7c45de967a496a0eb58ad267dbc5079: skwiki: Make Growth features available in dark mode (T284149; 2/3) (duration: 00m 56s)	[production]
18:14	<otto@cumin1001>	START - Cookbook sre.kafka.roll-restart-brokers	[production]
18:14	<ottomata>	rolling restart of kafka jumbo brokers - T283067	[production]
18:13	<urbanecm@deploy1002>	Synchronized wmf-config/config/skwiki.yaml: 15e09109b7c45de967a496a0eb58ad267dbc5079: skwiki: Make Growth features available in dark mode (T284149; 1/3) (duration: 00m 59s)	[production]
18:12	<otto@cumin1001>	END (PASS) - Cookbook sre.kafka.roll-restart-mirror-maker (exit_code=0)	[production]
18:04	<urbanecm>	[urbanecm@mwmaint1002 ~]$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php --wiki=skwiki growthexperiments # T284149	[production]
18:04	<urbanecm@deploy1002>	Synchronized wmf-config/InitialiseSettings.php: 5de2f8b27b016a2cd8f424d8e40318edde5e5704: Set WelcomeSurveyEnableWithHomepage (T281896, T284257) (duration: 00m 59s)	[production]
17:53	<otto@cumin1001>	START - Cookbook sre.kafka.roll-restart-mirror-maker	[production]
17:53	<ottomata>	rolling restart of kafka jumbo mirror makers - T283067	[production]
17:17	<ryankemper>	[Cirrussearch] We're seeing ~10% of current requests being rejected by poolcounter, due to ~2x expected `eqiad.full_text` query volume and ~30x expected `eqiad.entity_full_text` query volume	[production]
16:56	<ryankemper>	[WDQS] `ryankemper@wdqs1005:~$ sudo systemctl restart wdqs-blazegraph` (blazegraph locked up)	[production]