production SAL

5851-5900 of 10000 results (73ms)

2022-08-03 §
20:00	<rzl@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2012.codfw.wmnet	[production]
20:00	<rzl@cumin1001>	START - Cookbook sre.hosts.remove-downtime for kubernetes2012.codfw.wmnet	[production]
20:00	<rzl@deploy1002>	conftool action : set/pooled=yes; selector: name=kubernetes2012.codfw.wmnet	[production]
19:51	<marostegui@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P32251 and previous config saved to /var/cache/conftool/dbconfig/20220803-195113-marostegui.json	[production]
19:40	<ryankemper>	T314078 Forgot to mention, restart is at `ryankemper@cumin1001` tmux session `codfw_restarts`	[production]
19:39	<ryankemper>	T314078 Rolling upgrade of codfw hosts; after this all of eqiad/codfw will have the new plugin version and we can resume the `search-loader` instances: `sudo -E cookbook sre.elasticsearch.rolling-operation search_codfw "codfw cluster plugin upgrade" --upgrade --nodes-per-run 3 --start-datetime 2022-08-03T19:38:10 --task-id T314078`	[production]
19:38	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster plugin upgrade - ryankemper@cumin1001 - T314078	[production]
19:36	<marostegui@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312972)', diff saved to https://phabricator.wikimedia.org/P32250 and previous config saved to /var/cache/conftool/dbconfig/20220803-193607-marostegui.json	[production]
19:33	<marostegui@cumin1001>	dbctl commit (dc=all): 'Depooling db1182 (T312972)', diff saved to https://phabricator.wikimedia.org/P32249 and previous config saved to /var/cache/conftool/dbconfig/20220803-193354-marostegui.json	[production]
19:33	<marostegui@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance	[production]
19:33	<marostegui@cumin1001>	START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance	[production]
19:33	<marostegui@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312972)', diff saved to https://phabricator.wikimedia.org/P32248 and previous config saved to /var/cache/conftool/dbconfig/20220803-193334-marostegui.json	[production]
19:25	<mutante>	gerrit1001 - rsyncing /var/lib/gerrit/review_site/ over to gerrit2002 815401	[production]
19:18	<marostegui@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32247 and previous config saved to /var/cache/conftool/dbconfig/20220803-191828-marostegui.json	[production]
19:03	<marostegui@cumin1001>	dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P32246 and previous config saved to /var/cache/conftool/dbconfig/20220803-190321-marostegui.json	[production]
18:56	<rzl@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubernetes2011.codfw.wmnet	[production]
18:56	<rzl@cumin1001>	START - Cookbook sre.hosts.remove-downtime for kubernetes2011.codfw.wmnet	[production]
18:56	<rzl@deploy1002>	conftool action : set/pooled=yes; selector: name=kubernetes2011.codfw.wmnet	[production]
18:33	<rzl@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2027,2037].codfw.wmnet	[production]
18:33	<rzl@cumin1001>	START - Cookbook sre.hosts.remove-downtime for mc[2027,2037].codfw.wmnet	[production]
18:23	<mwdebug-deploy@deploy1002>	helmfile [codfw] DONE helmfile.d/services/mwdebug: apply	[production]
18:16	<mwdebug-deploy@deploy1002>	helmfile [codfw] START helmfile.d/services/mwdebug: apply	[production]
18:16	<mwdebug-deploy@deploy1002>	helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply	[production]
18:16	<dancy@deploy1002>	Synchronized php: group1 wikis to 1.39.0-wmf.23 refs T308076 (duration: 03m 37s)	[production]
18:15	<mwdebug-deploy@deploy1002>	helmfile [eqiad] START helmfile.d/services/mwdebug: apply	[production]
18:12	<dancy@deploy1002>	rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.23 refs T308076	[production]
17:58	<rzl@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for kubestage2002.codfw.wmnet	[production]
17:58	<rzl@cumin1001>	START - Cookbook sre.hosts.remove-downtime for kubestage2002.codfw.wmnet	[production]
17:57	<rzl@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc[2025-2026].codfw.wmnet	[production]
17:57	<rzl@cumin1001>	START - Cookbook sre.hosts.remove-downtime for mc[2025-2026].codfw.wmnet	[production]
17:57	<bking@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2044.codfw.wmnet	[production]
17:57	<bking@cumin1001>	START - Cookbook sre.hosts.remove-downtime for elastic2044.codfw.wmnet	[production]
17:56	<bking@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for elastic2043.codfw.wmnet	[production]
17:56	<bking@cumin1001>	START - Cookbook sre.hosts.remove-downtime for elastic2043.codfw.wmnet	[production]
17:55	<ottomata>	increasing partitions from 5 to 6 for *.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite topics in Kafka main-eqiad and main-codfw - T314426	[production]
17:55	<mvernon@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2055.codfw.wmnet	[production]
17:55	<mvernon@cumin1001>	START - Cookbook sre.hosts.remove-downtime for ms-be2055.codfw.wmnet	[production]
17:50	<rzl@cumin1001>	conftool action : set/pooled=yes; selector: name=kubestage2002.codfw.wmnet	[production]
17:38	<rzl@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse[2008-2010].codfw.wmnet	[production]
17:38	<rzl@cumin1001>	START - Cookbook sre.hosts.remove-downtime for parse[2008-2010].codfw.wmnet	[production]
17:23	<hnowlan@puppetmaster1001>	conftool action : set/pooled=yes; selector: name=restbase20[12]4.codfw.wmnet	[production]
17:14	<mvernon@cumin1001>	END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for 6 hosts	[production]
17:14	<mvernon@cumin1001>	START - Cookbook sre.hosts.remove-downtime for 6 hosts	[production]
17:08	<ryankemper>	T310145 `elastic2031` and `wcqs2002` powered off in preparation for C1 maintenance	[production]
17:06	<jayme@cumin1001>	conftool action : set/pooled=yes; selector: name=(kubernetes2020.codfw.wmnet\|kubernetes2009.codfw.wmnet\|kubernetes2010.codfw.wmnet)	[production]
17:00	<btullis@cumin1001>	END (FAIL) - Cookbook sre.aqs.roll-restart (exit_code=99) for AQS aqs cluster: Roll restart of all AQS's nodejs daemons.	[production]
16:48	<Emperor>	shutdown moss-fe2001.codfw.wmnet,ms-fe2011.codfw.wmnet,ms-be20[34,35,42,48,55,68].codfw.wmnet PDU work T310145	[production]
16:47	<mvernon@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on 8 hosts with reason: PDU work	[production]
16:47	<dzahn@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on gerrit2002.wikimedia.org with reason: in setup / flapping	[production]
16:47	<mvernon@cumin1001>	START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on 8 hosts with reason: PDU work	[production]