production SAL

401-450 of 10000 results (39ms)

2021-04-30 §
03:47	<ryankemper>	T280563 about half of codfw nodes have been rebooted before the failure caused by write queue not emptying fast enough, kicking it off again:`sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts`	[production]
03:45	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563	[production]
01:08	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563	[production]
2021-04-29 §
23:36	<thcipriani@deploy1002>	Synchronized README: Config: [[gerrit:683749\|Revert "DEMO: Add newline to README"]] (duration: 00m 56s)	[production]
23:18	<ryankemper>	T280563 successful reboot of `relforge100[3,4]`; `relforge` cluster is back to green status.	[production]
23:16	<thcipriani@deploy1002>	Synchronized README: Config: [[gerrit:683747\|DEMO: Add newline to README]] (duration: 00m 56s)	[production]
23:08	<ryankemper>	T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation search_codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts` (amended command)	[production]
23:06	<ryankemper>	T280563 `sudo -i cookbook sre.elasticsearch.rolling-operation codfw "codfw reboot" --reboot --nodes-per-run 3 --start-datetime 2021-04-29T23:04:29 --task-id T280563` on `ryankemper@cumin1001` tmux session `elastic_restarts`	[production]
23:05	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw reboot - ryankemper@cumin1001 - T280563	[production]
22:46	<ryankemper>	T280563 Current master is `relforge1003-relforge-eqiad`, will reboot `1004` first then `1003` after	[production]
22:44	<ryankemper>	T280563 Bleh, we never moved the new config into spicerack, so it's trying to talk to the old relforge hosts which no longer exist. Will reboot relforge manually and use the cookbook for codfw/eqiad, and circle back later for the spicerack change	[production]
22:37	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
22:36	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
22:32	<ryankemper>	T280563 Spotted the issue; forgot to set `--without-lvs` for relforge reboot	[production]
22:27	<ryankemper>	T280563 `urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fbe4bb8a518>: Failed to establish a new connection: [Errno -2] Name or service not known`	[production]
22:26	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563	[production]
22:26	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563	[production]
22:21	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
22:21	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
22:21	<ryankemper@cumin1001>	END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
22:20	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
21:36	<mutante>	icinga - enabling disabled notifications for random an-worker nodes where mgmt interface had enabled alerts but the actual host didnt	[production]
21:32	<mutante>	icinga - enabled notifications for checks on ms-backup1001 - they were all manually disabled but none of the checks had any status change since 50 days which indicates it was forgotten to turn them back on which is a common issue with disabling notifications	[production]
21:16	<mutante>	backup1001 - sudo check_bacula.py --icinga	[production]
20:54	<marostegui>	Stop mysql on tendril for the UTC night, dbtree and tendrill will remain down for a few hours T281486	[production]
20:16	<marostegui>	Restart tendril database - T281486	[production]
20:00	<jhuneidi@deploy1002>	rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.3 refs T278347	[production]
19:46	<jhuneidi@deploy1002>	Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 08s)	[production]
19:45	<jhuneidi@deploy1002>	rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 refs T278347	[production]
19:32	<dpifke@deploy1002>	Finished deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484 (duration: 00m 05s)	[production]
19:32	<dpifke@deploy1002>	Started deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484	[production]
19:01	<Krinkle>	graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki/wanobjectcache/revision_row_1/ (bad data from Sep 2019)	[production]
18:59	<Krinkle>	graphite1004/2003: prune /var/lib/carbon/whisper/rl-minify-* (bad data from Aug 2018)	[production]
18:58	<Krinkle>	graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki_ExternalGuidance_init_Google_tr_fr (bad data from Nov 2019)	[production]
18:38	<krinkle@deploy1002>	Synchronized php-1.37.0-wmf.1/includes/libs/objectcache/MemcachedBagOStuff.php: I926797a9d494a31, T281480 (duration: 01m 08s)	[production]
18:33	<mutante>	LDAP - added mmandere to wmf group (T281344)	[production]
18:10	<krinkle@deploy1002>	Synchronized php-1.37.0-wmf.3/includes/libs/objectcache/MemcachedBagOStuff.php: I926797a9d494a31, T281480 (duration: 01m 09s)	[production]
17:13	<pt1979@cumin2001>	END (PASS) - Cookbook sre.dns.netbox (exit_code=0)	[production]
17:10	<pt1979@cumin2001>	START - Cookbook sre.dns.netbox	[production]
17:01	<pt1979@cumin2001>	START - Cookbook sre.dns.netbox	[production]
16:29	<ryankemper>	T281498 `sudo -E cumin 'C:role::lvs::balancer' 'sudo run-puppet-agent'`	[production]
16:28	<liw@deploy1002>	rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.1"	[production]
16:27	<liw@deploy1002>	sync-wikiversions aborted: Revert "group[0\|1] wikis to [VERSION]" (duration: 00m 01s)	[production]
16:22	<ryankemper>	T281498 `ryankemper@wdqs2004:~$ sudo depool`	[production]
16:20	<ryankemper>	T281498 `ryankemper@wdqs2004:~$ sudo run-puppet-agent`	[production]
16:18	<otto@deploy1002>	Finished deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 39s)	[production]
16:15	<otto@deploy1002>	Started deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789	[production]
16:12	<papaul>	powerdown thanos-fe2001 for memory swap	[production]
15:44	<ryankemper>	T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (trying reimaging this host one final time, if this fails again will need to do a deeper investigation into what's going wrong here)	[production]
15:43	<ryankemper>	[WDQS] `wdqs2001` is high on update lag but otherwise functioning; will repool when lag is caught up	[production]