production SAL

251-300 of 10000 results (31ms)

2021-04-29 §
22:27	<ryankemper>	T280563 `urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fbe4bb8a518>: Failed to establish a new connection: [Errno -2] Name or service not known`	[production]
22:26	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563	[production]
22:26	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563	[production]
22:21	<ryankemper@cumin1001>	END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
22:21	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
22:21	<ryankemper@cumin1001>	END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
22:20	<ryankemper@cumin1001>	START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563	[production]
21:36	<mutante>	icinga - enabling disabled notifications for random an-worker nodes where mgmt interface had enabled alerts but the actual host didnt	[production]
21:32	<mutante>	icinga - enabled notifications for checks on ms-backup1001 - they were all manually disabled but none of the checks had any status change since 50 days which indicates it was forgotten to turn them back on which is a common issue with disabling notifications	[production]
21:16	<mutante>	backup1001 - sudo check_bacula.py --icinga	[production]
20:54	<marostegui>	Stop mysql on tendril for the UTC night, dbtree and tendrill will remain down for a few hours T281486	[production]
20:16	<marostegui>	Restart tendril database - T281486	[production]
20:00	<jhuneidi@deploy1002>	rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.3 refs T278347	[production]
19:46	<jhuneidi@deploy1002>	Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 08s)	[production]
19:45	<jhuneidi@deploy1002>	rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 refs T278347	[production]
19:32	<dpifke@deploy1002>	Finished deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484 (duration: 00m 05s)	[production]
19:32	<dpifke@deploy1002>	Started deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484	[production]
19:01	<Krinkle>	graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki/wanobjectcache/revision_row_1/ (bad data from Sep 2019)	[production]
18:59	<Krinkle>	graphite1004/2003: prune /var/lib/carbon/whisper/rl-minify-* (bad data from Aug 2018)	[production]
18:58	<Krinkle>	graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki_ExternalGuidance_init_Google_tr_fr (bad data from Nov 2019)	[production]
18:38	<krinkle@deploy1002>	Synchronized php-1.37.0-wmf.1/includes/libs/objectcache/MemcachedBagOStuff.php: I926797a9d494a31, T281480 (duration: 01m 08s)	[production]
18:33	<mutante>	LDAP - added mmandere to wmf group (T281344)	[production]
18:10	<krinkle@deploy1002>	Synchronized php-1.37.0-wmf.3/includes/libs/objectcache/MemcachedBagOStuff.php: I926797a9d494a31, T281480 (duration: 01m 09s)	[production]
17:13	<pt1979@cumin2001>	END (PASS) - Cookbook sre.dns.netbox (exit_code=0)	[production]
17:10	<pt1979@cumin2001>	START - Cookbook sre.dns.netbox	[production]
17:01	<pt1979@cumin2001>	START - Cookbook sre.dns.netbox	[production]
16:29	<ryankemper>	T281498 `sudo -E cumin 'C:role::lvs::balancer' 'sudo run-puppet-agent'`	[production]
16:28	<liw@deploy1002>	rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.1"	[production]
16:27	<liw@deploy1002>	sync-wikiversions aborted: Revert "group[0\|1] wikis to [VERSION]" (duration: 00m 01s)	[production]
16:22	<ryankemper>	T281498 `ryankemper@wdqs2004:~$ sudo depool`	[production]
16:20	<ryankemper>	T281498 `ryankemper@wdqs2004:~$ sudo run-puppet-agent`	[production]
16:18	<otto@deploy1002>	Finished deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 39s)	[production]
16:15	<otto@deploy1002>	Started deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789	[production]
16:12	<papaul>	powerdown thanos-fe2001 for memory swap	[production]
15:44	<ryankemper>	T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (trying reimaging this host one final time, if this fails again will need to do a deeper investigation into what's going wrong here)	[production]
15:43	<ryankemper>	[WDQS] `wdqs2001` is high on update lag but otherwise functioning; will repool when lag is caught up	[production]
15:37	<ryankemper>	[WDQS] `sudo systemctl restart wdqs-blazegraph` && `sudo systemctl restart wdqs-updater` on `wdqs2001`	[production]
15:35	<ryankemper>	[WDQS] ^ scratch that, depooled `wdqs2001`	[production]
15:34	<ryankemper>	[WDQS] pooled `wdqs2001`	[production]
14:35	<hnowlan@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog[1002-1003].eqiad.wmnet with reason: eventlog1003 migration	[production]
14:35	<hnowlan@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog[1002-1003].eqiad.wmnet with reason: eventlog1003 migration	[production]
13:44	<moritzm>	installing Java security updates on stat* hosts	[production]
13:43	<hnowlan@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog1003.eqiad.wmnet with reason: eventlog1003 migration	[production]
13:43	<hnowlan@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog1003.eqiad.wmnet with reason: eventlog1003 migration	[production]
13:42	<hnowlan@cumin1001>	END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog1002.eqiad.wmnet with reason: eventlog1003 migration	[production]
13:42	<hnowlan@cumin1001>	START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog1002.eqiad.wmnet with reason: eventlog1003 migration	[production]
13:40	<otto@deploy1002>	Finished deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 59s)	[production]
13:37	<otto@deploy1002>	Started deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789	[production]
13:11	<moritzm>	installing postgresql-11 security updates	[production]
13:08	<jbond42>	merge netbase change to manage /etc/services	[production]