6351-6400 of 10000 results (42ms)
2021-04-29 ยง
22:37 <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [production]
22:36 <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [production]
22:32 <ryankemper> T280563 Spotted the issue; forgot to set `--without-lvs` for relforge reboot [production]
22:27 <ryankemper> T280563 `urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fbe4bb8a518>: Failed to establish a new connection: [Errno -2] Name or service not known` [production]
22:26 <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563 [production]
22:26 <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge restart - ryankemper@cumin1001 - T280563 [production]
22:21 <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [production]
22:21 <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [production]
22:21 <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [production]
22:20 <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation reboot without plugin upgrade (2 nodes at a time) for ElasticSearch cluster relforge: relforge reboot - ryankemper@cumin1001 - T280563 [production]
21:36 <mutante> icinga - enabling disabled notifications for random an-worker nodes where mgmt interface had enabled alerts but the actual host didnt [production]
21:32 <mutante> icinga - enabled notifications for checks on ms-backup1001 - they were all manually disabled but none of the checks had any status change since 50 days which indicates it was forgotten to turn them back on which is a common issue with disabling notifications [production]
21:16 <mutante> backup1001 - sudo check_bacula.py --icinga [production]
20:54 <marostegui> Stop mysql on tendril for the UTC night, dbtree and tendrill will remain down for a few hours T281486 [production]
20:16 <marostegui> Restart tendril database - T281486 [production]
20:00 <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: all wikis to 1.37.0-wmf.3 refs T278347 [production]
19:46 <jhuneidi@deploy1002> Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 08s) [production]
19:45 <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 refs T278347 [production]
19:32 <dpifke@deploy1002> Finished deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484 (duration: 00m 05s) [production]
19:32 <dpifke@deploy1002> Started deploy [performance/navtiming@e7ad939]: Deploy https://gerrit.wikimedia.org/r/c/performance/navtiming/+/683484 [production]
19:01 <Krinkle> graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki/wanobjectcache/revision_row_1/ (bad data from Sep 2019) [production]
18:59 <Krinkle> graphite1004/2003: prune /var/lib/carbon/whisper/rl-minify-* (bad data from Aug 2018) [production]
18:58 <Krinkle> graphite1004/2003: prune /var/lib/carbon/whisper/MediaWiki_ExternalGuidance_init_Google_tr_fr (bad data from Nov 2019) [production]
18:38 <krinkle@deploy1002> Synchronized php-1.37.0-wmf.1/includes/libs/objectcache/MemcachedBagOStuff.php: I926797a9d494a31, T281480 (duration: 01m 08s) [production]
18:33 <mutante> LDAP - added mmandere to wmf group (T281344) [production]
18:10 <krinkle@deploy1002> Synchronized php-1.37.0-wmf.3/includes/libs/objectcache/MemcachedBagOStuff.php: I926797a9d494a31, T281480 (duration: 01m 09s) [production]
17:13 <pt1979@cumin2001> END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [production]
17:10 <pt1979@cumin2001> START - Cookbook sre.dns.netbox [production]
17:01 <pt1979@cumin2001> START - Cookbook sre.dns.netbox [production]
16:29 <ryankemper> T281498 `sudo -E cumin 'C:role::lvs::balancer' 'sudo run-puppet-agent'` [production]
16:28 <liw@deploy1002> rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.1" [production]
16:27 <liw@deploy1002> sync-wikiversions aborted: Revert "group[0|1] wikis to [VERSION]" (duration: 00m 01s) [production]
16:22 <ryankemper> T281498 `ryankemper@wdqs2004:~$ sudo depool` [production]
16:20 <ryankemper> T281498 `ryankemper@wdqs2004:~$ sudo run-puppet-agent` [production]
16:18 <otto@deploy1002> Finished deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 39s) [production]
16:15 <otto@deploy1002> Started deploy [analytics/refinery@b3c5820] (hadoop-test): update event_sanitized_main allowlst on an-launcher1002 - T273789 [production]
16:12 <papaul> powerdown thanos-fe2001 for memory swap [production]
15:44 <ryankemper> T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (trying reimaging this host one final time, if this fails again will need to do a deeper investigation into what's going wrong here) [production]
15:43 <ryankemper> [WDQS] `wdqs2001` is high on update lag but otherwise functioning; will repool when lag is caught up [production]
15:37 <ryankemper> [WDQS] `sudo systemctl restart wdqs-blazegraph` && `sudo systemctl restart wdqs-updater` on `wdqs2001` [production]
15:35 <ryankemper> [WDQS] ^ scratch that, depooled `wdqs2001` [production]
15:34 <ryankemper> [WDQS] pooled `wdqs2001` [production]
14:35 <hnowlan@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog[1002-1003].eqiad.wmnet with reason: eventlog1003 migration [production]
14:35 <hnowlan@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog[1002-1003].eqiad.wmnet with reason: eventlog1003 migration [production]
13:44 <moritzm> installing Java security updates on stat* hosts [production]
13:43 <hnowlan@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog1003.eqiad.wmnet with reason: eventlog1003 migration [production]
13:43 <hnowlan@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog1003.eqiad.wmnet with reason: eventlog1003 migration [production]
13:42 <hnowlan@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on eventlog1002.eqiad.wmnet with reason: eventlog1003 migration [production]
13:42 <hnowlan@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on eventlog1002.eqiad.wmnet with reason: eventlog1003 migration [production]
13:40 <otto@deploy1002> Finished deploy [analytics/refinery@b3c5820]: update event_sanitized_main allowlst on an-launcher1002 - T273789 (duration: 02m 59s) [production]