7351-7400 of 10000 results (30ms)
2021-04-28 ยง
22:15 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:49 <legoktm@deploy1002> helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [production]
21:49 <legoktm@deploy1002> helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [production]
21:47 <legoktm@deploy1002> helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [production]
21:46 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:44 <legoktm@deploy1002> helmfile [staging-codfw] START helmfile.d/admin 'apply'. [production]
21:44 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:41 <ryankemper@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE [production]
21:39 <ryankemper@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE [production]
21:39 <ryankemper@cumin1001> START - Cookbook sre.wdqs.data-transfer [production]
21:39 <ryankemper@cumin1001> END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) [production]
21:38 <ryankemper> T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` [production]
21:37 <ryankemper> T280382 `wdqs2007` is reachable again; glancing at `/srv/wdqs` its `wikidata.jnl` is `839G` when it should be `975G` so I'll re-do the wikidata journal transfer [production]
21:32 <ryankemper> T280382 [WDQS] `wdqs2007` ssh is unreachable; power cycling via `racadm>>racadm serveraction powercycle` [production]
21:24 <ryankemper> T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1013.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (previous reimage timed out, instance appears to have rebooted) [production]
21:11 <andrewbogott> cleaning up more references to deleted hypervisors with delete from services where topic='compute' and version != 53; [admin]
21:07 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [production]
21:05 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [production]
21:04 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE [production]
21:03 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:03 <robh@cumin1001> END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [production]
21:01 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE [production]
21:01 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE [production]
21:01 <robh@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE [production]
20:48 <andrewbogott> cleaning up references to deleted hypervisors with mysql:root@localhost [nova_eqiad1]> delete from compute_nodes where hypervisor_version != '5002000'; [admin]
20:00 <robh@cumin1001> END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [production]
19:57 <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.1" [production]
19:56 <robh@cumin1001> START - Cookbook sre.dns.netbox [production]
19:40 <andrewbogott> putting cloudvirt1040 into the maintenance aggregate pending more info about T281399 [admin]
19:13 <jhuneidi@deploy1002> Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 07s) [production]
19:12 <jhuneidi@deploy1002> rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 refs T278347 [production]
18:21 <legoktm> added mvolz as listadmin for services@ and reset admin pw (T278516) [production]
18:11 <andrewbogott> adding cloudvirt1040, 1041 and 1042 to the 'ceph' host aggregate -- T275081 [admin]
17:46 <hnowlan> eventlog1003 joined to groups successfully [analytics]
17:36 <razzi> sudo mkdir /srv/log/eventlogging and sudo chown eventlogging:eventlogging /srv/log/eventlogging to workaround missing directory puppet error (to be puppetized later) [analytics]
17:31 <razzi> remove deployment cache on eventlogging1003: sudo rm -fr /srv/deployment/eventlogging/analytics-cache/ [analytics]
17:26 <razzi> manually change /srv/deployment/eventlogging/analytics/.git/DEPLOY_HEAD to deployment1002 on deployment1002 to fix puppet scap error [analytics]
17:11 <urbanecm@deploy1002> Synchronized php-1.37.0-wmf.3/extensions/Wikibase/client/includes/DataAccess/Scribunto/WikibaseLanguageIndependentLuaBindings.php: b392dba0d77904d7de819043e51d8c3fbf003873: Fix incorrect ItemId typehint in Lua bindings (T281361) (duration: 01m 09s) [production]
16:53 <hnowlan> stopping deployment-eventlog05 in deployment-prep [analytics]
16:52 <papaul> powerdown logstash2034 for relocation [production]
16:32 <andrew@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: REIMAGE [production]
16:30 <andrew@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: REIMAGE [production]
16:29 <pt1979@cumin2001> END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [production]
16:29 <andrew@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: REIMAGE [production]
16:28 <andrew@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: REIMAGE [production]
16:27 <andrew@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: REIMAGE [production]
16:27 <pt1979@cumin2001> START - Cookbook sre.dns.netbox [production]
16:26 <andrew@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: REIMAGE [production]
16:25 <andrew@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: REIMAGE [production]
16:24 <andrew@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: REIMAGE [production]