2021-04-28
ยง
|
22:26 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
22:26 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
22:23 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |
22:18 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
22:18 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
22:18 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
22:15 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
21:49 |
<legoktm@deploy1002> |
helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. |
[production] |
21:49 |
<legoktm@deploy1002> |
helmfile [staging-eqiad] START helmfile.d/admin 'apply'. |
[production] |
21:47 |
<legoktm@deploy1002> |
helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. |
[production] |
21:46 |
<robh@cumin1001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
21:44 |
<legoktm@deploy1002> |
helmfile [staging-codfw] START helmfile.d/admin 'apply'. |
[production] |
21:44 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
21:41 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE |
[production] |
21:39 |
<ryankemper@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE |
[production] |
21:39 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
21:39 |
<ryankemper@cumin1001> |
END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) |
[production] |
21:38 |
<ryankemper> |
T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` |
[production] |
21:37 |
<ryankemper> |
T280382 `wdqs2007` is reachable again; glancing at `/srv/wdqs` its `wikidata.jnl` is `839G` when it should be `975G` so I'll re-do the wikidata journal transfer |
[production] |
21:32 |
<ryankemper> |
T280382 [WDQS] `wdqs2007` ssh is unreachable; power cycling via `racadm>>racadm serveraction powercycle` |
[production] |
21:24 |
<ryankemper> |
T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1013.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (previous reimage timed out, instance appears to have rebooted) |
[production] |
21:11 |
<andrewbogott> |
cleaning up more references to deleted hypervisors with delete from services where topic='compute' and version != 53; |
[admin] |
21:07 |
<robh@cumin1001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE |
[production] |
21:05 |
<robh@cumin1001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE |
[production] |
21:04 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE |
[production] |
21:03 |
<robh@cumin1001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
21:03 |
<robh@cumin1001> |
END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE |
[production] |
21:01 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE |
[production] |
21:01 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE |
[production] |
21:01 |
<robh@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE |
[production] |
20:48 |
<andrewbogott> |
cleaning up references to deleted hypervisors with mysql:root@localhost [nova_eqiad1]> delete from compute_nodes where hypervisor_version != '5002000'; |
[admin] |
20:00 |
<robh@cumin1001> |
END (PASS) - Cookbook sre.dns.netbox (exit_code=0) |
[production] |
19:57 |
<jhuneidi@deploy1002> |
rebuilt and synchronized wikiversions files: Revert "group1 wikis to 1.37.0-wmf.1" |
[production] |
19:56 |
<robh@cumin1001> |
START - Cookbook sre.dns.netbox |
[production] |
19:40 |
<andrewbogott> |
putting cloudvirt1040 into the maintenance aggregate pending more info about T281399 |
[admin] |
19:13 |
<jhuneidi@deploy1002> |
Synchronized php: group1 wikis to 1.37.0-wmf.3 refs T278347 (duration: 01m 07s) |
[production] |
19:12 |
<jhuneidi@deploy1002> |
rebuilt and synchronized wikiversions files: group1 wikis to 1.37.0-wmf.3 refs T278347 |
[production] |
18:21 |
<legoktm> |
added mvolz as listadmin for services@ and reset admin pw (T278516) |
[production] |
18:11 |
<andrewbogott> |
adding cloudvirt1040, 1041 and 1042 to the 'ceph' host aggregate -- T275081 |
[admin] |
17:46 |
<hnowlan> |
eventlog1003 joined to groups successfully |
[analytics] |
17:36 |
<razzi> |
sudo mkdir /srv/log/eventlogging and sudo chown eventlogging:eventlogging /srv/log/eventlogging to workaround missing directory puppet error (to be puppetized later) |
[analytics] |
17:31 |
<razzi> |
remove deployment cache on eventlogging1003: sudo rm -fr /srv/deployment/eventlogging/analytics-cache/ |
[analytics] |
17:26 |
<razzi> |
manually change /srv/deployment/eventlogging/analytics/.git/DEPLOY_HEAD to deployment1002 on deployment1002 to fix puppet scap error |
[analytics] |
17:11 |
<urbanecm@deploy1002> |
Synchronized php-1.37.0-wmf.3/extensions/Wikibase/client/includes/DataAccess/Scribunto/WikibaseLanguageIndependentLuaBindings.php: b392dba0d77904d7de819043e51d8c3fbf003873: Fix incorrect ItemId typehint in Lua bindings (T281361) (duration: 01m 09s) |
[production] |
16:53 |
<hnowlan> |
stopping deployment-eventlog05 in deployment-prep |
[analytics] |
16:52 |
<papaul> |
powerdown logstash2034 for relocation |
[production] |
16:32 |
<andrew@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: REIMAGE |
[production] |
16:30 |
<andrew@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: REIMAGE |
[production] |
16:29 |
<pt1979@cumin2001> |
END (PASS) - Cookbook sre.dns.netbox (exit_code=0) |
[production] |
16:29 |
<andrew@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1046.eqiad.wmnet with reason: REIMAGE |
[production] |