2021-08-10
§
|
19:04 |
<cmjohnson@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1004.eqiad.wmnet with reason: REIMAGE |
[production] |
19:03 |
<cmjohnson@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1023.eqiad.wmnet with reason: REIMAGE |
[production] |
18:49 |
<cmjohnson@cumin1001> |
END (PASS) - Cookbook sre.dns.netbox (exit_code=0) |
[production] |
18:47 |
<ryankemper> |
[WDQS] `ryankemper@wdqs2005:~$ sudo depool` (~1.26 hours of lag) |
[production] |
18:46 |
<cmjohnson@cumin1001> |
START - Cookbook sre.dns.netbox |
[production] |
18:46 |
<ryankemper> |
T288501 (Misread grafana graph, `wdqs2003` only has 1.33 hours to catch up on) |
[production] |
18:45 |
<ryankemper> |
T288501 `data-transfer` of `wikidata.jnl` completed successfully. Host needs to catch up on ~22 hours of WDQS lag before being re-pooled |
[production] |
18:42 |
<ryankemper@cumin2001> |
END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) |
[production] |
17:23 |
<jhuneidi@deploy1002> |
Finished scap: testwikis wikis to 1.37.0-wmf.18 (duration: 36m 35s) |
[production] |
17:19 |
<ryankemper> |
T288501 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2005.codfw.wmnet --dest wdqs2003.codfw.wmnet --reason "transferring fresh wikidata journal to resolve disk issue" --blazegraph_instance blazegraph` on `cumin2001` tmux session `wdqs_data_xfer` |
[production] |
17:19 |
<ryankemper@cumin2001> |
START - Cookbook sre.wdqs.data-transfer |
[production] |
17:18 |
<mbsantos@deploy1002> |
helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . |
[production] |
17:13 |
<ryankemper> |
T288501 [WDQS] `ryankemper@wdqs2003:~$ sudo rm -fv /srv/wdqs/wikidata.jnl` |
[production] |
17:09 |
<razzi@cumin1001> |
END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 |
[production] |
17:09 |
<razzi@cumin1001> |
START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 |
[production] |
17:06 |
<mbsantos@deploy1002> |
helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . |
[production] |
17:02 |
<btullis@cumin1001> |
END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 |
[production] |
17:02 |
<btullis@cumin1001> |
START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 |
[production] |
17:01 |
<mbsantos@deploy1002> |
helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . |
[production] |
16:49 |
<btullis@cumin1001> |
END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 |
[production] |
16:49 |
<btullis@cumin1001> |
START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 |
[production] |
16:47 |
<jhuneidi@deploy1002> |
Started scap: testwikis wikis to 1.37.0-wmf.18 |
[production] |
16:36 |
<ebernhardson@deploy1002> |
Finished deploy [wikimedia/discovery/analytics@d3c5363]: T287225: Bump rdf-spark-tools to 0.3.81 (duration: 02m 10s) |
[production] |
16:34 |
<ebernhardson@deploy1002> |
Started deploy [wikimedia/discovery/analytics@d3c5363]: T287225: Bump rdf-spark-tools to 0.3.81 |
[production] |
16:33 |
<btullis@cumin1001> |
END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 |
[production] |
16:33 |
<btullis@cumin1001> |
START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 |
[production] |
16:25 |
<brennen> |
gitlab: run ansible to apply [[gerrit:710676|fix shell for backup cronjob]] (T288324) |
[production] |
16:01 |
<moritzm> |
installing c-ares security updates on buster |
[production] |
14:48 |
<ladsgroup@deploy1002> |
Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710515|Reduce ten seconds from dispatch max time (T288175)]] (duration: 00m 58s) |
[production] |
13:32 |
<moritzm> |
updating bullseye installations to the latest state of testing |
[production] |
13:19 |
<moritzm> |
installing perl security updates on Bullseye (older distros not affected) |
[production] |
13:00 |
<jayme@deploy1002> |
helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . |
[production] |
12:54 |
<ppchelko@deploy1002> |
Finished deploy [restbase/deploy@5791a7a]: Add count parameter to recommendations API T287227 (duration: 37m 18s) |
[production] |
12:42 |
<lucaswerkmeister-wmde@deploy1002> |
Synchronized tests/multiversion/StaticSettingsTest.php: Config: [[gerrit:709504|Remove wmgWBRepoConceptBaseUri (T257260)]] (3/3, test) (duration: 00m 57s) |
[production] |
12:41 |
<lucaswerkmeister-wmde@deploy1002> |
Synchronized wmf-config/InitialiseSettings-labs.php: Config: [[gerrit:709504|Remove wmgWBRepoConceptBaseUri (T257260)]] (2/3, beta) (duration: 00m 57s) |
[production] |
12:39 |
<lucaswerkmeister-wmde@deploy1002> |
Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:709504|Remove wmgWBRepoConceptBaseUri (T257260)]] (1/3, prod) (duration: 00m 57s) |
[production] |
12:36 |
<lucaswerkmeister-wmde@deploy1002> |
Synchronized wmf-config/Wikibase.php: Config: [[gerrit:709503|Stop setting $wgWBRepoSettings['conceptBaseUri'] (T257260)]] (duration: 00m 58s) |
[production] |
12:23 |
<kormat> |
non-destructive (🤞) testing of db-switchover against s2/eqiad T288500 |
[production] |
12:17 |
<ppchelko@deploy1002> |
Started deploy [restbase/deploy@5791a7a]: Add count parameter to recommendations API T287227 |
[production] |
11:27 |
<dzahn@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue |
[production] |
11:27 |
<dzahn@cumin1001> |
START - Cookbook sre.hosts.downtime for 5 days, 8:00:00 on planet1002.eqiad.wmnet with reason: known issue |
[production] |
10:56 |
<marostegui> |
Install 10.4.21 on db1169 (s1) |
[production] |
10:54 |
<jayme@deploy1002> |
helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . |
[production] |
10:53 |
<mutante> |
etherpad deleting 2 pads as requested in T288328 |
[production] |
10:52 |
<marostegui> |
Install 10.4.21 on db1096 (s5 and s6) |
[production] |
10:34 |
<elukey@deploy1002> |
helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. |
[production] |
10:34 |
<elukey@deploy1002> |
helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. |
[production] |
10:33 |
<elukey@deploy1002> |
helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. |
[production] |
10:33 |
<elukey@deploy1002> |
helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. |
[production] |
10:28 |
<oblivian@deploy1002> |
helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . |
[production] |