2022-12-13
ยง
|
14:03 |
<btullis@cumin1001> |
START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-stretch2001.codfw.wmnet with reason: Accessing BIOS on kafka-stretch2001 |
[production] |
13:59 |
<jayme@deploy1002> |
helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply |
[production] |
13:59 |
<jayme@deploy1002> |
helmfile [eqiad] START helmfile.d/services/sessionstore: apply |
[production] |
13:57 |
<jayme@deploy1002> |
helmfile [codfw] DONE helmfile.d/services/sessionstore: apply |
[production] |
13:57 |
<jayme@deploy1002> |
helmfile [codfw] START helmfile.d/services/sessionstore: apply |
[production] |
13:57 |
<jayme@deploy1002> |
helmfile [staging] DONE helmfile.d/services/sessionstore: apply |
[production] |
13:49 |
<jayme@deploy1002> |
helmfile [staging] START helmfile.d/services/sessionstore: apply |
[production] |
13:49 |
<jayme@deploy1002> |
helmfile [codfw] DONE helmfile.d/services/sessionstore: apply |
[production] |
13:48 |
<jayme@deploy1002> |
helmfile [codfw] START helmfile.d/services/sessionstore: apply |
[production] |
13:28 |
<btullis@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-stretch2002.codfw.wmnet with reason: Accessing BIOS on kafka-stretch2002 |
[production] |
13:28 |
<btullis@cumin1001> |
START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-stretch2002.codfw.wmnet with reason: Accessing BIOS on kafka-stretch2002 |
[production] |
12:31 |
<claime> |
sessionstore outage being monitored |
[production] |
12:23 |
<claime> |
sessionstore outage, login functions severely impacted |
[production] |
12:07 |
<hashar> |
Gerrit now has CI job results represented in the Checks tab which should be a little nicer. The old HTML result table is gone and replaced by little bubbles representing the state of the builds for the latest patchset. Ref: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/3ULF5NPVC4MSVABZBSXAMDODLZUKFXHS/ |
[production] |
12:00 |
<jynus@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: restart |
[production] |
12:00 |
<jynus@cumin1001> |
START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: restart |
[production] |
11:57 |
<hashar> |
Restarted Gerrit on gerrit1001 |
[production] |
11:55 |
<hashar@deploy1002> |
Finished deploy [gerrit/gerrit@9ef1a16]: Replace CI result table by Checks API plugin - T214068 (duration: 00m 09s) |
[production] |
11:55 |
<hashar@deploy1002> |
Started deploy [gerrit/gerrit@9ef1a16]: Replace CI result table by Checks API plugin - T214068 |
[production] |
11:54 |
<hashar> |
Restarted Gerrit on gerrit2002 (replica) |
[production] |
11:52 |
<hashar@deploy1002> |
Finished deploy [gerrit/gerrit@9ef1a16]: Replace CI result table by Checks API plugin - T214068 (duration: 00m 11s) |
[production] |
11:52 |
<hashar@deploy1002> |
Started deploy [gerrit/gerrit@9ef1a16]: Replace CI result table by Checks API plugin - T214068 |
[production] |
11:42 |
<jmm@cumin2002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on idp-test1002.wikimedia.org with reason: Various tests which may cause temporary breakage on idp-test.w.o |
[production] |
11:42 |
<jmm@cumin2002> |
START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on idp-test1002.wikimedia.org with reason: Various tests which may cause temporary breakage on idp-test.w.o |
[production] |
11:22 |
<moritzm> |
installing paramiko security updates# |
[production] |
11:17 |
<claime> |
Puppet re-enabled on cp::text nodes - T290536 |
[production] |
10:58 |
<jgiannelos@deploy1002> |
Finished deploy [kartotherian/deploy@27ac6d3] (codfw): Increase codfw mirrored traffic to 100% (duration: 01m 40s) |
[production] |
10:57 |
<jgiannelos@deploy1002> |
Started deploy [kartotherian/deploy@27ac6d3] (codfw): Increase codfw mirrored traffic to 100% |
[production] |
10:54 |
<dcausse@deploy1002> |
Finished deploy [wikimedia/discovery/analytics@e988b5e]: Relax sla for the weekly es transfer and subgraph_and_query_metrics (duration: 02m 25s) |
[production] |
10:51 |
<dcausse@deploy1002> |
Started deploy [wikimedia/discovery/analytics@e988b5e]: Relax sla for the weekly es transfer and subgraph_and_query_metrics |
[production] |
10:36 |
<vgutierrez> |
clean up stale prometheus target files in prometheus5001 |
[production] |
10:22 |
<claime> |
puppet run on cp4037 - T290536 |
[production] |
10:21 |
<claime> |
puppet disabled on cp hosts for T290536 |
[production] |
10:01 |
<oblivian@deploy1002> |
helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply |
[production] |
10:00 |
<oblivian@deploy1002> |
helmfile [eqiad] START helmfile.d/services/mw-debug: apply |
[production] |
09:54 |
<moritzm> |
installing libhttp-daemon-perl security updates |
[production] |
09:17 |
<hashar@deploy1002> |
rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.14 refs T320519 |
[production] |
09:07 |
<claime> |
Repooled parse1002.eqiad.wmnet in parsoid service - T324949 |
[production] |
09:05 |
<cgoubert@cumin1001> |
END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for parse1002.eqiad.wmnet |
[production] |
09:05 |
<cgoubert@cumin1001> |
START - Cookbook sre.hosts.remove-downtime for parse1002.eqiad.wmnet |
[production] |
09:01 |
<cgoubert@cumin1001> |
conftool action : set/pooled=no; selector: name=parse1002.eqiad.wmnet |
[production] |
08:58 |
<moritzm> |
installing libpgjava security updates |
[production] |
08:55 |
<moritzm> |
installing xen security updates |
[production] |
08:30 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42683 and previous config saved to /var/cache/conftool/dbconfig/20221213-083019-root.json |
[production] |
08:15 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42682 and previous config saved to /var/cache/conftool/dbconfig/20221213-081514-root.json |
[production] |
08:13 |
<kartik@deploy1002> |
Finished scap: Backport for [[gerrit:867002|Enable Section Translation in Chuvash Wikipedia (T319176)]] (duration: 10m 01s) |
[production] |
08:05 |
<kartik@deploy1002> |
kartik and kartik: Backport for [[gerrit:867002|Enable Section Translation in Chuvash Wikipedia (T319176)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet |
[production] |
08:03 |
<kartik@deploy1002> |
Started scap: Backport for [[gerrit:867002|Enable Section Translation in Chuvash Wikipedia (T319176)]] |
[production] |
08:00 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42681 and previous config saved to /var/cache/conftool/dbconfig/20221213-080009-root.json |
[production] |
07:52 |
<ladsgroup@deploy1002> |
Finished scap: Backport for [[gerrit:867262|Reduce PC writes from parsoid API to 1%]] (duration: 09m 35s) |
[production] |