2021-10-11
§
|
12:45 |
<ema> |
cp4027: upgrade varnish to 6.0.8 T292290 |
[production] |
12:14 |
<wm-bot> |
<lucaswerkmeister> deployed fb32d04132 (l10n updates) |
[tools.lexeme-forms] |
12:04 |
<moritzm> |
install apache security updates on bullseye |
[production] |
10:32 |
<wm-bot> |
Adding a new grid webgrid generic node (T292465) - cookbook ran by dcaro@vulcanus |
[toolsbeta] |
10:23 |
<filippo@cumin1001> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host graphite2003.codfw.wmnet |
[production] |
09:50 |
<filippo@cumin1001> |
START - Cookbook sre.hosts.reimage for host graphite2003.codfw.wmnet |
[production] |
09:45 |
<filippo@cumin1001> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host graphite2003.codfw.wmnet |
[production] |
09:37 |
<elukey> |
force kafka preferred-replica-election on kafka-main2001 after another batch of topic partitions moves - T288825 |
[production] |
09:13 |
<filippo@cumin1001> |
START - Cookbook sre.hosts.reimage for host graphite2003.codfw.wmnet |
[production] |
09:09 |
<elukey> |
force kafka preferred-replica-election on kafka-main2001 after the first 50 topic partitions moves - T288825 |
[production] |
09:05 |
<volans@cumin2002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet |
[production] |
09:01 |
<godog> |
bounce swift-object-replicator on ms-be2036 |
[production] |
08:52 |
<godog> |
bounce statsite on graphite1004 to apply unit config changes |
[production] |
08:48 |
<volans@cumin1001> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet |
[production] |
08:41 |
<volans@cumin2002> |
START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet |
[production] |
08:38 |
<moritzm> |
updated buster d-i image for Bullseye 11.1 point release T292844 |
[production] |
08:38 |
<moritzm> |
updated buster d-i image for Buster 10.11 point release T292838 |
[production] |
08:26 |
<godog> |
swift eqiad-prod: final weight to ms-be10[64-67] - T290546 |
[production] |
08:25 |
<moritzm> |
updated buster d-i image for Buster 10.11 point release T292838 |
[production] |
08:24 |
<volans@cumin1001> |
START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet |
[production] |
08:06 |
<godog> |
bounce uwsgi on graphite hosts to bump request size limit - T292877 |
[production] |
07:58 |
<volans> |
migrating physical hosts DHCP to the new reimage process - T269855 |
[production] |
07:57 |
<elukey> |
start kafka topics rebalancing for main-codfw (long running maintenance) - T288825 |
[production] |
07:37 |
<joal> |
rerun refine_event for `event`.`mediawiki_content_translation_event` year=2021/month=10/day=10/hour=16 |
[analytics] |
02:08 |
<Krinkle> |
Browsers fail to connect with https://upload.wikimedia.beta.wmflabs.org/ (Certificate expired two days ago) |
[releng] |
2021-10-09
§
|
05:01 |
<jiji@deploy1002> |
helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . |
[production] |
04:28 |
<jiji@deploy1002> |
helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . |
[production] |
01:32 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 |
[production] |
00:46 |
<mutante> |
ms-be2045 - started systemd-timedated which had been killed by something |
[production] |
00:28 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 |
[production] |
00:24 |
<ryankemper@cumin1001> |
END (FAIL) - Cookbook sre.elasticsearch.force-unfreeze (exit_code=99) |
[production] |
00:23 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.force-unfreeze |
[production] |
00:18 |
<DeusExMachina> |
Deleted mars-01 per T292884 |
[wikisp] |
00:13 |
<ryankemper> |
T292814 Write queue stuck at 133 events in partition 1 of topic `codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite`, will try again at another time |
[production] |
00:12 |
<ryankemper@cumin1001> |
END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 |
[production] |
2021-10-08
§
|
23:16 |
<legoktm> |
sudo cumin -b 10 C:mediawiki::packages 'apt-get purge lilypond-data -y' |
[production] |
23:10 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 |
[production] |
23:00 |
<wm-bot> |
<root> root forced the deletion of job 913897 |
[tools.gerrit-reviewer-bot] |
22:55 |
<wm-bot> |
<root> Restarting in the hope of fixing LE cert issues |
[tools.gerrit-reviewer-bot] |
21:38 |
<mutante> |
mwmaint2002 - disable-puppet, stop bacula-fd, recovery in progress |
[production] |
21:34 |
<mutante> |
disabling puppet on bacula - going through a restore https://wikitech.wikimedia.org/wiki/Bacula#Restore_from_a_non-existent_host_(missing_private_key) |
[production] |
21:30 |
<legoktm> |
running puppet across C:mediawiki::packages to uninstall lilypond and ploticus: legoktm@cumin1001:~$ sudo cumin -b 4 C:mediawiki::packages 'run-puppet-agent' |
[production] |
21:17 |
<hashar> |
Purging Docker images on all CI agents |
[releng] |
20:12 |
<cmjohnson@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: REIMAGE |
[production] |
20:10 |
<cmjohnson@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1004.eqiad.wmnet with reason: REIMAGE |
[production] |
20:08 |
<cmjohnson@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: REIMAGE |
[production] |
20:08 |
<cmjohnson@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1003.eqiad.wmnet with reason: REIMAGE |
[production] |
20:06 |
<cmjohnson@cumin1001> |
START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1004.eqiad.wmnet with reason: REIMAGE |
[production] |