251-300 of 10000 results (29ms)
2021-10-11 §
17:08 <elukey> force kafka preferred-replica-election on kafka-main2001 after another batch of topic partitions moves - T288825 [production]
15:40 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [production]
15:34 <jmm@cumin2002> START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [production]
15:31 <jgleeson> smashpig updated from 3607b16f83 to dd3a81c7c2 [production]
14:59 <jmm@cumin2002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm[2001-2002,2005].codfw.wmnet with reason: Ganeti tests [production]
14:59 <jmm@cumin2002> START - Cookbook sre.hosts.downtime for 2:00:00 on testvm[2001-2002,2005].codfw.wmnet with reason: Ganeti tests [production]
14:36 <Emperor> start restoring weight to ms-be2045 T290881 [production]
13:42 <elukey> force kafka preferred-replica-election on kafka-main2001 after another batch of topic partitions moves - T288825 [production]
12:53 <moritzm> install apache security updates on buster [production]
12:49 <topranks> Setting up BGP peering to AS12552 (GlobalConnect Group) at AMS-IX on cr2-esams [production]
12:45 <ema> cp4027: upgrade varnish to 6.0.8 T292290 [production]
12:04 <moritzm> install apache security updates on bullseye [production]
10:23 <filippo@cumin1001> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host graphite2003.codfw.wmnet [production]
09:50 <filippo@cumin1001> START - Cookbook sre.hosts.reimage for host graphite2003.codfw.wmnet [production]
09:45 <filippo@cumin1001> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host graphite2003.codfw.wmnet [production]
09:37 <elukey> force kafka preferred-replica-election on kafka-main2001 after another batch of topic partitions moves - T288825 [production]
09:13 <filippo@cumin1001> START - Cookbook sre.hosts.reimage for host graphite2003.codfw.wmnet [production]
09:09 <elukey> force kafka preferred-replica-election on kafka-main2001 after the first 50 topic partitions moves - T288825 [production]
09:05 <volans@cumin2002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1002.eqiad.wmnet [production]
09:01 <godog> bounce swift-object-replicator on ms-be2036 [production]
08:52 <godog> bounce statsite on graphite1004 to apply unit config changes [production]
08:48 <volans@cumin1001> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet [production]
08:41 <volans@cumin2002> START - Cookbook sre.hosts.reimage for host sretest1002.eqiad.wmnet [production]
08:38 <moritzm> updated buster d-i image for Bullseye 11.1 point release T292844 [production]
08:38 <moritzm> updated buster d-i image for Buster 10.11 point release T292838 [production]
08:26 <godog> swift eqiad-prod: final weight to ms-be10[64-67] - T290546 [production]
08:25 <moritzm> updated buster d-i image for Buster 10.11 point release T292838 [production]
08:24 <volans@cumin1001> START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet [production]
08:06 <godog> bounce uwsgi on graphite hosts to bump request size limit - T292877 [production]
07:58 <volans> migrating physical hosts DHCP to the new reimage process - T269855 [production]
07:57 <elukey> start kafka topics rebalancing for main-codfw (long running maintenance) - T288825 [production]
2021-10-09 §
05:01 <jiji@deploy1002> helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [production]
04:28 <jiji@deploy1002> helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [production]
01:32 <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [production]
00:46 <mutante> ms-be2045 - started systemd-timedated which had been killed by something [production]
00:28 <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [production]
00:24 <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.force-unfreeze (exit_code=99) [production]
00:23 <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.force-unfreeze [production]
00:13 <ryankemper> T292814 Write queue stuck at 133 events in partition 1 of topic `codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite`, will try again at another time [production]
00:12 <ryankemper@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [production]
2021-10-08 §
23:16 <legoktm> sudo cumin -b 10 C:mediawiki::packages 'apt-get purge lilypond-data -y' [production]
23:10 <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic restart - ryankemper@cumin1001 - T292814 [production]
21:38 <mutante> mwmaint2002 - disable-puppet, stop bacula-fd, recovery in progress [production]
21:34 <mutante> disabling puppet on bacula - going through a restore https://wikitech.wikimedia.org/wiki/Bacula#Restore_from_a_non-existent_host_(missing_private_key) [production]
21:30 <legoktm> running puppet across C:mediawiki::packages to uninstall lilypond and ploticus: legoktm@cumin1001:~$ sudo cumin -b 4 C:mediawiki::packages 'run-puppet-agent' [production]
20:12 <cmjohnson@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: REIMAGE [production]
20:10 <cmjohnson@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1004.eqiad.wmnet with reason: REIMAGE [production]
20:08 <cmjohnson@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1018.eqiad.wmnet with reason: REIMAGE [production]
20:08 <cmjohnson@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage1003.eqiad.wmnet with reason: REIMAGE [production]
20:06 <cmjohnson@cumin1001> START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage1004.eqiad.wmnet with reason: REIMAGE [production]