8151-8200 of 10000 results (61ms)
2024-08-20 §
06:36 <ryankemper@deploy1003> Started deploy [wdqs/wdqs@316bf7f]: deploy to freshly reimaged host [production]
05:22 <marostegui> Deploy schema change on s1 eqiad old master db1184 dbmaint T367856 [production]
05:19 <marostegui@cumin1002> dbctl commit (dc=all): 'Depool db1184 T372524', diff saved to https://phabricator.wikimedia.org/P67395 and previous config saved to /var/cache/conftool/dbconfig/20240820-051948-marostegui.json [production]
05:18 <marostegui@cumin1002> dbctl commit (dc=all): 'Promote db1163 to s1 primary and set section read-write T372524', diff saved to https://phabricator.wikimedia.org/P67394 and previous config saved to /var/cache/conftool/dbconfig/20240820-051843-marostegui.json [production]
05:18 <marostegui@cumin1002> dbctl commit (dc=all): 'Set s1 eqiad as read-only for maintenance - T372524', diff saved to https://phabricator.wikimedia.org/P67393 and previous config saved to /var/cache/conftool/dbconfig/20240820-051821-root.json [production]
05:18 <marostegui> Starting s1 eqiad failover from db1184 to db1163 - T372524 [production]
05:17 <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1163 with weight 0 T372524', diff saved to https://phabricator.wikimedia.org/P67392 and previous config saved to /var/cache/conftool/dbconfig/20240820-051726-marostegui.json [production]
05:16 <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db1184.eqiad.wmnet with reason: Long schema change [production]
05:16 <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1184.eqiad.wmnet with reason: Long schema change [production]
04:52 <marostegui@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T372524 [production]
04:52 <marostegui@cumin1002> dbctl commit (dc=all): 'Set db1163 with weight 0 T372524', diff saved to https://phabricator.wikimedia.org/P67391 and previous config saved to /var/cache/conftool/dbconfig/20240820-045212-root.json [production]
04:52 <marostegui@cumin1002> START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T372524 [production]
04:00 <mwpresync@deploy1003> Pruned MediaWiki: 1.43.0-wmf.16 (duration: 00m 56s) [production]
03:48 <mwpresync@deploy1003> Finished scap sync-world: testwikis to 1.43.0-wmf.19 refs T366964 (duration: 46m 32s) [production]
03:18 <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [admin]
03:02 <mwpresync@deploy1003> Started scap sync-world: testwikis to 1.43.0-wmf.19 refs T366964 [production]
00:21 <mutante> previous message about prometheus can be ignored - race condition that solved itself on next puppet run [production]
00:04 <mutante> prometheus3003/prometheus1006 - are trying to use puppetserver1002 but get connection refused from puppetservre1001.eqiad.wmnet port 8140 - causing other puppet errors [production]
2024-08-19 §
23:59 <mutante> prometheus - puppet on prometheus hosts very slow - reason appears to be that /srv/prometheus is recursively managed by puppet but has ~ 20x more files than the default soft limit of 1000 [production]
23:55 <mutante> prometheus - switched ferm::service to firewall::service (gerrit:1057952) - NOOP except /etc/ferm/conf.d/10_prometheus-web becomes /etc/ferm/conf.d/10_prometheus_web with identical rules [production]
23:28 <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [admin]
23:28 <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node [admin]
23:17 <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node [admin]
23:17 <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.undrain_node (exit_code=0) [admin]
23:17 <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.undrain_node [admin]
23:16 <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=0) [admin]
23:16 <andrew@cloudcumin1001> START - Cookbook wmcs.ceph.osd.bootstrap_and_add [admin]
23:16 <andrew@cloudcumin1001> END (ERROR) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=97) [admin]
23:15 <ejegg> fundraising civicrm upgraded from fd01c939 to 1022abf1 [production]
22:30 <andrew@cumin1002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1041.eqiad.wmnet with OS bullseye [production]
22:12 <andrew@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage [production]
22:09 <andrew@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage [production]
22:02 <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-24 [tools]
21:56 <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-24 [tools]
21:52 <andrew@cloudcumin1001> END (PASS) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=0) for tools-k8s-worker-nfs-17 [tools]
21:50 <andrew@cumin1002> START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye [production]
21:48 <andrew@cumin1002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1040.eqiad.wmnet with OS bullseye [production]
21:46 <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17 [tools]
21:46 <andrew@cloudcumin1001> END (FAIL) - Cookbook wmcs.toolforge.k8s.reboot (exit_code=99) for tools-k8s-worker-nfs-17,tools-k8s-worker-nfs-24 [tools]
21:46 <andrew@cloudcumin1001> START - Cookbook wmcs.toolforge.k8s.reboot for tools-k8s-worker-nfs-17,tools-k8s-worker-nfs-24 [tools]
21:30 <andrew@cumin1002> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [production]
21:26 <andrew@cumin1002> START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage [production]
21:07 <andrew@cumin1002> START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye [production]
21:06 <andrew@cumin1002> END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1039.eqiad.wmnet with OS bullseye [production]
20:57 <eevans@deploy1003> Finished deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test (duration: 00m 06s) [production]
20:57 <eevans@deploy1003> Started deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test [production]
20:52 <sbassett> Deployed changes from T372570 to security.wikimedia.org (miscweb) [production]
20:49 <sbassett@deploy1003> helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [production]
20:49 <sbassett@deploy1003> helmfile [eqiad] START helmfile.d/services/miscweb: apply [production]
20:49 <sbassett@deploy1003> helmfile [codfw] DONE helmfile.d/services/miscweb: apply [production]