2024-08-20
§
|
05:16 |
<marostegui@cumin1002> |
START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db1184.eqiad.wmnet with reason: Long schema change |
[production] |
04:52 |
<marostegui@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 35 hosts with reason: Primary switchover s1 T372524 |
[production] |
04:52 |
<marostegui@cumin1002> |
dbctl commit (dc=all): 'Set db1163 with weight 0 T372524', diff saved to https://phabricator.wikimedia.org/P67391 and previous config saved to /var/cache/conftool/dbconfig/20240820-045212-root.json |
[production] |
04:52 |
<marostegui@cumin1002> |
START - Cookbook sre.hosts.downtime for 1:00:00 on 35 hosts with reason: Primary switchover s1 T372524 |
[production] |
04:00 |
<mwpresync@deploy1003> |
Pruned MediaWiki: 1.43.0-wmf.16 (duration: 00m 56s) |
[production] |
03:48 |
<mwpresync@deploy1003> |
Finished scap sync-world: testwikis to 1.43.0-wmf.19 refs T366964 (duration: 46m 32s) |
[production] |
03:02 |
<mwpresync@deploy1003> |
Started scap sync-world: testwikis to 1.43.0-wmf.19 refs T366964 |
[production] |
00:21 |
<mutante> |
previous message about prometheus can be ignored - race condition that solved itself on next puppet run |
[production] |
00:04 |
<mutante> |
prometheus3003/prometheus1006 - are trying to use puppetserver1002 but get connection refused from puppetservre1001.eqiad.wmnet port 8140 - causing other puppet errors |
[production] |
2024-08-19
§
|
23:59 |
<mutante> |
prometheus - puppet on prometheus hosts very slow - reason appears to be that /srv/prometheus is recursively managed by puppet but has ~ 20x more files than the default soft limit of 1000 |
[production] |
23:55 |
<mutante> |
prometheus - switched ferm::service to firewall::service (gerrit:1057952) - NOOP except /etc/ferm/conf.d/10_prometheus-web becomes /etc/ferm/conf.d/10_prometheus_web with identical rules |
[production] |
23:15 |
<ejegg> |
fundraising civicrm upgraded from fd01c939 to 1022abf1 |
[production] |
22:30 |
<andrew@cumin1002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1041.eqiad.wmnet with OS bullseye |
[production] |
22:12 |
<andrew@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage |
[production] |
22:09 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1041.eqiad.wmnet with reason: host reimage |
[production] |
21:50 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.reimage for host cloudcephosd1041.eqiad.wmnet with OS bullseye |
[production] |
21:48 |
<andrew@cumin1002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1040.eqiad.wmnet with OS bullseye |
[production] |
21:30 |
<andrew@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage |
[production] |
21:26 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1040.eqiad.wmnet with reason: host reimage |
[production] |
21:07 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.reimage for host cloudcephosd1040.eqiad.wmnet with OS bullseye |
[production] |
21:06 |
<andrew@cumin1002> |
END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1039.eqiad.wmnet with OS bullseye |
[production] |
20:57 |
<eevans@deploy1003> |
Finished deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test (duration: 00m 06s) |
[production] |
20:57 |
<eevans@deploy1003> |
Started deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test |
[production] |
20:52 |
<sbassett> |
Deployed changes from T372570 to security.wikimedia.org (miscweb) |
[production] |
20:49 |
<sbassett@deploy1003> |
helmfile [eqiad] DONE helmfile.d/services/miscweb: apply |
[production] |
20:49 |
<sbassett@deploy1003> |
helmfile [eqiad] START helmfile.d/services/miscweb: apply |
[production] |
20:49 |
<sbassett@deploy1003> |
helmfile [codfw] DONE helmfile.d/services/miscweb: apply |
[production] |
20:49 |
<sbassett@deploy1003> |
helmfile [codfw] START helmfile.d/services/miscweb: apply |
[production] |
20:49 |
<sbassett@deploy1003> |
helmfile [eqiad] DONE helmfile.d/services/miscweb: apply |
[production] |
20:48 |
<andrew@cumin1002> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage |
[production] |
20:46 |
<sbassett@deploy1003> |
helmfile [eqiad] START helmfile.d/services/miscweb: apply |
[production] |
20:45 |
<eevans@deploy1003> |
Finished deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test (duration: 00m 32s) |
[production] |
20:45 |
<sbassett@deploy1003> |
helmfile [codfw] DONE helmfile.d/services/miscweb: apply |
[production] |
20:45 |
<eevans@deploy1003> |
Started deploy [restbase/deploy@b504108] (beta): Dry run beta deployment test |
[production] |
20:44 |
<mforns@deploy1003> |
Finished deploy [airflow-dags/analytics_test@3ec5119]: (no justification provided) (duration: 00m 11s) |
[production] |
20:44 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1039.eqiad.wmnet with reason: host reimage |
[production] |
20:44 |
<mforns@deploy1003> |
Started deploy [airflow-dags/analytics_test@3ec5119]: (no justification provided) |
[production] |
20:42 |
<sbassett@deploy1003> |
helmfile [codfw] START helmfile.d/services/miscweb: apply |
[production] |
20:26 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye |
[production] |
20:26 |
<andrew@cumin1002> |
END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye |
[production] |
20:00 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye |
[production] |
19:59 |
<andrew@cumin1002> |
END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye |
[production] |
19:54 |
<ryankemper@cumin2002> |
END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2024.codfw.wmnet with OS bullseye |
[production] |
19:53 |
<dancy@deploy1003> |
Started scap sync-world: testing T371904 |
[production] |
19:52 |
<dancy@deploy1003> |
Installation of scap version "4.98.0" completed for 207 hosts |
[production] |
19:52 |
<dancy@deploy1003> |
Installing scap version "4.98.0" for 207 hosts |
[production] |
19:51 |
<jhancock@cumin2002> |
END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudlb2004-dev.codfw.wmnet with OS bookworm |
[production] |
19:45 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye |
[production] |
19:45 |
<andrew@cumin1002> |
END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1039.eqiad.wmnet with OS bullseye |
[production] |
19:29 |
<andrew@cumin1002> |
START - Cookbook sre.hosts.reimage for host cloudcephosd1039.eqiad.wmnet with OS bullseye |
[production] |