| 
      
        2024-07-11
      
      ยง
     | 
  
    
  | 14:42 | 
  <hnowlan@deploy1002> | 
  helmfile [codfw] START helmfile.d/services/changeprop: apply | 
  [production] | 
            
  | 14:41 | 
  <hnowlan@deploy1002> | 
  helmfile [staging] DONE helmfile.d/services/changeprop: apply | 
  [production] | 
            
  | 14:40 | 
  <hnowlan@deploy1002> | 
  helmfile [staging] START helmfile.d/services/changeprop: apply | 
  [production] | 
            
  | 14:38 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Repooling after maintenance db1185 (T367781)', diff saved to https://phabricator.wikimedia.org/P66296 and previous config saved to /var/cache/conftool/dbconfig/20240711-143829-arnaudb.json | 
  [production] | 
            
  | 14:36 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Depooling db1185 (T367781)', diff saved to https://phabricator.wikimedia.org/P66295 and previous config saved to /var/cache/conftool/dbconfig/20240711-143606-arnaudb.json | 
  [production] | 
            
  | 14:35 | 
  <arnaudb@cumin1002> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance | 
  [production] | 
            
  | 14:35 | 
  <arnaudb@cumin1002> | 
  START - Cookbook sre.hosts.downtime for 4:00:00 on db1185.eqiad.wmnet with reason: Maintenance | 
  [production] | 
            
  | 14:35 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'db1193 (re)pooling @ 5%: post T365996 repool', diff saved to https://phabricator.wikimedia.org/P66294 and previous config saved to /var/cache/conftool/dbconfig/20240711-143541-arnaudb.json | 
  [production] | 
            
  | 14:35 | 
  <godog> | 
  pool titan1001 for switch work T365996 | 
  [production] | 
            
  | 14:25 | 
  <arnaudb@cumin1002> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on backup1011.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet with reason: T365996 | 
  [production] | 
            
  | 14:25 | 
  <arnaudb@cumin1002> | 
  START - Cookbook sre.hosts.downtime for 1:30:00 on backup1011.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet with reason: T365996 | 
  [production] | 
            
  | 14:25 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'T365996 - depool db1193 - s8', diff saved to https://phabricator.wikimedia.org/P66293 and previous config saved to /var/cache/conftool/dbconfig/20240711-142544-arnaudb.json | 
  [production] | 
            
  | 14:20 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P66292 and previous config saved to /var/cache/conftool/dbconfig/20240711-142037-arnaudb.json | 
  [production] | 
            
  | 14:19 | 
  <cmooney@cumin1002> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad | 
  [production] | 
            
  | 14:19 | 
  <cmooney@cumin1002> | 
  START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad | 
  [production] | 
            
  | 14:15 | 
  <topranks> | 
  rebooting lsw1-f1-eqiad to install updated JunOS version T365996 | 
  [production] | 
            
  | 14:12 | 
  <godog> | 
  depool titan1001 for switch work T365996 | 
  [production] | 
            
  | 14:12 | 
  <cmooney@cumin1002> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad | 
  [production] | 
            
  | 14:12 | 
  <cmooney@cumin1002> | 
  START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad | 
  [production] | 
            
  | 14:09 | 
  <cmooney@cumin1002> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-f1-eqiad,lsw1-f1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f1-eqiad | 
  [production] | 
            
  | 14:08 | 
  <cmooney@cumin1002> | 
  START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-f1-eqiad,lsw1-f1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f1-eqiad | 
  [production] | 
            
  | 14:08 | 
  <cmooney@cumin1002> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-f1-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f1-eqiad | 
  [production] | 
            
  | 14:08 | 
  <cmooney@cumin1002> | 
  START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-f1-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f1-eqiad | 
  [production] | 
            
  | 14:05 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P66291 and previous config saved to /var/cache/conftool/dbconfig/20240711-140530-arnaudb.json | 
  [production] | 
            
  | 13:56 | 
  <klausman@deploy1002> | 
  helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 13:52 | 
  <klausman@deploy1002> | 
  helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 13:50 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Repooling after maintenance db1183 (T367781)', diff saved to https://phabricator.wikimedia.org/P66290 and previous config saved to /var/cache/conftool/dbconfig/20240711-135023-arnaudb.json | 
  [production] | 
            
  | 13:50 | 
  <Emperor> | 
  depool ms-fe1014 and thanos-fe1004 before switch work T365996 | 
  [production] | 
            
  | 13:49 | 
  <dcaro> | 
  deploy toolforge-jobs-framework 16.0.13 (T369573) | 
  [tools] | 
            
  | 13:47 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Depooling db1183 (T367781)', diff saved to https://phabricator.wikimedia.org/P66289 and previous config saved to /var/cache/conftool/dbconfig/20240711-134759-arnaudb.json | 
  [production] | 
            
  | 13:47 | 
  <arnaudb@cumin1002> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1183.eqiad.wmnet with reason: Maintenance | 
  [production] | 
            
  | 13:47 | 
  <arnaudb@cumin1002> | 
  START - Cookbook sre.hosts.downtime for 4:00:00 on db1183.eqiad.wmnet with reason: Maintenance | 
  [production] | 
            
  | 13:47 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Repooling after maintenance db1161 (T367781)', diff saved to https://phabricator.wikimedia.org/P66288 and previous config saved to /var/cache/conftool/dbconfig/20240711-134737-arnaudb.json | 
  [production] | 
            
  | 13:44 | 
  <btullis@cumin1002> | 
  END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. | 
  [production] | 
            
  | 13:42 | 
  <wmbot~dcaro@urcuchillay> | 
  END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) | 
  [admin] | 
            
  | 13:41 | 
  <wmbot~dcaro@urcuchillay> | 
  START - Cookbook wmcs.ceph.osd.bootstrap_and_add | 
  [admin] | 
            
  | 13:32 | 
  <klausman@deploy1002> | 
  helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 13:32 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P66287 and previous config saved to /var/cache/conftool/dbconfig/20240711-133229-arnaudb.json | 
  [production] | 
            
  | 13:29 | 
  <klausman@deploy1002> | 
  helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 13:28 | 
  <btullis@cumin1002> | 
  END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1090.eqiad.wmnet | 
  [production] | 
            
  | 13:26 | 
  <klausman@deploy1002> | 
  helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 13:22 | 
  <klausman@deploy1002> | 
  helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 13:20 | 
  <btullis@cumin1002> | 
  START - Cookbook sre.hosts.reboot-single for host an-worker1090.eqiad.wmnet | 
  [production] | 
            
  | 13:18 | 
  <btullis> | 
  setting cephosd cluster to noout mode for T365996 | 
  [analytics] | 
            
  | 13:17 | 
  <btullis> | 
  draining dse-k8s-worker1007 ready for T365996 | 
  [analytics] | 
            
  | 13:17 | 
  <arnaudb@cumin1002> | 
  dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P66286 and previous config saved to /var/cache/conftool/dbconfig/20240711-131721-arnaudb.json | 
  [production] | 
            
  | 13:14 | 
  <btullis> | 
  failed back hive and presto services to an-coord1003 | 
  [analytics] | 
            
  | 13:14 | 
  <cgoubert@cumin1002> | 
  conftool action : set/pooled=yes; selector: name=(kubernetes1062.eqiad.wmnet|mw1494.eqiad.wmnet|mw1495.eqiad.wmnet),cluster=kubernetes,service=kubesvc | 
  [production] | 
            
  | 13:14 | 
  <claime> | 
  Uncordoning and depooling kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet that were actually not concerned by T365996 | 
  [production] | 
            
  | 13:13 | 
  <klausman@deploy1002> | 
  helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | 
  [production] |