| 2024-07-11
      
      ยง | 
    
  | 14:35 | <godog> | pool titan1001 for switch work T365996 | [production] | 
            
  | 14:25 | <arnaudb@cumin1002> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on backup1011.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet with reason: T365996 | [production] | 
            
  | 14:25 | <arnaudb@cumin1002> | START - Cookbook sre.hosts.downtime for 1:30:00 on backup1011.eqiad.wmnet,db1193.eqiad.wmnet,dbproxy1027.eqiad.wmnet with reason: T365996 | [production] | 
            
  | 14:25 | <arnaudb@cumin1002> | dbctl commit (dc=all): 'T365996 - depool db1193 - s8', diff saved to https://phabricator.wikimedia.org/P66293 and previous config saved to /var/cache/conftool/dbconfig/20240711-142544-arnaudb.json | [production] | 
            
  | 14:20 | <arnaudb@cumin1002> | dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P66292 and previous config saved to /var/cache/conftool/dbconfig/20240711-142037-arnaudb.json | [production] | 
            
  | 14:19 | <cmooney@cumin1002> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad | [production] | 
            
  | 14:19 | <cmooney@cumin1002> | START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad | [production] | 
            
  | 14:15 | <topranks> | rebooting lsw1-f1-eqiad to install updated JunOS version T365996 | [production] | 
            
  | 14:12 | <godog> | depool titan1001 for switch work T365996 | [production] | 
            
  | 14:12 | <cmooney@cumin1002> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad | [production] | 
            
  | 14:12 | <cmooney@cumin1002> | START - Cookbook sre.hosts.downtime for 0:30:00 on 23 hosts with reason: JunOS upgrade lsw1-f1-eqiad | [production] | 
            
  | 14:09 | <cmooney@cumin1002> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on lsw1-f1-eqiad,lsw1-f1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f1-eqiad | [production] | 
            
  | 14:08 | <cmooney@cumin1002> | START - Cookbook sre.hosts.downtime for 0:30:00 on lsw1-f1-eqiad,lsw1-f1-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f1-eqiad | [production] | 
            
  | 14:08 | <cmooney@cumin1002> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:50:00 on lsw1-f1-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f1-eqiad | [production] | 
            
  | 14:08 | <cmooney@cumin1002> | START - Cookbook sre.hosts.downtime for 0:50:00 on lsw1-f1-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f1-eqiad | [production] | 
            
  | 14:05 | <arnaudb@cumin1002> | dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P66291 and previous config saved to /var/cache/conftool/dbconfig/20240711-140530-arnaudb.json | [production] | 
            
  | 13:56 | <klausman@deploy1002> | helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | [production] | 
            
  | 13:52 | <klausman@deploy1002> | helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. | [production] | 
            
  | 13:50 | <arnaudb@cumin1002> | dbctl commit (dc=all): 'Repooling after maintenance db1183 (T367781)', diff saved to https://phabricator.wikimedia.org/P66290 and previous config saved to /var/cache/conftool/dbconfig/20240711-135023-arnaudb.json | [production] | 
            
  | 13:50 | <Emperor> | depool ms-fe1014 and thanos-fe1004 before switch work T365996 | [production] | 
            
  | 13:49 | <dcaro> | deploy toolforge-jobs-framework 16.0.13 (T369573) | [tools] | 
            
  | 13:47 | <arnaudb@cumin1002> | dbctl commit (dc=all): 'Depooling db1183 (T367781)', diff saved to https://phabricator.wikimedia.org/P66289 and previous config saved to /var/cache/conftool/dbconfig/20240711-134759-arnaudb.json | [production] | 
            
  | 13:47 | <arnaudb@cumin1002> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1183.eqiad.wmnet with reason: Maintenance | [production] | 
            
  | 13:47 | <arnaudb@cumin1002> | START - Cookbook sre.hosts.downtime for 4:00:00 on db1183.eqiad.wmnet with reason: Maintenance | [production] | 
            
  | 13:47 | <arnaudb@cumin1002> | dbctl commit (dc=all): 'Repooling after maintenance db1161 (T367781)', diff saved to https://phabricator.wikimedia.org/P66288 and previous config saved to /var/cache/conftool/dbconfig/20240711-134737-arnaudb.json | [production] | 
            
  | 13:44 | <btullis@cumin1002> | END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. | [production] | 
            
  | 13:42 | <wmbot~dcaro@urcuchillay> | END (FAIL) - Cookbook wmcs.ceph.osd.bootstrap_and_add (exit_code=99) | [admin] | 
            
  | 13:41 | <wmbot~dcaro@urcuchillay> | START - Cookbook wmcs.ceph.osd.bootstrap_and_add | [admin] | 
            
  | 13:32 | <klausman@deploy1002> | helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | [production] | 
            
  | 13:32 | <arnaudb@cumin1002> | dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P66287 and previous config saved to /var/cache/conftool/dbconfig/20240711-133229-arnaudb.json | [production] | 
            
  | 13:29 | <klausman@deploy1002> | helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. | [production] | 
            
  | 13:28 | <btullis@cumin1002> | END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1090.eqiad.wmnet | [production] | 
            
  | 13:26 | <klausman@deploy1002> | helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | [production] | 
            
  | 13:22 | <klausman@deploy1002> | helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. | [production] | 
            
  | 13:20 | <btullis@cumin1002> | START - Cookbook sre.hosts.reboot-single for host an-worker1090.eqiad.wmnet | [production] | 
            
  | 13:18 | <btullis> | setting cephosd cluster to noout mode for T365996 | [analytics] | 
            
  | 13:17 | <btullis> | draining dse-k8s-worker1007 ready for T365996 | [analytics] | 
            
  | 13:17 | <arnaudb@cumin1002> | dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P66286 and previous config saved to /var/cache/conftool/dbconfig/20240711-131721-arnaudb.json | [production] | 
            
  | 13:14 | <btullis> | failed back hive and presto services to an-coord1003 | [analytics] | 
            
  | 13:14 | <cgoubert@cumin1002> | conftool action : set/pooled=yes; selector: name=(kubernetes1062.eqiad.wmnet|mw1494.eqiad.wmnet|mw1495.eqiad.wmnet),cluster=kubernetes,service=kubesvc | [production] | 
            
  | 13:14 | <claime> | Uncordoning and depooling kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet that were actually not concerned by T365996 | [production] | 
            
  | 13:13 | <klausman@deploy1002> | helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | [production] | 
            
  | 13:12 | <btullis@cumin1002> | START - Cookbook sre.presto.roll-restart-workers for Presto an-presto cluster: Roll restart of all Presto's jvm daemons. | [production] | 
            
  | 13:10 | <klausman@deploy1002> | helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. | [production] | 
            
  | 13:09 | <klausman@deploy1002> | helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | [production] | 
            
  | 13:08 | <cgoubert@cumin1002> | conftool action : set/pooled=inactive; selector: name=(kubernetes1062.eqiad.wmnet|mw1494.eqiad.wmnet|mw1495.eqiad.wmnet),cluster=kubernetes,service=kubesvc | [production] | 
            
  | 13:05 | <klausman@deploy1002> | helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. | [production] | 
            
  | 13:04 | <claime> | Cordoning and depooling kubernetes1062.eqiad.wmnet mw1494.eqiad.wmnet mw1495.eqiad.wmnet for T365996 | [production] | 
            
  | 13:04 | <bking@cumin2002> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6 days, 0:00:00 on relforge[1003-1004].eqiad.wmnet with reason: T368950 | [production] | 
            
  | 13:04 | <klausman@deploy1002> | helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. | [production] |