| 
      
        2021-04-29
      
      §
     | 
  
    
  | 01:23 | 
  <ryankemper> | 
  T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` | 
  [production] | 
            
  | 01:21 | 
  <ryankemper@cumin1001> | 
  END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) | 
  [production] | 
            
  | 01:21 | 
  <ryankemper@cumin1001> | 
  START - Cookbook sre.wdqs.data-transfer | 
  [production] | 
            
  | 01:20 | 
  <ryankemper@cumin1001> | 
  END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) | 
  [production] | 
            
  | 01:20 | 
  <ryankemper@cumin1001> | 
  START - Cookbook sre.wdqs.data-transfer | 
  [production] | 
            
  | 01:19 | 
  <ryankemper@cumin1001> | 
  END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) | 
  [production] | 
            
  | 01:19 | 
  <ryankemper@cumin1001> | 
  START - Cookbook sre.wdqs.data-transfer | 
  [production] | 
            
  | 01:19 | 
  <ryankemper> | 
  T280382 Aborted data transfer; `wdqs2007` is hosed (see https://phabricator.wikimedia.org/T281437) | 
  [production] | 
            
  | 01:18 | 
  <ryankemper@cumin1001> | 
  END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) | 
  [production] | 
            
  | 00:40 | 
  <tstarling@deploy1002> | 
  Synchronized php-1.37.0-wmf.3/includes/specials/pagers/ImageListPager.php: T281405 (duration: 01m 08s) | 
  [production] | 
            
  | 00:11 | 
  <ryankemper> | 
  T280382 `sudo -i wmf-auto-reimage-host -p T280382 wdqs1004.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` | 
  [production] | 
            
  | 00:06 | 
  <ryankemper> | 
  T280382 `wdqs1013.eqiad.wmnet` has been re-imaged and had the appropriate wikidata/categories journal files transferred. `df -h` shows disk space is no longer an issue following the switch to `raid0`: `/dev/mapper/vg0-srv   2.7T  998G  1.6T  39% /srv` | 
  [production] | 
            
  
    | 
      
        2021-04-28
      
      §
     | 
  
    
  | 23:42 | 
  <ryankemper@cumin1001> | 
  END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) | 
  [production] | 
            
  | 23:38 | 
  <robh@cumin1001> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 23:36 | 
  <robh@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 23:36 | 
  <robh@cumin1001> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 23:34 | 
  <robh@cumin1001> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 23:33 | 
  <robh@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 23:32 | 
  <robh@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 23:06 | 
  <dpifke@deploy1002> | 
  Finished deploy [performance/navtiming@cf8b2e9]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/682886 (duration: 00m 05s) | 
  [production] | 
            
  | 23:06 | 
  <dpifke@deploy1002> | 
  Started deploy [performance/navtiming@cf8b2e9]: Deploying https://gerrit.wikimedia.org/r/c/performance/navtiming/+/682886 | 
  [production] | 
            
  | 22:44 | 
  <dwisehaupt> | 
  civiproxy revision changed to 99cecb924a - initial rollout of code for testing | 
  [production] | 
            
  | 22:26 | 
  <ryankemper> | 
  T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` | 
  [production] | 
            
  | 22:26 | 
  <ryankemper@cumin1001> | 
  START - Cookbook sre.wdqs.data-transfer | 
  [production] | 
            
  | 22:23 | 
  <ryankemper@cumin1001> | 
  END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) | 
  [production] | 
            
  | 22:18 | 
  <ryankemper> | 
  T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs1004.eqiad.wmnet --dest wdqs1013.eqiad.wmnet --reason "transferring fresh categories journal following reimage" --blazegraph_instance categories` on `ryankemper@cumin1001` tmux session `reimage` | 
  [production] | 
            
  | 22:18 | 
  <ryankemper@cumin1001> | 
  START - Cookbook sre.wdqs.data-transfer | 
  [production] | 
            
  | 22:18 | 
  <robh@cumin1001> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 22:15 | 
  <robh@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:49 | 
  <legoktm@deploy1002> | 
  helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 21:49 | 
  <legoktm@deploy1002> | 
  helmfile [staging-eqiad] START helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 21:47 | 
  <legoktm@deploy1002> | 
  helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 21:46 | 
  <robh@cumin1001> | 
  END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:44 | 
  <legoktm@deploy1002> | 
  helmfile [staging-codfw] START helmfile.d/admin 'apply'. | 
  [production] | 
            
  | 21:44 | 
  <robh@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:41 | 
  <ryankemper@cumin1001> | 
  END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:39 | 
  <ryankemper@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1013.eqiad.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:39 | 
  <ryankemper@cumin1001> | 
  START - Cookbook sre.wdqs.data-transfer | 
  [production] | 
            
  | 21:39 | 
  <ryankemper@cumin1001> | 
  END (ERROR) - Cookbook sre.wdqs.data-transfer (exit_code=97) | 
  [production] | 
            
  | 21:38 | 
  <ryankemper> | 
  T280382 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2001.codfw.wmnet --dest wdqs2007.codfw.wmnet --reason "transferring fresh wikidata journal following reimage" --blazegraph_instance blazegraph` on `ryankemper@cumin1001` tmux session `reimage` | 
  [production] | 
            
  | 21:37 | 
  <ryankemper> | 
  T280382 `wdqs2007` is reachable again; glancing at `/srv/wdqs` its `wikidata.jnl` is `839G` when it should be `975G` so I'll re-do the wikidata journal transfer | 
  [production] | 
            
  | 21:32 | 
  <ryankemper> | 
  T280382 [WDQS] `wdqs2007` ssh is unreachable; power cycling via `racadm>>racadm serveraction powercycle` | 
  [production] | 
            
  | 21:24 | 
  <ryankemper> | 
  T280382 `sudo -i wmf-auto-reimage-host -p T280382 --new wdqs1013.eqiad.wmnet` on `ryankemper@cumin1001` tmux session `reimage` (previous reimage timed out, instance appears to have rebooted) | 
  [production] | 
            
  | 21:07 | 
  <robh@cumin1001> | 
  END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:05 | 
  <robh@cumin1001> | 
  END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:04 | 
  <robh@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on cp5016.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:03 | 
  <robh@cumin1001> | 
  END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:03 | 
  <robh@cumin1001> | 
  END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cp5014.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:01 | 
  <robh@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on cp5013.eqsin.wmnet with reason: REIMAGE | 
  [production] | 
            
  | 21:01 | 
  <robh@cumin1001> | 
  START - Cookbook sre.hosts.downtime for 2:00:00 on cp5015.eqsin.wmnet with reason: REIMAGE | 
  [production] |