| 2021-08-10
      
      ยง | 
    
  | 20:31 | <robh@cumin1001> | END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1042.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 20:30 | <robh@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on mc1043.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 20:29 | <robh@cumin1001> | END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1041.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 20:28 | <robh@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on mc1042.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 20:26 | <robh@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on mc1041.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:29 | <robh@cumin1001> | END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1040.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:27 | <robh@cumin1001> | END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mc1039.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:27 | <robh@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on mc1040.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:25 | <robh@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on mc1039.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:16 | <cmjohnson@cumin1001> | END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) | [production] | 
            
  | 19:15 | <cmjohnson@cumin1001> | START - Cookbook sre.dns.netbox | [production] | 
            
  | 19:09 | <cmjohnson@cumin1001> | END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on dumpsdata1005.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:09 | <cmjohnson@cumin1001> | END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1024.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:07 | <cmjohnson@cumin1001> | END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dumpsdata1004.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:05 | <cmjohnson@cumin1001> | END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ganeti1023.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:04 | <cmjohnson@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1005.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:04 | <cmjohnson@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1024.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:04 | <jhuneidi@deploy1002> | rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.18  refs T281159 | [production] | 
            
  | 19:04 | <cmjohnson@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on dumpsdata1004.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 19:03 | <cmjohnson@cumin1001> | START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti1023.eqiad.wmnet with reason: REIMAGE | [production] | 
            
  | 18:49 | <cmjohnson@cumin1001> | END (PASS) - Cookbook sre.dns.netbox (exit_code=0) | [production] | 
            
  | 18:47 | <ryankemper> | [WDQS] `ryankemper@wdqs2005:~$ sudo depool` (~1.26 hours of lag) | [production] | 
            
  | 18:46 | <cmjohnson@cumin1001> | START - Cookbook sre.dns.netbox | [production] | 
            
  | 18:46 | <ryankemper> | T288501 (Misread grafana graph, `wdqs2003` only has 1.33 hours to catch up on) | [production] | 
            
  | 18:45 | <ryankemper> | T288501 `data-transfer` of `wikidata.jnl` completed successfully. Host needs to catch up on ~22 hours of WDQS lag before being re-pooled | [production] | 
            
  | 18:42 | <ryankemper@cumin2001> | END (PASS) - Cookbook sre.wdqs.data-transfer (exit_code=0) | [production] | 
            
  | 17:23 | <jhuneidi@deploy1002> | Finished scap: testwikis wikis to 1.37.0-wmf.18 (duration: 36m 35s) | [production] | 
            
  | 17:19 | <ryankemper> | T288501 `sudo -i cookbook sre.wdqs.data-transfer --source wdqs2005.codfw.wmnet --dest wdqs2003.codfw.wmnet --reason "transferring fresh wikidata journal to resolve disk issue" --blazegraph_instance blazegraph` on `cumin2001` tmux session `wdqs_data_xfer` | [production] | 
            
  | 17:19 | <ryankemper@cumin2001> | START - Cookbook sre.wdqs.data-transfer | [production] | 
            
  | 17:18 | <mbsantos@deploy1002> | helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . | [production] | 
            
  | 17:13 | <ryankemper> | T288501 [WDQS] `ryankemper@wdqs2003:~$ sudo rm -fv /srv/wdqs/wikidata.jnl` | [production] | 
            
  | 17:09 | <razzi@cumin1001> | END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 | [production] | 
            
  | 17:09 | <razzi@cumin1001> | START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - razzi@cumin1001 | [production] | 
            
  | 17:06 | <mbsantos@deploy1002> | helmfile [codfw] Ran 'sync' command on namespace 'mobileapps' for release 'production' . | [production] | 
            
  | 17:02 | <btullis@cumin1001> | END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 | [production] | 
            
  | 17:02 | <btullis@cumin1001> | START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 | [production] | 
            
  | 17:01 | <mbsantos@deploy1002> | helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . | [production] | 
            
  | 16:49 | <btullis@cumin1001> | END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 | [production] | 
            
  | 16:49 | <btullis@cumin1001> | START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 | [production] | 
            
  | 16:47 | <jhuneidi@deploy1002> | Started scap: testwikis wikis to 1.37.0-wmf.18 | [production] | 
            
  | 16:36 | <ebernhardson@deploy1002> | Finished deploy [wikimedia/discovery/analytics@d3c5363]: T287225: Bump rdf-spark-tools to 0.3.81 (duration: 02m 10s) | [production] | 
            
  | 16:34 | <ebernhardson@deploy1002> | Started deploy [wikimedia/discovery/analytics@d3c5363]: T287225: Bump rdf-spark-tools to 0.3.81 | [production] | 
            
  | 16:33 | <btullis@cumin1001> | END (FAIL) - Cookbook sre.druid.roll-restart-workers (exit_code=99) for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 | [production] | 
            
  | 16:33 | <btullis@cumin1001> | START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid's jvm daemons. - btullis@cumin1001 | [production] | 
            
  | 16:25 | <brennen> | gitlab: run ansible to apply [[gerrit:710676|fix shell for backup cronjob]] (T288324) | [production] | 
            
  | 16:01 | <moritzm> | installing c-ares security updates on buster | [production] | 
            
  | 14:48 | <ladsgroup@deploy1002> | Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:710515|Reduce ten seconds from dispatch max time (T288175)]] (duration: 00m 58s) | [production] | 
            
  | 13:32 | <moritzm> | updating bullseye installations to the latest state of testing | [production] | 
            
  | 13:19 | <moritzm> | installing perl security updates on Bullseye (older distros not affected) | [production] | 
            
  | 13:00 | <jayme@deploy1002> | helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . | [production] |