| 2015-06-16
      
      § | 
    
  | 05:56 | <godog> | bump ES replication throttling to 60mb/s | [production] | 
            
  | 05:50 | <manybubbles> | ok - we're yellow and recovering. ops can take this from here. We have a root cause and we have things I can complain about to the elastic folks I plan to meet with today anyway. I'm going to finish waking up now. | [production] | 
            
  | 05:49 | <manybubbles> | reenabling puppet agent on elasticsearch machines | [production] | 
            
  | 05:46 | <manybubbles> | I expect them to be red for another few minutes during the initial master recovery | [production] | 
            
  | 05:46 | <manybubbles> | started all elasticsearch nodes and now they are recovering. | [production] | 
            
  | 05:41 | <godog> | restart gmond on elastic1007 | [production] | 
            
  | 05:39 | <filippo> | Synchronized wmf-config/PoolCounterSettings-common.php: throttle ES (duration: 00m 13s) | [production] | 
            
  | 05:25 | <manybubbles> | shutting down all the elasticsearch on the elasticsearch nodes against - another full cluster restart should fix it like it did last time............... | [production] | 
            
  | 05:11 | <godog> | restart elasticsearch on elastic1031 | [production] | 
            
  | 03:06 | <springle> | Synchronized wmf-config/db-eqiad.php: depool db1073 (duration: 00m 12s) | [production] | 
            
  | 02:27 | <LocalisationUpdate> | completed (1.26wmf9) at 2015-06-16 02:27:51+00:00 | [production] | 
            
  | 02:24 | <l10nupdate> | Synchronized php-1.26wmf9/cache/l10n: (no message) (duration: 05m 52s) | [production] | 
            
  | 00:55 | <tgr> | running extensions/Gather/maintenance/updateCounts.php for gather wikis - https://phabricator.wikimedia.org/T101460 | [production] | 
            
  | 00:52 | <springle> | Synchronized wmf-config/db-eqiad.php: repool db1057, warm up (duration: 00m 13s) | [production] | 
            
  | 00:46 | <godog> | killed bacula-fd on graphite1001, shouldn't be running and consuming bandwidth (cc akosiaris) | [production] | 
            
  | 00:27 | <godog> | kill python stats on cp1052, filling /tmp | [production] | 
            
  
    | 2015-06-15
      
      § | 
    
  | 23:42 | <ori> | Cleaning up renamed jobqueue metrics on graphite{1,2}001 | [production] | 
            
  | 23:01 | <godog> | killed bacula-fd on graphite2001, shouldn't be running and consuming bandwidth (cc akosiaris) | [production] | 
            
  | 22:54 | <hoo> | Synchronized wmf-config/filebackend.php: Fix commons image inclusion after commons went https only (duration: 00m 14s) | [production] | 
            
  | 22:18 | <godog> | run disk stress-test on restbase1007 / restbase1009 | [production] | 
            
  | 22:06 | <twentyafterfour> | Synchronized hhvm-fatal-error.php: deploy: Guard header() call in error page (duration: 00m 15s) | [production] | 
            
  | 22:05 | <twentyafterfour> | Synchronized wmf-config/InitialiseSettings-labs.php: deploy: Never use wgServer/wgCanonicalServer values from production in labs (duration: 00m 12s) | [production] | 
            
  | 20:37 | <yurik> | Synchronized docroot/bits/WikipediaMobileFirefoxOS: Bumping FirefoxOS app to latest (duration: 00m 14s) | [production] | 
            
  | 20:30 | <godog> | bounce cassandra on restbase1003 | [production] | 
            
  | 20:18 | <godog> | start cassandra on restbase1008, bootstrapping | [production] | 
            
  | 20:04 | <godog> | sign restbase1008 key, run puppet | [production] | 
            
  | 20:00 | <godog> | powercycle restbase1007, investigate disk issue | [production] | 
            
  | 19:07 | <ori> | Synchronized php-1.26wmf9/includes/jobqueue: 0a32aa3be4: jobqueue: use more sensible metric key names (duration: 00m 13s) | [production] | 
            
  | 16:57 | <thcipriani> | Synchronized wmf-config/InitialiseSettings.php: SWAT:  Grant cloudadmins the 'editallhiera' right [[gerrit:218115]] (duration: 00m 14s) | [production] | 
            
  | 16:49 | <thcipriani> | Synchronized php-1.26wmf9/extensions/OpenStackManager/OpenStackManagerHooks.php: SWAT: refer to user the right way (duration: 00m 13s) | [production] | 
            
  | 16:48 | <godog> | powercycle graphite1002, no ssh, unresponsive console | [production] | 
            
  | 16:19 | <jynus> | upgrading es1005 mysql service while depooled | [production] | 
            
  | 16:12 | <thcipriani> | Synchronized wmf-config/InitialiseSettings.php: SWAT:  Grant cloudadmins the 'editallhiera' right [[gerrit:218115]] (duration: 00m 12s) | [production] | 
            
  | 16:10 | <bblack> | pybal restarts complete, all ok | [production] | 
            
  | 16:09 | <thcipriani> | Finished scap: SWAT: Openstack manager and language updates (duration: 21m 27s) | [production] | 
            
  | 15:47 | <thcipriani> | Started scap: SWAT: Openstack manager and language updates | [production] | 
            
  | 15:46 | <bblack> | starting pybal restart process for config changes ( https://gerrit.wikimedia.org/r/#/c/218285/ ), inactives first w/ manual verification of ok-ness | [production] | 
            
  | 15:11 | <bblack> | rebooting cp3041 (downtimed) | [production] | 
            
  | 15:00 | <_joe_> | ES is green | [production] | 
            
  | 14:38 | <aude> | Synchronized php-1.26wmf9/extensions/Wikidata: Fix property label constraints bug (duration: 00m 24s) | [production] | 
            
  | 14:27 | <aude> | Synchronized arbitraryaccess.dblist: Enable arbitrary access on s7 wikis (duration: 00m 13s) | [production] | 
            
  | 13:47 | <jynus> | enabling puppet on all elastic* nodes, should enable also ganglia | [production] | 
            
  | 13:11 | <demon> | Synchronized wmf-config/PoolCounterSettings-common.php: all the search (duration: 00m 12s) | [production] | 
            
  | 13:04 | <_joe_> | re-scaling down the recovery index bandwidth in ES to 20 mb/s | [production] | 
            
  | 12:52 | <demon> | Synchronized wmf-config/PoolCounterSettings-common.php: partially turn search back on (duration: 00m 13s) | [production] | 
            
  | 11:54 | <_joe_> | raised the ES index replica bandwidth limit to 60mb | [production] | 
            
  | 11:31 | <akosiaris> | migrating etherpad.wikimedia.org to etherpad1001.eqiad.wmnet | [production] | 
            
  | 11:15 | <_joe_> | raised the max bytes for ES recovery to 40mbps | [production] | 
            
  | 10:49 | <manybubbles> | and we're yellow right now. | [production] | 
            
  | 10:49 | <manybubbles> | the initial primaries stage - the red stage of the rolling restart - recovers quick-ish | [production] |