| 2014-07-15
      
      § | 
    
  | 19:55 | <Reedy> | Applied extensions/UploadWizard/UploadWizard.sql to rowiki (re bug 59242) | [production] | 
            
  | 18:53 | <manybubbles> | bouncing elastic1018 to pick up new merge policy.  hopefully that'll help with io thrashing | [production] | 
            
  | 17:58 | <ori> | _joe_ deployed jobrunner to all job runners | [production] | 
            
  | 17:40 | <manybubbles> | my last attempt to lower the concurrent traffic for recovery was a failure - tried again and succeeded.  that seems to have fixed the echo service disruption from taking elastic1017 out of service | [production] | 
            
  | 17:37 | <ori> | updated jobrunner to bef32b9120 | [production] | 
            
  | 17:29 | <manybubbles> | elastic1017 went nuts again.  just shutting elasticsearch off on it for now | [production] | 
            
  | 16:25 | <_joe_> | all mw servers updated | [production] | 
            
  | 16:10 | <_joe_> | mw1100 and onwards updated | [production] | 
            
  | 16:00 | <_joe_> | mw1060-mw1099 updated | [production] | 
            
  | 15:58 | <manybubbles> | restarting Elasticsearch on elastic1017 - its thrashing the disk again.  I'm still not 100% sure why | [production] | 
            
  | 15:57 | <_joe_> | mw1020-mw1059 updated | [production] | 
            
  | 15:53 | <_joe_> | mw101[0-9] updated | [production] | 
            
  | 15:47 | <_joe_> | starting rolling update of all appservers to apache2 2.2.22-1ubuntu1.6, half of them are on 2.2.22-1ubuntu1.5 now | [production] | 
            
  | 15:42 | <manybubbles> | setting the filter cache on one node in the cluster set it on all.  yay, I guess.  Anyway, I'm going to let it soak for a while. | [production] | 
            
  | 15:32 | <manybubbles> | setting filter cache size to 20% on elastic1001 to see if it takes/helps us | [production] | 
            
  | 15:19 | <anomie> | Synchronized wmf-config/: SWAT: Remove dead ULS variable [[gerrit:145861]] (duration: 00m 10s) | [production] | 
            
  | 15:18 | <anomie> | anomie actually committed a live hack someone left on tin (removing db1035) | [production] | 
            
  | 15:16 | <anomie> | updated /a/common to {{Gerrit|I7ca6a16d5}}: Switch jawiki back to lsearchd | [production] | 
            
  | 13:42 | <manybubbles> | Synchronized wmf-config/InitialiseSettings.php: jawiki back to lsearchd (duration: 00m 05s) | [production] | 
            
  | 13:38 | <manybubbles> | elastic1017 had a load average of 60 - was thashing in io.  bounced Elasticsearch.  lets see if it recovers on its own | [production] | 
            
  | 09:09 | <_joe_> | restarting mailman on sodium, again, for testing | [production] | 
            
  | 08:50 | <godog> | restart mailman on sodium after inodes freed | [production] | 
            
  | 07:27 | <_joe_> | restarted mailman on sodium | [production] | 
            
  | 07:22 | <_joe_> | stopping mailman on sodium for repairing | [production] | 
            
  | 06:54 | <_joe_> | killed jenkins stale process on gallium, stuck in a futex while shutting down | [production] | 
            
  | 04:48 | <springle> | db1035 crash cycle. down for memtest and stuff | [production] | 
            
  | 03:34 | <LocalisationUpdate> | ResourceLoader cache refresh completed at Tue Jul 15 03:33:38 UTC 2014 (duration 33m 37s) | [production] | 
            
  | 03:01 | <LocalisationUpdate> | completed (1.24wmf13) at 2014-07-15 03:00:03+00:00 | [production] | 
            
  | 02:34 | <springle> | Synchronized wmf-config/db-eqiad.php: depool db1035, crashed (duration: 00m 13s) | [production] | 
            
  | 02:30 | <LocalisationUpdate> | completed (1.24wmf12) at 2014-07-15 02:29:02+00:00 | [production] | 
            
  | 02:27 | <springle> | powercycle db1035 unresponsive | [production] | 
            
  
    | 2014-07-14
      
      § | 
    
  | 23:32 | <mwalker> | Started scap: Updating for SWAT {{gerrit|146304}}, {{gerrit|146306}}, {{gerrit|146149}}, {{gerrit|146165}}, {{gerrit|146166}}, {{gerrit|146282}}, and {{gerrit|146281}}. Also finishing awight's deploy of FundraisingTranslateWorkflow. | [production] | 
            
  | 20:22 | <cscott> | updated Parsoid to version d51e64097bb1b18e356584d4f3ddcfd90a6071ba | [production] | 
            
  | 19:57 | <ori> | postponing jobrunner deployment to tomorrow; ran over time | [production] | 
            
  | 19:45 | <_joe_> | doing the same on mw1064, segfaulted for the same reason | [production] | 
            
  | 19:44 | <_joe_> | killed a lone apache2 child on mw1152, stuck in a futex, after a segfault of another apache process. Restarted apache, now working correctly | [production] | 
            
  | 19:04 | <godog> | re-enabling mailman on sodium, missing list config restored | [production] | 
            
  | 18:49 | <awight> | Synchronized wmf-config: Deploying FundraisingTranslateWorkflow on metawiki (t | [production] | 
            
  | 18:45 | <awight> | Synchronized php-1.24wmf13/extensions/FundraisingTranslateWorkflow: Update FundraisingTranslateWorkflow extension (wmf13) (duration: 00m 05s) | [production] | 
            
  | 18:44 | <awight> | Synchronized php-1.24wmf12/extensions/FundraisingTranslateWorkflow: Update FundraisingTranslateWorkflow extension (duration: 00m 05s) | [production] | 
            
  | 18:15 | <awight> | Synchronized wmf-config: Revert: Deploying FundraisingTranslateWorkflow on metawiki (duration: 00m 04s) | [production] | 
            
  | 18:03 | <awight> | Synchronized wmf-config: Deploying FundraisingTranslateWorkflow on metawiki (duration: 00m 05s) | [production] | 
            
  | 18:03 | <awight> | updated /a/common to {{Gerrit|Ie7599fb6e}}: jawiki gets Cirrus as primary search | [production] | 
            
  | 17:43 | <Krinkle> | npm-cache for integration slaves got corrupted again. Depooling/Repooling integration-slave100{1,2,3} onoe by one to clear cache and let it warm up again. | [production] | 
            
  | 17:35 | <Krinkle> | Jenkins slaves in labs are unable to reach zuul.eqiad.wmnet | [production] | 
            
  | 17:10 | <andrewbogott> | purging old local-* service group entries from labs ldap (via purgeOldServiceGroups.php) | [production] | 
            
  | 17:05 | <godog> | started mailman on sodium post-reboot | [production] | 
            
  | 17:04 | <demon> | Synchronized wmf-config/InitialiseSettings.php: nlwiki getting cirrus as primary (duration: 00m 04s) | [production] | 
            
  | 15:11 | <manybubbles> | Synchronized wmf-config: SWAT update cirrus settings for commons (duration: 00m 04s) | [production] | 
            
  | 15:04 | <manybubbles> | Synchronized wmf-config: SWAT update cirrus settings for commons (duration: 00m 04s) | [production] |