| 2009-07-13
      
      § | 
    
  | 01:03 | <brion> | load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P | [production] | 
            
  | 00:59 | <brion> | stopping apache on image scaler boxes, see what that does | [production] | 
            
  | 00:49 | <brion> | attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00 | [production] | 
            
  | 00:45 | <brion> | restarting nfs server | [production] | 
            
  | 00:44 | <brion> | stopping nfs server, restarting web server | [production] | 
            
  | 00:40 | <brion> | restarting nfs server on ms1 | [production] | 
            
  | 00:36 | <brion> | doesn't seem so far to have changed the NFS access delays on image scalers. | [production] | 
            
  | 00:31 | <brion> | shutting down webserver7 on ms1 | [production] | 
            
  | 00:23 | <brion> | investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside | [production] | 
            
  
    | 2009-07-12
      
      § | 
    
  | 20:30 | <domas> | dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately | [production] | 
            
  | 20:05 | <domas> | we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1 | [production] | 
            
  | 18:55 | <domas> | zil_disable=1 on ms1 | [production] | 
            
  | 18:34 | <mark> | Upgraded pybal on lvs3 | [production] | 
            
  | 18:16 | <mark> | Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s | [production] | 
            
  | 17:58 | <domas> | scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this. | [production] | 
            
  | 17:55 | <domas> | investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-) | [production] | 
            
  | 17:54 | <domas> | bumped up ms1 SJSWS thread count | [production] | 
            
  
    | 2009-07-11
      
      § | 
    
  | 15:45 | <mark> | Rebooting sq1 | [production] | 
            
  | 15:31 | <Tim> | rebooting ms1 | [production] | 
            
  | 14:54 | <Tim> | disabled CentralNotice temporarily | [production] | 
            
  | 14:54 | <tstarling> | synchronized php-1.5/InitialiseSettings.php  'disabling CentralNotice' | [production] | 
            
  | 14:53 | <tstarling> | synchronized php-1.5/InitialiseSettings.php  'disabling CentralAuth' | [production] | 
            
  | 14:36 | <Tim> | restarted webserver7 on ms1 | [production] | 
            
  | 14:22 | <Tim> | some kind of overload, seems to be image related | [production] | 
            
  | 10:09 | <midom> | synchronized php-1.5/db.php  'db8 doing commons read load, full write though' | [production] | 
            
  | 09:22 | <domas> | restarted job queue with externallinks purging code, <3 | [production] | 
            
  | 09:22 | <domas> | installed nrpe on db2 :) | [production] | 
            
  | 09:22 | <midom> | synchronized php-1.5/db.php  'giving db24 just negligible load for now' | [production] | 
            
  | 08:38 | <midom> | synchronized php-1.5/includes/parser/ParserOutput.php  'livemerging r53103:53105' | [production] | 
            
  | 08:37 | <midom> | synchronized php-1.5/includes/DefaultSettings.php | [production] | 
            
  
    | 2009-07-10
      
      § | 
    
  | 21:21 | <Fred> | added ganglia to db20 | [production] | 
            
  | 19:58 | <azafred> | synchronized php-1.5/CommonSettings.php  'removed border=0 from wgCopyrightIcon' | [production] | 
            
  | 18:58 | <Fred> | synched nagios config to reflect cleanup. | [production] | 
            
  | 18:52 | <Fred> | cleaned up the node_files for dsh and removed all decommissioned hosts. | [production] | 
            
  | 18:36 | <mark> | Added DNS entries for srv251-500 | [production] | 
            
  | 18:18 | <fvassard> | synchronized php-1.5/mc-pmtpa.php  'Added a couple spare memcache hosts.' | [production] | 
            
  | 18:16 | <RobH_DC> | moved test to srv66 instead. | [production] | 
            
  | 18:08 | <RobH_DC> | turning srv210 into test.wikipedia.org | [production] | 
            
  | 17:57 | <Andrew> | Reactivating UsabilityInitiative globally, too. | [production] | 
            
  | 17:55 | <Andrew> | Scapping, back-out diff is in /home/andrew/usability-diff | [production] | 
            
  | 17:43 | <Andrew> | Apply r52926, r52930, and update Resources and EditToolbar/images | [production] | 
            
  | 16:44 | <Fred> | reinstalled and configured gmond on storage1. | [production] | 
            
  | 15:08 | <Rob> | upgraded blog and techblog to wordpress 2.8.1 | [production] | 
            
  | 13:58 | <midom> | synchronized php-1.5/includes/api/ApiQueryCategoryMembers.php  'hello, fix\\!' | [production] | 
            
  | 12:40 | <Tim> | prototype.wikimedia.org is in OOM death, nagios reports down 3 hours, still responsive on shell so I will try a light touch | [production] | 
            
  | 11:08 | <tstarling> | synchronized php-1.5/mc-pmtpa.php  'more' | [production] | 
            
  | 10:58 | <Tim> | installed memcached on srv200-srv209 | [production] | 
            
  | 10:51 | <tstarling> | synchronized php-1.5/mc-pmtpa.php  'deployed the 11 available spares, will make some more' | [production] | 
            
  | 10:48 | <Tim> | mctest.php reports 17 servers down out of 78, most from the range that Rob decommissioned | [production] | 
            
  | 10:37 | <Tim> | installed memcached on srv120, srv121, srv122, srv123 | [production] |