| 2009-07-13
      
      § | 
    
  | 08:39 | <tstarling> | synchronized php-1.5/includes/Math.php  'statless render hack' | [production] | 
            
  | 08:05 | <Tim> | killed all image scalers to see if that helps with ms1 load | [production] | 
            
  | 08:00 | <Tim> | killed waiting apache processes | [production] | 
            
  | 07:35 | <midom> | synchronized php-1.5/mc-pmtpa.php | [production] | 
            
  | 07:24 | <midom> | synchronized php-1.5/mc-pmtpa.php  'swapping out srv81' | [production] | 
            
  | 04:11 | <Tim> | fixed /opt/local/bin/zfs-replicate on ms1 to write the snapshot number before starting replication, to avoid permanent error "dataset already exists" after failure | [production] | 
            
  | 02:16 | <brion> | -> https://bugzilla.wikimedia.org/show_bug.cgi?id=19683 | [production] | 
            
  | 02:12 | <brion> | sync-common script doesn't work on nfs-free apaches; language lists etc not being updated. Deployment scripts need to be fixed? | [production] | 
            
  | 02:03 | <brion> | srv159 is absurdly loaded/lagged wtf? | [production] | 
            
  | 01:58 | <brion> | reports of servers with old config, seeing "doesn't exist" for new mhr.wikipedia. checking... | [production] | 
            
  | 01:16 | <brion> | so far so good; CPU graphs on image scalers and ms1 look clean, and I can purge thumbs on commons ok | [production] | 
            
  | 01:10 | <brion> | trying switching image scalers back in for a few, see if they go right back to old pattern or not | [production] | 
            
  | 01:03 | <brion> | load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P | [production] | 
            
  | 00:59 | <brion> | stopping apache on image scaler boxes, see what that does | [production] | 
            
  | 00:49 | <brion> | attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00 | [production] | 
            
  | 00:45 | <brion> | restarting nfs server | [production] | 
            
  | 00:44 | <brion> | stopping nfs server, restarting web server | [production] | 
            
  | 00:40 | <brion> | restarting nfs server on ms1 | [production] | 
            
  | 00:36 | <brion> | doesn't seem so far to have changed the NFS access delays on image scalers. | [production] | 
            
  | 00:31 | <brion> | shutting down webserver7 on ms1 | [production] | 
            
  | 00:23 | <brion> | investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside | [production] | 
            
  
    | 2009-07-12
      
      § | 
    
  | 20:30 | <domas> | dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately | [production] | 
            
  | 20:05 | <domas> | we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1 | [production] | 
            
  | 18:55 | <domas> | zil_disable=1 on ms1 | [production] | 
            
  | 18:34 | <mark> | Upgraded pybal on lvs3 | [production] | 
            
  | 18:16 | <mark> | Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s | [production] | 
            
  | 17:58 | <domas> | scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this. | [production] | 
            
  | 17:55 | <domas> | investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-) | [production] | 
            
  | 17:54 | <domas> | bumped up ms1 SJSWS thread count | [production] | 
            
  
    | 2009-07-11
      
      § | 
    
  | 15:45 | <mark> | Rebooting sq1 | [production] | 
            
  | 15:31 | <Tim> | rebooting ms1 | [production] | 
            
  | 14:54 | <Tim> | disabled CentralNotice temporarily | [production] | 
            
  | 14:54 | <tstarling> | synchronized php-1.5/InitialiseSettings.php  'disabling CentralNotice' | [production] | 
            
  | 14:53 | <tstarling> | synchronized php-1.5/InitialiseSettings.php  'disabling CentralAuth' | [production] | 
            
  | 14:36 | <Tim> | restarted webserver7 on ms1 | [production] | 
            
  | 14:22 | <Tim> | some kind of overload, seems to be image related | [production] | 
            
  | 10:09 | <midom> | synchronized php-1.5/db.php  'db8 doing commons read load, full write though' | [production] | 
            
  | 09:22 | <domas> | restarted job queue with externallinks purging code, <3 | [production] | 
            
  | 09:22 | <domas> | installed nrpe on db2 :) | [production] | 
            
  | 09:22 | <midom> | synchronized php-1.5/db.php  'giving db24 just negligible load for now' | [production] | 
            
  | 08:38 | <midom> | synchronized php-1.5/includes/parser/ParserOutput.php  'livemerging r53103:53105' | [production] | 
            
  | 08:37 | <midom> | synchronized php-1.5/includes/DefaultSettings.php | [production] |