| 
      
        2009-07-13
      
      §
     | 
  
    
  | 07:35 | 
  <midom> | 
  synchronized php-1.5/mc-pmtpa.php  | 
  [production] | 
            
  | 07:24 | 
  <midom> | 
  synchronized php-1.5/mc-pmtpa.php  'swapping out srv81' | 
  [production] | 
            
  | 04:11 | 
  <Tim> | 
  fixed /opt/local/bin/zfs-replicate on ms1 to write the snapshot number before starting replication, to avoid permanent error "dataset already exists" after failure | 
  [production] | 
            
  | 02:16 | 
  <brion> | 
  -> https://bugzilla.wikimedia.org/show_bug.cgi?id=19683 | 
  [production] | 
            
  | 02:12 | 
  <brion> | 
  sync-common script doesn't work on nfs-free apaches; language lists etc not being updated. Deployment scripts need to be fixed? | 
  [production] | 
            
  | 02:03 | 
  <brion> | 
  srv159 is absurdly loaded/lagged wtf?  | 
  [production] | 
            
  | 01:58 | 
  <brion> | 
  reports of servers with old config, seeing "doesn't exist" for new mhr.wikipedia. checking... | 
  [production] | 
            
  | 01:16 | 
  <brion> | 
  so far so good; CPU graphs on image scalers and ms1 look clean, and I can purge thumbs on commons ok | 
  [production] | 
            
  | 01:10 | 
  <brion> | 
  trying switching image scalers back in for a few, see if they go right back to old pattern or not | 
  [production] | 
            
  | 01:03 | 
  <brion> | 
  load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P | 
  [production] | 
            
  | 00:59 | 
  <brion> | 
  stopping apache on image scaler boxes, see what that does | 
  [production] | 
            
  | 00:49 | 
  <brion> | 
  attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00 | 
  [production] | 
            
  | 00:45 | 
  <brion> | 
  restarting nfs server | 
  [production] | 
            
  | 00:44 | 
  <brion> | 
  stopping nfs server, restarting web server | 
  [production] | 
            
  | 00:40 | 
  <brion> | 
  restarting nfs server on ms1 | 
  [production] | 
            
  | 00:36 | 
  <brion> | 
  doesn't seem so far to have changed the NFS access delays on image scalers. | 
  [production] | 
            
  | 00:31 | 
  <brion> | 
  shutting down webserver7 on ms1 | 
  [production] | 
            
  | 00:23 | 
  <brion> | 
  investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside | 
  [production] | 
            
  
    | 
      
        2009-07-12
      
      §
     | 
  
    
  | 20:30 | 
  <domas> | 
  dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately  | 
  [production] | 
            
  | 20:05 | 
  <domas> | 
  we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1 | 
  [production] | 
            
  | 18:55 | 
  <domas> | 
  zil_disable=1 on ms1 | 
  [production] | 
            
  | 18:34 | 
  <mark> | 
  Upgraded pybal on lvs3 | 
  [production] | 
            
  | 18:16 | 
  <mark> | 
  Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s | 
  [production] | 
            
  | 17:58 | 
  <domas> | 
  scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this. | 
  [production] | 
            
  | 17:55 | 
  <domas> | 
  investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-)  | 
  [production] | 
            
  | 17:54 | 
  <domas> | 
  bumped up ms1 SJSWS thread count | 
  [production] | 
            
  
    | 
      
        2009-07-11
      
      §
     | 
  
    
  | 15:45 | 
  <mark> | 
  Rebooting sq1 | 
  [production] | 
            
  | 15:31 | 
  <Tim> | 
  rebooting ms1 | 
  [production] | 
            
  | 14:54 | 
  <Tim> | 
  disabled CentralNotice temporarily | 
  [production] | 
            
  | 14:54 | 
  <tstarling> | 
  synchronized php-1.5/InitialiseSettings.php  'disabling CentralNotice' | 
  [production] | 
            
  | 14:53 | 
  <tstarling> | 
  synchronized php-1.5/InitialiseSettings.php  'disabling CentralAuth' | 
  [production] | 
            
  | 14:36 | 
  <Tim> | 
  restarted webserver7 on ms1 | 
  [production] | 
            
  | 14:22 | 
  <Tim> | 
  some kind of overload, seems to be image related | 
  [production] | 
            
  | 10:09 | 
  <midom> | 
  synchronized php-1.5/db.php  'db8 doing commons read load, full write though' | 
  [production] | 
            
  | 09:22 | 
  <domas> | 
  restarted job queue with externallinks purging code, <3  | 
  [production] | 
            
  | 09:22 | 
  <domas> | 
  installed nrpe on db2 :)  | 
  [production] | 
            
  | 09:22 | 
  <midom> | 
  synchronized php-1.5/db.php  'giving db24 just negligible load for now' | 
  [production] | 
            
  | 08:38 | 
  <midom> | 
  synchronized php-1.5/includes/parser/ParserOutput.php  'livemerging r53103:53105' | 
  [production] | 
            
  | 08:37 | 
  <midom> | 
  synchronized php-1.5/includes/DefaultSettings.php  | 
  [production] | 
            
  
    | 
      
        2009-07-10
      
      §
     | 
  
    
  | 21:21 | 
  <Fred> | 
  added ganglia to db20 | 
  [production] | 
            
  | 19:58 | 
  <azafred> | 
  synchronized php-1.5/CommonSettings.php  'removed border=0 from wgCopyrightIcon' | 
  [production] | 
            
  | 18:58 | 
  <Fred> | 
  synched nagios config to reflect cleanup.  | 
  [production] | 
            
  | 18:52 | 
  <Fred> | 
  cleaned up the node_files for dsh and removed all decommissioned hosts. | 
  [production] | 
            
  | 18:36 | 
  <mark> | 
  Added DNS entries for srv251-500 | 
  [production] | 
            
  | 18:18 | 
  <fvassard> | 
  synchronized php-1.5/mc-pmtpa.php  'Added a couple spare memcache hosts.' | 
  [production] | 
            
  | 18:16 | 
  <RobH_DC> | 
  moved test to srv66 instead. | 
  [production] | 
            
  | 18:08 | 
  <RobH_DC> | 
  turning srv210 into test.wikipedia.org | 
  [production] | 
            
  | 17:57 | 
  <Andrew> | 
  Reactivating UsabilityInitiative globally, too. | 
  [production] | 
            
  | 17:55 | 
  <Andrew> | 
  Scapping, back-out diff is in /home/andrew/usability-diff | 
  [production] | 
            
  | 17:43 | 
  <Andrew> | 
  Apply r52926, r52930, and update Resources and EditToolbar/images | 
  [production] |