| 2009-07-13
      
      § | 
    
  | 21:31 | <Rob> | pushing dns update to fix management ips for new apaches | [production] | 
            
  | 19:05 | <Fred> | added storage3 to ganglia monitor. | [production] | 
            
  | 18:50 | <brion> | synchronized php-1.5/abusefilter.php  'Disable dewiki missingsummary, mysteriously in abusefilter section. Per bug 19208' | [production] | 
            
  | 16:30 | <Fred> | install wikimedia-nis-client to srv66 and mounted /home. | [production] | 
            
  | 16:28 | <brion> | synchronized php-1.5/InitialiseSettings.php  'fixing wikispecies RC-IRC prefix to species.wikimedia' | [production] | 
            
  | 16:27 | <brion> | test wiki was apparently moved from dead srv35 to srv66, which has new NFS-less config. thus fail since test runs from nfs | [production] | 
            
  | 16:24 | <brion> | test wiki borked; reported down for several days now :) investigating | [production] | 
            
  | 15:12 | <midom> | synchronized php-1.5/db.php  'db26 raid issues' | [production] | 
            
  | 14:55 | <midom> | synchronized php-1.5/db.php  'db3 and db5 coming live as commons servers' | [production] | 
            
  | 14:13 | <domas> | dropped few more snapshots, as %sys was increasing on ms1... | [production] | 
            
  | 11:16 | <domas> | manually restarted plethora of failing apaches (direct segfaults and other possible APC corruptions, leading to php OOM errors) | [production] | 
            
  | 09:50 | <tstarling> | synchronized php-1.5/includes/specials/SpecialBlockip.php | [production] | 
            
  | 09:00 | <Tim> | restarted apache2 on image scalers | [production] | 
            
  | 08:39 | <tstarling> | synchronized php-1.5/includes/Math.php  'statless render hack' | [production] | 
            
  | 08:05 | <Tim> | killed all image scalers to see if that helps with ms1 load | [production] | 
            
  | 08:00 | <Tim> | killed waiting apache processes | [production] | 
            
  | 07:35 | <midom> | synchronized php-1.5/mc-pmtpa.php | [production] | 
            
  | 07:24 | <midom> | synchronized php-1.5/mc-pmtpa.php  'swapping out srv81' | [production] | 
            
  | 04:11 | <Tim> | fixed /opt/local/bin/zfs-replicate on ms1 to write the snapshot number before starting replication, to avoid permanent error "dataset already exists" after failure | [production] | 
            
  | 02:16 | <brion> | -> https://bugzilla.wikimedia.org/show_bug.cgi?id=19683 | [production] | 
            
  | 02:12 | <brion> | sync-common script doesn't work on nfs-free apaches; language lists etc not being updated. Deployment scripts need to be fixed? | [production] | 
            
  | 02:03 | <brion> | srv159 is absurdly loaded/lagged wtf? | [production] | 
            
  | 01:58 | <brion> | reports of servers with old config, seeing "doesn't exist" for new mhr.wikipedia. checking... | [production] | 
            
  | 01:16 | <brion> | so far so good; CPU graphs on image scalers and ms1 look clean, and I can purge thumbs on commons ok | [production] | 
            
  | 01:10 | <brion> | trying switching image scalers back in for a few, see if they go right back to old pattern or not | [production] | 
            
  | 01:03 | <brion> | load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P | [production] | 
            
  | 00:59 | <brion> | stopping apache on image scaler boxes, see what that does | [production] | 
            
  | 00:49 | <brion> | attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00 | [production] | 
            
  | 00:45 | <brion> | restarting nfs server | [production] | 
            
  | 00:44 | <brion> | stopping nfs server, restarting web server | [production] | 
            
  | 00:40 | <brion> | restarting nfs server on ms1 | [production] | 
            
  | 00:36 | <brion> | doesn't seem so far to have changed the NFS access delays on image scalers. | [production] | 
            
  | 00:31 | <brion> | shutting down webserver7 on ms1 | [production] | 
            
  | 00:23 | <brion> | investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside | [production] | 
            
  
    | 2009-07-12
      
      § | 
    
  | 20:30 | <domas> | dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately | [production] | 
            
  | 20:05 | <domas> | we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1 | [production] | 
            
  | 18:55 | <domas> | zil_disable=1 on ms1 | [production] | 
            
  | 18:34 | <mark> | Upgraded pybal on lvs3 | [production] | 
            
  | 18:16 | <mark> | Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s | [production] | 
            
  | 17:58 | <domas> | scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this. | [production] | 
            
  | 17:55 | <domas> | investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-) | [production] | 
            
  | 17:54 | <domas> | bumped up ms1 SJSWS thread count | [production] |