| 2009-01-20
      
      § | 
    
  | 21:45 | <RobH> | killed some run away processes on db9 that were killing bugzilla | [production] | 
            
  | 21:44 | <brion> | stock long queries on bz again. got rob poking em | [production] | 
            
  | 20:31 | <brion> | putting $wgEnotifUseJobQ back for now. change postdates some of the spikes i'm seeing, but it'll be easier to not have to consider it | [production] | 
            
  | 20:19 | <mark> | Upgraded kernel to 2.6.24-22 on sq22 | [production] | 
            
  | 19:57 | <brion> | disabling $wgEnotifUseJobQ since the lag is ungodly | [production] | 
            
  | 17:58 | <JeLuF> | db2 overloaded, error messages about unreachable DB server have been supported. Nearly all connections on DB2 are in status "Sleep" | [production] | 
            
  | 17:21 | <JeLuF> | srv154 is reachable again, current load average is 25, no obvious CPU consuming processes visible | [production] | 
            
  | 17:10 | <JeLuF> | srv154 went down. Replaced its memcached by srv144's memcached | [production] | 
            
  | 03:02 | <brion> | syncing InitialiseSettings -- reenabling CentralNotice which we'd taken temporarily out during the upload breakage | [production] | 
            
  | 01:50 | <Tim> | exim4 on lily died while I examined reports of breakage, restarted it | [production] | 
            
  
    | 2009-01-15
      
      § | 
    
  | 21:16 | <brion> | seems magically better now | [production] | 
            
  | 20:48 | <brion> | ok webserver7 started | [production] | 
            
  | 20:43 | <brion> | per mark's recommendation, retrying webserver7 now that we've reduced hit rate and are past peak... | [production] | 
            
  | 20:28 | <brion> | bumping styles back to apaches | [production] | 
            
  | 20:25 | <brion> | restarted w/ some old server config bits commented out | [production] | 
            
  | 20:24 | <brion> | tom recompiled lighty w/ the solaris bug patch. may or may not be workin' better, but still not throwing a lot of reqs through. checking config... | [production] | 
            
  | 19:48 | <brion> | trying webserver7 again to see if it's still doing the funk and if we can measure something useful | [production] | 
            
  | 19:47 | <brion> | we're gonna poke around http://redmine.lighttpd.net/issues/show/673 but we're really not sure what the original problem was to begin with yet | [production] | 
            
  | 19:39 | <brion> | turning lighty back on, gonna poke it some more | [production] | 
            
  | 19:31 | <brion> | stopping lighty again. not sure what the hell is going on, but it seems not to respond to most requests | [production] | 
            
  | 19:27 | <brion> | image scalers are still doing wayyy under what they're supposed to, but they are churning some stuff out. not overloaded that i can see... | [production] | 
            
  | 19:20 | <brion> | seems to spawn its php-cgi's ok | [production] | 
            
  | 19:19 | <brion> | trying to stop lighty to poke at fastcgi again | [production] | 
            
  | 19:15 | <brion> | looks like ms1+lighty is successfully serving images, but failing to hit the scaling backends. possible fastcgi buggage | [production] | 
            
  | 19:12 | <brion> | started lighty on ms1 a bit ago. not realyl sure if it's configured right | [production] | 
            
  | 19:00 | <brion> | stopping it again. confirmed load spike still going on | [production] | 
            
  | 18:58 | <brion> | restarting webserver on ms1, see what happens | [production] | 
            
  | 18:56 | <brion> | apache load seems to have dropped back to normal | [production] | 
            
  | 18:48 | <brion> | switching stylepath back to upload (should be cached), seeing if that affects apache load | [production] |