2009-01-20
§
|
20:19 |
<mark> |
Upgraded kernel to 2.6.24-22 on sq22 |
[production] |
19:57 |
<brion> |
disabling $wgEnotifUseJobQ since the lag is ungodly |
[production] |
17:58 |
<JeLuF> |
db2 overloaded, error messages about unreachable DB server have been supported. Nearly all connections on DB2 are in status "Sleep" |
[production] |
17:21 |
<JeLuF> |
srv154 is reachable again, current load average is 25, no obvious CPU consuming processes visible |
[production] |
17:10 |
<JeLuF> |
srv154 went down. Replaced its memcached by srv144's memcached |
[production] |
03:02 |
<brion> |
syncing InitialiseSettings -- reenabling CentralNotice which we'd taken temporarily out during the upload breakage |
[production] |
01:50 |
<Tim> |
exim4 on lily died while I examined reports of breakage, restarted it |
[production] |
2009-01-15
§
|
21:16 |
<brion> |
seems magically better now |
[production] |
20:48 |
<brion> |
ok webserver7 started |
[production] |
20:43 |
<brion> |
per mark's recommendation, retrying webserver7 now that we've reduced hit rate and are past peak... |
[production] |
20:28 |
<brion> |
bumping styles back to apaches |
[production] |
20:25 |
<brion> |
restarted w/ some old server config bits commented out |
[production] |
20:24 |
<brion> |
tom recompiled lighty w/ the solaris bug patch. may or may not be workin' better, but still not throwing a lot of reqs through. checking config... |
[production] |
19:48 |
<brion> |
trying webserver7 again to see if it's still doing the funk and if we can measure something useful |
[production] |
19:47 |
<brion> |
we're gonna poke around http://redmine.lighttpd.net/issues/show/673 but we're really not sure what the original problem was to begin with yet |
[production] |
19:39 |
<brion> |
turning lighty back on, gonna poke it some more |
[production] |
19:31 |
<brion> |
stopping lighty again. not sure what the hell is going on, but it seems not to respond to most requests |
[production] |
19:27 |
<brion> |
image scalers are still doing wayyy under what they're supposed to, but they are churning some stuff out. not overloaded that i can see... |
[production] |
19:20 |
<brion> |
seems to spawn its php-cgi's ok |
[production] |
19:19 |
<brion> |
trying to stop lighty to poke at fastcgi again |
[production] |
19:15 |
<brion> |
looks like ms1+lighty is successfully serving images, but failing to hit the scaling backends. possible fastcgi buggage |
[production] |
19:12 |
<brion> |
started lighty on ms1 a bit ago. not realyl sure if it's configured right |
[production] |
19:00 |
<brion> |
stopping it again. confirmed load spike still going on |
[production] |
18:58 |
<brion> |
restarting webserver on ms1, see what happens |
[production] |
18:56 |
<brion> |
apache load seems to have dropped back to normal |
[production] |
18:48 |
<brion> |
switching stylepath back to upload (should be cached), seeing if that affects apache load |
[production] |
18:40 |
<brion> |
switching $wgStylePath to apaches for the moment |
[production] |
18:39 |
<brion> |
load dropping on ms1; ping time stabilizing also |
[production] |
18:38 |
<RobH> |
sq14, sq15, sq16 back up and serving requests |
[production] |
18:38 |
<brion> |
trying stopping/starting webserver on ms1 |
[production] |
18:27 |
<brion> |
nfs upload5 is not happy :( |
[production] |
18:27 |
<brion> |
some sort of issues w/ media fileserver, we think, perhaps pressure due to some upload squid cache clearing? |
[production] |
18:23 |
<RobH> |
sq14-aq16 offline, rebooting and cleaning cache |
[production] |