2011-04-04
§
|
18:50 |
<RobH> |
updating bugzilla per rt#718 bz#28409 bz#28402 |
[production] |
18:42 |
<notpeter> |
added cname etherpad for hooper.wikimedia.org |
[production] |
18:00 |
<Ryan_Lane> |
added the wikimedia-fonts package to lucid-wikimedia repo |
[production] |
17:29 |
<notpeter> |
adding self to nagios group. rebooterizing nagios. |
[production] |
05:58 |
<apergos> |
cleaned up perms on commons/thumb/a/af, left over from interrupted rsync test last night |
[production] |
05:50 |
<tstarling> |
synchronized php-1.17/wmf-config/InitialiseSettings.php 'enabling pool counter on all wikis' |
[production] |
04:12 |
<tstarling> |
synchronized php-1.17/wmf-config/InitialiseSettings.php 'enabling PoolCounter on testwiki and test2wiki' |
[production] |
01:22 |
<Tim> |
apache CPU overload lasted ~10 mins, v. high backend request rate, don't know cause, seems to have stopped now |
[production] |
2011-04-03
§
|
18:42 |
<apergos> |
8 rsyncs of ms4 thumbs restarted with better perms so scalers can write... in screen as root on ms5. If we start seeing nfs timesouts in the scaler logs please shoot a couple |
[production] |
17:14 |
<mark> |
Deployed max-connections on all cache peers for esams.upload squids to their florida parents (current limit 200) |
[production] |
17:00 |
<mark> |
Removed the carp weights on the esams backends again, as the weighting was completely screwed up |
[production] |
16:59 |
<mark> |
Started knsq13 backend |
[production] |
14:27 |
<catrope> |
ran sync-common-all |
[production] |
14:26 |
<RoanKattouw> |
Running sync-common-all to deploy r85256 |
[production] |
13:03 |
<apergos> |
shot rsyncs on ms5, setting 777 dir perms on all thumbnail dirs (eg e/ef/blablah.jpg) so scalers can write into them |
[production] |
12:53 |
<apergos> |
did same for rest of projects and subdirs (777 on hash dirs) |
[production] |
12:47 |
<apergos> |
chmod 777 on commons/thumb/*/* on ms5 so that scalers can create directories in there (mismatch of uid apache vs www-data) |
[production] |
11:12 |
<mark> |
Raised per-squid connection limit to ms5 of 200 to 400 connections |
[production] |
11:05 |
<mark> |
Raised per-squid connection limit to ms5 of 100 to 200 connections |
[production] |
10:55 |
<mark> |
Fixed squid loop, the pmtpa.upload squids were using the esams squids as "CARP parents for distant content" |
[production] |
10:29 |
<mark> |
Fixed puppet on sq42/43 |
[production] |
09:44 |
<mark> |
Lowered FCGI thumb handlers from 90 to 60 again, to reduce concurrency |
[production] |
08:08 |
<mark> |
Started 4 more rsyncs (8 total now) |
[production] |
07:49 |
<mark> |
Removed mlocate from ms5, puppetising |
[production] |
07:42 |
<mark> |
Started 4 rsyncs from ms4 to ms5 (--ignore-existing) |
[production] |
07:32 |
<mark> |
increased thumb handler count from 60 to 90 |
[production] |
07:11 |
<mark> |
Doubled the amount of fcgi thumb handlers |
[production] |
07:08 |
<mark> |
Turned off logging of 404s to nginx error.log |
[production] |
06:50 |
<mark> |
Restarted Apache on the image scalers |
[production] |
06:49 |
<mark> |
Reconfigured ms5 to use the 404 thumb handler |
[production] |
06:48 |
<Ryan_Lane> |
disabling nfs on ms4 |
[production] |
06:33 |
<mark> |
Running puppet on all apaches to fix fstab and mount ms5.pmtpa.wmnet:/export/thumbs |
[production] |
06:32 |
<mark> |
Unmounting /mnt/thumbs on all mediawiki-installation servers |
[production] |
06:30 |
<mark> |
Remounted NFS /mnt/thumbs on the scalers to ms5 |
[production] |
06:28 |
<Ryan_Lane> |
bring nfs back up |
[production] |
06:28 |
<Ryan_Lane> |
brought ms4 back up. stopping the web server service and nfs |
[production] |
06:20 |
<mark> |
Setup NFS kernel server on ms5 |
[production] |
06:18 |
<Ryan_Lane> |
powercycling ms4 |
[production] |
05:29 |
<Ryan_Lane> |
rebooting ms4 with -d to get a coredump |
[production] |
05:14 |
<apergos> |
reanbling webserver on ms4 for testing |
[production] |
04:45 |
<apergos> |
stopping web service on ms4 for the moment |
[production] |
04:29 |
<apergos> |
shot webserver again |
[production] |
04:26 |
<apergos> |
turned off hourly snaps on ms4, turned back on webserver and nfs |
[production] |
04:09 |
<apergos> |
rebooted ms4, shut down webserver and nfsd temporarily for testing |
[production] |
02:58 |
<apergos> |
still looking at kernel memory issues, still rebooting, ryan should be here in a few minutes to help out |
[production] |
02:03 |
<apergos> |
a solaris advisor... also have zfs arch cache max to 2g which is ridiculously low but wtf right? |
[production] |
02:02 |
<apergos> |
set tcp_time_wait_interval to 10000 at suggestion of |
[production] |
01:37 |
<apergos> |
lowered zfs arch max to 2g (someone should reset this later)... will take effect on next reboot |
[production] |
00:29 |
<apergos> |
rebooting with the new zfs arc cache max value, which will reduce the min value as well... dunno if this will give us enough breathing room or not |
[production] |
00:24 |
<apergos> |
set zfs arc cache to ridiculously low value of 4gb, since when it's healthy it's using much less than that (1gb), this will take effect on reboot |
[production] |