production SAL

7101-7150 of 10000 results (38ms)

2011-04-04 §
18:42	<notpeter>	added cname etherpad for hooper.wikimedia.org	[production]
18:00	<Ryan_Lane>	added the wikimedia-fonts package to lucid-wikimedia repo	[production]
17:29	<notpeter>	adding self to nagios group. rebooterizing nagios.	[production]
05:58	<apergos>	cleaned up perms on commons/thumb/a/af, left over from interrupted rsync test last night	[production]
05:50	<tstarling>	synchronized php-1.17/wmf-config/InitialiseSettings.php 'enabling pool counter on all wikis'	[production]
04:12	<tstarling>	synchronized php-1.17/wmf-config/InitialiseSettings.php 'enabling PoolCounter on testwiki and test2wiki'	[production]
01:22	<Tim>	apache CPU overload lasted ~10 mins, v. high backend request rate, don't know cause, seems to have stopped now	[production]
2011-04-03 §
18:42	<apergos>	8 rsyncs of ms4 thumbs restarted with better perms so scalers can write... in screen as root on ms5. If we start seeing nfs timesouts in the scaler logs please shoot a couple	[production]
17:14	<mark>	Deployed max-connections on all cache peers for esams.upload squids to their florida parents (current limit 200)	[production]
17:00	<mark>	Removed the carp weights on the esams backends again, as the weighting was completely screwed up	[production]
16:59	<mark>	Started knsq13 backend	[production]
14:27	<catrope>	ran sync-common-all	[production]
14:26	<RoanKattouw>	Running sync-common-all to deploy r85256	[production]
13:03	<apergos>	shot rsyncs on ms5, setting 777 dir perms on all thumbnail dirs (eg e/ef/blablah.jpg) so scalers can write into them	[production]
12:53	<apergos>	did same for rest of projects and subdirs (777 on hash dirs)	[production]
12:47	<apergos>	chmod 777 on commons/thumb// on ms5 so that scalers can create directories in there (mismatch of uid apache vs www-data)	[production]
11:12	<mark>	Raised per-squid connection limit to ms5 of 200 to 400 connections	[production]
11:05	<mark>	Raised per-squid connection limit to ms5 of 100 to 200 connections	[production]
10:55	<mark>	Fixed squid loop, the pmtpa.upload squids were using the esams squids as "CARP parents for distant content"	[production]
10:29	<mark>	Fixed puppet on sq42/43	[production]
09:44	<mark>	Lowered FCGI thumb handlers from 90 to 60 again, to reduce concurrency	[production]
08:08	<mark>	Started 4 more rsyncs (8 total now)	[production]
07:49	<mark>	Removed mlocate from ms5, puppetising	[production]
07:42	<mark>	Started 4 rsyncs from ms4 to ms5 (--ignore-existing)	[production]
07:32	<mark>	increased thumb handler count from 60 to 90	[production]
07:11	<mark>	Doubled the amount of fcgi thumb handlers	[production]
07:08	<mark>	Turned off logging of 404s to nginx error.log	[production]
06:50	<mark>	Restarted Apache on the image scalers	[production]
06:49	<mark>	Reconfigured ms5 to use the 404 thumb handler	[production]
06:48	<Ryan_Lane>	disabling nfs on ms4	[production]
06:33	<mark>	Running puppet on all apaches to fix fstab and mount ms5.pmtpa.wmnet:/export/thumbs	[production]
06:32	<mark>	Unmounting /mnt/thumbs on all mediawiki-installation servers	[production]
06:30	<mark>	Remounted NFS /mnt/thumbs on the scalers to ms5	[production]
06:28	<Ryan_Lane>	bring nfs back up	[production]
06:28	<Ryan_Lane>	brought ms4 back up. stopping the web server service and nfs	[production]
06:20	<mark>	Setup NFS kernel server on ms5	[production]
06:18	<Ryan_Lane>	powercycling ms4	[production]
05:29	<Ryan_Lane>	rebooting ms4 with -d to get a coredump	[production]
05:14	<apergos>	reanbling webserver on ms4 for testing	[production]
04:45	<apergos>	stopping web service on ms4 for the moment	[production]
04:29	<apergos>	shot webserver again	[production]
04:26	<apergos>	turned off hourly snaps on ms4, turned back on webserver and nfs	[production]
04:09	<apergos>	rebooted ms4, shut down webserver and nfsd temporarily for testing	[production]
02:58	<apergos>	still looking at kernel memory issues, still rebooting, ryan should be here in a few minutes to help out	[production]
02:03	<apergos>	a solaris advisor... also have zfs arch cache max to 2g which is ridiculously low but wtf right?	[production]
02:02	<apergos>	set tcp_time_wait_interval to 10000 at suggestion of	[production]
01:37	<apergos>	lowered zfs arch max to 2g (someone should reset this later)... will take effect on next reboot	[production]
00:29	<apergos>	rebooting with the new zfs arc cache max value, which will reduce the min value as well... dunno if this will give us enough breathing room or not	[production]
00:24	<apergos>	set zfs arc cache to ridiculously low value of 4gb, since when it's healthy it's using much less than that (1gb), this will take effect on reboot	[production]
00:22	<Reedy>	Still experiencing MS4 issues, thumb service is likely to be problematic for most users	[production]