production SAL

9501-9550 of 10000 results (35ms)

2010-11-03 §
00:30	<apergos1>	cleaning up space on other / full squids: sq42	[production]
2010-11-02 §
23:22	<apergos>	same story on sq50, cleared out some space, tried upping that to 300 but started seeing TCP connection to 208.80.152.156 (208.80.152.156:80) failed in the logs so backed off to 200	[production]
23:13	<apergos>	trying adjusting max-conn on sq49 for conns to ms4... tried 200, it maxed out. trying 300 now...	[production]
23:08	<apergos>	hupped squid on sq49, restarted syslog, / was full from "Failed to select source" errors, cleared out some space	[production]
23:08	<tfinc>	synchronized php-1.5/wmf-config/CommonSettings.php 'Updating sidebar links'	[production]
22:40	<apergos>	added in the amssq47 through amssq62 to /etc/squid/cachemgr.conf on fenari	[production]
19:48	<RobH>	torrus back online	[production]
19:44	<RobH>	following procedure on wikitech to fix torrus	[production]
16:46	<RobH>	sq42 & sq44 behaving normally now, cleaning cache on sq48 and killing squid for restart as it is flapping and at high load, due to earlier nfs issue	[production]
16:38	<RobH>	restarting and cleaning backend squid on sq44 and sq42 which were complaining in lvs	[production]
16:35	<RobH>	sq43 was flapping since the nfs mount on ms4 was borked. restarted it	[production]
16:07	<apergos>	NFSD_SERVERS=2048 in /etc/default on ms4	[production]
16:06	<apergos>	note that the variables rpcmod:cotsmaxdupreqs has been changed to 2048 in /etc/system, and	[production]
15:54	<apergos>	hard reset on ms4, reboot was not getting the job done	[production]
15:47	<apergos>	rebootint ms4, nfsd hung and couldn't be restarted or killed.	[production]
14:04	<RobH>	restarted pdns on linne due to crash from authdns update	[production]
14:02	<RobH>	updated dns with new mgmt entries for payments, owasrvs, and owadbs	[production]
03:45	<domas>	added srv193 back to apaches pool on lvs	[production]
2010-11-01 §
23:55	<tfinc>	synchronized php-1.5/extensions/CentralNotice/SpecialBannerController.php 'Picking up fixes for Bug #25564'	[production]
23:54	<tfinc>	synchronized php-1.5/extensions/CentralNotice/CentralNotice.php 'Picking up fixes for r25564'	[production]
20:43	<domas>	ms4 mildly loaded (disks go to >100i/s each) throwing nfs timeouts, I bumped up NFSD_SERVERS to 2048	[production]
19:05	<Ryan_Lane>	powercycling srv207	[production]
16:18	<RoanKattouw>	Something weird's going on with srv207: Nagios says its SSH is up but it times out on SSH from fenari	[production]
16:15	<catrope>	synchronized php-1.5/includes/api/ApiBase.php 'r75798'	[production]
2010-10-31 §
17:21	<catrope>	synchronized php-1.5/wmf-config/InitialiseSettings.php 'bug 25719 - Add missing slash in timezone'	[production]
2010-10-30 §
23:05	<apergos>	test of logging (sorry)	[production]
21:22	<mark>	Deploying a sudoers file for NRPE using Puppet	[production]
20:48	<mark>	Running apt-get upgrade on db17	[production]
20:48	<mark>	Pushed updated wikimedia-raid-utils package into the APT repository, with a newer arcconf that should work on Lucid	[production]
15:53	<atglenn>	powercycled mobile2, it was unresponsive to ssh and pings, ganaglia showed no activity	[production]
03:05	<domas>	ms1 can't snapshot either, I suspect kernel bugs. we either have to roll back to 2.6.28 or move forward, or actually try rebuilding filesystems from scratch with new kernels...	[production]
2010-10-29 §
23:21	<domas>	lol repaired myisam tables on db9, call if data has been lost, hehe	[production]
22:58	<domas>	resynced srv154, was running with months old configuration/code.	[production]
22:58	<domas>	was db22 disabled silently by someone? or not reenabled? :) reenabled now...	[production]
22:55	<midom>	synchronized php-1.5/wmf-config/db.php	[production]
18:33	<apergos>	restarted torrus on streber, after reports that it was not responding	[production]
17:46	<apergos>	domas ran "reset-mysql-slave db18" (from fenari) which clears out all old relay logs, and restarts the slaves.	[production]
17:34	<apergos>	removed some old relay logs from /a/sqldata on db18 to get space back, it was at 95%	[production]
15:22	<RoanKattouw>	Followers on Twitter: view missing entries between Sep 2 and today at http://identi.ca/wikimediatech	[production]
15:22	<RoanKattouw>	Re-established identi.ca->Twitter bridge for wikimediatech, broken since September 2	[production]
15:21	<RobH>	repaired the sessions table, rt is now happy	[production]
15:09	<RobH>	rt is being odd, looking into it	[production]
14:43	<phuzion>	test	[production]
2010-10-28 §
21:34	<RobH>	powercycled sq69, ran puppet, its back online	[production]
21:24	<RobH>	sq69 is borked, powercycling	[production]
17:51	<Ryan_Lane>	running checksetup.pl on kaulen for bugzilla	[production]
17:50	<Ryan_Lane>	running mysqlcheck --autorepair on bugzilla database on db9 for the bug_fulltext table	[production]
15:23	<atglenn>	reenabled logging for fundraising on locke	[production]
14:50	<atglenn>	I see a lot of lot of ERROR 1045 (28000): Access denied for user 'root'@'localhost' (using password: NO) after reboot of db9... not awake enough to try to look at it; services seem to be running ok	[production]
14:46	<atglenn>	powercycled db9, it was unreachable by ssh, ganglia showed load and wait_cpu through the roof	[production]