production SAL

5051-5100 of 7808 results (11ms)

2009-07-13 §
00:40	<brion>	restarting nfs server on ms1	[production]
00:36	<brion>	doesn't seem so far to have changed the NFS access delays on image scalers.	[production]
00:31	<brion>	shutting down webserver7 on ms1	[production]
00:23	<brion>	investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside	[production]
2009-07-12 §
20:30	<domas>	dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately	[production]
20:05	<domas>	we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1	[production]
18:55	<domas>	zil_disable=1 on ms1	[production]
18:34	<mark>	Upgraded pybal on lvs3	[production]
18:16	<mark>	Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s	[production]
17:58	<domas>	scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this.	[production]
17:55	<domas>	investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-)	[production]
17:54	<domas>	bumped up ms1 SJSWS thread count	[production]
2009-07-11 §
15:45	<mark>	Rebooting sq1	[production]
15:31	<Tim>	rebooting ms1	[production]
14:54	<Tim>	disabled CentralNotice temporarily	[production]
14:54	<tstarling>	synchronized php-1.5/InitialiseSettings.php 'disabling CentralNotice'	[production]
14:53	<tstarling>	synchronized php-1.5/InitialiseSettings.php 'disabling CentralAuth'	[production]
14:36	<Tim>	restarted webserver7 on ms1	[production]
14:22	<Tim>	some kind of overload, seems to be image related	[production]
10:09	<midom>	synchronized php-1.5/db.php 'db8 doing commons read load, full write though'	[production]
09:22	<domas>	restarted job queue with externallinks purging code, <3	[production]
09:22	<domas>	installed nrpe on db2 :)	[production]
09:22	<midom>	synchronized php-1.5/db.php 'giving db24 just negligible load for now'	[production]
08:38	<midom>	synchronized php-1.5/includes/parser/ParserOutput.php 'livemerging r53103:53105'	[production]
08:37	<midom>	synchronized php-1.5/includes/DefaultSettings.php	[production]
2009-07-10 §
21:21	<Fred>	added ganglia to db20	[production]
19:58	<azafred>	synchronized php-1.5/CommonSettings.php 'removed border=0 from wgCopyrightIcon'	[production]
18:58	<Fred>	synched nagios config to reflect cleanup.	[production]
18:52	<Fred>	cleaned up the node_files for dsh and removed all decommissioned hosts.	[production]
18:36	<mark>	Added DNS entries for srv251-500	[production]
18:18	<fvassard>	synchronized php-1.5/mc-pmtpa.php 'Added a couple spare memcache hosts.'	[production]
18:16	<RobH_DC>	moved test to srv66 instead.	[production]
18:08	<RobH_DC>	turning srv210 into test.wikipedia.org	[production]
17:57	<Andrew>	Reactivating UsabilityInitiative globally, too.	[production]
17:55	<Andrew>	Scapping, back-out diff is in /home/andrew/usability-diff	[production]
17:43	<Andrew>	Apply r52926, r52930, and update Resources and EditToolbar/images	[production]
16:44	<Fred>	reinstalled and configured gmond on storage1.	[production]
15:08	<Rob>	upgraded blog and techblog to wordpress 2.8.1	[production]
13:58	<midom>	synchronized php-1.5/includes/api/ApiQueryCategoryMembers.php 'hello, fix\\!'	[production]
12:40	<Tim>	prototype.wikimedia.org is in OOM death, nagios reports down 3 hours, still responsive on shell so I will try a light touch	[production]
11:08	<tstarling>	synchronized php-1.5/mc-pmtpa.php 'more'	[production]
10:58	<Tim>	installed memcached on srv200-srv209	[production]
10:51	<tstarling>	synchronized php-1.5/mc-pmtpa.php 'deployed the 11 available spares, will make some more'	[production]
10:48	<Tim>	mctest.php reports 17 servers down out of 78, most from the range that Rob decommissioned	[production]
10:37	<Tim>	installed memcached on srv120, srv121, srv122, srv123	[production]
10:32	<Tim>	found rogue server srv101, missing puppet configuration and so skipping syncs. Uninstalled apache on it.	[production]
2009-07-09 §
23:56	<RoanKattouw>	Rebooted prototype around 16:30, got stuck around 15:30	[production]
21:43	<Rob>	srv35 (test.wikipedia.org) is not posting, i think its dead jim.	[production]
21:35	<Rob>	decommissioned srv55 and put srv35 in its place in C4, test.wikipedia.org should be back online shortly	[production]
20:04	<Rob>	removed decommissioned servers from node groups, getting error on syncing up nagios.	[production]