production SAL

5551-5600 of 8334 results (25ms)

2009-07-13 §
16:28	<brion>	synchronized php-1.5/InitialiseSettings.php 'fixing wikispecies RC-IRC prefix to species.wikimedia'	[production]
16:27	<brion>	test wiki was apparently moved from dead srv35 to srv66, which has new NFS-less config. thus fail since test runs from nfs	[production]
16:24	<brion>	test wiki borked; reported down for several days now :) investigating	[production]
15:12	<midom>	synchronized php-1.5/db.php 'db26 raid issues'	[production]
14:55	<midom>	synchronized php-1.5/db.php 'db3 and db5 coming live as commons servers'	[production]
14:13	<domas>	dropped few more snapshots, as %sys was increasing on ms1...	[production]
11:16	<domas>	manually restarted plethora of failing apaches (direct segfaults and other possible APC corruptions, leading to php OOM errors)	[production]
09:50	<tstarling>	synchronized php-1.5/includes/specials/SpecialBlockip.php	[production]
09:00	<Tim>	restarted apache2 on image scalers	[production]
08:39	<tstarling>	synchronized php-1.5/includes/Math.php 'statless render hack'	[production]
08:05	<Tim>	killed all image scalers to see if that helps with ms1 load	[production]
08:00	<Tim>	killed waiting apache processes	[production]
07:35	<midom>	synchronized php-1.5/mc-pmtpa.php	[production]
07:24	<midom>	synchronized php-1.5/mc-pmtpa.php 'swapping out srv81'	[production]
04:11	<Tim>	fixed /opt/local/bin/zfs-replicate on ms1 to write the snapshot number before starting replication, to avoid permanent error "dataset already exists" after failure	[production]
02:16	<brion>	-> https://bugzilla.wikimedia.org/show_bug.cgi?id=19683	[production]
02:12	<brion>	sync-common script doesn't work on nfs-free apaches; language lists etc not being updated. Deployment scripts need to be fixed?	[production]
02:03	<brion>	srv159 is absurdly loaded/lagged wtf?	[production]
01:58	<brion>	reports of servers with old config, seeing "doesn't exist" for new mhr.wikipedia. checking...	[production]
01:16	<brion>	so far so good; CPU graphs on image scalers and ms1 look clean, and I can purge thumbs on commons ok	[production]
01:10	<brion>	trying switching image scalers back in for a few, see if they go right back to old pattern or not	[production]
01:03	<brion>	load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P	[production]
00:59	<brion>	stopping apache on image scaler boxes, see what that does	[production]
00:49	<brion>	attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00	[production]
00:45	<brion>	restarting nfs server	[production]
00:44	<brion>	stopping nfs server, restarting web server	[production]
00:40	<brion>	restarting nfs server on ms1	[production]
00:36	<brion>	doesn't seem so far to have changed the NFS access delays on image scalers.	[production]
00:31	<brion>	shutting down webserver7 on ms1	[production]
00:23	<brion>	investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside	[production]
2009-07-12 §
20:30	<domas>	dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately	[production]
20:05	<domas>	we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1	[production]
18:55	<domas>	zil_disable=1 on ms1	[production]
18:34	<mark>	Upgraded pybal on lvs3	[production]
18:16	<mark>	Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s	[production]
17:58	<domas>	scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this.	[production]
17:55	<domas>	investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-)	[production]
17:54	<domas>	bumped up ms1 SJSWS thread count	[production]
2009-07-11 §
15:45	<mark>	Rebooting sq1	[production]
15:31	<Tim>	rebooting ms1	[production]
14:54	<Tim>	disabled CentralNotice temporarily	[production]
14:54	<tstarling>	synchronized php-1.5/InitialiseSettings.php 'disabling CentralNotice'	[production]
14:53	<tstarling>	synchronized php-1.5/InitialiseSettings.php 'disabling CentralAuth'	[production]
14:36	<Tim>	restarted webserver7 on ms1	[production]
14:22	<Tim>	some kind of overload, seems to be image related	[production]
10:09	<midom>	synchronized php-1.5/db.php 'db8 doing commons read load, full write though'	[production]
09:22	<domas>	restarted job queue with externallinks purging code, <3	[production]
09:22	<domas>	installed nrpe on db2 :)	[production]
09:22	<midom>	synchronized php-1.5/db.php 'giving db24 just negligible load for now'	[production]
08:38	<midom>	synchronized php-1.5/includes/parser/ParserOutput.php 'livemerging r53103:53105'	[production]