5551-5600 of 8334 results (13ms)
2009-07-13 §
16:28 <brion> synchronized php-1.5/InitialiseSettings.php 'fixing wikispecies RC-IRC prefix to species.wikimedia' [production]
16:27 <brion> test wiki was apparently moved from dead srv35 to srv66, which has new NFS-less config. thus fail since test runs from nfs [production]
16:24 <brion> test wiki borked; reported down for several days now :) investigating [production]
15:12 <midom> synchronized php-1.5/db.php 'db26 raid issues' [production]
14:55 <midom> synchronized php-1.5/db.php 'db3 and db5 coming live as commons servers' [production]
14:13 <domas> dropped few more snapshots, as %sys was increasing on ms1... [production]
11:16 <domas> manually restarted plethora of failing apaches (direct segfaults and other possible APC corruptions, leading to php OOM errors) [production]
09:50 <tstarling> synchronized php-1.5/includes/specials/SpecialBlockip.php [production]
09:00 <Tim> restarted apache2 on image scalers [production]
08:39 <tstarling> synchronized php-1.5/includes/Math.php 'statless render hack' [production]
08:05 <Tim> killed all image scalers to see if that helps with ms1 load [production]
08:00 <Tim> killed waiting apache processes [production]
07:35 <midom> synchronized php-1.5/mc-pmtpa.php [production]
07:24 <midom> synchronized php-1.5/mc-pmtpa.php 'swapping out srv81' [production]
04:11 <Tim> fixed /opt/local/bin/zfs-replicate on ms1 to write the snapshot number before starting replication, to avoid permanent error "dataset already exists" after failure [production]
02:16 <brion> -> https://bugzilla.wikimedia.org/show_bug.cgi?id=19683 [production]
02:12 <brion> sync-common script doesn't work on nfs-free apaches; language lists etc not being updated. Deployment scripts need to be fixed? [production]
02:03 <brion> srv159 is absurdly loaded/lagged wtf? [production]
01:58 <brion> reports of servers with old config, seeing "doesn't exist" for new mhr.wikipedia. checking... [production]
01:16 <brion> so far so good; CPU graphs on image scalers and ms1 look clean, and I can purge thumbs on commons ok [production]
01:10 <brion> trying switching image scalers back in for a few, see if they go right back to old pattern or not [production]
01:03 <brion> load on ms1 has fallen hugely; outgoing network is way up. looks like we're serving out http images fine... of course scaling's dead :P [production]
00:59 <brion> stopping apache on image scaler boxes, see what that does [production]
00:49 <brion> attempting to replicate domas's earlier temp success dropping oldest snapshot (last was 4/13): zfs destroy export/upload@weekly-2009-04-20_03:30:00 [production]
00:45 <brion> restarting nfs server [production]
00:44 <brion> stopping nfs server, restarting web server [production]
00:40 <brion> restarting nfs server on ms1 [production]
00:36 <brion> doesn't seem so far to have changed the NFS access delays on image scalers. [production]
00:31 <brion> shutting down webserver7 on ms1 [production]
00:23 <brion> investigating site problem reports. image server stack seems overloaded, so intermittent timeouts on nfs to apaches or http/squid to outside [production]
2009-07-12 §
20:30 <domas> dropped few snapshots on ms1, observed sharp %sys decrease and much better nfs properties immediately [production]
20:05 <domas> we seem to be hitting issue similar to http://www.opensolaris.org/jive/thread.jspa?messageID=64379 on ms1 [production]
18:55 <domas> zil_disable=1 on ms1 [production]
18:34 <mark> Upgraded pybal on lvs3 [production]
18:16 <mark> Hacked in configurable timeout support for the ProxyFetch monitor of PyBal, set the renderers timeout at 60s [production]
17:58 <domas> scaler stampedes caused scalers to be depooled by pybal, thus directing stampede to other server in round-robin fashion, all blocking and consuming ms1 SJSWS slots. of course, high I/O load contributed to this. [production]
17:55 <domas> investigating LVS-based rolling scaler overload issue, Mark and Tim heading the effort now ;-) [production]
17:54 <domas> bumped up ms1 SJSWS thread count [production]
2009-07-11 §
15:45 <mark> Rebooting sq1 [production]
15:31 <Tim> rebooting ms1 [production]
14:54 <Tim> disabled CentralNotice temporarily [production]
14:54 <tstarling> synchronized php-1.5/InitialiseSettings.php 'disabling CentralNotice' [production]
14:53 <tstarling> synchronized php-1.5/InitialiseSettings.php 'disabling CentralAuth' [production]
14:36 <Tim> restarted webserver7 on ms1 [production]
14:22 <Tim> some kind of overload, seems to be image related [production]
10:09 <midom> synchronized php-1.5/db.php 'db8 doing commons read load, full write though' [production]
09:22 <domas> restarted job queue with externallinks purging code, <3 [production]
09:22 <domas> installed nrpe on db2 :) [production]
09:22 <midom> synchronized php-1.5/db.php 'giving db24 just negligible load for now' [production]
08:38 <midom> synchronized php-1.5/includes/parser/ParserOutput.php 'livemerging r53103:53105' [production]