2009-04-17
§
|
22:49 |
<brion> |
regenerated centralnotice output again... this time ok |
[production] |
22:48 |
<brion> |
srv93 and srv107 memcached nodes are running but broken. restarting them... |
[production] |
22:43 |
<brion> |
restarted srv82 memcache node. attempting to rebuild centralnotices... |
[production] |
22:41 |
<brion> |
bad memcached node srv82 |
[production] |
22:05 |
<mark> |
Set up 3 new pywikipedia mailing lists, redirected svn commit output to one of them |
[production] |
19:38 |
<robh> |
synchronized php-1.5/InitialiseSettings.php 'Bug 18494 Logo for ln.wiki' |
[production] |
17:22 |
<Rob> |
removed wikimedia.se from our nameservers as they are using their own. |
[production] |
16:48 |
<azafred> |
updated spamassassin rules on lily to include the SARE rules and mirror the settings on McHenry. |
[production] |
10:25 |
<tstarling> |
synchronized robots.txt |
[production] |
08:19 |
<tstarling> |
synchronized php-1.5/InitialiseSettings.php |
[production] |
07:13 |
<Tim> |
temporarily killed apache on overloaded ES masters |
[production] |
07:11 |
<tstarling> |
synchronized php-1.5/db.php 'zeroing read load on ES masters' |
[production] |
06:04 |
<Tim> |
brief site-wide outage while it rebooted, reason unknown. All good now. Resuming logrotate. |
[production] |
05:55 |
<Tim> |
db20 h/w reboot |
[production] |
05:48 |
<Tim> |
shutting down daemons on db20 for pre-emptive reboot. Serial console shows "BUG: soft lockup - CPU#4 stuck for 11s! [rsync:27854]" etc. |
[production] |
05:10 |
<Tim> |
on db20: killed logrotate -f half done due to alarming kswapd CPU (linked to deadlocked rsync processes). May need a reboot. |
[production] |
05:00 |
<Tim> |
fixed logrotate on db20, broken since March 10 due to broken status file, most likely due to non-ASCII filenames generated by demux.py. Patched demux.py. Removed everything.log. |
[production] |
02:14 |
<river> |
set up ms6.esams, copying /export/upload from ms1 |
[production] |
00:24 |
<Tim> |
blocked lots of uci.edu IPs that were collectively doing 20 req/s of expensive API queries, overloading ES |
[production] |
00:15 |
<brion> |
techblog post on Phorm opt-out is linked from slashdot; load on singer seems fairly stable. |
[production] |
2009-04-16
§
|
23:06 |
<tfinc> |
synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php |
[production] |
22:48 |
<azafred> |
bounced apache on srv217. All threads were DED - dead |
[production] |
22:16 |
<tfinc> |
synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php |
[production] |
22:08 |
<tfinc> |
synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php |
[production] |
17:41 |
<domas> |
fantastic. I start _looking_ at stuff and it fixes itself. |
[production] |
17:35 |
<midom> |
synchronized php-1.5/includes/Revision.php 'live profiling hook' |
[production] |
17:28 |
<domas> |
db20 has kswapd deadlock, needs reboot soonish |
[production] |
17:18 |
<midom> |
synchronized php-1.5/InitialiseSettings.php 'disabled stats' |
[production] |
17:15 |
<midom> |
synchronized php-1.5/InitialiseSettings.php 'enabling udp stats' |
[production] |
16:18 |
<azafred> |
bounced apache on srv217 (no pid file so previous restart did not include this one) |
[production] |
15:57 |
<brion> |
network borkage between Florida and Amsterdam. Visitors through AMS proxies can't reach sites. |
[production] |
15:55 |
<azafred> |
bounced apache on srv[73,86,88,93,108,114,139,141,154,181,194,204,213,99] |
[production] |
15:52 |
<Tim-away> |
started mysqld on srv98,srv122,srv124,srv142,srv106,srv107: done with them for now. srv102 still going. |
[production] |
15:30 |
<mark> |
Set up ms6 with SP management at ms6.ipmi.esams.wikimedia.org |
[production] |
14:13 |
<mark> |
Restoring traffic to Amsterdam cluster |
[production] |
14:06 |
<mark> |
Reloading csw1-esams |
[production] |
13:55 |
<mark> |
Reloading csw1-esams |
[production] |
13:53 |
<JeLuF> |
ms1 NFS issues again. Might be load related |
[production] |
13:49 |
<Tim> |
copying fedora ES data from ms3 to ms2 |
[production] |
13:44 |
<JeLuF> |
ms1 is reachable, no errors logged, NFS daemons running fine. After some minutes, NFS clients were able to access the server again. Root cause unknown. |
[production] |
13:38 |
<JeLuF> |
ms1 issues. On NFS slaves: "ls: cannot access /mnt/upload5/: Input/output error" |
[production] |
13:24 |
<mark> |
DNS scenario knams-down for upcoming core switch reboot |
[production] |
08:23 |
<river> |
pdns on bayle crashed, bindbackend parser seems rather fragile |
[production] |