| 2009-04-17
      
      § | 
    
  | 22:49 | <brion> | regenerated centralnotice output again... this time ok | [production] | 
            
  | 22:48 | <brion> | srv93 and srv107 memcached nodes are running but broken. restarting them... | [production] | 
            
  | 22:43 | <brion> | restarted srv82 memcache node. attempting to rebuild centralnotices... | [production] | 
            
  | 22:41 | <brion> | bad memcached node srv82 | [production] | 
            
  | 22:05 | <mark> | Set up 3 new pywikipedia mailing lists, redirected svn commit output to one of them | [production] | 
            
  | 19:38 | <robh> | synchronized php-1.5/InitialiseSettings.php  'Bug 18494 Logo for ln.wiki' | [production] | 
            
  | 17:22 | <Rob> | removed wikimedia.se from our nameservers as they are using their own. | [production] | 
            
  | 16:48 | <azafred> | updated spamassassin rules on lily to include the SARE rules and mirror the settings on McHenry. | [production] | 
            
  | 10:25 | <tstarling> | synchronized robots.txt | [production] | 
            
  | 08:19 | <tstarling> | synchronized php-1.5/InitialiseSettings.php | [production] | 
            
  | 07:13 | <Tim> | temporarily killed apache on overloaded ES masters | [production] | 
            
  | 07:11 | <tstarling> | synchronized php-1.5/db.php  'zeroing read load on ES masters' | [production] | 
            
  | 06:04 | <Tim> | brief site-wide outage while it rebooted, reason unknown. All good now. Resuming logrotate. | [production] | 
            
  | 05:55 | <Tim> | db20 h/w reboot | [production] | 
            
  | 05:48 | <Tim> | shutting down daemons on db20 for pre-emptive reboot. Serial console shows "BUG: soft lockup - CPU#4 stuck for 11s! [rsync:27854]" etc. | [production] | 
            
  | 05:10 | <Tim> | on db20: killed logrotate -f half done due to alarming kswapd CPU (linked to deadlocked rsync processes). May need a reboot. | [production] | 
            
  | 05:00 | <Tim> | fixed logrotate on db20, broken since March 10 due to broken status file, most likely due to non-ASCII filenames generated by demux.py. Patched demux.py. Removed everything.log. | [production] | 
            
  | 02:14 | <river> | set up ms6.esams, copying /export/upload from ms1 | [production] | 
            
  | 00:24 | <Tim> | blocked lots of uci.edu IPs that were collectively doing 20 req/s of expensive API queries, overloading ES | [production] | 
            
  | 00:15 | <brion> | techblog post on Phorm opt-out is linked from slashdot; load on singer seems fairly stable. | [production] | 
            
  
    | 2009-04-16
      
      § | 
    
  | 23:06 | <tfinc> | synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php | [production] | 
            
  | 22:48 | <azafred> | bounced apache on srv217. All threads were DED - dead | [production] | 
            
  | 22:16 | <tfinc> | synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php | [production] | 
            
  | 22:08 | <tfinc> | synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php | [production] | 
            
  | 17:41 | <domas> | fantastic. I start _looking_ at stuff and it fixes itself. | [production] | 
            
  | 17:35 | <midom> | synchronized php-1.5/includes/Revision.php  'live profiling hook' | [production] | 
            
  | 17:28 | <domas> | db20 has kswapd deadlock, needs reboot soonish | [production] | 
            
  | 17:18 | <midom> | synchronized php-1.5/InitialiseSettings.php  'disabled stats' | [production] | 
            
  | 17:15 | <midom> | synchronized php-1.5/InitialiseSettings.php  'enabling udp stats' | [production] | 
            
  | 16:18 | <azafred> | bounced apache on srv217 (no pid file so previous restart did not include this one) | [production] | 
            
  | 15:57 | <brion> | network borkage between Florida and Amsterdam. Visitors through AMS proxies can't reach sites. | [production] | 
            
  | 15:55 | <azafred> | bounced apache on srv[73,86,88,93,108,114,139,141,154,181,194,204,213,99] | [production] | 
            
  | 15:52 | <Tim-away> | started mysqld on srv98,srv122,srv124,srv142,srv106,srv107: done with them for now. srv102 still going. | [production] | 
            
  | 15:30 | <mark> | Set up ms6 with SP management at ms6.ipmi.esams.wikimedia.org | [production] | 
            
  | 14:13 | <mark> | Restoring traffic to Amsterdam cluster | [production] | 
            
  | 14:06 | <mark> | Reloading csw1-esams | [production] | 
            
  | 13:55 | <mark> | Reloading csw1-esams | [production] | 
            
  | 13:53 | <JeLuF> | ms1 NFS issues again. Might be load related | [production] | 
            
  | 13:49 | <Tim> | copying fedora ES data from ms3 to ms2 | [production] | 
            
  | 13:44 | <JeLuF> | ms1 is reachable, no errors logged, NFS daemons running fine. After some minutes, NFS clients were able to access the server again. Root cause unknown. | [production] | 
            
  | 13:38 | <JeLuF> | ms1 issues. On NFS slaves: "ls: cannot access /mnt/upload5/: Input/output error" | [production] | 
            
  | 13:24 | <mark> | DNS scenario knams-down for upcoming core switch reboot | [production] | 
            
  | 08:23 | <river> | pdns on bayle crashed, bindbackend parser seems rather fragile | [production] | 
            
  | 03:01 | <andrew> | synchronized php-1.5/InitialiseSettings.php  'Deployed AbuseFilter to ptwiki' | [production] |