| 
      
        2009-04-17
      
      §
     | 
  
    
  | 22:48 | 
  <brion> | 
  srv93 and srv107 memcached nodes are running but broken. restarting them... | 
  [production] | 
            
  | 22:43 | 
  <brion> | 
  restarted srv82 memcache node. attempting to rebuild centralnotices... | 
  [production] | 
            
  | 22:41 | 
  <brion> | 
  bad memcached node srv82 | 
  [production] | 
            
  | 22:05 | 
  <mark> | 
  Set up 3 new pywikipedia mailing lists, redirected svn commit output to one of them | 
  [production] | 
            
  | 19:38 | 
  <robh> | 
  synchronized php-1.5/InitialiseSettings.php  'Bug 18494 Logo for ln.wiki' | 
  [production] | 
            
  | 17:22 | 
  <Rob> | 
  removed wikimedia.se from our nameservers as they are using their own. | 
  [production] | 
            
  | 16:48 | 
  <azafred> | 
  updated spamassassin rules on lily to include the SARE rules and mirror the settings on McHenry. | 
  [production] | 
            
  | 10:25 | 
  <tstarling> | 
  synchronized robots.txt  | 
  [production] | 
            
  | 08:19 | 
  <tstarling> | 
  synchronized php-1.5/InitialiseSettings.php  | 
  [production] | 
            
  | 07:13 | 
  <Tim> | 
  temporarily killed apache on overloaded ES masters | 
  [production] | 
            
  | 07:11 | 
  <tstarling> | 
  synchronized php-1.5/db.php  'zeroing read load on ES masters' | 
  [production] | 
            
  | 06:04 | 
  <Tim> | 
  brief site-wide outage while it rebooted, reason unknown. All good now. Resuming logrotate. | 
  [production] | 
            
  | 05:55 | 
  <Tim> | 
  db20 h/w reboot | 
  [production] | 
            
  | 05:48 | 
  <Tim> | 
  shutting down daemons on db20 for pre-emptive reboot. Serial console shows "BUG: soft lockup - CPU#4 stuck for 11s! [rsync:27854]" etc. | 
  [production] | 
            
  | 05:10 | 
  <Tim> | 
  on db20: killed logrotate -f half done due to alarming kswapd CPU (linked to deadlocked rsync processes). May need a reboot. | 
  [production] | 
            
  | 05:00 | 
  <Tim> | 
  fixed logrotate on db20, broken since March 10 due to broken status file, most likely due to non-ASCII filenames generated by demux.py. Patched demux.py. Removed everything.log. | 
  [production] | 
            
  | 02:14 | 
  <river> | 
  set up ms6.esams, copying /export/upload from ms1 | 
  [production] | 
            
  | 00:24 | 
  <Tim> | 
  blocked lots of uci.edu IPs that were collectively doing 20 req/s of expensive API queries, overloading ES | 
  [production] | 
            
  | 00:15 | 
  <brion> | 
  techblog post on Phorm opt-out is linked from slashdot; load on singer seems fairly stable. | 
  [production] | 
            
  
    | 
      
        2009-04-16
      
      §
     | 
  
    
  | 23:06 | 
  <tfinc> | 
  synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php  | 
  [production] | 
            
  | 22:48 | 
  <azafred> | 
  bounced apache on srv217. All threads were DED - dead | 
  [production] | 
            
  | 22:16 | 
  <tfinc> | 
  synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php  | 
  [production] | 
            
  | 22:08 | 
  <tfinc> | 
  synchronized php-1.5/extensions/ContributionReporting/ContributionHistory_body.php  | 
  [production] | 
            
  | 17:41 | 
  <domas> | 
  fantastic. I start _looking_ at stuff and it fixes itself. | 
  [production] | 
            
  | 17:35 | 
  <midom> | 
  synchronized php-1.5/includes/Revision.php  'live profiling hook' | 
  [production] | 
            
  | 17:28 | 
  <domas> | 
  db20 has kswapd deadlock, needs reboot soonish | 
  [production] | 
            
  | 17:18 | 
  <midom> | 
  synchronized php-1.5/InitialiseSettings.php  'disabled stats' | 
  [production] | 
            
  | 17:15 | 
  <midom> | 
  synchronized php-1.5/InitialiseSettings.php  'enabling udp stats' | 
  [production] | 
            
  | 16:18 | 
  <azafred> | 
  bounced apache on srv217 (no pid file so previous restart did not include this one) | 
  [production] | 
            
  | 15:57 | 
  <brion> | 
  network borkage between Florida and Amsterdam. Visitors through AMS proxies can't reach sites. | 
  [production] | 
            
  | 15:55 | 
  <azafred> | 
  bounced apache on srv[73,86,88,93,108,114,139,141,154,181,194,204,213,99] | 
  [production] | 
            
  | 15:52 | 
  <Tim-away> | 
  started mysqld on srv98,srv122,srv124,srv142,srv106,srv107: done with them for now. srv102 still going. | 
  [production] | 
            
  | 15:30 | 
  <mark> | 
  Set up ms6 with SP management at ms6.ipmi.esams.wikimedia.org | 
  [production] | 
            
  | 14:13 | 
  <mark> | 
  Restoring traffic to Amsterdam cluster | 
  [production] | 
            
  | 14:06 | 
  <mark> | 
  Reloading csw1-esams | 
  [production] | 
            
  | 13:55 | 
  <mark> | 
  Reloading csw1-esams | 
  [production] | 
            
  | 13:53 | 
  <JeLuF> | 
  ms1 NFS issues again. Might be load related | 
  [production] | 
            
  | 13:49 | 
  <Tim> | 
  copying fedora ES data from ms3 to ms2 | 
  [production] | 
            
  | 13:44 | 
  <JeLuF> | 
  ms1 is reachable, no errors logged, NFS daemons running fine. After some minutes, NFS clients were able to access the server again. Root cause unknown. | 
  [production] | 
            
  | 13:38 | 
  <JeLuF> | 
  ms1 issues. On NFS slaves: "ls: cannot access /mnt/upload5/: Input/output error" | 
  [production] | 
            
  | 13:24 | 
  <mark> | 
  DNS scenario knams-down for upcoming core switch reboot | 
  [production] | 
            
  | 08:23 | 
  <river> | 
  pdns on bayle crashed, bindbackend parser seems rather fragile | 
  [production] | 
            
  | 03:01 | 
  <andrew> | 
  synchronized php-1.5/InitialiseSettings.php  'Deployed AbuseFilter to ptwiki' | 
  [production] |