| 
      
        2015-08-31
      
      §
     | 
  
    
  | 21:21 | 
  <valhallasw`cloud> | 
  webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest) | 
  [tools] | 
            
  | 21:20 | 
  <valhallasw`cloud> | 
  restarted webservicemonitor | 
  [tools] | 
            
  | 21:19 | 
  <valhallasw`cloud> | 
  seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2 | 
  [tools] | 
            
  | 21:18 | 
  <valhallasw`cloud> | 
  running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running | 
  [tools] | 
            
  | 21:15 | 
  <valhallasw`cloud> | 
  several webservices seem to actually have not gotten back online?! what on earth is going on. | 
  [tools] | 
            
  | 21:10 | 
  <valhallasw`cloud> | 
  some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again | 
  [tools] | 
            
  | 20:29 | 
  <valhallasw`cloud> | 
  |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time. | 
  [tools] | 
            
  | 20:25 | 
  <valhallasw`cloud> | 
  ca 500 jobs @ 5s/job = approx 40 minutes | 
  [tools] | 
            
  | 20:23 | 
  <valhallasw`cloud> | 
  doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh* | 
  [tools] | 
            
  | 20:21 | 
  <valhallasw`cloud> | 
  now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues | 
  [tools] | 
            
  | 19:36 | 
  <valhallasw`cloud> | 
  last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs | 
  [tools] | 
            
  | 19:35 | 
  <valhallasw`cloud> | 
  one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi? | 
  [tools] | 
            
  | 19:31 | 
  <valhallasw`cloud> | 
  https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues | 
  [tools] | 
            
  | 07:31 | 
  <valhallasw`cloud> | 
  removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs) | 
  [tools] |