| 
      
        2016-01-21
      
      §
     | 
  
    
  | 20:11 | 
  <YuviPanda> | 
  repooled exec node son 1005 | 
  [tools] | 
            
  | 19:53 | 
  <YuviPanda> | 
  depooled exec nodes on labvirt1005 | 
  [tools] | 
            
  | 19:49 | 
  <YuviPanda> | 
  repooled exec nodes from labvirt1004 | 
  [tools] | 
            
  | 19:48 | 
  <YuviPanda> | 
  failed over proxy to tools-proxy-01 again | 
  [tools] | 
            
  | 19:31 | 
  <YuviPanda> | 
  depooled exec nodes from labvirt1004 | 
  [tools] | 
            
  | 19:29 | 
  <YuviPanda> | 
  repooled exec nodes from labvirt1003 | 
  [tools] | 
            
  | 19:13 | 
  <YuviPanda> | 
  depooled instances on labvirt1003 | 
  [tools] | 
            
  | 19:06 | 
  <YuviPanda> | 
  re-enabled queues on exec nodes that were on labvirt1002 | 
  [tools] | 
            
  | 19:02 | 
  <YuviPanda> | 
  failed over tools proxy to tools-proxy-02 | 
  [tools] | 
            
  | 18:46 | 
  <YuviPanda> | 
  drained and disabled queues on all nodes on labvirt1002 | 
  [tools] | 
            
  | 18:38 | 
  <YuviPanda> | 
  restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead | 
  [tools] | 
            
  
    | 
      
        2016-01-11
      
      §
     | 
  
    
  | 22:19 | 
  <valhallasw`cloud> | 
  reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30 | 
  [tools] | 
            
  | 22:12 | 
  <YuviPanda> | 
  restarted gridengine master again | 
  [tools] | 
            
  | 22:07 | 
  <valhallasw`cloud> | 
  set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0 | 
  [tools] | 
            
  | 22:05 | 
  <valhallasw`cloud> | 
  set maxujobs back to 0, but doesn't help | 
  [tools] | 
            
  | 21:57 | 
  <valhallasw`cloud> | 
  reset to 7:30 | 
  [tools] | 
            
  | 21:57 | 
  <valhallasw`cloud> | 
  that cleared the measure, but jobs still not starting. Ugh! | 
  [tools] | 
            
  | 21:55 | 
  <valhallasw`cloud> | 
  set job_load_adjustments_decay_time = 0:0:0 | 
  [tools] | 
            
  | 21:45 | 
  <YuviPanda> | 
  restarted gridengine master | 
  [tools] | 
            
  | 21:43 | 
  <valhallasw`cloud> | 
  qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting | 
  [tools] | 
            
  | 21:42 | 
  <valhallasw`cloud> | 
  resetting to 0:7:30, as it's not having the intended effect | 
  [tools] | 
            
  | 21:41 | 
  <valhallasw`cloud> | 
  currently 353 jobs in qw state | 
  [tools] | 
            
  | 21:40 | 
  <valhallasw`cloud> | 
  that's load_adjustment_decay_time | 
  [tools] | 
            
  | 21:40 | 
  <valhallasw`cloud> | 
  temporarily sudo qconf -msconf to 0:0:1 | 
  [tools] | 
            
  | 19:59 | 
  <YuviPanda> | 
  Set maxujobs (max concurrent jobs per user) on gridengine to 128 | 
  [tools] | 
            
  | 17:51 | 
  <YuviPanda> | 
  kill all queries running on labsdb1003 | 
  [tools] | 
            
  | 17:20 | 
  <YuviPanda> | 
  stopped webservice for quentinv57-tools | 
  [tools] |