tools SAL

4301-4350 of 4390 results (31ms)

2015-09-01 §
06:17	<valhallasw`cloud>	going to restart sge_qmaster, hoping this solves the issue :/	[tools]
06:07	<valhallasw`cloud>	e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!	[tools]
06:06	<valhallasw`cloud>	test job does not get submitted because all queues are overloaded?!	[tools]
06:06	<valhallasw`cloud>	investigating SGE issues reported on irc/email	[tools]
2015-08-31 §
21:21	<valhallasw`cloud>	webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest)	[tools]
21:20	<valhallasw`cloud>	restarted webservicemonitor	[tools]
21:19	<valhallasw`cloud>	seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2	[tools]
21:18	<valhallasw`cloud>	running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running	[tools]
21:15	<valhallasw`cloud>	several webservices seem to actually have not gotten back online?! what on earth is going on.	[tools]
21:10	<valhallasw`cloud>	some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again	[tools]
20:29	<valhallasw`cloud>	\|sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.	[tools]
20:25	<valhallasw`cloud>	ca 500 jobs @ 5s/job = approx 40 minutes	[tools]
20:23	<valhallasw`cloud>	doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead sigh	[tools]
20:21	<valhallasw`cloud>	now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues	[tools]
19:36	<valhallasw`cloud>	last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs	[tools]
19:35	<valhallasw`cloud>	one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?	[tools]
19:31	<valhallasw`cloud>	https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues	[tools]
07:31	<valhallasw`cloud>	removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)	[tools]
2015-08-30 §
13:23	<valhallasw`cloud>	killed wikibugs-backup and grrrit-wm on tools-webproxy-01	[tools]
13:20	<valhallasw`cloud>	disabling 503 error page	[tools]
13:01	<YuviPanda>	rebooted tools-bastion-01 to see if that remounts NFS	[tools]
10:57	<valhallasw`cloud>	started wkibugs from tools-webproxy-01 as well, still need to check if the phab<->redis part is still alive	[tools]
10:55	<valhallasw`cloud>	restarted grrrit-wm from tools-webproxy-01	[tools]
10:53	<valhallasw`cloud>	Set error page on tools webserver via Hiera + some manual hacking (https://wikitech.wikimedia.org/wiki/Hiera:Tools)	[tools]
2015-08-27 §
15:00	<valhallasw`cloud>	killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again	[tools]
2015-08-25 §
14:58	<YuviPanda>	pooled in two new instances for the precise exec pool	[tools]
14:45	<YuviPanda>	reboot tools-exec-1221	[tools]
14:26	<YuviPanda>	rebooting tools-exec-1220 because NFS wedge...	[tools]
14:18	<YuviPanda>	pooled in tools-webgrid-generic-1405	[tools]
10:16	<YuviPanda>	created tools-webgrid-generic-1405	[tools]
10:04	<YuviPanda>	apply exec node puppet roles to tools-exec-1220 and -1221	[tools]
09:59	<YuviPanda>	created tools-exec-1220 and -1221	[tools]
2015-08-24 §
16:37	<valhallasw`cloud>	more processes were started, so added a talk page message on [[User:Coet]] (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01	[tools]
16:15	<valhallasw`cloud>	kill -9'ing because normal killing doesn't work	[tools]
16:13	<valhallasw`cloud>	killing all processes of tools.cobain which are flooding tools-bastion-01	[tools]
2015-08-20 §
18:44	<valhallasw`cloud>	both are now at 3dbbc87	[tools]
18:43	<valhallasw`cloud>	running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).	[tools]
18:42	<valhallasw`cloud>	tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.	[tools]
18:39	<valhallasw`cloud>	cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files	[tools]
17:06	<valhallasw`cloud>	wait, what timezone is this?!	[tools]
17:05	<valhallasw`cloud>	wait, what timezone is this?!	[tools]
2015-08-19 §
10:45	<valhallasw`cloud>	ran `for i in $(qstat -f -xml \| grep "<state>au" -B 6 \| grep "<name>" \| cut -d'@' -f2 \| cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406	[tools]
2015-08-18 §
13:57	<valhallasw`cloud>	same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.	[tools]
13:55	<valhallasw`cloud>	no, wait, that's ''tools-webgrid-lighttpd-1411.eqiad.wmflabs'', not the actual host ''tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs''. We should fix that dns mess as well.	[tools]
13:54	<valhallasw`cloud>	tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.	[tools]
13:47	<valhallasw`cloud>	that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state	[tools]
13:46	<valhallasw`cloud>	starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using <code>for i in $(qstat -f -xml \| grep "<state>au" -B 6 \| grep "<name>" \| cut -d'@' -f2 \| cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done</code>	[tools]
08:37	<valhallasw`cloud>	sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"	[tools]
08:33	<valhallasw`cloud>	tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)	[tools]
08:30	<valhallasw`cloud>	hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config	[tools]