production SAL

3051-3100 of 10000 results (33ms)

2014-07-15 §
17:58	<ori>	_joe_ deployed jobrunner to all job runners	[production]
17:40	<manybubbles>	my last attempt to lower the concurrent traffic for recovery was a failure - tried again and succeeded. that seems to have fixed the echo service disruption from taking elastic1017 out of service	[production]
17:37	<ori>	updated jobrunner to bef32b9120	[production]
17:29	<manybubbles>	elastic1017 went nuts again. just shutting elasticsearch off on it for now	[production]
16:25	<_joe_>	all mw servers updated	[production]
16:10	<_joe_>	mw1100 and onwards updated	[production]
16:00	<_joe_>	mw1060-mw1099 updated	[production]
15:58	<manybubbles>	restarting Elasticsearch on elastic1017 - its thrashing the disk again. I'm still not 100% sure why	[production]
15:57	<_joe_>	mw1020-mw1059 updated	[production]
15:53	<_joe_>	mw101[0-9] updated	[production]
15:47	<_joe_>	starting rolling update of all appservers to apache2 2.2.22-1ubuntu1.6, half of them are on 2.2.22-1ubuntu1.5 now	[production]
15:42	<manybubbles>	setting the filter cache on one node in the cluster set it on all. yay, I guess. Anyway, I'm going to let it soak for a while.	[production]
15:32	<manybubbles>	setting filter cache size to 20% on elastic1001 to see if it takes/helps us	[production]
15:19	<anomie>	Synchronized wmf-config/: SWAT: Remove dead ULS variable [[gerrit:145861]] (duration: 00m 10s)	[production]
15:18	<anomie>	anomie actually committed a live hack someone left on tin (removing db1035)	[production]
15:16	<anomie>	updated /a/common to {{Gerrit\|I7ca6a16d5}}: Switch jawiki back to lsearchd	[production]
13:42	<manybubbles>	Synchronized wmf-config/InitialiseSettings.php: jawiki back to lsearchd (duration: 00m 05s)	[production]
13:38	<manybubbles>	elastic1017 had a load average of 60 - was thashing in io. bounced Elasticsearch. lets see if it recovers on its own	[production]
09:09	<_joe_>	restarting mailman on sodium, again, for testing	[production]
08:50	<godog>	restart mailman on sodium after inodes freed	[production]
07:27	<_joe_>	restarted mailman on sodium	[production]
07:22	<_joe_>	stopping mailman on sodium for repairing	[production]
06:54	<_joe_>	killed jenkins stale process on gallium, stuck in a futex while shutting down	[production]
04:48	<springle>	db1035 crash cycle. down for memtest and stuff	[production]
03:34	<LocalisationUpdate>	ResourceLoader cache refresh completed at Tue Jul 15 03:33:38 UTC 2014 (duration 33m 37s)	[production]
03:01	<LocalisationUpdate>	completed (1.24wmf13) at 2014-07-15 03:00:03+00:00	[production]
02:34	<springle>	Synchronized wmf-config/db-eqiad.php: depool db1035, crashed (duration: 00m 13s)	[production]
02:30	<LocalisationUpdate>	completed (1.24wmf12) at 2014-07-15 02:29:02+00:00	[production]
02:27	<springle>	powercycle db1035 unresponsive	[production]
2014-07-14 §
23:32	<mwalker>	Started scap: Updating for SWAT {{gerrit\|146304}}, {{gerrit\|146306}}, {{gerrit\|146149}}, {{gerrit\|146165}}, {{gerrit\|146166}}, {{gerrit\|146282}}, and {{gerrit\|146281}}. Also finishing awight's deploy of FundraisingTranslateWorkflow.	[production]
20:22	<cscott>	updated Parsoid to version d51e64097bb1b18e356584d4f3ddcfd90a6071ba	[production]
19:57	<ori>	postponing jobrunner deployment to tomorrow; ran over time	[production]
19:45	<_joe_>	doing the same on mw1064, segfaulted for the same reason	[production]
19:44	<_joe_>	killed a lone apache2 child on mw1152, stuck in a futex, after a segfault of another apache process. Restarted apache, now working correctly	[production]
19:04	<godog>	re-enabling mailman on sodium, missing list config restored	[production]
18:49	<awight>	Synchronized wmf-config: Deploying FundraisingTranslateWorkflow on metawiki (t	[production]
18:45	<awight>	Synchronized php-1.24wmf13/extensions/FundraisingTranslateWorkflow: Update FundraisingTranslateWorkflow extension (wmf13) (duration: 00m 05s)	[production]
18:44	<awight>	Synchronized php-1.24wmf12/extensions/FundraisingTranslateWorkflow: Update FundraisingTranslateWorkflow extension (duration: 00m 05s)	[production]
18:15	<awight>	Synchronized wmf-config: Revert: Deploying FundraisingTranslateWorkflow on metawiki (duration: 00m 04s)	[production]
18:03	<awight>	Synchronized wmf-config: Deploying FundraisingTranslateWorkflow on metawiki (duration: 00m 05s)	[production]
18:03	<awight>	updated /a/common to {{Gerrit\|Ie7599fb6e}}: jawiki gets Cirrus as primary search	[production]
17:43	<Krinkle>	npm-cache for integration slaves got corrupted again. Depooling/Repooling integration-slave100{1,2,3} onoe by one to clear cache and let it warm up again.	[production]
17:35	<Krinkle>	Jenkins slaves in labs are unable to reach zuul.eqiad.wmnet	[production]
17:10	<andrewbogott>	purging old local-* service group entries from labs ldap (via purgeOldServiceGroups.php)	[production]
17:05	<godog>	started mailman on sodium post-reboot	[production]
17:04	<demon>	Synchronized wmf-config/InitialiseSettings.php: nlwiki getting cirrus as primary (duration: 00m 04s)	[production]
15:11	<manybubbles>	Synchronized wmf-config: SWAT update cirrus settings for commons (duration: 00m 04s)	[production]
15:04	<manybubbles>	Synchronized wmf-config: SWAT update cirrus settings for commons (duration: 00m 04s)	[production]
15:02	<manybubbles>	Synchronized wmf-config: SWAT update cirrus settings for commons (duration: 00m 05s)	[production]
14:39	<_joe_>	rebooted nescio, stuck and with console showing just a truncated log (timestamp only)	[production]