2014-07-15
§
|
17:40 |
<manybubbles> |
my last attempt to lower the concurrent traffic for recovery was a failure - tried again and succeeded. that seems to have fixed the echo service disruption from taking elastic1017 out of service |
[production] |
17:37 |
<ori> |
updated jobrunner to bef32b9120 |
[production] |
17:29 |
<manybubbles> |
elastic1017 went nuts again. just shutting elasticsearch off on it for now |
[production] |
16:25 |
<_joe_> |
all mw servers updated |
[production] |
16:10 |
<_joe_> |
mw1100 and onwards updated |
[production] |
16:00 |
<_joe_> |
mw1060-mw1099 updated |
[production] |
15:58 |
<manybubbles> |
restarting Elasticsearch on elastic1017 - its thrashing the disk again. I'm still not 100% sure why |
[production] |
15:57 |
<_joe_> |
mw1020-mw1059 updated |
[production] |
15:53 |
<_joe_> |
mw101[0-9] updated |
[production] |
15:47 |
<_joe_> |
starting rolling update of all appservers to apache2 2.2.22-1ubuntu1.6, half of them are on 2.2.22-1ubuntu1.5 now |
[production] |
15:42 |
<manybubbles> |
setting the filter cache on one node in the cluster set it on all. yay, I guess. Anyway, I'm going to let it soak for a while. |
[production] |
15:32 |
<manybubbles> |
setting filter cache size to 20% on elastic1001 to see if it takes/helps us |
[production] |
15:19 |
<anomie> |
Synchronized wmf-config/: SWAT: Remove dead ULS variable [[gerrit:145861]] (duration: 00m 10s) |
[production] |
15:18 |
<anomie> |
anomie actually committed a live hack someone left on tin (removing db1035) |
[production] |
15:16 |
<anomie> |
updated /a/common to {{Gerrit|I7ca6a16d5}}: Switch jawiki back to lsearchd |
[production] |
13:42 |
<manybubbles> |
Synchronized wmf-config/InitialiseSettings.php: jawiki back to lsearchd (duration: 00m 05s) |
[production] |
13:38 |
<manybubbles> |
elastic1017 had a load average of 60 - was thashing in io. bounced Elasticsearch. lets see if it recovers on its own |
[production] |
09:09 |
<_joe_> |
restarting mailman on sodium, again, for testing |
[production] |
08:50 |
<godog> |
restart mailman on sodium after inodes freed |
[production] |
07:27 |
<_joe_> |
restarted mailman on sodium |
[production] |
07:22 |
<_joe_> |
stopping mailman on sodium for repairing |
[production] |
06:54 |
<_joe_> |
killed jenkins stale process on gallium, stuck in a futex while shutting down |
[production] |
04:48 |
<springle> |
db1035 crash cycle. down for memtest and stuff |
[production] |
03:34 |
<LocalisationUpdate> |
ResourceLoader cache refresh completed at Tue Jul 15 03:33:38 UTC 2014 (duration 33m 37s) |
[production] |
03:01 |
<LocalisationUpdate> |
completed (1.24wmf13) at 2014-07-15 03:00:03+00:00 |
[production] |
02:34 |
<springle> |
Synchronized wmf-config/db-eqiad.php: depool db1035, crashed (duration: 00m 13s) |
[production] |
02:30 |
<LocalisationUpdate> |
completed (1.24wmf12) at 2014-07-15 02:29:02+00:00 |
[production] |
02:27 |
<springle> |
powercycle db1035 unresponsive |
[production] |
2014-07-14
§
|
23:32 |
<mwalker> |
Started scap: Updating for SWAT {{gerrit|146304}}, {{gerrit|146306}}, {{gerrit|146149}}, {{gerrit|146165}}, {{gerrit|146166}}, {{gerrit|146282}}, and {{gerrit|146281}}. Also finishing awight's deploy of FundraisingTranslateWorkflow. |
[production] |
20:22 |
<cscott> |
updated Parsoid to version d51e64097bb1b18e356584d4f3ddcfd90a6071ba |
[production] |
19:57 |
<ori> |
postponing jobrunner deployment to tomorrow; ran over time |
[production] |
19:45 |
<_joe_> |
doing the same on mw1064, segfaulted for the same reason |
[production] |
19:44 |
<_joe_> |
killed a lone apache2 child on mw1152, stuck in a futex, after a segfault of another apache process. Restarted apache, now working correctly |
[production] |
19:04 |
<godog> |
re-enabling mailman on sodium, missing list config restored |
[production] |
18:49 |
<awight> |
Synchronized wmf-config: Deploying FundraisingTranslateWorkflow on metawiki (t |
[production] |
18:45 |
<awight> |
Synchronized php-1.24wmf13/extensions/FundraisingTranslateWorkflow: Update FundraisingTranslateWorkflow extension (wmf13) (duration: 00m 05s) |
[production] |
18:44 |
<awight> |
Synchronized php-1.24wmf12/extensions/FundraisingTranslateWorkflow: Update FundraisingTranslateWorkflow extension (duration: 00m 05s) |
[production] |
18:15 |
<awight> |
Synchronized wmf-config: Revert: Deploying FundraisingTranslateWorkflow on metawiki (duration: 00m 04s) |
[production] |
18:03 |
<awight> |
Synchronized wmf-config: Deploying FundraisingTranslateWorkflow on metawiki (duration: 00m 05s) |
[production] |
18:03 |
<awight> |
updated /a/common to {{Gerrit|Ie7599fb6e}}: jawiki gets Cirrus as primary search |
[production] |
17:43 |
<Krinkle> |
npm-cache for integration slaves got corrupted again. Depooling/Repooling integration-slave100{1,2,3} onoe by one to clear cache and let it warm up again. |
[production] |
17:35 |
<Krinkle> |
Jenkins slaves in labs are unable to reach zuul.eqiad.wmnet |
[production] |
17:10 |
<andrewbogott> |
purging old local-* service group entries from labs ldap (via purgeOldServiceGroups.php) |
[production] |
17:05 |
<godog> |
started mailman on sodium post-reboot |
[production] |
17:04 |
<demon> |
Synchronized wmf-config/InitialiseSettings.php: nlwiki getting cirrus as primary (duration: 00m 04s) |
[production] |
15:11 |
<manybubbles> |
Synchronized wmf-config: SWAT update cirrus settings for commons (duration: 00m 04s) |
[production] |
15:04 |
<manybubbles> |
Synchronized wmf-config: SWAT update cirrus settings for commons (duration: 00m 04s) |
[production] |
15:02 |
<manybubbles> |
Synchronized wmf-config: SWAT update cirrus settings for commons (duration: 00m 05s) |
[production] |
14:39 |
<_joe_> |
rebooted nescio, stuck and with console showing just a truncated log (timestamp only) |
[production] |
14:33 |
<mutante> |
powercycling sodium |
[production] |