2012-07-02
§
|
14:37 |
<mark> |
Shutdown PyBal BGP sessions on cr1-sdtpa |
[production] |
14:34 |
<mark> |
Shutdown BGP session to 2828 on cr1-sdtpa |
[production] |
13:36 |
<hashar> |
db12 suffering some 1400sec (and growing) replag. mysqldump in progress on that host. |
[production] |
12:35 |
<mutante> |
installing upgrades on fenari (linux-firmware linux-libc-dev..) |
[production] |
12:27 |
<mutante> |
rebooting gallium one more time to install kernel |
[production] |
12:26 |
<mutante> |
upgrading kernel on gallium |
[production] |
12:23 |
<hashar> |
synchronized live-1.5/CREDITS |
[production] |
11:31 |
<mark> |
Now we have packet loss within pmtpa/sdtpa... reverting change |
[production] |
10:57 |
<mark> |
Problems on one of two pmtpa-eqiad waves; raised OSPF metric to 60 to failover traffic to the other link |
[production] |
10:50 |
<Tim> |
fixing leap second issue on bastion1 by rebooting it |
[production] |
10:47 |
<Tim> |
fixed leap second issue on bastion-restricted |
[production] |
09:57 |
<Tim> |
fixing leap second issue on virt1,virt2,virt3,virt4,virt5 |
[production] |
09:53 |
<Tim> |
fixing leap second issue on aluminium,gallium,manganese |
[production] |
09:47 |
<Tim> |
fixing leap second issue on formey,grosley,hooper,sanger,sockpuppet |
[production] |
09:43 |
<Tim> |
on fenari: fixed leap second issue with the mozilla method |
[production] |
09:39 |
<apergos> |
rebooting gallium, it's pretty unhappy (maybe related to leap second issue) |
[production] |
08:14 |
<hashar:> |
srv190 srv266 srv281 timeouts on sync-file |
[production] |
08:14 |
<hashar> |
synchronized wmf-config/InitialiseSettings.php 'Bug 37457 - fix import sources for viwikibooks' |
[production] |
08:11 |
<hashar> |
Stopped Jenkins on gallium. It is not doing anything anyway. Asked to reboot box {{rt|3208}} |
[production] |
02:53 |
<LocalisationUpdate> |
completed (1.20wmf5) at Mon Jul 2 02:53:51 UTC 2012 |
[production] |
02:28 |
<LocalisationUpdate> |
completed (1.20wmf6) at Mon Jul 2 02:28:48 UTC 2012 |
[production] |
01:48 |
<Tim> |
kill -CONT on populateRevisionSha1.php processes |
[production] |
00:47 |
<Tim> |
on nfs1: trying leap second fix suggested at https://bugzilla.mozilla.org/show_bug.cgi?id=769972#c5 |
[production] |
00:26 |
<tstarling> |
synchronized wmf-config/db.php 'reduce db32 read load to zero due to persistent lag' |
[production] |
00:12 |
<Tim> |
switched enwiki back to r/w |
[production] |
00:12 |
<tstarling> |
synchronized wmf-config/db.php |
[production] |
00:06 |
<Tim> |
on hume: stopped all populateRevisionSha1.php processes with kill -STOP |
[production] |
00:03 |
<reedy> |
synchronized wmf-config/db.php 's1/enwiki into readonly' |
[production] |
2012-07-01
§
|
19:12 |
<reedy> |
synchronized php-1.20wmf6/extensions/WikimediaMaintenance/ 'Update to master for hashar' |
[production] |
17:55 |
<aaron> |
synchronized php-1.20wmf5/includes/WikiPage.php 'more logging' |
[production] |
17:45 |
<aaron> |
synchronized php-1.20wmf5/includes/WikiPage.php 'more logging' |
[production] |
17:43 |
<aaron> |
synchronized php-1.20wmf5/includes/WikiPage.php 'more logging' |
[production] |
17:32 |
<aaron> |
synchronized php-1.20wmf5/includes/WikiPage.php |
[production] |
17:30 |
<aaron> |
synchronized php-1.20wmf5/includes/WikiPage.php |
[production] |
16:53 |
<aaron> |
synchronized php-1.20wmf5/includes/WikiPage.php |
[production] |
16:48 |
<aaron> |
synchronized php-1.20wmf5/includes/WikiPage.php 'logging' |
[production] |
12:54 |
<notpeter> |
also going to reboot all pmtpa search nodes. not in prod, but are still freaking out from leap second bug. |
[production] |
05:33 |
<aaron> |
synchronized php-1.20wmf5/includes/WikiPage.php 'logging' |
[production] |
04:25 |
<LocalisationUpdate> |
completed (1.20wmf5) at Sun Jul 1 04:25:25 UTC 2012 |
[production] |
04:06 |
<Ryan_Lane> |
virt1000 is back up, rebooting virt0 |
[production] |
04:02 |
<Ryan_Lane> |
rebooting virt1000 |
[production] |
03:16 |
<LocalisationUpdate> |
completed (1.20wmf6) at Sun Jul 1 03:16:39 UTC 2012 |
[production] |
01:43 |
<notpeter> |
that worked. restarting all remaining search nodes. |
[production] |
01:39 |
<notpeter> |
problem with lucene persisting through service restart, but not node restart. restarting en pool nodes. |
[production] |
01:20 |
<paravoid> |
restarting opendj (nfs1/nfs2), load spike, possibly related to leap second |
[production] |
00:51 |
<notpeter> |
search1004 dead. powercycling. |
[production] |
00:50 |
<notpeter> |
based on ganglia evidence, lucene seems to have been affected by leap second bug. restartig each instance, one minute wait in between |
[production] |