2012-01-04
§
|
18:22 |
<catrope> |
synchronized wmf-config/CommonSettings.php 'Enable tracking for AFTv5 bucketing' |
[production] |
18:06 |
<mutante> |
duplicate nagios-wm instances on spence (/home/wikipedia/bin/ircecho vs. /usr/ircecho/bin/ircecho) killed them both, restarted with init.d/ircecho |
[production] |
18:00 |
<catrope> |
synchronized php-1.18/resources/mediawiki/mediawiki.user.js 'Live hack for tracking a percentage of bucketing events' |
[production] |
17:52 |
<mutante> |
knsq11 is broken. boots into installer, then "Dazed and confused" at hardware detection (NMI received for unknown reason 21 on CPU 0). -> RT 2206 |
[production] |
17:38 |
<mutante> |
powercycling knsq11 |
[production] |
15:52 |
<mutante> |
added project deployment-prep for hexmode and petan |
[production] |
11:31 |
<catrope> |
synchronized php-1.18/extensions/ClickTracking/ClickTracking.hooks.php '[[rev:108017|r108017]]' |
[production] |
08:44 |
<nikerabbit> |
synchronized php-1.18/includes/specials/SpecialAllmessages.php '[[rev:107998|r107998]]' |
[production] |
07:40 |
<Tim> |
fixed puppet by re-running the post-merge hook with key forwarding enabled, and then started puppet on ms6 |
[production] |
07:32 |
<Tim> |
on ms6.esams: fixed proxy IP address and stopped puppet while I figure out how to fix it |
[production] |
03:25 |
<Tim> |
experimentally raised max_concurrent_checks to 128 |
[production] |
03:17 |
<Tim> |
on spence in nagios.cfg, reduced service_reaper_frequency from 10 to 1, to avoid having a massive process count spike every 10 seconds as checks are started. Locally only as a test. |
[production] |
02:27 |
<Ryan_Lane> |
I should clarify that I removed 10.2.1.13 from /etc/network/interfaces, it's still properly bound to lo |
[production] |
02:24 |
<Tim> |
on spence: setting up logrotate for nagios.log and removing nagios-bloated-log.log |
[production] |
02:22 |
<Ryan_Lane> |
removing manually added 10.2.1.13 address from lvs4 |
[production] |
02:01 |
<LocalisationUpdate> |
completed (1.18) at Wed Jan 4 02:04:57 UTC 2012 |
[production] |
01:43 |
<Nemo_bis> |
Last week slowness: job queue backlog now cleared on !Wikimedia Commons and (almost) English !Wikipedia http://ur1.ca/77q9b |
[production] |
01:02 |
<reedy> |
synchronized php-1.18/includes/ '[[rev:107978|r107978]]' |
[production] |
00:45 |
<reedy> |
synchronized php-1.18/extensions '[[rev:107977|r107977]], [[rev:107976|r107976]]' |
[production] |
00:39 |
<Tim> |
running purgeParserCache.php on hume, deleting objects older than 3 months |
[production] |
00:38 |
<reedy> |
synchronized php-1.18/includes/specials/ '[[rev:107975|r107975]]' |
[production] |
00:29 |
<tstarling> |
synchronizing Wikimedia installation... : |
[production] |
00:27 |
<reedy> |
synchronized php-1.18/extensions/Nuke/ '[[rev:107974|r107974]]' |
[production] |
00:25 |
<reedy> |
synchronized php-1.18/extensions/ '[[rev:107970|r107970]]' |
[production] |
2012-01-03
§
|
23:00 |
<Tim> |
on spence: restarting gmetad |
[production] |
22:58 |
<reedy> |
synchronizing Wikimedia installation... : Pushing [[rev:107953|r107953]], [[rev:107955|r107955]], [[rev:107956|r107956]], [[rev:107957|r107957]] |
[production] |
22:47 |
<LeslieCarr> |
stopping and then starting apache2 on spence to try and lower load |
[production] |
22:29 |
<RobH> |
added in the lo addres to lvs4, now its working and generating thumbnails |
[production] |
22:09 |
<reedy> |
synchronizing Wikimedia installation... : Push [[rev:107938|r107938]] [[rev:107948|r107948]] |
[production] |
21:45 |
<RobH> |
ganglia graphs will have missing data for past 30 to 40 minutes |
[production] |
21:45 |
<RobH> |
spence back online, ganglia and nagios confirmed operational |
[production] |
21:38 |
<RobH> |
resetting spence and dropping to serial to try to fix it |
[production] |
21:25 |
<RobH> |
nagios and ganglia down due to spence reboot, system still coming back online |
[production] |
21:21 |
<RobH> |
spence is unresponsive to ssh and serial console, rebooting |
[production] |
21:14 |
<LeslieCarr> |
resetting DRAC 5 on spence for management connectivity |
[production] |
21:05 |
<binasher> |
that fixed it. but how did that happen? |
[production] |
21:05 |
<binasher> |
ran ip addr add 10.2.1.22/32 label "lo:LVS" dev lo on lvs4 |
[production] |
19:36 |
<reedy> |
synchronized php-1.18/skins/common/images/ '[[rev:107930|r107930]]' |
[production] |
17:36 |
<mutante> |
killing more runJobs.php / nextJobDB.php processes on a bunch of servers (/home/catrope/badjobrunners) |
[production] |
17:26 |
<RoanKattouw> |
Stopping job runners on the following DECOMMISSIONED servers: srv151 srv152 srv153 srv158 srv160 srv164 srv165 srv166 srv167 srv168 srv170 srv176 srv177 srv178 srv181 srv184 srv185 |
[production] |
15:55 |
<RobH> |
torrus back, took forever to recompile |
[production] |
15:53 |
<reedy> |
synchronized wmf-config/InitialiseSettings.php 'Bug 33485 - Enable WikiLove in si.wikipedia' |
[production] |
15:52 |
<Reedy> |
Created wikilove tables on siwiki |
[production] |
15:46 |
<RobH> |
torrus deadlocked, kicking |
[production] |
14:00 |
<RoanKattouw> |
Restarting job runners on srv242 and mw25, those are the last ones that are stuck |
[production] |
13:57 |
<RoanKattouw> |
Restarting all job runners that are stuck |
[production] |
13:48 |
<RoanKattouw> |
Restarting job runner on srv236, seems to be stuck |
[production] |
02:02 |
<LocalisationUpdate> |
completed (1.18) at Tue Jan 3 02:05:21 UTC 2012 |
[production] |