2012-01-05
§
|
18:00 |
<mutante> |
tarin - added "#includedir /etc/sudoers.d" to sudo config, needs to read /etc/sudoers.d/nrpe for Nagios RAID check |
[production] |
17:49 |
<logmsgbot_> |
hashar: gallium: cleaned /tmp . Our test suites leak a large amount of files :D |
[production] |
17:49 |
<^demon> |
removed chuck norris plugin from jenkins, restarted |
[production] |
16:48 |
<mutante> |
payments4 - 25 running nginx procs cause a warning - but normal and just raise limit? |
[production] |
16:15 |
<mutante> |
people claim it was "completely resolved with "2.6.38-10 backport from PPA." (add-apt-repository ppa:kernel-ppa/ppa ...). wanna try that? (or just reboot ms1002 pls) |
[production] |
15:45 |
<mutante> |
ms1002 - kswapd 100% CPU - but no swap used and free memory left - this looks like https://bugs.launchpad.net/ubuntu/+bug/721896 again |
[production] |
15:39 |
<mutante> |
Nagios check_ntp does stuff like: overall average offset: 0 -> NTP OK: Offset unknown| -> NTP CRITICAL: Offset unknown (even though this bug was supposed to be fixed in a version before the one we use)..sigh |
[production] |
15:14 |
<mutante> |
lvs1004 - puppet didnt run since 12 hours, looked stuck, "already in progress" on every run. rm /var/lib/puppet/state/puppetdlock, restart puppet agent, finished fine in a few seconds. maybe puppet [[bugzilla:2888|bug 2888]],5246 or related |
[production] |
14:57 |
<mutante> |
magnesium - memcached runs on default port 11211, but we run all the others on 11000, this causes Nagios CRIT. Is it supposed to run here? (was also on -l 127.0.0.1 only, but init script starts it on all) |
[production] |
14:55 |
<Jeff_Green> |
searchidx1 /a reached 100%, did the "space issues" maintenance procedure from wikitech search documentation |
[production] |
14:39 |
<mutante> |
same on srv193 |
[production] |
14:35 |
<mutante> |
srv290 - before restart memcached was running with -m 64 and -l 127.0.0.1 for some reason, causing Nagios CRIT, now it looks like others and recovered |
[production] |
14:32 |
<mutante> |
restarting memcached on srv290 |
[production] |
02:01 |
<LocalisationUpdate> |
completed (1.18) at Thu Jan 5 02:05:03 UTC 2012 |
[production] |
2012-01-04
§
|
23:27 |
<catrope> |
synchronizing Wikimedia installation... : Deploying MoodBar and MarkAsHelpful changes |
[production] |
22:39 |
<Tim> |
taking srv280 for action=purge slowness investigation |
[production] |
21:20 |
<Ryan_Lane> |
deploying LdapAuthentication 2.0a and OpenStackmanager 1.3 to virt1 |
[production] |
21:13 |
<RoanKattouw> |
Applying schema changes to moodbar_feedback_response on all wikis (drop index, create index, add column) |
[production] |
19:36 |
<notpeter> |
restarting dhcpd on brewster |
[production] |
19:13 |
<RobH> |
dns update successful and none of them fell over |
[production] |
19:12 |
<Reedy> |
[[rev:108070|r108070]] even |
[production] |
19:12 |
<reedy> |
synchronized php-1.18/extensions/CentralAuth/specials/ '[[rev:107070|r107070]]' |
[production] |
19:11 |
<RobH> |
updating dns for mgmt of ms-fe1/2 and other new servers in tampa, as well as search boxen in eqiad |
[production] |
19:04 |
<mutante> |
srv199 boots but without eth0, NIC1 is Enabled in BIOS but MAC Address "Not Present" - creating hardware ticket |
[production] |
18:55 |
<catrope> |
synchronized php-1.18/extensions/ArticleFeedbackv5/modules/jquery.articleFeedbackv5/jquery.articleFeedbackv5.js '[[rev:108064|r108064]]' |
[production] |
18:43 |
<catrope> |
synchronized wmf-config/CommonSettings.php 'Disable AFTv5 bucketing tracking again' |
[production] |
18:38 |
<mutante> |
powercycling srv199 |
[production] |
18:33 |
<catrope> |
synchronized php-1.18/resources/startup.js 'touch' |
[production] |
18:30 |
<catrope> |
synchronized wmf-config/CommonSettings.php 'Actually bump version number' |
[production] |
18:28 |
<catrope> |
synchronized php-1.18/resources/mediawiki/mediawiki.user.js 'Revert live hack' |
[production] |
18:24 |
<catrope> |
synchronized wmf-config/CommonSettings.php 'and bump the version number too' |
[production] |
18:22 |
<catrope> |
synchronized wmf-config/CommonSettings.php 'Enable tracking for AFTv5 bucketing' |
[production] |
18:06 |
<mutante> |
duplicate nagios-wm instances on spence (/home/wikipedia/bin/ircecho vs. /usr/ircecho/bin/ircecho) killed them both, restarted with init.d/ircecho |
[production] |
18:00 |
<catrope> |
synchronized php-1.18/resources/mediawiki/mediawiki.user.js 'Live hack for tracking a percentage of bucketing events' |
[production] |
17:52 |
<mutante> |
knsq11 is broken. boots into installer, then "Dazed and confused" at hardware detection (NMI received for unknown reason 21 on CPU 0). -> RT 2206 |
[production] |
17:38 |
<mutante> |
powercycling knsq11 |
[production] |
15:52 |
<mutante> |
added project deployment-prep for hexmode and petan |
[production] |
11:31 |
<catrope> |
synchronized php-1.18/extensions/ClickTracking/ClickTracking.hooks.php '[[rev:108017|r108017]]' |
[production] |
08:44 |
<nikerabbit> |
synchronized php-1.18/includes/specials/SpecialAllmessages.php '[[rev:107998|r107998]]' |
[production] |
07:40 |
<Tim> |
fixed puppet by re-running the post-merge hook with key forwarding enabled, and then started puppet on ms6 |
[production] |
07:32 |
<Tim> |
on ms6.esams: fixed proxy IP address and stopped puppet while I figure out how to fix it |
[production] |
03:25 |
<Tim> |
experimentally raised max_concurrent_checks to 128 |
[production] |
03:17 |
<Tim> |
on spence in nagios.cfg, reduced service_reaper_frequency from 10 to 1, to avoid having a massive process count spike every 10 seconds as checks are started. Locally only as a test. |
[production] |
02:27 |
<Ryan_Lane> |
I should clarify that I removed 10.2.1.13 from /etc/network/interfaces, it's still properly bound to lo |
[production] |
02:24 |
<Tim> |
on spence: setting up logrotate for nagios.log and removing nagios-bloated-log.log |
[production] |
02:22 |
<Ryan_Lane> |
removing manually added 10.2.1.13 address from lvs4 |
[production] |
02:01 |
<LocalisationUpdate> |
completed (1.18) at Wed Jan 4 02:04:57 UTC 2012 |
[production] |
01:43 |
<Nemo_bis> |
Last week slowness: job queue backlog now cleared on !Wikimedia Commons and (almost) English !Wikipedia http://ur1.ca/77q9b |
[production] |
01:02 |
<reedy> |
synchronized php-1.18/includes/ '[[rev:107978|r107978]]' |
[production] |
00:45 |
<reedy> |
synchronized php-1.18/extensions '[[rev:107977|r107977]], [[rev:107976|r107976]]' |
[production] |