2011-12-19
§
|
19:41 |
<Jeff_Green> |
dropping several db's from db9 which have already been migrated to fundraisingdb cluster |
[production] |
19:40 |
<notpeter> |
powercycling maerlant |
[production] |
19:05 |
<nikerabbit> |
synchronized wmf-config/InitialiseSettings.php 'Translate with tables' |
[production] |
18:34 |
<nikerabbit> |
synchronized wmf-config/InitialiseSettings.php 'Translate needs tables' |
[production] |
18:28 |
<nikerabbit> |
synchronizing Wikimedia installation... : I18ndeploy [[rev:106667|r106667]] and new extensions on mediawiki.org |
[production] |
16:42 |
<RobH> |
dataset1 new data partition ready and setup to automount |
[production] |
15:49 |
<RobH> |
dataset1 reinstalled and has had puppet run. Now to see if it can keep time |
[production] |
15:46 |
<RoanKattouw> |
maerlant is fried, load avg is 500+, linearly increasing since Friday. Rejects SSH login attempts |
[production] |
15:45 |
<notpeter> |
restarting indexer on searchidx2 |
[production] |
14:16 |
<apergos> |
thumb cleaner to bed for the night... for the last time? |
[production] |
13:15 |
<mutante> |
truncated spence.cfg in ./puppet_checks.d/ - it had multiple dupe service definitions for all checks on spence |
[production] |
13:11 |
<mutante> |
commented check_job_queue stuff from non-puppetized files on spence (hosts.cfg, conf.php) to get rid of "duplicate definition" now that it's been pupptized |
[production] |
12:35 |
<mutante> |
deleted snapshot4 files from /var/lib/puppet/yaml/node and ./yaml/facts on sockpuppet and stafford, they got recreated and fixed puppet run on sn4 |
[production] |
10:08 |
<apergos> |
a few more binlogs on db9 gone. eeking out another 12 hours or so |
[production] |
06:57 |
<apergos> |
thumb cleaner awake for the day. poor thing, slaving away but soon it will be able to retire |
[production] |
01:57 |
<LocalisationUpdate> |
failed (1.18) at Mon Dec 19 02:00:11 UTC 2011 |
[production] |
2011-12-17
§
|
22:49 |
<RobH> |
Anytime db9 hits 98 or 99% someone needs to remove binlogs to bring it back down to 94 or 95% |
[production] |
22:48 |
<RobH> |
removed older binlogs on db9 again to kick it back to a bit more free space to last the weekend. |
[production] |
17:53 |
<catrope> |
synchronized wmf-config/CommonSettings.php 'Remove SVN dir setting, this is now passed in on the command line' |
[production] |
16:43 |
<RoanKattouw> |
Found out why LocalisationUpdate was failing. Would have been fixed already if puppet had been running on fenari, but it's throwing errors. See [[rev:1617|r1617]] and my comment on [[rev:1558|r1558]] |
[production] |
14:32 |
<apergos> |
thumb cleaner to bed for the night... about 2 days left I think |
[production] |
07:25 |
<apergos> |
thumb cleaner started up for the day |
[production] |
01:57 |
<LocalisationUpdate> |
failed (1.18) at Sat Dec 17 02:00:18 UTC 2011 |
[production] |
2011-12-16
§
|
22:30 |
<RobH> |
reclaimed space on db9, restarted mysql, services seem to be recovering |
[production] |
22:24 |
<maplebed> |
restarting mysql on db9; brief downtime for a number of apps (bugzilla, blog, etc.) expected. |
[production] |
22:03 |
<RobH> |
db9 space reclaimed back to 94% full, related services should start recovering |
[production] |
21:57 |
<RobH> |
db9 disk full, related services are messing up, fixing |
[production] |
21:56 |
<RobH> |
kicking apache for bz related issues on kaulen |
[production] |
19:14 |
<catrope> |
synchronized php-1.18/resources/startup.js 'touch' |
[production] |
19:07 |
<catrope> |
synchronized wmf-config/InitialiseSettings.php 'Set AFTv4 lottery odds to 100% on en_labswikimedia' |
[production] |
18:48 |
<LeslieCarr> |
removed the ssl* yaml logs on stafford to fix the puppet not running error |
[production] |
16:13 |
<apergos> |
thumb cleaner to bed for the night. definitely need an alarm clock for this... good thing it's only got about 4 days of backlog left |
[production] |
15:41 |
<RobH> |
es1002 being actively worked on for hdd controller testing |
[production] |
15:39 |
<RobH> |
lvs1003 disk dead per RT 1549, will torubleshoot on site later today or Monday |
[production] |
15:32 |
<RobH> |
lvs1003 unresponsive to serial console, rebooting |
[production] |
15:18 |
<RobH> |
reinstalling dataset1 |
[production] |
14:45 |
<mutante> |
puppet was broken on all servers including "nrpe" due to package conflict with nagios-plugins-basic i added to base, revert+fix |
[production] |
13:29 |
<RoanKattouw> |
Dropping and recreating AFTv5 tables on en_labswikimedia and enwiki |
[production] |
13:26 |
<catrope> |
synchronized php-1.18/extensions/ArticleFeedbackv5/ 'Updating to trunk state' |
[production] |
13:25 |
<mutante> |
tweaked Nagios earlier today: external command_check_interval & event_broker_options (see comments in gerrit Id3b4a458) |
[production] |
13:01 |
<mark> |
Found lvs5 and lvs6 with offload-gro enabled, even though it's set disabled in /etc/network/interfaces... corrected |
[production] |
09:21 |
<apergos> |
restarted lighthttpd on ds2, it had stopped (and why didn't nagios tell us? ) |
[production] |
08:38 |
<mutante> |
spence - had killed additional notifications.cgi and history.cgi procs, waited 5 minutes, load went down a lot, restarting nagios |
[production] |
08:23 |
<mutante> |
spence - almost unusable, Nagios notifications.cgi and history.cgi use a lot of memory, stopping Nagios, watching swap |
[production] |
08:15 |
<mutante> |
spence slow again, side-note: tried to use "sar" to investigate but "Please check if data collecting is enabled in /etc/default/sysstat" (want to?) |
[production] |
07:54 |
<nikerabbit> |
synchronized php-1.18/extensions/WebFonts/resources/ext.webfonts.js 'JS fix [[rev:106418|r106418]]' |
[production] |