2015-09-01
§
|
13:09 |
<moritzm> |
enabled ferm on labsdb100[467] |
[production] |
12:01 |
<YuviPanda> |
disable puppet on labsdb1006 |
[production] |
08:58 |
<moritzm> |
enabled ferm on labsdb1001 |
[production] |
08:58 |
<godog> |
fixup current graphite retention for metrics under "servers" hierarchy T96662 |
[production] |
08:51 |
<moritzm> |
enabled ferm on labsdb1002 |
[production] |
08:31 |
<moritzm> |
enabled ferm on labsdb1003 |
[production] |
08:29 |
<godog> |
repool mw1125 mw1142 after nutcracker failures |
[production] |
07:45 |
<jynus> |
cloning mysql data from es1010 to es1017 [ETA: 6h] |
[production] |
07:23 |
<jynus@tin> |
Synchronized wmf-config/db-eqiad.php: Depool es1010 (duration: 00m 12s) |
[production] |
07:13 |
<jynus@tin> |
Synchronized wmf-config/db-eqiad.php: Repool es1007, pool es1013 (duration: 00m 13s) |
[production] |
06:36 |
<mutante> |
uploaded survey2012 to dumps/dataset1001; ownership as it is for survey2011; - T110746 in time for midnight PST |
[production] |
06:23 |
<valhallasw`cloud> |
seems to have worked. SGE :( |
[tools] |
06:17 |
<valhallasw`cloud> |
going to restart sge_qmaster, hoping this solves the issue :/ |
[tools] |
06:07 |
<valhallasw`cloud> |
e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?! |
[tools] |
06:06 |
<valhallasw`cloud> |
test job does not get submitted because all queues are overloaded?! |
[tools] |
06:06 |
<valhallasw`cloud> |
investigating SGE issues reported on irc/email |
[tools] |
05:18 |
<l10nupdate@tin> |
ResourceLoader cache refresh completed at Tue Sep 1 05:18:09 UTC 2015 (duration 18m 8s) |
[production] |
02:28 |
<l10nupdate@tin> |
LocalisationUpdate completed (1.26wmf20) at 2015-09-01 02:28:30+00:00 |
[production] |
02:25 |
<l10nupdate@tin> |
Synchronized php-1.26wmf20/cache/l10n: l10nupdate for 1.26wmf20 (duration: 06m 00s) |
[production] |
01:12 |
<James_F> |
Re-restarting grrrit-wm rolled back to 2f5de55ff75c3c268decfda7442dcdd62df0a42d |
[tools.lolrrit-wm] |
01:12 |
<James_F> |
Re-restarting grrrit-wm rolled back to 2f5de55ff75c3c268decfda7442dcdd62df0a42d |
[releng] |
00:54 |
<James_F> |
Restarted grrrit-wm with I7eb67e3482 as well as I48ed549dc2b. |
[releng] |
00:32 |
<James_F> |
Didn't work, rolled back grrrit-wm to 2f5de55ff75c3c268decfda7442dcdd62df0a42d. |
[releng] |
00:32 |
<James_F> |
Didn't work, r |
[releng] |
00:29 |
<James_F> |
Restarted grrrit-wm for I48ed549dc2b. |
[releng] |
2015-08-31
§
|
23:56 |
<krenair@tin> |
Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/233665/ (duration: 00m 11s) |
[production] |
23:49 |
<ebernhardson@tin> |
Synchronized wmf-config/InitialiseSettings.php: reenable config changes for cirrus experimental completion api (duration: 00m 12s) |
[production] |
23:40 |
<ori@tin> |
Synchronized php-1.26wmf20/extensions/EducationProgram: 97ab82eab2: Updated mediawiki/core Project: mediawiki/extensions/EducationProgram 85a7d3932c1a4ad28f1a8dd05704f4e524152349 (duration: 00m 14s) |
[production] |
23:27 |
<ebernhardson@tin> |
Synchronized php-1.26wmf20/extensions/CirrusSearch/: (no message) (duration: 00m 12s) |
[production] |
23:25 |
<ebernhardson@tin> |
Synchronized wmf-config/InitialiseSettings.php: revert update for cirrussearch experimental suggestions api (duration: 00m 12s) |
[production] |
23:21 |
<ebernhardson@tin> |
Synchronized wmf-config/InitialiseSettings.php: update config of cirrussearch experimental suggestions api (duration: 00m 12s) |
[production] |
22:45 |
<chasemp> |
disabled puppet on elastic hosts temporarily to safely roll out fw change. elastic seems to have not taken it well and I'm holding for green cluster state. |
[production] |
21:21 |
<valhallasw`cloud> |
webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest) |
[tools] |
21:20 |
<mutante> |
installing package upgrades on argon |
[production] |
21:20 |
<valhallasw`cloud> |
restarted webservicemonitor |
[tools] |
21:19 |
<valhallasw`cloud> |
seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2 |
[tools] |
21:18 |
<valhallasw`cloud> |
running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running |
[tools] |
21:15 |
<valhallasw`cloud> |
several webservices seem to actually have not gotten back online?! what on earth is going on. |
[tools] |
21:10 |
<valhallasw`cloud> |
some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again |
[tools] |
20:58 |
<ori> |
imported pybal_1.08_amd64.changes to jessie-wikimedia |
[production] |
20:44 |
<chasemp> |
ferm for elastic100[4-7] and adjust ferm to include wikitech source |
[production] |
20:29 |
<valhallasw`cloud> |
|sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time. |
[tools] |
20:25 |
<valhallasw`cloud> |
ca 500 jobs @ 5s/job = approx 40 minutes |
[tools] |
20:23 |
<valhallasw`cloud> |
doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh* |
[tools] |
20:21 |
<valhallasw`cloud> |
now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues |
[tools] |
20:21 |
<subbu> |
deployed parsoid version c3e4df5e |
[production] |
19:36 |
<valhallasw`cloud> |
last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs |
[tools] |
19:35 |
<valhallasw`cloud> |
one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi? |
[tools] |
19:31 |
<valhallasw`cloud> |
https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues |
[tools] |
16:22 |
<godog> |
depool mw1125 + mw1142 from api, nutcracker client connections exceeded |
[production] |