2020-06-04
§
|
07:31 |
<elukey> |
stop netflow hive2druid timers to do some experiments |
[analytics] |
06:52 |
<mutante> |
mwmaint1002 started mediawiki_job_cirrus_build_completion_indices_eqiad.service |
[production] |
06:13 |
<elukey> |
kill application_1589903254658_75731 (druid indexation for netflow still running since 12h ago) |
[analytics] |
06:06 |
<oblivian@puppetmaster1001> |
conftool action : set/weight=10; selector: name=logstash200.* |
[production] |
06:05 |
<oblivian@puppetmaster1001> |
conftool action : set/weight=10; selector: name=logstash100.* |
[production] |
06:04 |
<oblivian@puppetmaster1001> |
conftool action : set/weight=10; selector: cluster=eventschemas,service=eventschemas |
[production] |
06:02 |
<oblivian@puppetmaster1001> |
conftool action : set/weight=10; selector: dc=codfw,cluster=elasticsearch,service=elasticsearch.* |
[production] |
06:01 |
<oblivian@puppetmaster1001> |
conftool action : set/weight=10; selector: dc=codfw,cluster=elasticsearch,service=elasticsearch |
[production] |
05:59 |
<_joe_> |
fixing weights of cp2040 T245594 |
[production] |
05:36 |
<elukey> |
restart druid middlemanager on druid1002 - strange protobuf warnings, netflow hive2druid indexation job stuck for hours |
[analytics] |
05:31 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) |
[production] |
05:28 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime |
[production] |
05:13 |
<elukey> |
reimage druid1003 to Buster |
[analytics] |
00:36 |
<reedy@deploy1001> |
Synchronized php-1.35.0-wmf.35/includes/specials/SpecialUserrights.php: T254417 T251534 (duration: 01m 06s) |
[production] |
00:20 |
<MacFan4000> |
restarting for code and config changes |
[tools.zppixbot] |
2020-06-03
§
|
23:24 |
<MacFan4000> |
restart (again) |
[tools.zppixbot-test] |
23:21 |
<MacFan4000> |
restart (again) |
[tools.zppixbot-test] |
23:15 |
<MacFan4000> |
restart (again) |
[tools.zppixbot-test] |
23:09 |
<MacFan4000> |
restart for code changes |
[tools.zppixbot-test] |
23:08 |
<reedy@deploy1001> |
Synchronized wmf-config/CommonSettings-labs.php: T249834 (duration: 01m 06s) |
[production] |
23:06 |
<reedy@deploy1001> |
Synchronized wmf-config/InitialiseSettings-labs.php: T249834 (duration: 01m 06s) |
[production] |
22:22 |
<ryankemper@cumin2001> |
END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) |
[production] |
22:21 |
<Texas> |
kubectl delete pods sopeltest.bot-6876f8c6b4-svmw5 |
[tools.zppixbot] |
22:19 |
<Texas> |
git pull |
[tools.zppixbot] |
22:17 |
<Texas> |
kubectl delete pods sopeltest.bot-6876f8c6b4-pj7pq |
[tools.zppixbot-test] |
22:16 |
<Texas> |
git pull |
[tools.zppixbot-test] |
21:54 |
<jforrester@deploy1001> |
rebuilt and synchronized wikiversions files: Re-rolling group1 to 1.35.0-wmf.35 for T253023 |
[production] |
21:49 |
<jforrester@deploy1001> |
Synchronized php-1.35.0-wmf.35/extensions/EventStreamConfig/includes/ApiStreamConfigs.php: T254390 ApiStreamConfigs: If the 'constraints' parameter is unset, don't explode (duration: 01m 06s) |
[production] |
21:43 |
<cstone> |
civicrm revision changed from 63508b01b9 to 11b0e7c7e5 |
[production] |
21:16 |
<ryankemper@cumin2001> |
START - Cookbook sre.elasticsearch.rolling-upgrade |
[production] |
21:15 |
<ryankemper> |
The previously ran `_cluster/reroute?retry_failed=true` command worked as intended, the two shards in question have recovered and we're back to green cluster status. We're now in a known state and ready to proceed with the eqiad rolling upgrade |
[production] |
21:13 |
<ryankemper> |
Ran `curl -X POST "https://localhost:9243/_cluster/reroute?pretty&retry_failed=true&explain=true" -H 'Content-Type: application/json' -d '{}' --insecure` via the ssh tunnel `ssh bast4002.wikimedia.org -L 9243:search.svc.eqiad.wmnet:9243 -L 9443:search.svc.eqiad.wmnet:9443 -L 9643:search.svc.eqiad.wmnet:9643`, two unassigned shards are now initializing |
[production] |
21:05 |
<wm-bot> |
<root> Hard restart with --canonical after clearing cached gridengine state in front proxy. (T254361) |
[tools.most-wanted] |
21:05 |
<ryankemper> |
Elasticsearch Eqiad was in yellow cluster status before starting the above cookbook run (therefore the run was a no-op until I ctlr+C'd), going to try unsticking the two unassigned shards via `/_cluster/reroute?retry_failed=true` |
[production] |
21:03 |
<ryankemper@cumin2001> |
END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) |
[production] |
20:58 |
<wm-bot> |
<root> Hard restart (T254361) |
[tools.most-wanted] |
20:58 |
<ryankemper@cumin2001> |
START - Cookbook sre.elasticsearch.rolling-upgrade |
[production] |
20:52 |
<ryankemper@cumin2001> |
END (PASS) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=0) |
[production] |
20:49 |
<eileen> |
civicrm revision changed from eb156dffa4 to 63508b01b9, config revision is 95dcdb0a8a |
[production] |
20:47 |
<ryankemper@cumin2001> |
START - Cookbook sre.elasticsearch.rolling-upgrade |
[production] |
20:38 |
<MacFan4000> |
restart for config changes |
[tools.zppixbot] |
20:37 |
<MacFan4000> |
restart for config changes |
[tools.zppixbot-test] |
20:19 |
<gehel> |
elasticsearch cluster restart stopped |
[production] |
20:18 |
<ryankemper@cumin2001> |
END (ERROR) - Cookbook sre.elasticsearch.rolling-upgrade (exit_code=97) |
[production] |
19:47 |
<James_F> |
Zuul: [mediawiki/extensions/MachineVision] Add codehealth. |
[releng] |
19:35 |
<ppchelko@deploy1001> |
helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . |
[production] |
19:35 |
<ppchelko@deploy1001> |
helmfile [EQIAD] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . |
[production] |
19:33 |
<ppchelko@deploy1001> |
helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . |
[production] |
19:32 |
<ppchelko@deploy1001> |
helmfile [CODFW] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . |
[production] |
19:30 |
<ryankemper@cumin2001> |
START - Cookbook sre.elasticsearch.rolling-upgrade |
[production] |