2020-10-29
§
|
08:54 |
<vgutierrez> |
turn off ECDHE-ECDSA-AES128-SHA support on the main caching cluster - T258405 |
[production] |
08:54 |
<moritzm> |
fixing up stray jenkins auto restart timers on secondary releases server |
[production] |
08:53 |
<vgutierrez> |
A:cp (except cp3052, running varnish 5) upgrade libvmod-netmapper to 1.9-1 T266567 T264398 |
[production] |
08:48 |
<moritzm> |
fixing up stray mcelog auto restart timers on kubestage* |
[production] |
08:38 |
<moritzm> |
fixing up stray cas auto restart timers on secondary IDP servers |
[production] |
08:19 |
<moritzm> |
fixing up stray pmacctd auto restart timers on netflow* |
[production] |
08:19 |
<moritzm> |
fixing up stray pcacctd auto restart timers on netflow* |
[production] |
08:02 |
<marostegui> |
Disconnect replication codfw -> eqiad on s1 T266663 |
[production] |
07:56 |
<vgutierrez> |
set LimitNOFILE=500000 for gdnsd on authdns1001 |
[production] |
07:54 |
<marostegui> |
Disconnect replication codfw -> eqiad on s4 T266663 |
[production] |
07:50 |
<vgutierrez> |
restart haproxy on authdns2001 |
[production] |
07:49 |
<marostegui> |
Disconnect replication codfw -> eqiad on s8 T266663 |
[production] |
07:48 |
<godog> |
swift codfw-prod: bump object weight for ms-be2057 - T261633 |
[production] |
07:46 |
<marostegui> |
Disconnect replication codfw -> eqiad on s3 T266663 |
[production] |
07:43 |
<vgutierrez> |
restart anycast-healthchecker on authdns2001 |
[production] |
07:34 |
<vgutierrez> |
set LimitNOFILE=500000 for gdnsd on authdns2001 |
[production] |
07:27 |
<elukey> |
"sudo truncate -s 10g /var/log/daemon.log" on authdns2001 |
[production] |
06:52 |
<marostegui> |
Disconnect replication codfw -> eqiad on s2 T266663 |
[production] |
06:38 |
<marostegui> |
Disconnect replication codfw -> eqiad on s7 T266663 |
[production] |
06:36 |
<marostegui> |
Disconnect replication codfw -> eqiad on s6 T266663 |
[production] |
06:25 |
<elukey> |
execute 'truncate -s 10g /var/log/syslog.1 on authdns2001 - root partition full |
[production] |
06:23 |
<marostegui> |
Disconnect replication codfw -> eqiad on s5 T266663 |
[production] |
06:10 |
<marostegui> |
Disconnect replication codfw -> eqiad on es4 and es5 T266663 |
[production] |
06:07 |
<marostegui> |
Disconnect replication codfw -> eqiad on x1 T266663 |
[production] |
05:58 |
<marostegui> |
Disconnect replication codfw -> eqiad on pc1, pc2 and pc3 T266663 |
[production] |
04:06 |
<ryankemper@cumin1001> |
END (PASS) - Cookbook sre.elasticsearch.rolling-restart (exit_code=0) |
[production] |
01:41 |
<mutante> |
scandium reimaged a second time after making puppet changes to ensure nodejs/npm is NOT installed anymore (T257906) |
[production] |
01:17 |
<ryankemper> |
T266492 Beginning rolling restart of eqiad cirrus cluster, 3 nodes at a time, on `ryankemper@cumin1001` tmux session `elasticsearch_restart_eqiad` |
[production] |
01:16 |
<ryankemper@cumin1001> |
START - Cookbook sre.elasticsearch.rolling-restart |
[production] |
00:51 |
<ryankemper> |
Finished restart of wdqs categories across production hosts; wdqs deploy is complete and the service is healthy |
[production] |
00:14 |
<Amir1> |
rolling restart of ores |
[production] |
00:12 |
<dzahn@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) |
[production] |
00:10 |
<dzahn@cumin1001> |
START - Cookbook sre.hosts.downtime |
[production] |
00:04 |
<ryankemper> |
Beginning restart of wdqs categories across production hosts, one at a time: `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 60 && systemctl restart wdqs-categories && sleep 30 && pool'` |
[production] |
00:03 |
<ryankemper> |
Restarted wdqs categories across test hosts: `sudo -E cumin 'A:wdqs-test' 'systemctl restart wdqs-categories'` |
[production] |
00:03 |
<ryankemper> |
Restarted wdqs updater across all hosts: `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` |
[production] |
00:02 |
<ryankemper> |
Following wdqs deploy, https://query.wikidata.org successfully responds to an example query |
[production] |
00:01 |
<ryankemper@deploy1001> |
Finished deploy [wdqs/wdqs@8c97b17]: 0.3.53 (duration: 09m 29s) |
[production] |
2020-10-28
§
|
23:54 |
<ryankemper> |
Canary `wdqs1003` tests pass, proceeding with wdqs deploy to rest of fleet |
[production] |
23:52 |
<ryankemper@deploy1001> |
Started deploy [wdqs/wdqs@8c97b17]: 0.3.53 |
[production] |
23:52 |
<ryankemper@deploy1001> |
deploy aborted: 0.3.53 (duration: 00m 00s) |
[production] |
23:52 |
<ryankemper@deploy1001> |
Started deploy [wdqs/wdqs@8c97b17]: 0.3.53 |
[production] |
22:54 |
<mutante> |
scandium - scap pull after reinstalling OS |
[production] |
22:14 |
<dzahn@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) |
[production] |
22:12 |
<dzahn@cumin1001> |
START - Cookbook sre.hosts.downtime |
[production] |
21:41 |
<ryankemper> |
Disabled elasticsearch "saneitizer" systemd timer in eqiad due to checker jobs falling behind: `sudo systemctl disable mediawiki_job_cirrus_sanitize_jobs.timer` on `mwmaint1002` |
[production] |
21:22 |
<herron@cumin1001> |
END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) |
[production] |
21:05 |
<hnowlan@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) |
[production] |
21:05 |
<hnowlan@cumin1001> |
START - Cookbook sre.hosts.downtime |
[production] |
20:50 |
<herron@cumin1001> |
START - Cookbook sre.ganeti.makevm |
[production] |