2020-09-08
ยง
|
17:58 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-reload |
[production] |
17:57 |
<ryankemper@cumin1001> |
END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) |
[production] |
17:54 |
<Amir1> |
Deployed patch for T262240 |
[production] |
17:53 |
<ryankemper@cumin1001> |
START - Cookbook sre.wdqs.data-reload |
[production] |
17:23 |
<andrewbogott> |
rebooting cloudvirt1033 |
[production] |
17:03 |
<klausman> |
attempted to add rock-dkms_3.3-19_all.deb to thirdparty/amd-rocm33 for use on analytics servers with GPUs |
[production] |
16:35 |
<otto@deploy1001> |
Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: Set canary_events_enabled: true for eventgate test streams and eventlogging_Test - T251609 (duration: 00m 58s) |
[production] |
16:34 |
<herron> |
increased elk5 logstash JVM heaps to 2g (to help decrease kafka-logging consumer lag) |
[production] |
16:12 |
<longma> |
1.36.0-wmf.8 was branched at e81e81e91473cc8259c473165863aca8ecea2784 for T257976 |
[production] |
16:03 |
<akosiaris@deploy1001> |
helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . |
[production] |
16:03 |
<akosiaris@deploy1001> |
helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . |
[production] |
16:02 |
<akosiaris@deploy1001> |
helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . |
[production] |
15:34 |
<jayme@cumin1001> |
conftool action : set/pooled=yes; selector: name=kubernetes1004.* |
[production] |
15:32 |
<jayme@cumin1001> |
conftool action : set/pooled=yes; selector: service=kubesvc,name=kubernetes1013.* |
[production] |
15:30 |
<elukey> |
roll restart of hadoop master daemons on an-master100[1,2] after the cookbook failed |
[production] |
15:26 |
<elukey@cumin1001> |
END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) |
[production] |
15:20 |
<_joe_> |
restarted celery-ores-worker.service on ores1007 |
[production] |
15:19 |
<_joe_> |
restarted ferm on wdqs1011 |
[production] |
15:18 |
<elukey@cumin1001> |
START - Cookbook sre.hadoop.roll-restart-masters |
[production] |
15:16 |
<_joe_> |
starting wdqs-updater on wdqs1005 |
[production] |
15:15 |
<bblack@cumin1001> |
conftool action : set/pooled=yes; selector: name=cp1090.eqiad.wmnet |
[production] |
15:14 |
<bblack@cumin1001> |
conftool action : set/pooled=yes; selector: name=cp108[789].eqiad.wmnet |
[production] |
15:14 |
<bblack> |
repool cp1087-90 (eqiad row D) |
[production] |
15:13 |
<herron> |
rolling restart of elk5 logstashes |
[production] |
15:10 |
<marostegui> |
Start mysql on db1106 after PDU maintenance is done |
[production] |
15:03 |
<jayme@cumin1001> |
conftool action : set/pooled=inactive; selector: service=kubesvc,name=kubernetes1013.* |
[production] |
15:03 |
<jayme@cumin1001> |
conftool action : set/pooled=inactive; selector: name=kubernetes1004.* |
[production] |
15:03 |
<XioNoX> |
request virtual-chassis vc-port set pic-slot 1 member 4 port 0 |
[production] |
15:03 |
<XioNoX> |
request virtual-chassis vc-port set pic-slot 0 member 2 port 50 |
[production] |
15:02 |
<XioNoX> |
request virtual-chassis vc-port set pic-slot 1 member 1 port 1 |
[production] |
14:53 |
<marostegui> |
Reload dbproxy1016 to recover the alert |
[production] |
14:45 |
<jynus> |
restarting bacula-dir @ backup1001 |
[production] |
14:44 |
<XioNoX> |
reboot asw2-d3-eqiad |
[production] |
14:33 |
<moritzm> |
bouncing ferm on hosts where ferm.service failed due to DNS resolution issues for prometheus hosts |
[production] |
14:31 |
<volans> |
restarted ssh on mc1033 from console |
[production] |
14:16 |
<XioNoX> |
request virtual-chassis vc-port delete pic-slot 1 member 4 port 0 |
[production] |
14:16 |
<XioNoX> |
request virtual-chassis vc-port delete pic-slot 0 member 2 port 50 |
[production] |
14:14 |
<XioNoX> |
request virtual-chassis vc-port delete pic-slot 1 member 1 port 1 |
[production] |
14:13 |
<akosiaris> |
drain kubernetes1013, kubernetes1004. They are on row D |
[production] |
14:13 |
<bblack> |
dns1002 - disable puppet + bird service (stop advertising recdns from row D) |
[production] |
14:03 |
<kormat@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) |
[production] |
14:03 |
<kormat@cumin1001> |
START - Cookbook sre.hosts.downtime |
[production] |
13:59 |
<bblack@cumin1001> |
conftool action : set/pooled=no; selector: name=cp1090.eqiad.wmnet |
[production] |
13:59 |
<bblack> |
depooling cp1087-1090 |
[production] |
13:59 |
<bblack@cumin1001> |
conftool action : set/pooled=no; selector: name=cp108[789].eqiad.wmnet |
[production] |
13:57 |
<XioNoX> |
asw2-d-eqiad> request system reboot member 3 |
[production] |
13:35 |
<cmjohnson1> |
the power cable was not properly seated and lost power to asw2-d3-eqiad |
[production] |
13:34 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) |
[production] |
13:30 |
<akosiaris@deploy1001> |
helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . |
[production] |
13:28 |
<akosiaris@deploy1001> |
helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . |
[production] |