2018-01-09
ยง
|
23:21 |
<yuvipanda> |
paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading |
[tools] |
23:08 |
<yuvipanda> |
turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1 |
[tools] |
23:01 |
<yuvipanda> |
kill paws master and reboot it |
[tools] |
22:57 |
<bd808> |
Deployed be6109b (add s8 slice) |
[tools.replag] |
22:54 |
<yuvipanda> |
kill all kube-system pods in paws cluster |
[tools] |
22:54 |
<yuvipanda> |
kill all PAWS pods |
[tools] |
22:53 |
<yuvipanda> |
redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason |
[tools] |
22:52 |
<godog> |
ms-be1033 truncate unrotated and big server.log |
[production] |
22:49 |
<yuvipanda> |
run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8 |
[tools] |
22:48 |
<yuvipanda> |
run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash'' to setup kubeadm on all paws worker nodes |
[tools] |
22:46 |
<yuvipanda> |
reboot all paws-worker nodes |
[tools] |
22:46 |
<yuvipanda> |
run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster |
[tools] |
22:46 |
<madhuvishy> |
run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster |
[tools] |
22:22 |
<aaron@tin> |
Synchronized php-1.31.0-wmf.16/includes/Setup.php: 68b4bbfbc12c626 (duration: 01m 15s) |
[production] |
22:20 |
<mutante> |
netmon2001 - arming keyholder for rancid |
[production] |
21:17 |
<chasemp> |
...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test" |
[tools] |
21:17 |
<chasemp> |
tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test" |
[tools] |
21:10 |
<mepps> |
updated SmashPig from 45aa62650c to 778e8f87b4 |
[production] |
21:10 |
<chasemp> |
tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done |
[tools] |
20:57 |
<twentyafterfour@tin> |
Finished scap: Deploy 1.31.0-wmf.16 to test wikis and rebuild l10n. refs T180749 (attempt 2) (duration: 36m 34s) |
[production] |
20:55 |
<chasemp> |
for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016`; do kubectl cordon $n; done |
[tools] |
20:51 |
<chasemp> |
kubectl cordon tools-worker-1001.tools.eqiad.wmflabs |
[tools] |
20:34 |
<mutante> |
wikibase-vue cant start Apache because docker-proxy is already using port 80 |
[wikidata-dev] |
20:32 |
<mutante> |
fixed puppet runs on wikibase-stretch, wikibase-vue, wikibase with https://gerrit.wikimedia.org/r/#/c/403232/ |
[wikidata-dev] |
20:21 |
<twentyafterfour@tin> |
Started scap: Deploy 1.31.0-wmf.16 to test wikis and rebuild l10n. refs T180749 (attempt 2) |
[production] |
20:15 |
<chasemp> |
disable puppet on proxies and k8s workers |
[tools] |
20:14 |
<twentyafterfour@tin> |
scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="test2wiki" --outdir="/tmp/scap_l10n_3984299293" --threads=10 --lang en --quiet' returned non-zero exit status 1 (duration: 02m 44s) |
[production] |
20:13 |
<mutante> |
netmon2001 - rebooting |
[production] |
20:12 |
<twentyafterfour@tin> |
Started scap: Deploy 1.31.0-wmf.16 to test wikis and rebuild l10n. refs T180749 |
[production] |
20:04 |
<mutante> |
gerrit2001 - rebooting |
[production] |
20:00 |
<mutante> |
phab2001 - reboot for upgrade |
[production] |
19:50 |
<chasemp> |
clush -w @all 'sudo puppet agent --test' |
[tools] |
19:42 |
<chasemp> |
reboot tools-worker-1010 |
[tools] |
19:20 |
<mepps> |
rolledback SmashPig from 0c45b1a684 to 45aa62650c |
[production] |
19:07 |
<mepps> |
updated SmashPig from 45aa62650c to 0c45b1a684 |
[production] |
18:42 |
<mutante> |
ms-fe3002,ms-fe3001 - powering down, removing from puppet and icinga, ms-be* removing from puppet/icinga (T169518) |
[production] |
18:38 |
<mutante> |
ms-fe3001 - shutting down for decom, removed from puppet |
[production] |
18:38 |
<mutante> |
mw1227 still not showing recovery, using restart-hhvm |
[production] |
18:29 |
<mutante> |
mw1227 killed it one more time and also restarted apache.. now load going down |
[production] |
18:26 |
<mutante> |
mw1227 hhvm-dump-debug > /root/hhvm-dump-debug-20170109-1024PST.log ; then killed hhvm and restarted it with systemctl |
[production] |
17:56 |
<twentyafterfour> |
MediaWiki Train: Branching 1.31.0-wmf.16 |
[production] |
17:41 |
<moritzm> |
rebooting image scalers in codfw for kernel security update (along with HHVM update) |
[production] |
17:30 |
<volans> |
re-enabled Icinga event handlers on RAID checks for lvs3001 |
[production] |
17:17 |
<ema> |
failover traffic back to lvs3001, raid rebuilt |
[production] |
17:15 |
<godog> |
depool restbase cassandra 2 nodes - T184100 |
[production] |
16:53 |
<joal> |
Rerun pageview-druid-hourly-wf-2018-1-9-13 |
[analytics] |
16:35 |
<cmjohnson1> |
disabling pupppet for decom on mw1180-1200 |
[production] |
16:28 |
<volans> |
disabled Icinga event handlers on RAID checks for lvs3001, WIP on the host |
[production] |
16:18 |
<gehel> |
starting cluster reboot for elasticsearch / cirrus codfw |
[production] |
16:09 |
<bd808> |
data-services: added s8.{analytics,web}.db.svc.eqiad.wmflabs and aliases (T181643, T184179) |
[production] |