2021-05-25
ยง
|
21:58 |
<razzi@cumin1001> |
END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) |
[production] |
21:58 |
<razzi@cumin1001> |
START - Cookbook sre.hadoop.roll-restart-masters |
[production] |
21:13 |
<razzi@cumin1001> |
END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) |
[production] |
21:13 |
<razzi@cumin1001> |
START - Cookbook sre.hadoop.roll-restart-masters |
[production] |
21:13 |
<razzi@cumin1001> |
END (ERROR) - Cookbook sre.hadoop.roll-restart-workers (exit_code=97) |
[production] |
21:13 |
<razzi@cumin1001> |
START - Cookbook sre.hadoop.roll-restart-workers |
[production] |
20:40 |
<razzi@cumin1001> |
END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) |
[production] |
20:28 |
<razzi@cumin1001> |
START - Cookbook sre.hadoop.roll-restart-workers |
[production] |
20:00 |
<twentyafterfour@deploy1002> |
rebuilt and synchronized wikiversions files: group0 wikis to 1.37.0-wmf.7 |
[production] |
19:20 |
<cmjohnson@cumin1001> |
END (PASS) - Cookbook sre.dns.netbox (exit_code=0) |
[production] |
19:17 |
<cmjohnson@cumin1001> |
START - Cookbook sre.dns.netbox |
[production] |
19:17 |
<cmjohnson@cumin1001> |
END (PASS) - Cookbook sre.dns.netbox (exit_code=0) |
[production] |
19:12 |
<twentyafterfour@deploy1002> |
Finished scap: testwikis wikis to 1.37.0-wmf.7 (duration: 33m 29s) |
[production] |
19:12 |
<cmjohnson@cumin1001> |
START - Cookbook sre.dns.netbox |
[production] |
18:38 |
<twentyafterfour@deploy1002> |
Started scap: testwikis wikis to 1.37.0-wmf.7 |
[production] |
18:16 |
<razzi> |
sudo systemctl start all failed units from `systemctl list-units --state=failed` on an-launcher1002 |
[analytics] |
18:14 |
<razzi> |
sudo systemctl start eventlogging_to_druid_navigationtiming_hourly.service |
[analytics] |
18:08 |
<krinkle@deploy1002> |
Synchronized wmf-config/CommonSettings.php: I2ebe9674fb109f (duration: 00m 56s) |
[production] |
18:01 |
<razzi> |
manually edit /etc/hadoop/conf/capacity-scheduler.xml to make queues running and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues |
[analytics] |
17:52 |
<razzi> |
sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues on an-master1001 and an-master1002 |
[analytics] |
17:34 |
<Krinkle> |
mwmaint1002: Running purge-parsercache-now.php on server 2/4 (pc1007, depooled spare). Ref P16060, T280605, T282761. |
[production] |
17:30 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1164 (re)pooling @ 100%: Repool db1164', diff saved to https://phabricator.wikimedia.org/P16207 and previous config saved to /var/cache/conftool/dbconfig/20210525-173031-root.json |
[production] |
17:28 |
<razzi> |
sudo systemctl restart refine_eventlogging_legacy |
[analytics] |
17:28 |
<razzi> |
sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to enable submitting jobs once again |
[analytics] |
17:22 |
<effie> |
disable puppet on mc2019 (for tests) |
[production] |
17:15 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1164 (re)pooling @ 75%: Repool db1164', diff saved to https://phabricator.wikimedia.org/P16206 and previous config saved to /var/cache/conftool/dbconfig/20210525-171527-root.json |
[production] |
17:14 |
<andrewbogott> |
deleting old ingress controllers toolsbeta-test-k8s-ingress-1 and toolsbeta-test-k8s-ingress-2 |
[toolsbeta] |
17:13 |
<andrewbogott> |
created two new ingress nodes, toolsbeta-test-k8s-ingress-4 and toolsbeta-test-k8s-ingress-5 |
[toolsbeta] |
17:07 |
<razzi> |
re-enabled puppet on an-masters and an-launcher |
[analytics] |
17:04 |
<razzi> |
sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave |
[analytics] |
17:03 |
<razzi> |
sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet |
[analytics] |
17:00 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1164 (re)pooling @ 50%: Repool db1164', diff saved to https://phabricator.wikimedia.org/P16205 and previous config saved to /var/cache/conftool/dbconfig/20210525-170024-root.json |
[production] |
16:45 |
<marostegui@cumin1001> |
dbctl commit (dc=all): 'db1164 (re)pooling @ 25%: Repool db1164', diff saved to https://phabricator.wikimedia.org/P16203 and previous config saved to /var/cache/conftool/dbconfig/20210525-164520-root.json |
[production] |
16:43 |
<razzi> |
sudo systemctl restart hadoop-hdfs-namenode on an-master1001 |
[analytics] |
16:38 |
<razzi> |
sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace |
[analytics] |
16:35 |
<razzi> |
sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter |
[analytics] |
16:28 |
<razzi> |
sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet |
[analytics] |
16:23 |
<razzi> |
sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode leave |
[analytics] |
16:14 |
<bd808> |
Closed #wikimedia-cloud-admin on f***node |
[admin] |
16:11 |
<bd808> |
Closed #wikimedia-cloud-feed on f***node |
[admin] |
16:06 |
<razzi> |
sudo systemctl restart hadoop-hdfs-namenode |
[analytics] |
15:52 |
<razzi> |
checkpoint hdfs with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace |
[analytics] |
15:51 |
<razzi> |
enable safe mode on an-master1001 with sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter |
[analytics] |
15:36 |
<razzi> |
disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet again |
[analytics] |
15:35 |
<razzi> |
re-enable puppet on an-masters, run puppet, and sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues |
[analytics] |
15:32 |
<razzi> |
disable puppet on an-master1001.eqiad.wmnet and an-master1002.eqiad.wmnet |
[analytics] |
15:19 |
<dcaro> |
rebooted cloudvirt1020, starting VMs (T275893) |
[admin] |
15:13 |
<dcaro> |
rebooting cloudvirt1020 (T275893) |
[admin] |
15:09 |
<dcaro> |
turning off VM toolsbeta-test-k8s-etcd-14 to be able to reboot cloudvirt1020 |
[toolsbeta] |
14:42 |
<dcaro> |
taking cloudvirt1020 out for maintenance (openstack wise) so no new VMs are scheduled on it (T275893) |
[admin] |