2019-02-25
§
|
23:20 |
<bstorm_> |
Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for T217066 |
[tools] |
21:41 |
<andrewbogott> |
depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test T217066 |
[tools] |
13:11 |
<chicocvenancio> |
PAWS: Stopped AABot notebook pod T217010 |
[tools] |
12:54 |
<chicocvenancio> |
PAWS: Restarted Criscod notebook pod T217010 |
[tools] |
12:21 |
<chicocvenancio> |
PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod T217010 |
[tools] |
09:50 |
<gtirloni> |
rebooted tools-sgeexec-09{16,22,40} (T216988) |
[tools] |
09:41 |
<gtirloni> |
rebooted tools-sgeexec-09{16,22,40} |
[tools] |
08:37 |
<zhuyifei1999_> |
uncordon tools-worker-1015.tools.eqiad.wmflabs |
[tools] |
08:34 |
<legoktm> |
hard rebooted tools-worker-1015 via horizon |
[tools] |
07:48 |
<zhuyifei1999_> |
systemd stuck in D state. :( |
[tools] |
07:44 |
<zhuyifei1999_> |
I saved dmesg and process list to a few files in /root if that helps debugging |
[tools] |
07:43 |
<zhuyifei1999_> |
D states are not responding to SIGKILL. Will reboot. |
[tools] |
07:37 |
<zhuyifei1999_> |
tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining. |
[tools] |
2019-02-20
§
|
23:30 |
<zhuyifei1999_> |
begin rebuilding all docker images T178601 T193646 T215683 |
[tools] |
23:25 |
<zhuyifei1999_> |
upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version) |
[tools] |
23:19 |
<zhuyifei1999_> |
this was built for stretch. hopefully it works for all distros |
[tools] |
23:17 |
<zhuyifei1999_> |
begin build new tools-webservice package T178601 T193646 T215683 |
[tools] |
21:57 |
<andrewbogott> |
moving tools-static-13 to a new virt host |
[tools] |
21:34 |
<andrewbogott> |
moving the tools-static IP from tools-static-13 to tools-static-12 |
[tools] |
19:17 |
<andrewbogott> |
moving tools-bastion-02 to labvirt1004 |
[tools] |
16:56 |
<andrewbogott> |
moving tools-paws-worker-1003 |
[tools] |
15:53 |
<andrewbogott> |
moving tools-worker-1017, tools-worker-1027, tools-worker-1028 |
[tools] |
15:03 |
<andrewbogott> |
moving tools-exec-1413 and tools-exec-1442 |
[tools] |
2019-02-16
§
|
05:00 |
<zhuyifei1999_> |
fixed by restarting flannel. another puppet run simply started kubelet |
[tools] |
04:58 |
<zhuyifei1999_> |
puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory' |
[tools] |
04:52 |
<zhuyifei1999_> |
copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet |
[tools] |
04:48 |
<zhuyifei1999_> |
that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago) |
[tools] |
04:44 |
<zhuyifei1999_> |
puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known' |
[tools] |
04:43 |
<zhuyifei1999_> |
this one has logs full of 'Can't contact LDAP server' |
[tools] |
04:41 |
<zhuyifei1999_> |
nslcd also broken on tools-worker-1005 |
[tools] |
04:34 |
<zhuyifei1999_> |
uncordon tools-worker-1014.tools.eqiad.wmflabs |
[tools] |
04:33 |
<zhuyifei1999_> |
the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT |
[tools] |
04:31 |
<zhuyifei1999_> |
then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs |
[tools] |