2019-02-16 §
04:58 <zhuyifei1999_> puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory' [tools]
04:52 <zhuyifei1999_> copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet [tools]
04:48 <zhuyifei1999_> that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago) [tools]
04:44 <zhuyifei1999_> puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known' [tools]
04:43 <zhuyifei1999_> this one has logs full of 'Can't contact LDAP server' [tools]
04:41 <zhuyifei1999_> nslcd also broken on tools-worker-1005 [tools]
04:34 <zhuyifei1999_> uncordon tools-worker-1014.tools.eqiad.wmflabs [tools]
04:33 <zhuyifei1999_> the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT [tools]
04:31 <zhuyifei1999_> then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs [tools]
04:30 <zhuyifei1999_> `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work [tools]
04:23 <zhuyifei1999_> drained tools-worker-1014.tools.eqiad.wmflabs [tools]
04:16 <zhuyifei1999_> logs: https://phabricator.wikimedia.org/P8095 [tools]
04:14 <zhuyifei1999_> restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs [tools]
04:12 <zhuyifei1999_> restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP [tools]
2019-02-14 §
21:57 <bd808> Deleted old tools-proxy-02 instance [tools]
21:57 <bd808> Deleted old tools-proxy-01 instance [tools]
21:56 <bd808> Deleted old tools-package-builder-01 instance [tools]
20:57 <andrewbogott> rebooting tools-worker-1005 [tools]
20:34 <andrewbogott> moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419 [tools]
19:55 <andrewbogott> moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419 [tools]
19:33 <andrewbogott> moving tools-checker-01 to labvirt1003 [tools]
19:25 <andrewbogott> moving tools-elastic-02 to labvirt1003 [tools]
19:11 <andrewbogott> moving tools-k8s-etcd-01 to labvirt1002 [tools]
18:37 <andrewbogott> moving tools-exec-1418, tools-exec-1424 to labvirt1003 [tools]
18:34 <andrewbogott> moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002 [tools]
17:35 <arturo> T215154 tools-sgebastion-07 now running systemd 239 and starts enforcing user limits [tools]
15:33 <andrewbogott> moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r [tools]
2019-02-13 §
19:16 <andrewbogott> deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06 [tools]
15:16 <zhuyifei1999_> `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml|awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042 [tools]
15:06 <zhuyifei1999_> `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042 [tools]
13:03 <arturo> T216030 switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07 [tools]
2019-02-12 §
01:24 <bd808> Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers (T215704) [tools]
2019-02-11 §
22:57 <bd808> Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI [tools]
22:34 <bd808> Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28} [tools]
22:23 <bd808> sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs [tools]
22:21 <bd808> sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs [tools]
22:18 <bd808> sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs [tools]
22:07 <bd808> sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs [tools]
22:06 <bd808> sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs [tools]
22:05 <bd808> sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs [tools]
20:06 <bstorm_> Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache) [tools]
19:08 <bd808> Upgraded tools-manifest on tools-cron-01 to v0.19 (T107878) [tools]
18:57 <bd808> Upgraded tools-manifest on tools-sgecron-01 to v0.19 (T107878) [tools]
18:57 <bd808> Built tools-manifest_0.19_all.deb and published to aptly repos (T107878) [tools]
18:26 <bd808> Upgraded tools-manifest on tools-sgecron-01 to v0.18 (T107878) [tools]
18:25 <bd808> Built tools-manifest_0.18_all.deb and published to aptly repos (T107878) [tools]
18:12 <bd808> Upgraded tools-manifest on tools-sgecron-01 to v0.17 (T107878) [tools]
18:08 <bd808> Built tools-manifest_0.17_all.deb and published to aptly repos (T107878) [tools]
10:41 <godog> flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1 [tools]
2019-02-08 §
19:17 <hauskatze> Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible [tools]