2021-03-26 §
12:21 <arturo> shutdown tools-package-builder-02 (stretch), we keep -03 which is buster (T275864) [tools]
2021-03-25 §
19:30 <bstorm> forced deletion of all jobs stuck in a deleting state T277653 [tools]
17:46 <arturo> rebooting tools-sgeexec-* nodes to account for new grid master (T277653) [tools]
16:20 <arturo> rebuilding tools-sgegrid-master VM as debian buster (T277653) [tools]
16:18 <arturo> icinga-downtime toolschecker for 2h [tools]
16:05 <bstorm> failed over the tools grid to the shadow master T277653 [tools]
13:36 <arturo> shutdown tools-sge-services-03 (T278354) [tools]
13:33 <arturo> shutdown tools-sge-services-04 (T278354) [tools]
13:31 <arturo> point aptly clients to `tools-services-05.tools.eqiad1.wikimedia.cloud` (hiera change) (T278354) [tools]
12:58 <arturo> created VM `tools-services-05` as Debian Buster (T278354) [tools]
12:51 <arturo> create cinder volume `tools-aptly-data` (T278354) [tools]
2021-03-24 §
12:46 <arturo> shutoff the old stretch VMs `tools-docker-registry-03` and `tools-docker-registry-04` (T278303) [tools]
12:38 <arturo> associate floating IP with `tools-docker-registry-05` and refresh FQDN docker-registry.tools.wmflabs.org accordingly (T278303) [tools]
12:33 <arturo> attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-05` (T278303) [tools]
12:32 <arturo> snapshot cinder volume `tools-docker-registry-data` into `tools-docker-registry-data-stretch-migration` (T278303) [tools]
12:32 <arturo> bump cinder storage quota from 80G to 400G (without quota request task) [tools]
12:11 <arturo> created VM `tools-docker-registry-06` as Debian Buster (T278303) [tools]
12:09 <arturo> dettach cinder volume `tools-docker-registry-data` (T278303) [tools]
11:46 <arturo> attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-03` to format it and pre-populate it with registry data (T278303) [tools]
11:20 <arturo> created 80G cinder volume tools-docker-registry-data (T278303) [tools]
11:10 <arturo> starting VM tools-docker-registry-04 which was stopped probably since 2021-03-09 due to hypervisor draining [tools]
2021-03-23 §
12:46 <arturo> aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service [tools]
12:15 <arturo> delete & re-create VM tools-sgegrid-shadow as Debian Buster (T277653) [tools]
12:14 <arturo> created puppet prefix 'tools-sgegrid-shadow' and migrated puppet configuration from VM-puppet [tools]
12:13 <arturo> created server group 'tools-grid-master-shadow' with anty-affinity policy [tools]
2021-03-18 §
19:24 <bstorm> set profile::toolforge::infrastructure across the entire project with login_server set on the bastion and exec node-related prefixes [tools]
16:21 <andrewbogott> enabling puppet tools-wide [tools]
16:20 <andrewbogott> disabling puppet tools-wide to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456 [tools]
16:19 <bstorm> added profile::toolforge::infrastructure class to puppetmaster T277756 [tools]
04:12 <bstorm> rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight [tools]
03:59 <bstorm> rebooting grid master. sorry for the cron spam [tools]
03:49 <bstorm> restarting sssd on tools-sgegrid-master [tools]
03:37 <bstorm> deleted a massive number of stuck jobs that misfired from the cron server [tools]
03:35 <bstorm> rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it [tools]
01:46 <bstorm> killed the toolschecker cron job, which had an LDAP error, and ran it again by hand [tools]
2021-03-17 §
20:57 <bstorm> deployed changes to rbac for kubernetes to add kubectl top access for tools [tools]
20:26 <andrewbogott> moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv [tools]
2021-03-16 §
16:31 <arturo> installing jobutils and misctools 1.41 [tools]
15:55 <bstorm> deleted a bunch of messed up grid jobs (9989481,8813,81682,86317,122602,122623,583621,606945,606999) [tools]
12:32 <arturo> add packages jobutils / misctools v1.41 to {stretch,buster}-tools aptly repository in tools-sge-services-03 [tools]
2021-03-12 §
23:13 <bstorm> cleared error state for all grid queues [tools]
2021-03-11 §
17:40 <bstorm> deployed metrics-server:0.4.1 to kubernetes [tools]
16:21 <bstorm> add jobutils 1.40 and misctools 1.40 to stretch-tools [tools]
13:11 <arturo> add misctools 1.37 to buster-tools|toolsbeta aptly repo for T275865 [tools]
13:10 <arturo> add jobutils 1.40 to buster-tools aptly repo for T275865 [tools]
2021-03-10 §
10:56 <arturo> briefly stopped VM tools-k8s-etcd-7 to disable VMX cpu flag [tools]
2021-03-09 §
13:31 <arturo> hard-reboot tools-docker-registry-04 because issues related to T276922 [tools]
12:34 <arturo> briefly rebooting VM tools-docker-registry-04, we need to reboot the hypervisor cloudvirt1038 and failed to migrate away [tools]
2021-03-05 §
12:30 <arturo> started tools-redis-1004 again [tools]
12:22 <arturo> stop tools-redis-1004 to ease draining of cloudvirt1035 [tools]