2021-01-18 §
09:00 <dcaro> Enabling custom application 'cinder' on pool codfw1dev-cinder to get rid of health warnings [admin]
2021-01-17 §
16:53 <arturo> icinga downtime labstore1004 /srv/tools space check for 3 days (T272247) [admin]
2021-01-15 §
13:41 <arturo> icinga downtime labstore1004 maintain-dbuser alert until 2021-01-19 (T272125) [admin]
09:47 <arturo> labstore1004 maintain-dbusers affected by T272127 and T272125 [admin]
09:22 <arturo> restart maintain-dbusers.service in labstore1004 [admin]
08:19 <dcaro> Merging the patch to disable write caches on ceph osds (T271527) [admin]
2021-01-13 §
17:03 <arturo> remove cloudvirt1013 cloudvirt1032 cloudvirt1037 to the 'toobusy' host aggregate to prevent further CPU oversubscribing [admin]
12:40 <arturo> try increasing systemd watchdog timeout for conntrackd in cloudnet1004 (T268335) [admin]
11:45 <dcaro> https://gerrit.wikimedia.org/r/c/operations/puppet/+/654419 merged and deployed (and tested) (T268877) [admin]
11:40 <dcaro> merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/654419 that might affect the encapi service (puppet on cloud environment), no downtime expected though (T268877) [admin]
10:56 <arturo> trying to cleanup dpkg package mess in cloudnet2002-dev [admin]
10:02 <arturo> prevent floating IP allocation from neutron transport subnet: root@cloudcontrol1005:~# neutron subnet-update --allocation-pool start=,end= cloud-instances-transport1-b-eqiad1 (T271867) [admin]
2021-01-12 §
10:33 <arturo> reboot cloudnet1004 [admin]
10:32 <arturo> update firmware-bnx2x from 20190114-2 to 20200918-1~bpo10+1 on cloudnet1004 (T271058) [admin]
2021-01-11 §
10:22 <arturo> doubling size of conntrack table in cloudnet servers https://gerrit.wikimedia.org/r/c/operations/puppet/+/655407 (T271058) [admin]
10:07 <arturo> manually cleanup conntrack table in cloudnet1004 (T271058) [admin]
09:19 <dcaro> cleaned up ~1800 snapshots, 109 remaining only, one for each host x image combination (plus some ephemeral ones while doing backups), closing the task (T270478) [admin]
08:39 <dcaro> cleaning up dangling snapshots now that we have the new suffixed ones (T270478) [admin]
2021-01-10 §
16:02 <andrewbogott> restarting rabbitmq-server on all eqiad1 cloudcontrols [admin]
15:54 <andrewbogott> restating neutron-metadata-agent on cloudnet1004 due to many syslog complaints [admin]
2021-01-08 §
11:25 <arturo> rebooting both cloudnet2002-dev/cloudnet2003-dev to make sure interfaces are set up correctl (T271517) [admin]
11:22 <arturo> connecting cloudnet2002-dev cloudnet2003-dev back to vlan 2120 (T271517) [admin]
11:06 <arturo> root@cloudcontrol2001-dev:~# openstack router set --external-gateway wan-transport-codfw --fixed-ip subnet=cloud-instances-transport1-b-codfw,ip-address= cloudinstances2b-gw (T271517) [admin]
11:02 <arturo> root@cloudcontrol2001-dev:~# openstack router set --enable-snat cloudinstances2b-gw --external-gateway wan-transport-codfw (T271517) [admin]
11:01 <arturo> enabling neutron hacks in codfw1dev (cloudnet2002-dev, cloudnet2003-dev) (T271517) [admin]
10:55 <arturo> aborrero@labtestvirt2003:~ $ sudo ifdown eno2.2107 (T271517) [admin]
10:55 <arturo> aborrero@labtestvirt2003:~ $ sudo ifdown eno2.2120 (T271517) [admin]
10:53 <arturo> root@cloudcontrol2001-dev:~# openstack subnet create --network wan-transport-codfw --gateway --ip-version 4 --network wan-transport-codfw --no-dhcp --subnet-range cloud-instances-transport1-b-codfw (T271517) [admin]
10:40 <dcaro> Finished tests, brining osd online (od.48) for eqiad ceph cluster (T271417) [admin]
09:59 <dcaro> Started performance tests on sdc (od.48) for eqiad ceph cluster (T271417) [admin]
09:41 <dcaro> Taking osd.48 from eqiad ceph cluster out to do performance tests (T271417) [admin]
2021-01-07 §
15:19 <dcaro> Finished speed tests on cloudcephosd2001-dev, reprovisioning the osd.0 sdc (T271417) [admin]
14:39 <dcaro> Starting speed tests on cloudcephosd2001-dev sdc (T271417) [admin]
12:53 <dcaro> Taking osd.0 down on codfw ceph cluster to try the disk performance testing process (T271417) [admin]
11:35 <arturo> merging dmz_cidr change (T209082, T267779) [admin]
2021-01-05 §
10:40 <dcaro> removing dumps-[1..*] backups from cloudvirt1024 as they are not needed (T271094) [admin]
2021-01-03 §
07:06 <dcaro> Got a network hiccup on cloudnet1004, keeping track here T271058 [admin]
2020-12-28 §
12:32 <arturo> stop doing backups for the dumps project https://gerrit.wikimedia.org/r/c/operations/puppet/+/652182 (T260692) [admin]
12:32 <arturo> stop doing backups for the dumps project https://gerrit.wikimedia.org/r/c/operations/puppet/+/652182 (T260682) [admin]
12:23 <arturo> icinga downtime cloudvirt1026 disk space check until january 5 (T260692) [admin]
06:15 <andrewbogott> restarting designate-central on cloudservices1003/1004. I'm pretty sure they're distressed because of DB lag but it's worth a try [admin]
2020-12-23 §
15:38 <andrewbogott> restarting rabbitmq on cloudcontrol1004; suspected leaks [admin]
15:33 <andrewbogott> restarting each cloudcontrol galera node in turn to see if that quiets down the syncing warnings [admin]
12:08 <arturo> move memory out of the swap in cloudcontrol1004 by disabling/enabling it (1Gb swap was being used) [admin]
2020-12-22 §
15:30 <dcaro> cleaning up 6778 dangling snapshots for glance images in eqiad (T270478) [admin]
13:51 <dcaro> merged patch to move wikidumpparse backups to cloudvirt1025 to free space on cloudvirt1026 [admin]
2020-12-19 §
16:18 <dcaro> gzipped a bunch of logs on cloudvirt1004 due to / being out of space [admin]
00:14 <bstorm> truncated /var/log/debug.1 on cloudcontrol1003 which appears to be the exact same content as the user.log files anyway [admin]
00:10 <bstorm> truncated /var/log/daemon.log.1 and the haproxy log [admin]
00:02 <bstorm> truncated /var/log/messages.1 on cloudcontrol1003 [admin]