451-500 of 10000 results (34ms)
2022-03-17 ยง
07:53 <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [production]
07:53 <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [production]
07:53 <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 6:00:00 on db1106.eqiad.wmnet with reason: Maintenance [production]
07:52 <marostegui@cumin1001> dbctl commit (dc=all): 'db1163 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P22746 and previous config saved to /var/cache/conftool/dbconfig/20220317-075201-root.json [production]
07:36 <marostegui@cumin1001> dbctl commit (dc=all): 'db1163 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P22745 and previous config saved to /var/cache/conftool/dbconfig/20220317-073658-root.json [production]
07:31 <marostegui> dbmaint on s5@eqiad T297189 [production]
07:21 <marostegui@cumin1001> dbctl commit (dc=all): 'db1163 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P22744 and previous config saved to /var/cache/conftool/dbconfig/20220317-072154-root.json [production]
07:12 <marostegui@cumin1001> dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 100%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22743 and previous config saved to /var/cache/conftool/dbconfig/20220317-071200-root.json [production]
07:11 <ryankemper> [WDQS] Depooled `wdqs2003` (8 hours of lag to catch up on) [production]
07:06 <marostegui@cumin1001> dbctl commit (dc=all): 'db1163 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P22742 and previous config saved to /var/cache/conftool/dbconfig/20220317-070650-root.json [production]
07:04 <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 14 hosts with reason: Maintenance [production]
07:04 <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 12:00:00 on 14 hosts with reason: Maintenance [production]
07:04 <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [production]
07:04 <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [production]
06:57 <ryankemper> [WDQS] Also of note is the spiking thread counts on the affected hosts: https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1647457172391&to=1647500081971&viewPanel=22 [production]
06:57 <ryankemper> [WDQS] Note that per https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1647457172391&to=1647500081971&viewPanel=7 `wdqs2003` has been offline for ~6 hours, `wdqs2001` for 1.5 hours and `wdqs2004` just recently. [production]
06:56 <marostegui@cumin1001> dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 75%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22741 and previous config saved to /var/cache/conftool/dbconfig/20220317-065656-root.json [production]
06:54 <ryankemper> [WDQS] `ryankemper@wdqs2003:~$ sudo systemctl restart wdqs-blazegraph.service` [production]
06:53 <ryankemper> [WDQS] `ryankemper@wdqs2001:~$ sudo systemctl restart wdqs-blazegraph.service` [production]
06:50 <elukey> restart blazegraph on wdqs2004 [production]
06:46 <elukey> kill remaining hanging processes for ppche*lko and accra*ze on an-test-client1001 to allow users offboard (puppet broken) [analytics]
06:46 <elukey> kill remaining hanging processes for ppche*lko and accra*ze on an-test-client1001 to allow users offboard (puppet broken) [production]
06:41 <marostegui@cumin1001> dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 50%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22740 and previous config saved to /var/cache/conftool/dbconfig/20220317-064152-root.json [production]
06:26 <marostegui@cumin1001> dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 25%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22739 and previous config saved to /var/cache/conftool/dbconfig/20220317-062648-root.json [production]
06:15 <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [production]
06:15 <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [production]
06:11 <marostegui@cumin1001> dbctl commit (dc=all): 'db1099:3318 (re)pooling @ 10%: After buffer pool testing', diff saved to https://phabricator.wikimedia.org/P22738 and previous config saved to /var/cache/conftool/dbconfig/20220317-061144-root.json [production]
04:06 <marostegui@cumin1001> dbctl commit (dc=all): 'Depooling db1146:3314 (T300775)', diff saved to https://phabricator.wikimedia.org/P22737 and previous config saved to /var/cache/conftool/dbconfig/20220317-040634-marostegui.json [production]
04:06 <marostegui@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [production]
04:06 <marostegui@cumin1001> START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [production]
02:57 <andrew@cumin1001> END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [production]
02:07 <andrew@cumin1001> START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [production]
02:07 <andrew@cumin1001> END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1016.eqiad.wmnet with OS bullseye [production]
01:11 <andrew@cumin1001> START - Cookbook sre.hosts.reimage for host cloudvirt1016.eqiad.wmnet with OS bullseye [production]
01:09 <wm-bot> Drained 'cloudvirt1016.eqiad.wmnet'. (T281276) - cookbook ran by andrew@buster [admin]
00:53 <wm-bot> Set cloudvirt 'cloudvirt1016.eqiad.wmnet' maintenance. (T281276) - cookbook ran by andrew@buster [admin]
00:52 <wm-bot> Setting cloudvirt 'cloudvirt1016.eqiad.wmnet' maintenance. (T281276) - cookbook ran by andrew@buster [admin]
00:52 <wm-bot> Draining 'cloudvirt1016.eqiad.wmnet'. (T281276) - cookbook ran by andrew@buster [admin]
00:44 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [pipelinelib-experimental]
00:43 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [wikidata-realtime-dumps]
00:42 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [wikidata-autodesc]
00:40 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [thumbor]
00:39 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [sentry]
00:38 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [redwarn]
00:37 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [push]
00:36 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [privpol-captcha]
00:33 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [openrefine]
00:30 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [globalcu]
00:29 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [glampipe]
00:28 <andrewbogott> deleting remaining VMs and project, as per https://wikitech.wikimedia.org/wiki/News/Cloud_VPS_2021_Purge [community-labs-monitoring]