2023-03-21
§
|
13:16 |
<elukey@cumin1001> |
START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1005.eqiad.wmnet |
[production] |
13:11 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-main1005.eqiad.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware |
[production] |
13:11 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-main1005.eqiad.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware |
[production] |
13:05 |
<elukey> |
move kafka mirror maker instances to PKI migration settings (new truststores) - T319372 |
[production] |
12:21 |
<wm-bot> |
<lucaswerkmeister-wmde> deployed ba986e3595 (add .mailmap; pulled without webservice restart) |
[tools.wdmm] |
11:53 |
<wm-bot> |
<lucaswerkmeister-wmde> deployed 66765dcae6 (update plwiktionary override) |
[tools.wdmm] |
11:20 |
<aikochou@deploy2002> |
helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . |
[production] |
11:09 |
<joal> |
Unpause mediacounts_load airflow job with start_date set to 2023-03-21T10:00 |
[production] |
11:08 |
<joal> |
Kill mediacounts_load oozie job |
[production] |
11:07 |
<joal> |
Unpause mediawiki_history_denormalize airflow job |
[production] |
11:06 |
<joal> |
Kill mediawiki_denormalize oozie job |
[production] |
11:04 |
<joal@deploy2002> |
Finished deploy [airflow-dags/analytics@42e862b]: Regular analytics weekly train [airflow-dags/analytics@42e862b] (duration: 00m 11s) |
[production] |
11:04 |
<joal@deploy2002> |
Started deploy [airflow-dags/analytics@42e862b]: Regular analytics weekly train [airflow-dags/analytics@42e862b] |
[production] |
11:01 |
<joal> |
Deploy analytics airflow code |
[analytics] |
10:49 |
<nfraison_> |
deployment last changes on k8s dse cluster failed due to certificate secret creation failure due to timeout contacting pki.discovery.wmnet |
[analytics] |
10:43 |
<nfraison@deploy2002> |
helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. |
[production] |
10:41 |
<joal> |
Unpause pageview_actor airflow dag |
[analytics] |
10:41 |
<joal> |
Alter wmf.pageview_actor table adding referer_data field |
[analytics] |
10:32 |
<nfraison@deploy2002> |
helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. |
[production] |
10:31 |
<nfraison_> |
deploy last changes on k8s dse cluster (dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater, enable spark operator mutation webhook, Allow communication from spark pods to HDFS/Hive) |
[analytics] |
10:26 |
<joal> |
Deploy refinery onto HDFS |
[analytics] |
10:25 |
<joal> |
Pause pageview_actor airflow job during HDFS refinery deploy and alter table update |
[analytics] |
10:24 |
<joal@deploy2002> |
Finished deploy [analytics/refinery@0bb61e9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0bb61e9] (duration: 01m 30s) |
[production] |
10:22 |
<joal@deploy2002> |
Started deploy [analytics/refinery@0bb61e9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0bb61e9] |
[production] |
10:22 |
<joal@deploy2002> |
Finished deploy [analytics/refinery@0bb61e9] (thin): Regular analytics weekly train THIN [analytics/refinery@0bb61e9] (duration: 00m 09s) |
[production] |
10:22 |
<joal@deploy2002> |
Started deploy [analytics/refinery@0bb61e9] (thin): Regular analytics weekly train THIN [analytics/refinery@0bb61e9] |
[production] |
10:22 |
<joal@deploy2002> |
Finished deploy [analytics/refinery@0bb61e9]: Regular analytics weekly train [analytics/refinery@0bb61e9] (duration: 07m 48s) |
[production] |
10:14 |
<joal@deploy2002> |
Started deploy [analytics/refinery@0bb61e9]: Regular analytics weekly train [analytics/refinery@0bb61e9] |
[production] |
10:13 |
<joal> |
Deploy refinery with scap sorry |
[analytics] |
10:13 |
<joal> |
Deploy refinery with sqoop |
[analytics] |
09:43 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.reimage for host kafka-main1005.eqiad.wmnet with OS bullseye |
[production] |
09:39 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-main1005.eqiad.wmnet with reason: Stop kafka, attempt to reimage |
[production] |
09:39 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-main1005.eqiad.wmnet with reason: Stop kafka, attempt to reimage |
[production] |
09:25 |
<phedenskog@deploy2002> |
Finished deploy [performance/navtiming@d2b97ad]: (no justification provided) (duration: 00m 06s) |
[production] |
09:25 |
<phedenskog@deploy2002> |
Started deploy [performance/navtiming@d2b97ad]: (no justification provided) |
[production] |
09:06 |
<elukey@cumin1001> |
END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Systemd units failing, pupper tries to bring them up periodically, spam on IRC |
[production] |
09:05 |
<elukey@cumin1001> |
START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Systemd units failing, pupper tries to bring them up periodically, spam on IRC |
[production] |
08:31 |
<elukey> |
move purged daemons on cp nodes to a new CA bundle (to allow accepting kafka clients using PKI tls certs) - T319372 |
[production] |
08:11 |
<wm-bot2> |
cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko |
[tools] |
06:50 |
<ayounsi@cumin1001> |
END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13150 |
[production] |
06:49 |
<ayounsi@cumin1001> |
START - Cookbook sre.network.peering with action 'configure' for AS: 13150 |
[production] |
03:57 |
<mwpresync@deploy2002> |
Pruned MediaWiki: 1.40.0-wmf.26 (duration: 02m 18s) |
[production] |
03:55 |
<mwpresync@deploy2002> |
Finished scap: testwikis wikis to 1.41.0-wmf.1 refs T330207 (duration: 52m 38s) |
[production] |
03:02 |
<mwpresync@deploy2002> |
Started scap: testwikis wikis to 1.41.0-wmf.1 refs T330207 |
[production] |