751-800 of 5815 results (30ms)
2023-03-30 §
12:11 <joal> Kill virtualpageview oozie job - migrated to airflow [analytics]
11:56 <joal> Kill oozie referer_daily job - migrated to airflow [analytics]
09:56 <btullis> re-running refine_event [analytics]
09:48 <joal> Deploy airflow analytics [analytics]
09:38 <joal> Deploying refinery onto HDFS [analytics]
09:27 <joal> Deploying refinery using scap [analytics]
2023-03-28 §
15:58 <btullis> deploying refinery to HDFS [analytics]
14:35 <btullis> re-enabling gobblin timers: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903668 T330165 [analytics]
14:31 <btullis> re-enabling YARN queues: https://gerrit.wikimedia.org/r/c/operations/puppet/+/903565 T330165 [analytics]
14:25 <btullis> proceeding to take HDFS out of safe mode. [analytics]
14:25 <btullis> restarting hive-server2 and hive-metastore services on an-coord1001 [analytics]
13:54 <btullis> entering safe mode for analytics-hadoop cluster: T330165 [analytics]
13:37 <btullis> refreshed YARN queues with: `sudo kerberos-run-command yarn /usr/bin/yarn rmadmin -refreshQueues` on both an-master100[1-2] - T330165 [analytics]
13:31 <btullis> setting all four YARN queues to STOPPED https://gerrit.wikimedia.org/r/c/operations/puppet/+/903627 T330165 [analytics]
12:50 <btullis> merging the change to disable ingestion to HDFS https://gerrit.wikimedia.org/r/c/operations/puppet/+/903610 [analytics]
10:46 <btullis> failing over hive services to an-coord1002 prior to switch upgrade. [analytics]
2023-03-27 §
17:19 <milimetric> added 2023-03-14T11 and 2023-03-14T12 partitions for codfw on event.mediawiki_page_move with alter table mediawiki_page_move add partition (datacenter='codfw',year=2023,month=3,day=14,hour=[11,12]); [analytics]
2023-03-24 §
14:43 <topranks> merged alertmanager rules for eventlogging checks being migrated from Icinga T309007 [analytics]
2023-03-23 §
13:48 <joal> Restart virtualpageview-hourly-coord with pageview_allowlist fix - starting 2023-03-21T08:00 [analytics]
13:47 <joal> Kill oozie virtualpageview-hourly-coord job [analytics]
13:29 <joal> Hotfix deploy refinery [analytics]
11:37 <btullis> we changed the retention policy on an-test-druid to `{"period":"P1M","includeFuture":true,"tieredReplicants":{"_default_tier":1},"type":"loadByPeriod"},{"type":"dropForever"}` [analytics]
11:36 <btullis> reimaging an-test-druid1001 in place to upgrade to bullseye [analytics]
08:28 <joal> Rerun failed virtualpageview-druid-daily-wf-2023-3-22 [analytics]
2023-03-21 §
17:48 <joal> rerun failed airflow tasks [analytics]
17:39 <joal> Deploy airflow, hopefully fixing HDFSArchiver jobs [analytics]
13:21 <nfraison_> deploy last changes on k8s dse cluster (dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater, enable spark operator mutation webhook, Allow communication from spark pods to HDFS/Hive) [analytics]
11:01 <joal> Deploy analytics airflow code [analytics]
10:49 <nfraison_> deployment last changes on k8s dse cluster failed due to certificate secret creation failure due to timeout contacting pki.discovery.wmnet [analytics]
10:41 <joal> Unpause pageview_actor airflow dag [analytics]
10:41 <joal> Alter wmf.pageview_actor table adding referer_data field [analytics]
10:31 <nfraison_> deploy last changes on k8s dse cluster (dse-k8s-eqiad: flink-operator should watch rdf-streaming-updater, enable spark operator mutation webhook, Allow communication from spark pods to HDFS/Hive) [analytics]
10:26 <joal> Deploy refinery onto HDFS [analytics]
10:25 <joal> Pause pageview_actor airflow job during HDFS refinery deploy and alter table update [analytics]
10:13 <joal> Deploy refinery with scap sorry [analytics]
10:13 <joal> Deploy refinery with sqoop [analytics]
2023-03-17 §
07:45 <nfraison_> reset failed session-c624.scope as last issue was on March 14 on an-worker1132 [analytics]
07:42 <joal> Rerun failed refine_event job [analytics]
2023-03-16 §
17:00 <btullis> enabling puppet on an-airflow1004 to restart airflow services. [analytics]
16:51 <btullis> upgrading airflow package on an-airflow1004 [analytics]
16:29 <btullis> stopping puppet and airflow services on an-airflow1004 for the upgrade. [analytics]
2023-03-15 §
18:37 <joal> Manually creating partitions for event.mediawiki_client_session_tick (datacenter=eqiad/year=2023/month=3/day=7/hour=[10,11,12,13,14]) [analytics]
13:10 <btullis> rerunning eventlogging_legacy failed job [analytics]
11:18 <btullis> stopping the matomo database replica on db1108 [analytics]
2023-03-14 §
14:57 <btullis> deploying ceph mon and mgr daemons to cephosd100[1-5] T328123 [analytics]
11:48 <btullis> reran refine_event_sanitized_analytics_immediate for netflow year=2023/month=3/day=8/hour=6 [analytics]
10:23 <btullis> deploying airflow package version 2.5.1-py3.10-20230228 to stats hosts [analytics]
2023-03-13 §
17:14 <nfraison_> restart jobhistory in prod cluster to take in account https://gerrit.wikimedia.org/r/c/operations/puppet/+/896305 [analytics]
17:08 <nfraison_> restart jobhistory in test cluster to take in account https://gerrit.wikimedia.org/r/c/operations/puppet/+/896305 [analytics]
13:53 <milimetric> killing pageview-monthly_dump-coord, pageview-daily_dump-coord, and pageview-hourly-coord oozie jobs to migrate to airflow [analytics]