51-100 of 5005 results (13ms)
2023-03-01 §
10:25 <nfraison> rebooting an-worker1132 being slower than other node (potential issue with raid card/disks) [analytics]
07:59 <nfraison> restarted hiveserver2 in analytics-test to take in account -XX:MaxMetaspaceSize=512m JVM parameter [analytics]
2023-02-28 §
21:33 <xcollazo> Deploying section_image_recommendations DAG to platform_eng Airflow instance [analytics]
11:38 <btullis> cancelled merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128 [analytics]
11:32 <btullis> merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128 [analytics]
09:42 <nfraison> restart presto prod coordinator to take in account heap size change [analytics]
09:38 <nfraison> Failover hive servers to active server: an-coord1001 [analytics]
09:32 <nfraison> restarted hive-metastore and hiveserver2 on an-coord1001 (non-active hive server) [analytics]
08:22 <nfraison> Failover hive servers to standby server: https://gerrit.wikimedia.org/r/c/operations/dns/+/892460 [analytics]
2023-02-27 §
14:52 <nfraison> restarted hive-metastore and hiveserver2 on an-coord1002 (standby hive server) [analytics]
2023-02-22 §
19:39 <mforns> restarted the following an-launcher1002 timers, which seemed stuck (next run = n/a): gobblin-webrequest.timer, reportupdater-browser.timer, reportupdater-reference-previews.timer, refine_event.timer, refine_eventlogging_legacy.timer [analytics]
11:07 <nfraison> roll restart presto clusters to take in account fix on node.environment typo [analytics]
2023-02-21 §
19:01 <mforns> re airflow silent failure: the job was pageview_actor_hourly [analytics]
19:00 <mforns> we had another silent failure in airflow, a sensor that failed without sending an email. the logs are missing. [analytics]
09:33 <nfraison> adding last batch of 5 nodes to the presto prod cluster [analytics]
2023-02-20 §
13:11 <nfraison> Reimage an-presto1001 to upgrade to bullseye T329361 [analytics]
12:45 <nfraison> adding 5 nodes to the presto prod cluster [analytics]
12:32 <nfraison> roll-restart presto workers on an-presto100[1-5] to take in account new configs T329525 [analytics]
12:29 <nfraison> restart presto coordinator on an-coord1001 to take in account new configs T329525 [analytics]
2023-02-18 §
08:29 <elukey> kill leftover processes of user `mepps` (offboarded) from stat100[4,5] to unblock puppet [analytics]
2023-02-16 §
21:10 <SandraEbele> restarted oozie webrequest load bundle. [analytics]
21:09 <SandraEbele> Added new field referer_data to wmf.webrequest table using the alter table statement [analytics]
21:07 <SandraEbele> successfully deployed analytics refinery [analytics]
18:46 <SandraEbele> started deploying analytics refinery [analytics]
18:37 <SandraEbele> killed webrequest bundle ooze jobs to deploy refinery changes. [analytics]
16:55 <SandraEbele> Deployed refinery-source change to remove Github.io from Mediasites definition of referers. [analytics]
2023-02-13 §
21:40 <xcollazo> deploying section_topics v0.5.0 on platform_eng Airflow instance [analytics]
21:39 <ottomata> enabled rc1.mediawiki.page_change stream on group0 and group1 wikis [analytics]
14:15 <btullis> roll-restarting all eventgate pods [analytics]
14:06 <nfraison> Reimage an-test-presto1001 to upgrade to bullseye T329361 [analytics]
10:46 <nfraison> restarting presto-worker on an-presto[1001-1015].eqiad.wmnet to pick up new gc logging settings T329054 [analytics]
10:15 <btullis> Reimage an-test-worker1001 to upgrade to bullseye T329363 [analytics]
09:59 <nfraison> restarting presto-coordinator on an-coord1001 to pick up new gc logging settings T329054 [analytics]
09:57 <nfraison> re-enabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet [analytics]
09:08 <aqu> Rerun killed Oozie pageview-hourly-coord of 2023-02-11 with sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -rerun 0019103-210107075406929-oozie-oozi-C -date 2023-02-11T14:00Z::2023-02-11T16:00Z [analytics]
09:04 <nfraison> restarting presto-coordinator on an-test-coord1001 to pick up new gc logging settings T329054 [analytics]
08:59 <nfraison> restarting presto-worker on an-test-presto1001 to pick up new gc logging settings T329054 [analytics]
08:52 <nfraison> disabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/888214 on test cluster first only [analytics]
2023-02-10 §
23:22 <mforns> unpaused all airflow dags and cleared all failed tasks after the incident [analytics]
22:30 <btullis> starting the hadoop-yarn-resourcemanager on an-master1001 and failing back to iy. [analytics]
22:25 <btullis> stopping hadoop-yarn-resourcemanager service in an-master1001 to fail over automatically to an-master1002 [analytics]
21:21 <mforns> restarted airflow@analytics.service in an-launcher1002 [analytics]
2023-02-09 §
17:32 <mforns> deployed airflow [analytics]
12:01 <btullis> Shutting down an-worker109[89] and dse-k8s-worker1002 for another GPU move - T318696 [analytics]
10:36 <joal> Start airflow webrequest_actor jobs [analytics]
10:26 <joal> Deploy analytics-airflow [analytics]
10:25 <joal> Setup airflow start-date variables for new dags [analytics]
10:10 <joal> Merge airflow code for learning/actor -> webrequest_actor move [analytics]
10:01 <joal> Move data and update hive tables from learning/actor convention to webrequest_actor convention [analytics]
09:59 <joal> Kill oozie pageview-learning jobs [analytics]