1201-1250 of 6119 results (34ms)
2023-02-13 §
08:59 <nfraison> restarting presto-worker on an-test-presto1001 to pick up new gc logging settings T329054 [analytics]
08:52 <nfraison> disabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/888214 on test cluster first only [analytics]
2023-02-10 §
23:22 <mforns> unpaused all airflow dags and cleared all failed tasks after the incident [analytics]
22:30 <btullis> starting the hadoop-yarn-resourcemanager on an-master1001 and failing back to iy. [analytics]
22:25 <btullis> stopping hadoop-yarn-resourcemanager service in an-master1001 to fail over automatically to an-master1002 [analytics]
21:21 <mforns> restarted airflow@analytics.service in an-launcher1002 [analytics]
2023-02-09 §
17:32 <mforns> deployed airflow [analytics]
12:01 <btullis> Shutting down an-worker109[89] and dse-k8s-worker1002 for another GPU move - T318696 [analytics]
10:36 <joal> Start airflow webrequest_actor jobs [analytics]
10:26 <joal> Deploy analytics-airflow [analytics]
10:25 <joal> Setup airflow start-date variables for new dags [analytics]
10:10 <joal> Merge airflow code for learning/actor -> webrequest_actor move [analytics]
10:01 <joal> Move data and update hive tables from learning/actor convention to webrequest_actor convention [analytics]
09:59 <joal> Kill oozie pageview-learning jobs [analytics]
2023-02-08 §
19:26 <milimetric> finished deploying refinery-source 0.2.11, refinery, and synced to hdfs [analytics]
12:04 <btullis> shut down an-worker109[67] and dse-k8s-worker1001 ready for GPU swap. [analytics]
2023-02-03 §
15:23 <milimetric> deployed airflow-dags/analytics to disable skein log collection from the SparkSubmitOperator. [analytics]
10:11 <steve_munene> roll-restart aqs to update mediawiki_history_snapshot to 2023-01 [analytics]
2023-02-02 §
12:26 <btullis> deploying the updated build of superset to production T328047 [analytics]
09:56 <btullis> correction: beginning a rolling reboot of all aqs servers for T325132 [analytics]
09:52 <btullis> beginning a rolling reboot of all aqs servers for T326945 [analytics]
08:44 <steve_munene> Deployed refinery using scap, then deployed onto hdfs [analytics]
08:26 <steve_munene> refinery-deploy-to-hdfs run4 [analytics]
2023-02-01 §
10:51 <steve_munene> Deploying refinery for ops week [analytics]
2023-01-30 §
16:41 <btullis> started an-presto1006-1015 again, but disabled the presto service on them once again T323783 and T325809 [analytics]
2023-01-27 §
11:41 <steve_munene> datahub helmfile apply on main for T327884 [analytics]
11:17 <btullis> shut down an-worker1087 to await RAID BBU replacement [analytics]
11:03 <steve_munene> datahub: apply on main for T327884 [analytics]
2023-01-26 §
10:42 <joal> deploying airflow analytics for GDI dags [analytics]
10:36 <joal> drop/recreate wmf_raw.mediawiki_private_cu_changes hive table to have new fields [analytics]
10:01 <joal> deploy refinery onto hdfs [analytics]
09:48 <joal> deploying refinery using scap (no refinery-source deploy) [analytics]
09:43 <joal> Rerun failed 'cassandra_daily_load.load_mediarequest_per_file_to_cassandra 2023-01-25T00:00:00+00:00' task [analytics]
2023-01-25 §
16:54 <steve_munene> Restarting presto-server.service on presto coordinator an-coord1001 for T323783 [analytics]
16:53 <btullis> kicked off a rolling reboot of kafka-jumbo as part of T325132 [analytics]
15:14 <btullis> rebooting an-conf1003 for new kernel [analytics]
14:54 <btullis> started a rolling-reboot of the hadoop workers via `sre.hadoop.reboot-workers` cookbook. [analytics]
2023-01-23 §
13:06 <btullis> restarted webrequest_sampled_supervisor realtime druid indexation job [analytics]
10:04 <btullis> proceeding to upgrade an-tool1010 to bullseye for superset 1.5.3 upgrade T323458 [analytics]
2023-01-19 §
10:25 <btullis> enabled dashboard native filtering in superset https://gerrit.wikimedia.org/r/c/operations/puppet/+/881510 for T318299 [analytics]
2023-01-17 §
20:54 <xcollazo> dropping old partitions from image_suggestions Hive tables as per https://phabricator.wikimedia.org/T325837 [analytics]
16:50 <btullis> shutdown an-worker1086 for RAID BBU replacement [analytics]
2023-01-16 §
08:46 <elukey> powercycle an-worker1125 - soft lockup traces registered in the tty, host frozen [analytics]
2023-01-10 §
17:33 <btullis> chassis power reset on an-worker1032 (T326459) [analytics]
15:58 <SandraEbele> backfilling refine_event_sanitized_analytics_immediate on an-launcher1002 ‘sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event_sanitized_analytics_immediate —ignore_failure_flag=true --since=2023-01-07T17:00:00 until=2023-01-08T10:00:00 [analytics]
15:55 <SandraEbele> reran failed pageview-druid-hourly-coord oozie job for 2023-1-10-10. [analytics]
11:36 <btullis> roll-rebooting the analytics druid cluster to pick up new kernel [analytics]
10:24 <btullis> roll-rebooting the druid-public cluster to pick up new kernel [analytics]
2023-01-09 §
17:09 <aqu> Relaunching refine_event after partial backfilling `sudo systemctl start refine_event.service` (an-launcher1002) [analytics]
14:48 <SandraEbele> reran webrequest failed jobs ‘sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -Dstart_time=2023-01-08T07:00Z -Dstop_time=2023-01-08T14:59Z -Dwebrequest_source=text -Derror_incomplete_data_threshold=100 -Dwarning_incomplete_data_threshold=100 -Derror_data_loss_threshold=100 -Dwarning_data_loss_threshold=100 -submit -config /home/ebysans/webrequest_text_coordinator.properties’ [analytics]