51-100 of 4935 results (30ms)
2023-01-25 §
16:53 <btullis> kicked off a rolling reboot of kafka-jumbo as part of T325132 [analytics]
15:14 <btullis> rebooting an-conf1003 for new kernel [analytics]
14:54 <btullis> started a rolling-reboot of the hadoop workers via `sre.hadoop.reboot-workers` cookbook. [analytics]
2023-01-23 §
13:06 <btullis> restarted webrequest_sampled_supervisor realtime druid indexation job [analytics]
10:04 <btullis> proceeding to upgrade an-tool1010 to bullseye for superset 1.5.3 upgrade T323458 [analytics]
2023-01-19 §
10:25 <btullis> enabled dashboard native filtering in superset https://gerrit.wikimedia.org/r/c/operations/puppet/+/881510 for T318299 [analytics]
2023-01-17 §
20:54 <xcollazo> dropping old partitions from image_suggestions Hive tables as per https://phabricator.wikimedia.org/T325837 [analytics]
16:50 <btullis> shutdown an-worker1086 for RAID BBU replacement [analytics]
2023-01-16 §
08:46 <elukey> powercycle an-worker1125 - soft lockup traces registered in the tty, host frozen [analytics]
2023-01-10 §
17:33 <btullis> chassis power reset on an-worker1032 (T326459) [analytics]
15:58 <SandraEbele> backfilling refine_event_sanitized_analytics_immediate on an-launcher1002 ‘sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event_sanitized_analytics_immediate —ignore_failure_flag=true --since=2023-01-07T17:00:00 until=2023-01-08T10:00:00 [analytics]
15:55 <SandraEbele> reran failed pageview-druid-hourly-coord oozie job for 2023-1-10-10. [analytics]
11:36 <btullis> roll-rebooting the analytics druid cluster to pick up new kernel [analytics]
10:24 <btullis> roll-rebooting the druid-public cluster to pick up new kernel [analytics]
2023-01-09 §
17:09 <aqu> Relaunching refine_event after partial backfilling `sudo systemctl start refine_event.service` (an-launcher1002) [analytics]
14:48 <SandraEbele> reran webrequest failed jobs ‘sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -Dstart_time=2023-01-08T07:00Z -Dstop_time=2023-01-08T14:59Z -Dwebrequest_source=text -Derror_incomplete_data_threshold=100 -Dwarning_incomplete_data_threshold=100 -Derror_data_loss_threshold=100 -Dwarning_data_loss_threshold=100 -submit -config /home/ebysans/webrequest_text_coordinator.properties’ [analytics]
10:21 <aqu> backfilling with refine_event on an-launcher1002 `sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event --ignore_failure_flag=true --since=2023-01-07T16:00:00 --until=2023-01-09T09:00:00 --verbose` [analytics]
09:48 <aqu> killing refine_event yarn application `sudo -u analytics yarn application -kill application_1663082229270_682638` [analytics]
09:39 <aqu> Manually kill the Spark process on an-launcher1002 `sudo -u analytics kill -9 28538` [analytics]
2023-01-06 §
12:29 <steve_munene> roll restarting aqs servers for to bump up mediawiki_history_snapshot to 2022-12 [analytics]
2023-01-04 §
17:14 <xcollazo> Dropped all temporary differential privacy tables with the 'DROP DATABASE tumult_temp_*' pattern. [analytics]
2023-01-03 §
11:08 <btullis> restarted hive-server2 and hive-metastore services on an-coord1001 after failover to standby server [analytics]
10:39 <btullis> fail over hive services to an-coord1002 with change to the DNS CNAME for analytics-hive.eqiad.wmnet [analytics]
10:20 <btullis> restart hive-server2 and hive-metastore services on an-coord1002 prior to failover [analytics]
2022-12-25 §
19:52 <btullis> reran the `refine_eventlogging_legacy` job [analytics]
16:56 <btullis> restarted `monitor_refine_event` service on an-launcher1002 after successful refine run [analytics]
16:55 <btullis> reran refine_event for 'mediawiki_api_request|mediawiki_cirrussearch_request' at 16:40 [analytics]
2022-12-22 §
11:01 <btullis> powering up an-presto10[05-15] but presto-server will be disabled. [analytics]
2022-12-21 §
14:42 <elukey> `apt-get clean` on an-launcher1002 to free some space [analytics]
01:17 <xcollazo> Deleted unused tables analytics_platform_eng.imagerec and analytics_platform_eng.imagerec_prod. [analytics]
2022-12-19 §
13:45 <btullis> restart presto-server on an-coord1001 to increase heap from 4GB to 16 GB T325331 [analytics]
12:11 <aqu> systemctl start hadoop-namenode-backup-hdfs.service on an-master1002 at 11am UTC [analytics]
09:36 <aqu> Deployed analytics/refinery using scap, then deployed onto HDFS. [analytics]
09:17 <aqu> About to deploy analytics/refinery (bug fix in HDFS usage pipeline) [analytics]
2022-12-16 §
15:36 <xcollazo> deploying 'Fix subtle bug on image_suggestions when resolving varprop.' on platform_eng Airflow instance. [analytics]
2022-12-15 §
22:28 <btullis> run `sudo apt clean` on an-coord1001 [analytics]
19:08 <xcollazo> Deploying Spark3 upgrade of image_suggestions job to the platform_eng Airflow instance. [analytics]
10:03 <joal> Restart failed airflow tasks [analytics]
2022-12-13 §
21:35 <aqu> Deploying analytics/refinery (HDFS FSImage conversion to XML script) [analytics]
2022-12-09 §
08:38 <joal> Kill refine_eventlogging_legacy stuck job (application_1663082229270_510052) [analytics]
2022-12-08 §
13:55 <joal> rerun webrequest failed jobs for hour 2022-12-08-T11:00Z with updated workflow (no dataloss checks) [analytics]
12:23 <joal> rerun webrequest failed jobs for hour 2022-12-08-T11:00Z [analytics]
2022-12-07 §
17:57 <aqu> Adding raw hdfs fsimage dir in HDFS (an-launcher1002) [analytics]
17:47 <aqu> Adding hdfs/usage folder dataset in HDFS [analytics]
16:24 <aqu> Deploying analytics/refinery (HDFS usage scripts) [analytics]
15:13 <btullis> roll-restarting AQS to pick up new mediawiki_history_reduce snapshot [analytics]
14:06 <btullis> rebuilding an-tool1005 as bullseye to test superset 1.5.2 upgrade [analytics]
09:10 <btullis> reboot an-worker1108 as it was spinning with soft CPU lockups [analytics]
2022-12-06 §
12:47 <btullis> sudo systemctl restart wmf_auto_restart_prometheus-mysqld-exporter.service on matomo1002 [analytics]
11:53 <btullis> attempting to unmount and remount `/mnt/hdfs` on stat1004 [analytics]