analytics SAL

51-100 of 4935 results (31ms)

2023-01-25 §
16:53	<btullis>	kicked off a rolling reboot of kafka-jumbo as part of T325132	[analytics]
15:14	<btullis>	rebooting an-conf1003 for new kernel	[analytics]
14:54	<btullis>	started a rolling-reboot of the hadoop workers via `sre.hadoop.reboot-workers` cookbook.	[analytics]
2023-01-23 §
13:06	<btullis>	restarted webrequest_sampled_supervisor realtime druid indexation job	[analytics]
10:04	<btullis>	proceeding to upgrade an-tool1010 to bullseye for superset 1.5.3 upgrade T323458	[analytics]
2023-01-19 §
10:25	<btullis>	enabled dashboard native filtering in superset https://gerrit.wikimedia.org/r/c/operations/puppet/+/881510 for T318299	[analytics]
2023-01-17 §
20:54	<xcollazo>	dropping old partitions from image_suggestions Hive tables as per https://phabricator.wikimedia.org/T325837	[analytics]
16:50	<btullis>	shutdown an-worker1086 for RAID BBU replacement	[analytics]
2023-01-16 §
08:46	<elukey>	powercycle an-worker1125 - soft lockup traces registered in the tty, host frozen	[analytics]
2023-01-10 §
17:33	<btullis>	chassis power reset on an-worker1032 (T326459)	[analytics]
15:58	<SandraEbele>	backfilling refine_event_sanitized_analytics_immediate on an-launcher1002 ‘sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event_sanitized_analytics_immediate —ignore_failure_flag=true --since=2023-01-07T17:00:00 until=2023-01-08T10:00:00	[analytics]
15:55	<SandraEbele>	reran failed pageview-druid-hourly-coord oozie job for 2023-1-10-10.	[analytics]
11:36	<btullis>	roll-rebooting the analytics druid cluster to pick up new kernel	[analytics]
10:24	<btullis>	roll-rebooting the druid-public cluster to pick up new kernel	[analytics]
2023-01-09 §
17:09	<aqu>	Relaunching refine_event after partial backfilling `sudo systemctl start refine_event.service` (an-launcher1002)	[analytics]
14:48	<SandraEbele>	reran webrequest failed jobs ‘sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -Dstart_time=2023-01-08T07:00Z -Dstop_time=2023-01-08T14:59Z -Dwebrequest_source=text -Derror_incomplete_data_threshold=100 -Dwarning_incomplete_data_threshold=100 -Derror_data_loss_threshold=100 -Dwarning_data_loss_threshold=100 -submit -config /home/ebysans/webrequest_text_coordinator.properties’	[analytics]
10:21	<aqu>	backfilling with refine_event on an-launcher1002 `sudo -u analytics kerberos-run-command analytics /usr/local/bin/refine_event --ignore_failure_flag=true --since=2023-01-07T16:00:00 --until=2023-01-09T09:00:00 --verbose`	[analytics]
09:48	<aqu>	killing refine_event yarn application `sudo -u analytics yarn application -kill application_1663082229270_682638`	[analytics]
09:39	<aqu>	Manually kill the Spark process on an-launcher1002 `sudo -u analytics kill -9 28538`	[analytics]
2023-01-06 §
12:29	<steve_munene>	roll restarting aqs servers for to bump up mediawiki_history_snapshot to 2022-12	[analytics]
2023-01-04 §
17:14	<xcollazo>	Dropped all temporary differential privacy tables with the 'DROP DATABASE tumult_temp_*' pattern.	[analytics]
2023-01-03 §
11:08	<btullis>	restarted hive-server2 and hive-metastore services on an-coord1001 after failover to standby server	[analytics]
10:39	<btullis>	fail over hive services to an-coord1002 with change to the DNS CNAME for analytics-hive.eqiad.wmnet	[analytics]
10:20	<btullis>	restart hive-server2 and hive-metastore services on an-coord1002 prior to failover	[analytics]
2022-12-25 §
19:52	<btullis>	reran the `refine_eventlogging_legacy` job	[analytics]
16:56	<btullis>	restarted `monitor_refine_event` service on an-launcher1002 after successful refine run	[analytics]
16:55	<btullis>	reran refine_event for 'mediawiki_api_request\|mediawiki_cirrussearch_request' at 16:40	[analytics]
2022-12-22 §
11:01	<btullis>	powering up an-presto10[05-15] but presto-server will be disabled.	[analytics]
2022-12-21 §
14:42	<elukey>	`apt-get clean` on an-launcher1002 to free some space	[analytics]
01:17	<xcollazo>	Deleted unused tables analytics_platform_eng.imagerec and analytics_platform_eng.imagerec_prod.	[analytics]
2022-12-19 §
13:45	<btullis>	restart presto-server on an-coord1001 to increase heap from 4GB to 16 GB T325331	[analytics]
12:11	<aqu>	systemctl start hadoop-namenode-backup-hdfs.service on an-master1002 at 11am UTC	[analytics]
09:36	<aqu>	Deployed analytics/refinery using scap, then deployed onto HDFS.	[analytics]
09:17	<aqu>	About to deploy analytics/refinery (bug fix in HDFS usage pipeline)	[analytics]
2022-12-16 §
15:36	<xcollazo>	deploying 'Fix subtle bug on image_suggestions when resolving varprop.' on platform_eng Airflow instance.	[analytics]
2022-12-15 §
22:28	<btullis>	run `sudo apt clean` on an-coord1001	[analytics]
19:08	<xcollazo>	Deploying Spark3 upgrade of image_suggestions job to the platform_eng Airflow instance.	[analytics]
10:03	<joal>	Restart failed airflow tasks	[analytics]
2022-12-13 §
21:35	<aqu>	Deploying analytics/refinery (HDFS FSImage conversion to XML script)	[analytics]
2022-12-09 §
08:38	<joal>	Kill refine_eventlogging_legacy stuck job (application_1663082229270_510052)	[analytics]
2022-12-08 §
13:55	<joal>	rerun webrequest failed jobs for hour 2022-12-08-T11:00Z with updated workflow (no dataloss checks)	[analytics]
12:23	<joal>	rerun webrequest failed jobs for hour 2022-12-08-T11:00Z	[analytics]
2022-12-07 §
17:57	<aqu>	Adding raw hdfs fsimage dir in HDFS (an-launcher1002)	[analytics]
17:47	<aqu>	Adding hdfs/usage folder dataset in HDFS	[analytics]
16:24	<aqu>	Deploying analytics/refinery (HDFS usage scripts)	[analytics]
15:13	<btullis>	roll-restarting AQS to pick up new mediawiki_history_reduce snapshot	[analytics]
14:06	<btullis>	rebuilding an-tool1005 as bullseye to test superset 1.5.2 upgrade	[analytics]
09:10	<btullis>	reboot an-worker1108 as it was spinning with soft CPU lockups	[analytics]
2022-12-06 §
12:47	<btullis>	sudo systemctl restart wmf_auto_restart_prometheus-mysqld-exporter.service on matomo1002	[analytics]
11:53	<btullis>	attempting to unmount and remount `/mnt/hdfs` on stat1004	[analytics]