analytics SAL

51-100 of 5005 results (21ms)

2023-03-01 §
10:25	<nfraison>	rebooting an-worker1132 being slower than other node (potential issue with raid card/disks)	[analytics]
07:59	<nfraison>	restarted hiveserver2 in analytics-test to take in account -XX:MaxMetaspaceSize=512m JVM parameter	[analytics]
2023-02-28 §
21:33	<xcollazo>	Deploying section_image_recommendations DAG to platform_eng Airflow instance	[analytics]
11:38	<btullis>	cancelled merge of https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128	[analytics]
11:32	<btullis>	merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/878128	[analytics]
09:42	<nfraison>	restart presto prod coordinator to take in account heap size change	[analytics]
09:38	<nfraison>	Failover hive servers to active server: an-coord1001	[analytics]
09:32	<nfraison>	restarted hive-metastore and hiveserver2 on an-coord1001 (non-active hive server)	[analytics]
08:22	<nfraison>	Failover hive servers to standby server: https://gerrit.wikimedia.org/r/c/operations/dns/+/892460	[analytics]
2023-02-27 §
14:52	<nfraison>	restarted hive-metastore and hiveserver2 on an-coord1002 (standby hive server)	[analytics]
2023-02-22 §
19:39	<mforns>	restarted the following an-launcher1002 timers, which seemed stuck (next run = n/a): gobblin-webrequest.timer, reportupdater-browser.timer, reportupdater-reference-previews.timer, refine_event.timer, refine_eventlogging_legacy.timer	[analytics]
11:07	<nfraison>	roll restart presto clusters to take in account fix on node.environment typo	[analytics]
2023-02-21 §
19:01	<mforns>	re airflow silent failure: the job was pageview_actor_hourly	[analytics]
19:00	<mforns>	we had another silent failure in airflow, a sensor that failed without sending an email. the logs are missing.	[analytics]
09:33	<nfraison>	adding last batch of 5 nodes to the presto prod cluster	[analytics]
2023-02-20 §
13:11	<nfraison>	Reimage an-presto1001 to upgrade to bullseye T329361	[analytics]
12:45	<nfraison>	adding 5 nodes to the presto prod cluster	[analytics]
12:32	<nfraison>	roll-restart presto workers on an-presto100[1-5] to take in account new configs T329525	[analytics]
12:29	<nfraison>	restart presto coordinator on an-coord1001 to take in account new configs T329525	[analytics]
2023-02-18 §
08:29	<elukey>	kill leftover processes of user `mepps` (offboarded) from stat100[4,5] to unblock puppet	[analytics]
2023-02-16 §
21:10	<SandraEbele>	restarted oozie webrequest load bundle.	[analytics]
21:09	<SandraEbele>	Added new field referer_data to wmf.webrequest table using the alter table statement	[analytics]
21:07	<SandraEbele>	successfully deployed analytics refinery	[analytics]
18:46	<SandraEbele>	started deploying analytics refinery	[analytics]
18:37	<SandraEbele>	killed webrequest bundle ooze jobs to deploy refinery changes.	[analytics]
16:55	<SandraEbele>	Deployed refinery-source change to remove Github.io from Mediasites definition of referers.	[analytics]
2023-02-13 §
21:40	<xcollazo>	deploying section_topics v0.5.0 on platform_eng Airflow instance	[analytics]
21:39	<ottomata>	enabled rc1.mediawiki.page_change stream on group0 and group1 wikis	[analytics]
14:15	<btullis>	roll-restarting all eventgate pods	[analytics]
14:06	<nfraison>	Reimage an-test-presto1001 to upgrade to bullseye T329361	[analytics]
10:46	<nfraison>	restarting presto-worker on an-presto[1001-1015].eqiad.wmnet to pick up new gc logging settings T329054	[analytics]
10:15	<btullis>	Reimage an-test-worker1001 to upgrade to bullseye T329363	[analytics]
09:59	<nfraison>	restarting presto-coordinator on an-coord1001 to pick up new gc logging settings T329054	[analytics]
09:57	<nfraison>	re-enabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet	[analytics]
09:08	<aqu>	Rerun killed Oozie pageview-hourly-coord of 2023-02-11 with sudo -u analytics kerberos-run-command analytics oozie job --oozie $OOZIE_URL -rerun 0019103-210107075406929-oozie-oozi-C -date 2023-02-11T14:00Z::2023-02-11T16:00Z	[analytics]
09:04	<nfraison>	restarting presto-coordinator on an-test-coord1001 to pick up new gc logging settings T329054	[analytics]
08:59	<nfraison>	restarting presto-worker on an-test-presto1001 to pick up new gc logging settings T329054	[analytics]
08:52	<nfraison>	disabled puppet agent on an-presto[1001-1015].eqiad.wmnet and an-coord1001.eqiad.wmnet to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/888214 on test cluster first only	[analytics]
2023-02-10 §
23:22	<mforns>	unpaused all airflow dags and cleared all failed tasks after the incident	[analytics]
22:30	<btullis>	starting the hadoop-yarn-resourcemanager on an-master1001 and failing back to iy.	[analytics]
22:25	<btullis>	stopping hadoop-yarn-resourcemanager service in an-master1001 to fail over automatically to an-master1002	[analytics]
21:21	<mforns>	restarted airflow@analytics.service in an-launcher1002	[analytics]
2023-02-09 §
17:32	<mforns>	deployed airflow	[analytics]
12:01	<btullis>	Shutting down an-worker109[89] and dse-k8s-worker1002 for another GPU move - T318696	[analytics]
10:36	<joal>	Start airflow webrequest_actor jobs	[analytics]
10:26	<joal>	Deploy analytics-airflow	[analytics]
10:25	<joal>	Setup airflow start-date variables for new dags	[analytics]
10:10	<joal>	Merge airflow code for learning/actor -> webrequest_actor move	[analytics]
10:01	<joal>	Move data and update hive tables from learning/actor convention to webrequest_actor convention	[analytics]
09:59	<joal>	Kill oozie pageview-learning jobs	[analytics]