analytics SAL

201-250 of 3174 results (18ms)

2020-09-23 §
06:06	<elukey>	stop timers on an-launcher1002 as prep step before maintenance	[analytics]
2020-09-22 §
06:29	<elukey>	re-run webrequest-load-text 21/09T21 - failed due to sporadic hive/kerberos issue (SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default;principal=hive/an-coord1001.eqiad.wmnet@WIKIMEDIA: Peer indicated failure: Failure to initialize security context)	[analytics]
2020-09-21 §
18:00	<elukey>	execute sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mgerlach/logs/* to free ~30TB of space on HDFS (Replicated)	[analytics]
17:44	<elukey>	restart yarn resource managers on an-master100[1,2] to pick up settings for https://gerrit.wikimedia.org/r/c/operations/puppet/+/628887	[analytics]
16:59	<joal>	Manually add _SUCCESS file to events to hourly-partition of page_move events so that wikidata-item_page_link job starts	[analytics]
16:21	<joal>	Kill restart wikidata-item_page_link-weekly-coord to not wait on missing data	[analytics]
15:45	<joal>	Restart wikidata-json_entity-weekly coordinator after wrong kill in new hue UI	[analytics]
15:42	<joal>	manually killing wikidata-json_entity-weekly-wf-2020-08-31 - Raw data is missing from dumps folder (json dumps)	[analytics]
2020-09-18 §
15:05	<elukey>	systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear icinga alrms	[analytics]
10:38	<elukey>	force ./create_virtualenv.sh in /srv/jupyterhub/deploy to update the jupyter's default venv	[analytics]
2020-09-17 §
10:12	<klausman>	started backup of stat1004's /srv to stat1008	[analytics]
2020-09-16 §
19:12	<joal>	Manually kill webrequest-hour oozie job that started before the restart could happen (waiting for previous hour to be finished)	[analytics]
19:00	<joal>	Kill-restart data-quality-hourly bundle after deploy	[analytics]
18:57	<joal>	Kill-restart webrequest after deploy	[analytics]
18:44	<joal>	Kill restart mediawiki-history-reduced job after deploy	[analytics]
17:59	<joal>	Deploy refinery onto HDFS	[analytics]
17:46	<joal>	Deploy refinery using scap	[analytics]
15:27	<elukey>	update the TLS backend certificate for Analytics UIs (unified one) to include hue-next.w.o as SAN	[analytics]
12:11	<klausman>	stat1008 updated to use rock/rocm DKMS driver and back in operation	[analytics]
11:28	<klausman>	starting to upgrade to rock-dkms driver on stat1008	[analytics]
08:11	<elukey>	superset 0.37.1 deployed to an-tool1005 (staging env)	[analytics]
2020-09-15 §
13:43	<elukey>	re-enable timers on an-launcher1002 after maintenance to an-coord1001	[analytics]
13:43	<elukey>	restart of hive/oozie/presto daemons on an-coord1001	[analytics]
12:30	<elukey>	stop timers on an-launcher1002 to drain the cluster and restart an-coord1001's daemons (hive/oozie/presto)	[analytics]
06:48	<elukey>	run systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002	[analytics]
2020-09-14 §
14:36	<milimetric>	deployed eventstreams with new KafkaSSE version on staging, eqiad, codfw	[analytics]
2020-09-11 §
15:41	<milimetric>	restarted data quality stats bundles	[analytics]
01:32	<milimetric>	deployed small fix for hql of editors_bycountry load job	[analytics]
00:46	<milimetric>	deployed refinery source 0.0.136, refinery, and synced to HDFS	[analytics]
2020-09-09 §
10:11	<klausman>	Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442)	[analytics]
07:25	<elukey>	restart varnishkafka-webrequest on cp5010 and cp5012, delivery reports errors happening since yesterday's network outage	[analytics]
2020-09-04 §
18:11	<milimetric>	aqs deploy went well! Geoeditors endpoint is live internally, data load job was successful, will submit pull request for public endpoint.	[analytics]
06:54	<joal>	Manually restart mediawiki-history-drop-snapshot after hive-partitions/hdfs-folders mismatch fix	[analytics]
06:08	<elukey>	reset-failed mediawiki-history-drop-snapshot on an-launcher1002 to clear icinga errors	[analytics]
01:52	<milimetric>	aborted aqs deploy due to cassandra error	[analytics]
2020-09-03 §
19:15	<milimetric>	finished deploying refinery and refinery-source, restarting jobs now	[analytics]
13:59	<milimetric>	edit-hourly-druid-wf-2020-08 fails consistently	[analytics]
13:56	<joal>	Kill-restart mediawiki-history-reduced oozie job into production queue	[analytics]
13:56	<joal>	rerun edit-hourly-druid-wf-2020-08 after failed attempt	[analytics]
2020-09-02 §
18:24	<milimetric>	restarting mediawiki history denormalize coordinator in production queue, due to failed 2020-08 run	[analytics]
08:37	<elukey>	run kafka preferred-replica-election on jumbo after jumbo1003's reimage to buster	[analytics]
2020-08-31 §
13:43	<elukey>	run kafka preferred-replica-election on Jumbo after jumbo1001's reimage	[analytics]
07:13	<elukey>	run kafka preferred-replica-election on Jumbo after jumbo1005's reimage	[analytics]
2020-08-28 §
14:25	<mforns>	deployed pageview whitelist with new wiki: ja.wikivoyage	[analytics]
14:18	<elukey>	run kafka preferred-replica-election on jumbo after the reimage of jumbo1006	[analytics]
07:21	<joal>	Manually add ja.wikivoyage to pageview allowlist to prevent alerts	[analytics]
2020-08-27 §
19:05	<mforns>	finished refinery deploy (ref v0.0.134)	[analytics]
18:41	<mforns>	starting refinery deploy (ref v0.0.134)	[analytics]
18:30	<mforns>	deployed refinery-source v0.0.134	[analytics]
13:29	<elukey>	restart jvm daemons on analytics1042, aqs1004, kafka-jumbo1001 to pick up new openjdk upgrades (canaries)	[analytics]