2020-09-09 §
10:11 <klausman> Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442) [analytics]
07:25 <elukey> restart varnishkafka-webrequest on cp5010 and cp5012, delivery reports errors happening since yesterday's network outage [analytics]
2020-09-04 §
18:11 <milimetric> aqs deploy went well! Geoeditors endpoint is live internally, data load job was successful, will submit pull request for public endpoint. [analytics]
06:54 <joal> Manually restart mediawiki-history-drop-snapshot after hive-partitions/hdfs-folders mismatch fix [analytics]
06:08 <elukey> reset-failed mediawiki-history-drop-snapshot on an-launcher1002 to clear icinga errors [analytics]
01:52 <milimetric> aborted aqs deploy due to cassandra error [analytics]
2020-09-03 §
19:15 <milimetric> finished deploying refinery and refinery-source, restarting jobs now [analytics]
13:59 <milimetric> edit-hourly-druid-wf-2020-08 fails consistently [analytics]
13:56 <joal> Kill-restart mediawiki-history-reduced oozie job into production queue [analytics]
13:56 <joal> rerun edit-hourly-druid-wf-2020-08 after failed attempt [analytics]
2020-09-02 §
18:24 <milimetric> restarting mediawiki history denormalize coordinator in production queue, due to failed 2020-08 run [analytics]
08:37 <elukey> run kafka preferred-replica-election on jumbo after jumbo1003's reimage to buster [analytics]
2020-08-31 §
13:43 <elukey> run kafka preferred-replica-election on Jumbo after jumbo1001's reimage [analytics]
07:13 <elukey> run kafka preferred-replica-election on Jumbo after jumbo1005's reimage [analytics]
2020-08-28 §
14:25 <mforns> deployed pageview whitelist with new wiki: ja.wikivoyage [analytics]
14:18 <elukey> run kafka preferred-replica-election on jumbo after the reimage of jumbo1006 [analytics]
07:21 <joal> Manually add ja.wikivoyage to pageview allowlist to prevent alerts [analytics]
2020-08-27 §
19:05 <mforns> finished refinery deploy (ref v0.0.134) [analytics]
18:41 <mforns> starting refinery deploy (ref v0.0.134) [analytics]
18:30 <mforns> deployed refinery-source v0.0.134 [analytics]
13:29 <elukey> restart jvm daemons on analytics1042, aqs1004, kafka-jumbo1001 to pick up new openjdk upgrades (canaries) [analytics]
2020-08-25 §
15:47 <elukey> restart mariadb@analytics_meta on db1108 to apply a replication filter (exclude superset_staging database from replication) [analytics]
06:35 <elukey> restart mediawiki-history-drop-snapshot on an-launcher1002 to check that it works [analytics]
2020-08-24 §
06:50 <joal> Dropping wikitext-history snapshots 2020-04 and 2020-05 keeping two (2020-06 and 2020-07) to free space in hdfs [analytics]
2020-08-23 §
19:34 <nuria> deleted 1.2 TB from hdfs://analytics-hadoop/user/analytics/.Trash/200811000000 [analytics]
19:31 <nuria> deleted 1.2 TB from hdfs://analytics-hadoop/user/nuria/.Trash/* [analytics]
19:26 <nuria> deleted 300G from hdfs://analytics-hadoop/user/analytics/.Trash/200814000000 [analytics]
19:25 <nuria> deleted 1.2 TB from hdfs://analytics-hadoop/user/analytics/.Trash/200808000000 [analytics]
2020-08-20 §
16:49 <joal> Kill restart webrequest-load bundle to move it to production queue [analytics]
2020-08-14 §
09:13 <fdans> restarting refine to apply T257860 [analytics]
2020-08-13 §
16:13 <fdans> restarting webrequest bundle [analytics]
14:44 <fdans> deploying refinery [analytics]
14:13 <fdans> updating refinery source symlinks [analytics]
2020-08-11 §
17:36 <ottomata> refine with refinery-source 0.0.132 and merge_with_hive_schema_before_read=true - T255818 [analytics]
14:52 <ottomata> scap deploy refinery to an-launcher1002 to get camus wrapper script changes [analytics]
2020-08-06 §
14:47 <fdans> deploying refinery [analytics]
08:07 <elukey> roll restart druid-brokers (on both clusters) to pick up new changes for monitorings [analytics]
2020-08-05 §
13:04 <elukey> restart yarn resource managers on an-master100[12] to pick up new Yarn settings - https://gerrit.wikimedia.org/r/c/operations/puppet/+/618529 [analytics]
13:03 <elukey> set yarn_scheduler_minimum_allocation_mb = 1 (was zero) to Hadoop to workaround a Flink 1.1 issue (namely it doesn't work if the value is <= 0) [analytics]
09:32 <elukey> set ticket max renewable lifetime to 7d on all kerberos clients (was zero, the default) [analytics]
2020-08-04 §
08:30 <elukey> resume druid-related oozie coordinator jobs via Hue (after druid upgrade) [analytics]
08:28 <elukey> started netflow kafka supervisor on Druid Analytics (after upgrade) [analytics]
08:19 <elukey> restore systemd timers for druid jobs on an-launcher1002 (after druid upgrade) [analytics]
07:33 <elukey> stop systemd timers related to druid on an-launcher1002 [analytics]
07:29 <elukey> stop kafka supervisor for netflow on Druid Analytics (prep step for druid upgrade) [analytics]
07:00 <elukey> suspend all druid-related coordinators in Hue as prep step for upgrade [analytics]
2020-08-03 §
09:53 <elukey> move all druid-related systemd timer to spark client mode - T254493 [analytics]
08:07 <elukey> roll restart aqs on aqs* to pick up new druid settings [analytics]
2020-08-01 §
13:22 <joal> Rerun cassandra-monthly-wf-local_group_default_T_unique_devices-2020-7 to load missing data (email with bug description sent to list) [analytics]
2020-07-31 §
14:46 <mforns> restarted webrequest oozie bundle [analytics]