2021-02-26 §
07:29 <elukey> added journalnode partition to all hadoop workers not having it in the Analytics cluster [analytics]
07:01 <elukey> reboot an-worker1099 to clear out kernel soft lockup errors [analytics]
06:59 <elukey> restart datanode on an-worker1099 - soft lockup kernel errors [analytics]
2021-02-25 §
17:04 <razzi> rebalance kafka partitions for webrequest_upload_3 [analytics]
13:36 <elukey> drop /srv/backup/wikistats from thorium [analytics]
13:35 <elukey> drop /srv/backup/backup_wikistats_1 from thorium [analytics]
11:14 <elukey> add an-worker111[7,8] to Analytics Hadoop (were previously backup worker nodes) [analytics]
08:50 <elukey> move analytics-privatedata/search/product to fixed gid/uid on all buster nodes (including airflow/stat100x/launcher) [analytics]
2021-02-24 §
19:16 <ottomata> service hadoop-yarn-nodemanager start on an-worker1112 [analytics]
16:03 <milimetric> deployed refinery [analytics]
14:09 <elukey> roll restart druid brokers on druid public to pick up caffeine cache settings [analytics]
14:03 <elukey> roll restart druid brokers on druid analytics to pick up caffeine cache settings [analytics]
11:08 <elukey> restart druid-broker on an-druid1001 (used by Turnilo) with caffeine cache [analytics]
09:01 <elukey> roll restart druid brokers on druid public - locked [analytics]
07:47 <elukey> change gid/uid for druid + roll restart of all druid nodes [analytics]
2021-02-23 §
21:20 <ottomata> started nodemanager on an-worker1112 [analytics]
21:15 <razzi> rebalance kafka partitions for webrequest_upload partition 2 [analytics]
19:31 <elukey> roll out new uid/gid for mapred/druid/analytics/yarn/hdfs for all buster nodes (no op for stretch) [analytics]
17:47 <elukey> change uid/gid for yarn/mapred/analytics/hdfs/druid on stat100x, an-presto100x [analytics]
15:57 <elukey> an-launcher1002's timers restored [analytics]
15:28 <elukey> stop timers on an-launcher1002 to change gid/uid for yarn/hdfs/mapred/analytics/druid and to reboot for kernel updates [analytics]
15:23 <elukey> deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-tool100[8,9] [analytics]
15:22 <elukey> deploy new uid/gid scheme for yarn/mapred/analytics/hdfs/druid on an-airflow1001, an-test* buster nodes [analytics]
15:05 <klausman> an-master1001 ~ $ sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp analytics-privatedata-users /wmf/data/raw/webrequest/webrequest_text/hourly/2021/02/22/01/webrequest* [analytics]
14:51 <elukey> drop /srv/backup-1007 on stat1008 to free space [analytics]
2021-02-22 §
19:27 <ottomata> restart oozie on an-coord1001 to pick up new spark share lib without hadoop jars - T274384 [analytics]
14:38 <ottomata> upgrade spark2 on analytics cluster to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed) - T274384 [analytics]
14:12 <ottomata> upgrade spark2 on an-coord1001 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed), will remove and auto-re add spark-2.4.4-assembly.zip in hdfs after running puppet here [analytics]
14:07 <ottomata> upgrade spark2 on stat1004 to 2.4.4-bin-hadoop2.6-5~wmf0 (hadoop jars removed) [analytics]
09:01 <elukey> reboot stat1005/stat1008 for kernel upgrades [analytics]
2021-02-19 §
15:53 <elukey> restart oozie again to test another setting for role/admins [analytics]
15:43 <ottomata> installing spark 2.4.4 without hadoop jars on analytics test cluster - T274384 [analytics]
15:31 <elukey> restart oozie to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/665352 [analytics]
14:34 <joal> rerun mobile_apps-uniques-daily-wf-2021-2-18 [analytics]
09:16 <elukey> stop and decom the hadoop backup cluster [analytics]
2021-02-18 §
18:38 <razzi> rebalance kafka partition for webrequest_upload partition 1 [analytics]
17:27 <elukey> an-coord1002 back in service with raid1 configured [analytics]
15:48 <elukey> stop hive/mysql on an-coord1002 as precautionary step to rebuild the md array [analytics]
13:10 <elukey> failover analytics-hive to an-coord1001 after maintenance (DNS change) [analytics]
11:32 <elukey> restart hive daemons on an-coord1001 to pick up new parquet settings [analytics]
10:07 <elukey> hive failover to an-coord1002 to apply new hive settings to an-coord1001 [analytics]
10:00 <elukey> restart hive daemons on an-coord1002 (standby coord) to pick up new default parquet file format change [analytics]
09:46 <elukey> upgrade presto to 0.246-wmf on an-coord1001, an-presto*, stat100x [analytics]
2021-02-17 §
17:44 <razzi> rebalance kafka partitions for webrequest_upload partition 0 [analytics]
16:14 <razzi> rebalance kafka partitions for eqiad.mediawiki.api-request [analytics]
07:04 <elukey> reboot stat1004/stat1006/stat1007 for kernel upgrades [analytics]
2021-02-16 §
22:31 <razzi> rebalance kafka partitions for codfw.mediawiki.api-request [analytics]
17:44 <razzi> rebalance kafka partitions for netflow [analytics]
17:42 <razzi> rebalance kafka partitions for atskafka_test_webrequest_text [analytics]
07:32 <elukey> restart hadoop daemons on an-worker1099 after reconfiguring a new disk [analytics]