2021-07-06 §
17:33 <joal> Deploy refinery onto HDFS [analytics]
16:41 <joal> Deploy refinery for gobblin [analytics]
16:03 <joal> Kill webrequest_test oozie job [analytics]
15:55 <joal> Drop and recreate wmf_raw.webrequest table [analytics]
15:52 <joal> Moved camus and gobblin data for webrequest on analytics-test-hadoop [analytics]
15:48 <ottomata> deploying refinery to test cluster for webrequest_test gobblin job [analytics]
14:16 <ottomata> restarted aqs for july mw histroy snapshot deploy [analytics]
13:29 <joal> Run first manual empty job for webrequest_test on analytics-test-hadoop [analytics]
13:29 <joal> Clean gobblin state_store and data before starting webrequest_test on analytics-test-hadoop [analytics]
2021-07-03 §
19:57 <joal> rerun learning-features-actor-hourly-wf-2021-7-2-11 [analytics]
2021-07-02 §
13:47 <joal> Reset failed timer refinery-sqoop-mediawiki-private.service [analytics]
12:21 <joal> Replacing failed data with successful data generated when testing https://gerrit.wikimedia.org/r/702877 - wmf_raw.mediawiki_private_cu_changes [analytics]
00:04 <razzi> razzi@an-coord1002:~$ sudo mount -a [analytics]
00:04 <razzi> razzi@an-coord1002:~$ sudo umount /mnt/hdfs [analytics]
00:03 <razzi> razzi@an-coord1002:~$ sudo systemctl restart hive-metastore.service [analytics]
00:02 <razzi> razzi@an-coord1002:~$ sudo systemctl restart hive-server2.service [analytics]
2021-07-01 §
18:56 <razzi> razzi@authdns1001:~$ sudo authdns-update [analytics]
18:19 <razzi> razzi@an-coord1001:~$ sudo mount -a [analytics]
18:18 <razzi> razzi@an-coord1001:~$ sudo umount /mnt/hdfs [analytics]
18:17 <razzi> razzi@an-coord1001:~$ sudo systemctl restart presto-server.service [analytics]
18:16 <razzi> razzi@an-coord1001:~$ sudo systemctl restart hive-metastore.service [analytics]
18:16 <razzi> sudo systemctl restart hive-server2.service [analytics]
18:15 <razzi> sudo systemctl restart oozie on an-coord1001 for https://phabricator.wikimedia.org/T283067 [analytics]
16:38 <razzi> sudo authdns-update on ns0.wikimedia.org to apply https://gerrit.wikimedia.org/r/c/operations/dns/+/702689 [analytics]
2021-06-30 §
18:19 <razzi> unmount and remount /mnt/hdfs on an-test-client1001 for java security update [analytics]
2021-06-29 §
22:55 <razzi> sudo systemctl restart hive-server2 on an-test-coord1001.eqiad.wmnet for T283067 [analytics]
22:53 <razzi> sudo systemctl restart hive-metastore on an-test-coord1001.eqiad.wmnet for T283067 [analytics]
22:52 <razzi> sudo systemctl restart presto-server on an-test-coord1001.eqiad.wmnet for T283067 [analytics]
22:51 <razzi> sudo systemctl restart oozie on an-test-coord1001.eqiad.wmnet for T283067 [analytics]
13:31 <ottomata> deploying refinery for weekly train [analytics]
2021-06-28 §
17:00 <elukey> apt-get reinstall llvm-gpu on stat100[5-8] - T285495 [analytics]
2021-06-25 §
08:01 <elukey> reboot an-worker1101 to unblock stuck GPU [analytics]
07:57 <elukey> execute "sudo /opt/rocm/bin/rocm-smi --gpureset -d 1" on an-worker1101 as attempt to unblock the GPU [analytics]
2021-06-24 §
06:38 <elukey> drop hieradata/role/common/analytics_cluster/superset.yaml from puppet private repo (unused config, all the values dumplicated in the new hiera config) [analytics]
06:34 <elukey> rename superset hiera role configs in puppet private repo (to match the role change done recently) + superset restart [analytics]
2021-06-23 §
17:56 <ottomata> enable canary events for NavigationTiming extension streams - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/699789 [analytics]
15:30 <elukey> drop /reportupdater-queries on an-launcher1002 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/701130 [analytics]
2021-06-22 §
14:46 <XioNoX> remove decom hosts from the analytics firewall filter on cr2-eqiad - T279429 [analytics]
14:37 <XioNoX> start updating analytics firewall rules to capirca generated ones on cr2-eqiad - T279429 [analytics]
14:28 <XioNoX> remove decom hosts from the analytics firewall filter on cr1-eqiad - T279429 [analytics]
14:12 <XioNoX> start updating analytics firewall rules to capirca generated ones on cr1-eqiad - T279429 [analytics]
2021-06-21 §
13:35 <razzi> sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet [analytics]
2021-06-18 §
06:37 <elukey> execute "sudo find -type f -name '*.log*' -mtime +30 -delete" on an-coord1001 to free space in the root partition [analytics]
2021-06-15 §
17:46 <razzi> remove hdfs namenode backup on stat1004 [analytics]
17:45 <razzi> enable puppet on an-launcher [analytics]
17:45 <razzi> sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [analytics]
16:55 <razzi> sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet [analytics]
16:53 <razzi> run uid script on an-master1002 [analytics]
16:33 <elukey> restart hadoop-yarn-resourcemanager on an-master1001 [analytics]
16:16 <razzi> sudo systemctl stop 'hadoop-*' on an-master1002 [analytics]