1 results (14ms)
2021-07-20 §
20:30 <joal> rerun webrequest timed-out instances [analytics]
18:58 <mforns> starting refinery deployment [analytics]
18:40 <razzi> razzi@an-launcher1002:~$ sudo puppet agent --enable [analytics]
18:39 <razzi> razzi@an-master1001:/var/log/hadoop-hdfs$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [analytics]
18:37 <razzi> razzi@an-master1002:~$ sudo -i puppet agent --enable [analytics]
18:34 <razzi> razzi@an-master1002:~$ sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [analytics]
18:32 <razzi> razzi@an-master1002:~$ sudo systemctl start hadoop-yarn-resourcemanager.service [analytics]
18:31 <razzi> razzi@an-master1002:~$ sudo systemctl stop hadoop-yarn-resourcemanager.service [analytics]
18:22 <razzi> sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [analytics]
18:21 <razzi> re-enable yarn queues by merging puppet patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/705732 [analytics]
17:27 <razzi> razzi@cumin1001:~$ sudo -i wmf-auto-reimage-host -p T278423 an-master1001.eqiad.wmnet [analytics]
17:17 <razzi> stop all hadoop processes on an-master1001 [analytics]
16:52 <razzi> starting hadoop processes on an-master1001 since they didn't failover cleanly [analytics]
16:31 <razzi> sudo bash gid_script.bash on an-maseter1001 [analytics]
16:29 <razzi> razzi@alert1001:~$ sudo icinga-downtime -h an-master1001 -d 7200 -r "an-master1001 debian upgrade" [analytics]
16:25 <razzi> razzi@an-master1001:~$ sudo systemctl stop hadoop-mapreduce-historyserver [analytics]
16:25 <razzi> sudo systemctl stop hadoop-hdfs-zkfc.service on an-master1001 again [analytics]
16:25 <razzi> sudo systemctl stop hadoop-yarn-resourcemanager on an-master1001 again [analytics]
16:23 <razzi> sudo systemctl stop hadoop-hdfs-namenode on an-master1001 [analytics]
16:19 <razzi> razzi@an-master1001:~$ sudo systemctl stop hadoop-hdfs-zkfc [analytics]
16:19 <razzi> razzi@an-master1001:~$ sudo systemctl stop hadoop-yarn-resourcemanager [analytics]
16:18 <razzi> sudo systemctl stop hadoop-hdfs-namenode [analytics]
16:10 <razzi> razzi@cumin1001:~$ sudo transfer.py an-master1002.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage [analytics]
16:03 <razzi> root@an-master1002:/srv/hadoop/name# tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current [analytics]
15:57 <razzi> sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace [analytics]
15:52 <razzi> sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter [analytics]
15:37 <razzi> kill yarn applications: for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done [analytics]
15:08 <razzi> sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [analytics]
14:52 <razzi> sudo systemctl stop 'gobblin-*.timer' [analytics]
14:51 <razzi> sudo systemctl stop analytics-reportupdater-logs-rsync.timer [analytics]
14:47 <razzi> Disable jobs on an-launcher1002 (see https://phabricator.wikimedia.org/T278423#7190372) [analytics]
14:46 <razzi> razzi@an-launcher1002:~$ sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster' [analytics]
08:32 <mforns> restarted webrequest bundle (messed up a coord when trying to rerun some failed hours) [analytics]
2021-07-17 §
08:54 <elukey> run 'sudo find -type f -name '*.log*' -mtime +30 -delete' on an-coord1001:/var/log/hive to free space (root partition almost filled up) - T279304 [analytics]
2021-07-15 §
16:44 <ottomata> deploying refinery and refinery-source 0.1.15 for refine job fixes - T271232 [analytics]
13:39 <joal> Kill refine_event application_1623774792907_154469 to let manual run finish [analytics]
13:35 <joal> Kill currently running refine job (application_1623774792907_154014) [analytics]
11:20 <joal> Kill stuck refine application [analytics]
2021-07-14 §
17:39 <razzi> sudo cookbook sre.druid.roll-restart-workers public for https://phabricator.wikimedia.org/T283067 [analytics]
00:34 <razzi> razzi@an-test-druid1001:~$ sudo systemctl restart zookeeper [analytics]
00:33 <razzi> razzi@an-test-druid1001:~$ sudo systemctl restart druid-coordinator [analytics]
00:33 <razzi> razzi@an-test-druid1001:~$ sudo systemctl restart druid-broker [analytics]
00:28 <razzi> razzi@an-test-druid1001:~$ sudo systemctl restart druid-middlemanager [analytics]
00:24 <razzi> razzi@an-test-druid1001:~$ sudo systemctl restart druid-overlord [analytics]
00:24 <razzi> razzi@an-test-druid1001:~$ sudo systemctl restart druid-historical [analytics]
2021-07-13 §
19:29 <joal> move /wmf/data/raw/eventlogging --> /wmf/data/raw/eventlogging_camus and drop /wmf/data/raw/eventlogging_legacy/*/year=2021/month=07/day=13/hour=14 [analytics]
19:02 <razzi> razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-workers analytics [analytics]
13:03 <joal> remove /wmf/gobblin/locks/event_default.lock to unlock gobblin event job [analytics]
2021-07-12 §
18:37 <joal> Move /wmf/data/raw/event to /wmf/data/raw/event_camus and /wmf/data/raw/event_gobblin to /wmf/data/raw/event [analytics]
18:36 <joal> Delete /year=2021/month=07/day=12/hour=14 of gobblin imported events [analytics]