1-50 of 3919 results (9ms)
2021-06-28 §
17:00 <elukey> apt-get reinstall llvm-gpu on stat100[5-8] - T285495 [analytics]
2021-06-25 §
08:01 <elukey> reboot an-worker1101 to unblock stuck GPU [analytics]
07:57 <elukey> execute "sudo /opt/rocm/bin/rocm-smi --gpureset -d 1" on an-worker1101 as attempt to unblock the GPU [analytics]
2021-06-24 §
06:38 <elukey> drop hieradata/role/common/analytics_cluster/superset.yaml from puppet private repo (unused config, all the values dumplicated in the new hiera config) [analytics]
06:34 <elukey> rename superset hiera role configs in puppet private repo (to match the role change done recently) + superset restart [analytics]
2021-06-23 §
17:56 <ottomata> enable canary events for NavigationTiming extension streams - https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/699789 [analytics]
15:30 <elukey> drop /reportupdater-queries on an-launcher1002 after https://gerrit.wikimedia.org/r/c/operations/puppet/+/701130 [analytics]
2021-06-22 §
14:46 <XioNoX> remove decom hosts from the analytics firewall filter on cr2-eqiad - T279429 [analytics]
14:37 <XioNoX> start updating analytics firewall rules to capirca generated ones on cr2-eqiad - T279429 [analytics]
14:28 <XioNoX> remove decom hosts from the analytics firewall filter on cr1-eqiad - T279429 [analytics]
14:12 <XioNoX> start updating analytics firewall rules to capirca generated ones on cr1-eqiad - T279429 [analytics]
2021-06-21 §
13:35 <razzi> sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1001-eqiad-wmnet an-master1002-eqiad-wmnet [analytics]
2021-06-18 §
06:37 <elukey> execute "sudo find -type f -name '*.log*' -mtime +30 -delete" on an-coord1001 to free space in the root partition [analytics]
2021-06-15 §
17:46 <razzi> remove hdfs namenode backup on stat1004 [analytics]
17:45 <razzi> enable puppet on an-launcher [analytics]
17:45 <razzi> sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [analytics]
16:55 <razzi> sudo -i wmf-auto-reimage-host -p T278423 an-master1002.eqiad.wmnet [analytics]
16:53 <razzi> run uid script on an-master1002 [analytics]
16:33 <elukey> restart hadoop-yarn-resourcemanager on an-master1001 [analytics]
16:16 <razzi> sudo systemctl stop 'hadoop-*' on an-master1002 [analytics]
16:14 <razzi> sudo systemctl stop hadoop-* on an-master1001, then realize I meant to do this on an-master1002, so start hadoop-* [analytics]
16:11 <razzi> downtime an-master1002 [analytics]
15:55 <razzi> sudo transfer.py an-master1001.eqiad.wmnet:/srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage [analytics]
15:42 <razzi> tar -czf /srv/hadoop/backup/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current on an-master1001 [analytics]
15:38 <razzi> backup /srv/hadoop/name/current to /home/razzi/hdfs-namenode-snapshot-buster-reimage-2021-06-15.tar.gz on an-master1001 [analytics]
15:33 <razzi> sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace [analytics]
15:27 <razzi> sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter [analytics]
15:25 <razzi> kill running yarn applications via for loop [analytics]
15:11 <razzi> sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [analytics]
15:09 <razzi> disable puppet on an-mastesr [analytics]
15:08 <razzi> run puppet on an-masters to update capacity-scheduler.xml [analytics]
15:02 <razzi> disable puppet on an-masters [analytics]
15:01 <razzi> sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues to stop queues [analytics]
14:35 <razzi> disable jobs that use hadoop on an-launcher1002 following https://phabricator.wikimedia.org/T278423#7094641 [analytics]
2021-06-14 §
18:45 <ottomata> remove packges from hadoop common nodes: sudo cumin 'R:Class = profile::analytics::cluster::packages::common' 'apt-get -y remove python3-pandas python3-pycountry python3-numpy python3-tz' - T275786 [analytics]
18:43 <ottomata> remove packges from stat nodes: sudo cumin 'stat*' apt-get -y remove subversion mercurial tofrodos libwww-perl libcgi-pm-perl libjson-perl libtext-csv-xs-perl libproj-dev libboost-regex-dev libboost-system-dev libgoogle-glog-dev libboost-iostreams-dev libgdal-dev [analytics]
07:18 <joal> Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2021-6-11 [analytics]
2021-06-10 §
21:17 <razzi> sudo systemctl restart monitor_refine_eventlogging_analytics [analytics]
18:17 <razzi> sudo systemctl restart hadoop-mapreduce-historyserver [analytics]
17:24 <razzi> sudo systemctl restart hadoop-hdfs-namenode on an-master1002 [analytics]
17:24 <razzi> sudo systemctl restart hadoop-hdfs-zkfc on an-master1002 [analytics]
17:12 <razzi> sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet [analytics]
16:25 <razzi> rolling restart hadoop masters to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194 [analytics]
14:07 <ottomata> altered event.wmdebannerevent event.eventRate field to change type from BIGINT to DOUBLE - T282562 [analytics]
2021-06-08 §
16:56 <elukey> move away from dbstore1004 in favor of dbstore1007 in analytics CNAME/SRV records (will affect analytics-mysql and sqoop) [analytics]
13:42 <ottomata> roll restart an-conf zookeepers - T283067 [analytics]
13:22 <ottomata> roll restarting analytics presto-servers - T283067 [analytics]
06:08 <elukey> restart yarn nodemanager on analytics1075 to clear the un-healthy state after some days of downtime (one-off issue but let's keep an eye on it) [analytics]
2021-06-07 §
18:14 <ottomata> rolling restart of kafka jumbo brokers - T283067 [analytics]
17:53 <ottomata> rolling restart of kafka jumbo mirror makers - T283067 [analytics]