751-800 of 6119 results (34ms)
2023-07-10 §
14:02 <btullis> powered off an-worker1145 for T341481 [analytics]
10:55 <btullis> `sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet` on an-master1001 [analytics]
2023-07-07 §
09:56 <btullis> `sudo systemctl start hadoop-hdfs-namenode.service ` on an-master1001 [analytics]
09:28 <stevemunene> running sre.hadoop.roll-restart-masters restart the maters to completely remove any reference of analytics[1058-1069] T317861 [analytics]
09:15 <stevemunene> run puppet on hadoop masters to pick up changes from recently decommissioned hosts [analytics]
08:12 <elukey> wipe kafka-test cluster (data + zookeper config) to start clean after the issue happened yesterday [analytics]
2023-07-06 §
14:51 <elukey> upgraded zookeeper-test1002 to bookworm, but its metadata got re-initialized as well (my bad for this) [analytics]
14:30 <stevemunene> decommission analytics1069.eqiad.wmnet T341209 [analytics]
14:19 <stevemunene> decommission analytics1068.eqiad.wmnet T341208 [analytics]
14:06 <stevemunene> decommission analytics1067.eqiad.wmnet T341207 [analytics]
13:13 <stevemunene> decommission analytics1066.eqiad.wmnet T341206 [analytics]
13:02 <stevemunene> decommission analytics1065.eqiad.wmnet T341205 [analytics]
12:35 <stevemunene> decommission analytics1064.eqiad.wmnet T341204 [analytics]
11:18 <stevemunene> decommission analytics1063.eqiad.wmnet T339201 [analytics]
10:40 <stevemunene> decommission analytics1062.eqiad.wmnet T339200 [analytics]
09:57 <stevemunene> decommission analytics1061.eqiad.wmnet T339199 [analytics]
07:23 <stevemunene> run puppet agent on hadoop masters [analytics]
07:21 <stevemunene> Remove analytics1064_1069 from hdfs net_topology [analytics]
07:17 <stevemunene> stop hadoop-hdfs-datanode service on analytics[1061-1069] Preparing to decommission the hosts - T317861 [analytics]
07:11 <stevemunene> disable-puppet on analytics[1061-1069] Preparing to decommission the hosts - T317861 [analytics]
2023-07-05 §
14:36 <stevemunene> enable puppet on analytics1069 to get the host back into puppetdb and hence allow the the decommission cookbook run later [analytics]
11:47 <btullis> restarted archiva for T329716 [analytics]
11:45 <btullis> restarted hive-servers2 and hive-metastore service on an-coord1002 [analytics]
11:40 <btullis> roll-restarting kafka-jumbo brokers for T329716 [analytics]
11:01 <btullis> roll-restarting the presto workers for T329716 [analytics]
10:20 <btullis> deploying updated spark3 defaults to disable the `spark.shuffle.useOldFetchProtocol`option for T332765 [analytics]
09:45 <btullis> failing back namenode to an-master1001 with `sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1002-eqiad-wmnet an-master1001-eqiad-wmnet` on an-master1001 [analytics]
09:38 <btullis> re-enabled gobblin jobs on an-launcher1002 [analytics]
09:03 <btullis> switching yarn shuffler - running puppet on 87 worker nodes [analytics]
08:44 <btullis> disabled gobblin and spark jobs on an-launcher for T332765 [analytics]
08:33 <btullis> disabled gobblin jobs with https://gerrit.wikimedia.org/r/c/operations/puppet/+/935425 [analytics]
08:27 <btullis> roll-restarting hadoop workers in the test cluster [analytics]
2023-07-04 §
13:55 <btullis> roll-restarting the eventgate-analytics-external worker pods in eqiad with: `helmfile -e eqiad --state-values-set roll_restart=1 sync` [analytics]
10:31 <btullis> beginning hdfs datanode rolling restart with `sudo cumin -b 2 -p 80 -s 120 A:hadoop-worker 'systemctl restart hadoop-hdfs-datanode'` [analytics]
10:10 <btullis> btullis@an-master1001:~$ sudo systemctl start hadoop-hdfs-namenode [analytics]
10:00 <btullis> roll-restarting journal nodes with 30 seconds between each one: `sudo cumin -b 1 -p 100 -s 30 A:hadoop-hdfs-journal 'systemctl restart hadoop-hdfs-journalnode'` [analytics]
09:29 <btullis> restarting the yarn restart with `sudo cumin -b 5 -p 80 -s 30 A:hadoop-worker 'systemctl restart hadoop-yarn-nodemanager'` [analytics]
08:57 <btullis> executing `cookbook sre.hadoop.roll-restart-workers analytics` [analytics]
2023-07-03 §
12:52 <btullis> restarting the aqs service to pick up mediawiki history snapshot for June [analytics]
2023-06-29 §
13:44 <btullis> upgrading airflow on an-launcher1002 to version 2.6.1 [analytics]
2023-06-28 §
13:25 <btullis> upgrading an-test-worker1003 to bullseye, after upgrading firmware [analytics]
13:08 <btullis> upgrading idrac firmware of an-test-worker1003 via the cookbook for T329363 [analytics]
2023-06-27 §
14:53 <mforns> deployed airflow analytics to unbreak DataHub's Druid ingestion [analytics]
13:32 <joal> Rerun druid_load_pageviews_hourly_aggregated_daily after deploy [analytics]
13:32 <joal> druid_load_pageviews_hourly_aggregated_dailyRerun [analytics]
13:25 <joal> Deploy Airflow [analytics]
11:10 <joal> Deploy refinery onto HDFS [analytics]
11:01 <stevemunene> upgrading an-test-worker1003 to bullseye, keeping `/srv/hadoop` intact [analytics]
10:55 <joal> Deploy refinery using scap [analytics]
09:42 <stevemunene> !log run puppet on hadoop-masters this does a refresh of the hdfs nodes [analytics]