351-400 of 5937 results (30ms)
2023-11-06 §
18:38 <milimetric> deployed refinery-source, starting to deploy analytics airflow dags [analytics]
13:57 <stevemunene> roll-restart druid public workers to pick up a new zookeeper node druid1009. T336042 [analytics]
13:32 <stevemunene> restart zookeper leader to pick up new host druid1009 T336042 [analytics]
13:25 <stevemunene> stop and disable zookeper on druid1004 T336042 [analytics]
13:19 <stevemunene> disable puppet on druid1004 and druid10[09-11] to Onboard new druid1009 to the ZooKeeper cluster for `druid-public-eqiad` cluster [analytics]
2023-11-01 §
15:58 <stevemunene> powercyle stat1008, host is frozen/stuck in an unresponsive state [analytics]
2023-10-31 §
09:26 <brouberol> I replaced the self-signed skein certificate by one issued by our cfssl PKI on an-test1002 - T329398 [analytics]
2023-10-26 §
16:18 <stevemunene> roll-restart druid public workers to pick up new zookeeper hosts. T336042 [analytics]
15:29 <stevemunene> stop zookeper on druid1005 current leader for the `druid-public-eqiad` this will trigger the election of a new leader T336042 [analytics]
10:18 <stevemunene> restart zookeper leader to pick up new host druid1011 T336042 [analytics]
09:18 <stevemunene> stop zookeper on druid1006 T336042 [analytics]
08:48 <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1009 [analytics]
08:06 <brouberol> sudo cookbook sre.hosts.reimage --os bullseye -t T348495 kafka-jumbo1008 [analytics]
2023-10-24 §
16:46 <xcollazo> Deploying latest DAGs to analytics Airflow instance [analytics]
12:41 <joal> Drop wmf.referrer_daily hive table and data [analytics]
10:07 <btullis> transferring snapshot s2.2023-10-23--01-34-18 from dbprov1004 to dbstore1007:/srv/sqldata.s2 [analytics]
10:02 <btullis> stopping and deleting s2 on dbstore1007. [analytics]
2023-10-23 §
10:14 <brouberol> sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1001.eqiad.wmnet [analytics]
10:11 <btullis> deploying multiple spark shufflers to the test cluster for T344910 [analytics]
09:58 <brouberol> sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1002.eqiad.wmnet [analytics]
09:47 <btullis> restarting krb5-kdc.service and krb5-admin-server.service on krb1001 and re-enabling puppet for T346135 [analytics]
09:10 <btullis> root@krb1001:~# systemctl stop krb5-kdc.service krb5-admin-server.service [analytics]
09:09 <btullis> disabling puppet on krb1001 for T346135 [analytics]
08:53 <brouberol> sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1004.eqiad.wmnet [analytics]
08:28 <brouberol> sudo cookbook sre.hosts.decommission -t T336044 kafka-jumbo1005.eqiad.wmnet - T336044 [analytics]
2023-10-19 §
19:58 <xcollazo> ran "sudo -u hdfs hdfs dfs -cp /user/xcollazo/artifacts/spark-3.3.2-assembly.zip /user/spark/share/lib/" and "sudo -u hdfs hdfs dfs -chmod o+r /user/spark/share/lib/spark-3.3.2-assembly.zip" to bring make Spark 3.3.2 assembly available for other folks. [analytics]
19:54 <xcollazo> ran "sudo -u hdfs hdfs dfs -rm /user/spark/share/lib/spark-3.1.2-assembly.jar.backup" to remove old spark assembly backup from May 25 2023. [analytics]
19:52 <xcollazo> ran "$ sudo -u hdfs hdfs dfs -rm /user/spark/share/lib/spark-3.1.2-assembly.jar.bak" to remove old spark assembly backup from Jun 13 2023. [analytics]
15:22 <brouberol> The kafka service has been stopped on kafka-jumbo100[1-6] - T336044 [analytics]
15:04 <brouberol> sudo cumin --batch-size 1 --batch-sleep 60 'kafka-jumbo100[1-6].eqiad.wmnet' 'sudo systemctl stop kafka.service' - T336044 [analytics]
15:02 <brouberol> disabling puppet on kafka-jumbo100[1-6] to make sure kafka isn't resarted - T336044 [analytics]
12:13 <brouberol> disabling puppet on kafka-jumbo nodes so we can merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/966497 [analytics]
09:42 <btullis> re-running airflow jobs for missing webrequest data on hadoop-test [analytics]
2023-10-18 §
18:03 <stevemunene> revert Add analytics-wmde service user to the Yarn production queue T340648 [analytics]
17:43 <tchin> deploying mw-page-content-change-enrich [analytics]
16:53 <stevemunene> Add analytics-wmde service user to the Yarn production queue T340648 [analytics]
09:14 <btullis> rebooting stat100[6-7] [analytics]
09:07 <btullis> rebooting stat1004 [analytics]
07:01 <aqu> Started deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train [analytics]
2023-10-17 §
16:17 <btullis> restarting hadoop-yarn-nodemanager on an-test-worker1001 [analytics]
14:01 <tchin> deploying airflow analytics [analytics]
13:39 <tchin> deploying refinery [analytics]
12:56 <btullis> deploying multiple spark shufflers to the test cluster [analytics]
09:51 <btullis> re-enabling all previously paused dags [analytics]
09:50 <btullis> restarting all airflow schedulers after rebooting an-db1001 [analytics]
09:10 <btullis> pausing both active dags on the analytics_product airflow instance [analytics]
09:09 <btullis> pausing all 7 active dags on airflow-platform_eng airflow instance [analytics]
09:07 <btullis> pausing all 3 active dags on airflow-research instance [analytics]
09:07 <btullis> pausing all 28 active airflow dags on airflow-search instance [analytics]
09:03 <btullis> pausing all airflow dags on analytics instance [analytics]