551-600 of 4935 results (24ms)
2022-03-05 §
10:03 <elukey> restart hadoop-yarn-nodemanager on an-worker1132 (unhealthy node, reason Linux Container Executor reached unrecoverable exception) [analytics]
2022-03-04 §
17:46 <mforns> deployed Airflow to analytics instance to fix skein logs problem [analytics]
15:50 <mforns> deployed airflow in an-test-client1001 to test skein log fix [analytics]
05:19 <milimetric> rerunning monthly edit hourly druid oozie coordinator [analytics]
2022-03-03 §
17:48 <ottomata> roll restart aqs to pick up new MW history snapshot [analytics]
2022-03-01 §
18:38 <SandraEbele> sandra testing [analytics]
18:34 <razzi> demo irc logging to data eng team members [analytics]
10:19 <btullis> btullis@an-coord1002:/srv$ sudo rm -rf an-coord1001-backup/ (#T302777) [analytics]
09:48 <elukey> elukey@stat1004:~$ sudo kill `pgrep -u zpapierski` (offboarded user, puppet broken on the host) [analytics]
2022-02-28 §
16:00 <milimetric> refinery done deploying and syncing, new sqoop list is up [analytics]
15:01 <milimetric> deploying new wikis to sqoop list ahead of sqoop job starting in a few hours [analytics]
2022-02-25 §
17:00 <milimetric> rerunning webrequest-load-wf-text-2022-2-25-15 after confirming all false positive loss [analytics]
2022-02-23 §
23:00 <razzi> sudo maintain-views --table flaggedrevs --databases fiwiki on clouddb1014.eqiad.wmnet and clouddb1018.eqiad.wmnet for T302233 [analytics]
2022-02-22 §
10:37 <btullis> re-enabled puppet on an-launcher1002, having absented the network_internal druid load job [analytics]
09:30 <aqu> Deploying analytics/refinery on hadoop-test only. [analytics]
07:38 <elukey> systemctl reset-failed mediawiki-history-drop-snapshot on an-launcher1002 (opened since a week ago) [analytics]
07:30 <elukey> kill remaining processes of rhuang-ctr on stat1004 and an-test-client1001 (user offboarded, but still holding jupyter notebooks etc..). Puppet was broken trying to remove the user. [analytics]
2022-02-21 §
17:55 <elukey> kill remaining processes of rhuang-ctr on various stat nodes (user offboarded, but still holding jupyter notebooks etc..). Puppet was broken trying to remove the user. [analytics]
16:58 <mforns> Deployed refinery using scap, then deployed onto hdfs (aqs hourly airflow queries) [analytics]
2022-02-19 §
12:21 <elukey> stop puppet on an-launcher1002, stop timers for eventlogging_to_druid_network_flows_internal_{hourly,daily} since no data is coming to the Kafka topic (expected due to some work for the Marseille DC) and it keeps alarming [analytics]
2022-02-17 §
16:18 <mforns> deployed wikistats2 [analytics]
2022-02-16 §
14:13 <mforns> deployed airflow-dags to analytics instance [analytics]
2022-02-15 §
17:20 <ottomata> split anaconda-wmf into 2 packages: anaconda-wmf-base and anaconda-wmf. anaconda-wmf-base is installed on workers, anaconda-wmf on clients. The size of the package on workers is now much smaller. Installing throught the cluster. relevant: T292699 [analytics]
2022-02-14 §
17:38 <razzi> razzi@an-test-client1001:~$ sudo systemctl reset-failed airflow-scheduler@analytics-test.service [analytics]
16:08 <razzi> sudo cookbook sre.ganeti.makevm --vcpus 4 --memory 8 --disk 50 eqiad_B datahubsearch1002 for T301383 [analytics]
2022-02-12 §
08:50 <elukey> truncate /var/log/auth.log to 1g on krb1001 to free space on root partition (original log saved under /srv) [analytics]
2022-02-11 §
15:06 <ottomata> set hive.warehouse.subdir.inherit.perms = false - T291664 [analytics]
2022-02-10 §
18:54 <ottomata> setting up research airflow-dags scap deployment, recreating airflow database and starting from scractch (fab okayed this) - T295380 [analytics]
16:48 <ottomata> deploying airflow analytics with lots of recent changes to airflow-dags repository [analytics]
2022-02-09 §
17:41 <joal> Deploy refinery onto HDFS [analytics]
17:05 <joal> Deploying refinery with scap [analytics]
16:39 <joal> Release refinery-source v0.1.25 to archiva [analytics]
2022-02-08 §
07:27 <elukey> restart hadoop-yarn-nodemanager on an-worker1115 (container executor reached unrecoverable exception, doesn't talk with the Yarn RM anymore) [analytics]
2022-02-07 §
18:43 <ottomata> manually installing airflow_2.1.4-py3.7-2_amd64.deb on an-test-client1001 [analytics]
14:38 <ottomata> merged Set spark maxPartitionBytes to hadoop dfs block size - T300299 [analytics]
12:17 <btullis> depooled aqs1009 [analytics]
11:59 <btullis> depooled aqs1008 [analytics]
11:41 <btullis> depooled aqs1007 [analytics]
11:03 <btullis> depooled aqs1006 [analytics]
10:22 <btullis> depooling aqs1005 [analytics]
2022-02-04 §
16:05 <elukey> unmask prometheus-mysqld-exporter.service and clean up the old @analytics + wmf_auto_restart units (service+timer) not used anymore on an-coord100[12] [analytics]
12:55 <joal> Rerun cassandra-daily-wf-local_group_default_T_pageviews_per_article_flat-2022-2-3 [analytics]
07:12 <elukey> `GRANT PROCESS, REPLICATION CLIENT ON *.* TO `prometheus`@`localhost` IDENTIFIED VIA unix_socket WITH MAX_USER_CONNECTIONS 5` on an-test-coord1001 to allow the prometheus exporter to gather metrics [analytics]
07:09 <elukey> cleanup wmf_auto_restart_prometheus-mysqld-exporter@analytics-meta on an-test-coord1001 and unmasked wmf_auto_restart_prometheus-mysqld-exporter (now used) [analytics]
07:03 <elukey> clean up wmf_auto_restart_prometheus-mysqld-exporter@matomo on matomo1002 (not used anymore, listed as failed) [analytics]
2022-02-03 §
19:35 <joal> Rerun virtualpageview-druid-monthly-wf-2022-1 [analytics]
19:32 <btullis> re-running the failed refine_event job as per email. [analytics]
19:27 <joal> Rerun virtualpageview-druid-daily-wf-2022-1-16 [analytics]
19:12 <joal> Kill druid indexation stuck task on Druid (from 2022-01-17T02:31) [analytics]
19:09 <joal> Kill druid-loading stuck yarn applications (3 HiveToDruid, 2 oozie launchers) [analytics]