451-500 of 4882 results (26ms)
2022-03-29 §
19:16 <joal> Drop/recreate wmf_raw.webrequest for schema change (high-entropy CH-UA) [analytics]
19:13 <mforns> starting refinery deployment (regular weekly train) [analytics]
19:11 <joal> kill webrequest-load oozie bundle for webrequest schema change [analytics]
17:13 <razzi> razzi@cumin1001:~$ sudo cookbook sre.hosts.downtime an-tool1005.eqiad.wmnet -D 1 -r 'Testing deploy of superset 1.4.2 to staging' [analytics]
15:38 <ntsako> Stopped geoeditor Airflow DAGs to check on data quality [analytics]
14:13 <btullis> correction: restarted hadoop-yarn-nodemanager.service on an-worker1128 [analytics]
14:13 <btullis> restarted hadoop-yarn-nodemanager.service on an-worker1238 [analytics]
2022-03-24 §
11:15 <btullis> roll-restarting kafka-jumbo brokers T300626 [analytics]
2022-03-21 §
18:10 <razzi> sudo systemctl restart jupyter-bearloga-singleuser on stat1008 [analytics]
2022-03-17 §
17:10 <ottomata> restart webrequest and pageview_actor data purge - https://gerrit.wikimedia.org/r/c/operations/puppet/+/771389 [analytics]
14:07 <btullis> shutdown analytics1063 and analytics1067 with 120 minutes of downtime T303151 [analytics]
06:46 <elukey> kill remaining hanging processes for ppche*lko and accra*ze on an-test-client1001 to allow users offboard (puppet broken) [analytics]
2022-03-16 §
19:14 <ottomata> deploying refinery to hadoop-test cluster with new gobblin-wmf-core jar [analytics]
18:00 <razzi> sudo cookbook sre.hosts.downtime -D 3 -r 'Setting up karapace for the first time' karapace1001.eqiad.wmnet [analytics]
17:57 <btullis> restarted mediawiki-history-drop-snapshot service on an-launcher1002 [analytics]
16:03 <aqu> analytics/refinery - scap deply "Migrate session_length/daily from Oozie to Airflow" [analytics]
10:26 <btullis> rerunning failed mediawiki_structured_task_article_link_suggestion_interaction refnie job [analytics]
2022-03-15 §
22:16 <razzi> upload karapace_2.1.3-py3.7-1_amd64.deb to apt.wikimedia.org [analytics]
19:58 <razzi> upload karapace_2.1.3-py3.7-0_amd64.deb to apt.wikimedia.org [analytics]
17:24 <ottomata> also change stats uid and gid to 918 on an-web1001 - T291384 [analytics]
14:35 <ottomata> change stats uid and gid on all stat boxes to 918 - T291384 [analytics]
13:59 <ottomata> roll restarting kafka jumbo brokers to set max.incremental.fetch.session.cache.slots=2000 - T303324 [analytics]
2022-03-14 §
21:05 <razzi> `sudo kill -9 15674` to stop unresponsive hive query [analytics]
2022-03-09 §
21:05 <ottomata> fix group ownership of cchen.db/new_editors/cohort=2021-12 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/cchen.db/new_editors/cohort=2021-12 [analytics]
18:33 <ottomata> fix group ownership of wmf_product.db//new_editors/cohort=2021-12 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/new_editors/cohort=2021-12 [analytics]
18:32 <ottomata> fix group ownership of wmf_product.db/global_markets_pageviews/year=2022/month=2 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/global_markets_pageviews/year=2022/month=2 [analytics]
18:19 <btullis> btullis@ganeti1024:~$ sudo gnt-instance start karapace1001.eqiad.wmnet (T301562) [analytics]
16:16 <ottomata> fix group ownership of wmf_product.db/poageviews_corrected/year=222/month=2 after reverting T291664 - sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /user/hive/warehouse/wmf_product.db/pageviews_corrected/year=2022/month=2 [analytics]
2022-03-08 §
13:31 <ottomata> restarted webrequest-load oozie bundle as 0073173-220113112502223-oozie-oozi-B starting at 2022-03-08T12:00Z [analytics]
13:09 <ottomata> killing and rerunning webrequest-load-text-wf for webrequest_source=text/year=2022/month=3/day=7/hour=17, it was stuck in add_partition task as SUSPENDED, not sure why. [analytics]
12:47 <btullis> roll-restarting druid-analytics T300626 [analytics]
12:08 <btullis> roll-restarting druid-public. T300626 [analytics]
11:21 <btullis> roll-restarting druid-test T300626 [analytics]
11:00 <btullis> roll-restarting aqs T300626 [analytics]
10:57 <btullis> restarted archiva T300626 [analytics]
2022-03-07 §
19:14 <ottomata> sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/*/hourly/year=2022/month=3/day=7 to make sure perms are fixed after revert of T291664 [analytics]
19:13 <ottomata> sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/virtualpageview/hourly/year=2022/month=3/day=7 - revert of T291664 [analytics]
18:45 <ottomata> sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/mediacounts/year=2022/month=3/day=7 [analytics]
18:37 <ottomata> sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7 - after reverting - T291664 [analytics]
18:34 <ottomata> restarting hive-server2 on an-coord1001 to revert hive.warehouse.subdir.inherit.perms change - T291664 [analytics]
14:44 <btullis> failing back hive services to an-coord1001 [analytics]
13:09 <aqu_> About to deploy analytics/refinery - Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics]
12:45 <aqu_> About to deploy airflow-dags/analytics - Migrates wikidata/item_page_link [analytics]
12:10 <btullis> restarted hive-server2 process on an-coord1001 [analytics]
11:52 <btullis> obtaining heap dump: `hive@an-coord1001:/srv/hive-tmp$ jmap -dump:format=b,file=hive_server2_heap_T303168.bin 16971` [analytics]
11:51 <btullis> obtaining summary of heap objects and sizes: `hive@an-coord1001:/srv/hive-tmp$ jmap -histo:live 16971 > hive-object-storage-and-sizes.T303168.txt` [analytics]
11:38 <btullis> failing over hive to an-coord1001 T303168 [analytics]
2022-03-05 §
10:03 <elukey> restart hadoop-yarn-nodemanager on an-worker1132 (unhealthy node, reason Linux Container Executor reached unrecoverable exception) [analytics]
2022-03-04 §
17:46 <mforns> deployed Airflow to analytics instance to fix skein logs problem [analytics]
15:50 <mforns> deployed airflow in an-test-client1001 to test skein log fix [analytics]