551-600 of 4835 results (20ms)
2022-01-13 §
11:59 <btullis> stopped eventlogging service on eventlog1003 with 1 hour's downtime. [analytics]
11:52 <btullis> Upgrading hive packages on stat1005 [analytics]
11:26 <btullis> restarted hive-metastore and hive-server2 on an-coord1001 after running puppet. [analytics]
11:23 <btullis> btullis@an-coord1001:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie oozie-client [analytics]
11:18 <btullis> btullis@an-coord1002:~$ sudo systemctl restart hive-metastore hive-server2 [analytics]
09:53 <btullis> DNS change deployed, failing over hive to an-coord1002. [analytics]
09:42 <btullis> btullis@an-coord1002:~$ sudo apt install hive hive-hcatalog hive-jdbc hive-metastore hive-server2 oozie-client [analytics]
08:45 <joal> Kill-restart wikidata-json_entity-weekly-coord after deploy [analytics]
2022-01-12 §
21:13 <joal> Deploying refinery to HDFS [analytics]
20:46 <joal> Deploying refinery with scap [analytics]
20:35 <joal> refinery-source v0.1.24 released on archiva [analytics]
11:25 <elukey> move kafka-jumbo nodes to fixed kafka uid/gid [analytics]
07:46 <elukey> `systemctl reset-failed product-analytics-movement-metrics.service` on stat1007 [analytics]
2022-01-10 §
13:56 <btullis> Upgrading oozie packages on an-test-coord1001 to test new log4j versions [analytics]
2022-01-08 §
10:51 <elukey> start hive-server2 on an-coord1002 - failed to connect to the metastore due to restart [analytics]
10:41 <elukey> restart hive daemons on an-coord1002 (after my last upgrade/rollback of packages the prometheus agent settings were not picked up, so no metrics) [analytics]
2022-01-07 §
20:16 <ottomata> altering hive table MobileWikiAppiOSUserHistory field event.device_level_enabled to string - T298721 [analytics]
17:29 <btullis> deployed updated hive packages to an-test-worker100[1-3] and an-test-ui1001 [analytics]
14:52 <btullis> root@aqs1014:~# jmap -dump:live,format=b,file=/srv/cassandra-b/tmp/aqs1014-b-dump202201071450.hprof 4468 [analytics]
2022-01-06 §
18:02 <btullis> btullis@aqs1010:~$ sudo systemctl restart cassandra-a.service [analytics]
12:22 <btullis> restarting cassandra-a service on aqs1004.eqiad.wmnet in order to troubleshoot logging. [analytics]
11:24 <btullis> restarting cassandra-a service on aqs1010.eqiad.wmnet in order to troubleshoot logging. [analytics]
08:12 <joal> Rerun failed webrequest-load-wf-text-2022-1-6-7 [analytics]
07:58 <joal> Rerun refine_event_sanitized_analytics_immediate missing hours after errors from the past days [analytics]
07:39 <joal> Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-05T2[123]:00:00 and 2022-01-06T00:00:00, dropping malformed rows as discussed with schema owner [analytics]
2022-01-05 §
19:16 <joal> Rerun failed refine_eventlogging_analytics for mobilewikiappiosuserhistory schema, hours 2022-01-04T1[5789]:00:00, dropping malformed rows as discussed with schema owner [analytics]
11:37 <btullis> Upgrading hive on an-test-client1001 in order to test log4j upgrade [analytics]
11:35 <btullis> Upgrading hive packages on an-test-coord1001 to test log4j changes. [analytics]
2022-01-04 §
10:39 <elukey> restart cassandra-a on aqs1010 (heap size used in full, high GC) [analytics]
10:20 <elukey> restart cassandra-a on aqs1015 (heap size used in full, high GC) [analytics]
2022-01-03 §
18:26 <joal> rerun cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2022-1-1 [analytics]
16:08 <joal> Kill cassandra3-local_group_default_T_mediarequest_per_file-daily-2022-1-1 [analytics]
11:26 <elukey> restart cassandra-b on aqs1015 (instance not responding, probably trashing) [analytics]
11:16 <elukey> restart cassandra-b on aqs1010 (stuck trashing) [analytics]
10:34 <elukey> depool aqs1010 (`sudo -i depool` on the node) to allow investigation of the cassandra -b instance [analytics]
10:22 <elukey> powercycle an-worker1114 (CPU soft lockup errors in mgmt console) [analytics]
10:20 <elukey> powercycle an-worker1120 (CPU soft lockup errors in mgmt console) [analytics]
2021-12-22 §
19:13 <milimetric> Additional context on the last delete message: on an-launcher1002 which is filled up [analytics]
19:12 <milimetric> Marcel and I are deleting files from /tmp older than 60 days [analytics]
15:55 <mforns> finished refinery deployment for anomaly detection queries [analytics]
14:54 <mforns> starting refinery deployment for anomaly detection queries [analytics]
2021-12-20 §
18:59 <mforns> finished deployment of refinery, adding anomaly detection hql for airflow job [analytics]
18:39 <mforns> started to deploy refinery, adding anomaly detection hql for airflow job [analytics]
2021-12-17 §
12:32 <btullis> Upgraded druid packages, with pool/depool on druid1004 [analytics]
11:20 <btullis> btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord [analytics]
11:18 <btullis> updating reprepo with new druid packages for buster-wikimedia to pick up new log4j jar files [analytics]
2021-12-16 §
11:01 <btullis> btullis@an-test-druid1001:~$ sudo apt-get install druid-broker druid-common druid-coordinator druid-historical druid-middlemanager druid-overlord [analytics]
11:01 <btullis> upgrading druid on the test cluster with new packages to test log4j changes. [analytics]
2021-12-15 §
08:51 <joal> Rerun failed cassandra-daily-wf-local_group_default_T_mediarequest_per_file-2021-12-13 after cluster restart [analytics]
07:20 <elukey> elukey@stat1007:~$ sudo systemctl reset-failed product-analytics-movement-metrics [analytics]