2023-03-08
§
|
10:25 |
<nfraison> |
failover namenode in prod from an-master1002-eqiad-wmnet to an-master1001-eqiad-wmnet |
[analytics] |
09:59 |
<nfraison> |
restart namenode in an-master1001 (standby in prod) to take in account new quota init threads setting |
[analytics] |
09:53 |
<nfraison> |
restart namenode in an-test-master1002 to take in account new quota init threads setting |
[analytics] |
09:52 |
<nfraison> |
failover namenode in test from an-test-master1002-eqiad-wmnet to an-test-master1001-eqiad-wmnet |
[analytics] |
09:47 |
<nfraison> |
restart namenode in an-test-master1001 to take in account new quota init threads setting |
[analytics] |
09:36 |
<nfraison> |
restart test hiveserver2: T303168 |
[analytics] |
09:13 |
<nfraison> |
restart prod resourcemanager to take in account new dedicated exclude file |
[analytics] |
08:58 |
<nfraison> |
restart test resourcemanager to take in account new dedicated exclude file |
[analytics] |
07:56 |
<nfraison> |
restart prod jobhistory to take in account: https://gerrit.wikimedia.org/r/c/operations/puppet/+/894481 |
[analytics] |
07:47 |
<nfraison> |
restart test jobhistory to take in account: https://gerrit.wikimedia.org/r/c/operations/puppet/+/894481 |
[analytics] |
2023-03-07
§
|
22:03 |
<mforns> |
deployed airflow analytics again to try and fix druid_load_edit_hourly |
[analytics] |
16:55 |
<xcollazo> |
deployed image-suggestions hotfix to platform_eng Airflow instance. See https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/262. |
[analytics] |
15:23 |
<btullis> |
re-enabling ingestion via gobblin. |
[analytics] |
14:59 |
<nfraison> |
force startup of nodemanager on analytics_cluster |
[analytics] |
14:58 |
<btullis> |
pooled druid1004 |
[analytics] |
14:57 |
<btullis> |
pooling aqs1010 and aqs1016 |
[analytics] |
14:56 |
<btullis> |
pooling datahubsearch1001 |
[analytics] |
14:53 |
<btullis> |
leaving safe mode on hdfs |
[analytics] |
13:59 |
<btullis> |
disabled puppet temporarily on an-master100[1-2] to avoid an automatic restart of yarn |
[analytics] |
13:57 |
<btullis> |
stopped `hadoop-yarn-resourcemanager.service` on both an-master100[1-2] |
[analytics] |
13:54 |
<btullis> |
entering safe mode with `sudo -u hdfs kerberos-run-command hdfs hadoop dfsadmin -safemode enter` on an-master1002 |
[analytics] |
12:57 |
<btullis> |
depooled druid1004 for T329073 |
[analytics] |
12:56 |
<btullis> |
depooled datahubsearch1001 for T329073 |
[analytics] |
12:51 |
<btullis> |
disabled gobblin timers on an-launcher1002 |
[analytics] |
12:46 |
<btullis> |
depooling aqs1016for T329073 |
[analytics] |
12:45 |
<btullis> |
depooling aqs1010 for T329073 |
[analytics] |
08:00 |
<nfraison> |
Reimage an-conf1003 to upgrade to bullseye T329362 |
[analytics] |
2023-03-01
§
|
22:45 |
<mforns> |
re-deployed airflow analytics with some forgotten changes |
[analytics] |
22:42 |
<mforns> |
deployed Airflow analytics |
[analytics] |
22:30 |
<mforns> |
finished refinery deployment, although didn't manage to run refinery-deploy-to-hdfs without warnings... |
[analytics] |
21:48 |
<mforns> |
kill edit-hourly-coord in Hue to migrate it to Airflow |
[analytics] |
21:26 |
<mforns> |
starting refinery deploy |
[analytics] |
19:38 |
<SandraEbele> |
rerunning webrequest load text for 2023-03-01-08 hour. |
[analytics] |
18:54 |
<joal> |
Create empty partitions in event.mediawiki_page_move table for codfw datacenter from beginning of week (2023-02-27T00 -> 2023-02-28T13) |
[analytics] |
10:25 |
<nfraison> |
rebooting an-worker1132 being slower than other node (potential issue with raid card/disks) |
[analytics] |
07:59 |
<nfraison> |
restarted hiveserver2 in analytics-test to take in account -XX:MaxMetaspaceSize=512m JVM parameter |
[analytics] |