201-250 of 3230 results (8ms)
2020-10-05 §
18:20 <elukey> manual creation of /opt/rocm -> /opt/rocm-3.3.0 on stat1008 to avoid failures in finding the lib dir [analytics]
17:11 <elukey> bootstrap an-worker[1115-1117] as hadoop workers [analytics]
14:52 <milimetric> disabling drop-el-unsanitized-events timer until https://gerrit.wikimedia.org/r/c/analytics/refinery/+/631804/ is deployed [analytics]
14:41 <elukey> shutdown stat1005 and stat1008 for ram expansion (1005 again) [analytics]
14:25 <elukey> shutdown an-master1001 for ram expansion [analytics]
13:54 <elukey> shutdown stat1005 for ram upgrade [analytics]
13:31 <elukey> shutdown an-master1002 for ram expansion (64 -> 128G) [analytics]
12:35 <elukey> execute "PURGE BINARY LOGS BEFORE '2020-09-28 00:00:00';" on an-coord1001's mysql to free space - T264081 [analytics]
10:31 <elukey> bootstrap an-worker111[0,2] as hadoop workers [analytics]
10:31 <elukey> bootstrap an-worker111[0,2 [analytics]
06:33 <elukey> reboot stat1005 to resolve weird GPU state (scheduled last week) [analytics]
2020-10-03 §
10:35 <joal> Manually run mediawiki-history-denormalize after fail-rerun problem (second time) [analytics]
2020-10-02 §
16:43 <joal> Rerun mediawiki-history-denormalize-wf-2020-09 after failed instance [analytics]
14:23 <elukey> live patch refinery-drop-older-than on stat1007 to unblock timer (patch https://gerrit.wikimedia.org/r/6317800) [analytics]
13:00 <elukey> add an-worker110[6-9] to the Hadoop cluster [analytics]
06:49 <elukey> add an-worker110[0-2] to the hadoop cluster [analytics]
06:33 <joal> Manually sqoop page_props and user_properties to unlock mediawiki-history-load oozie job [analytics]
2020-10-01 §
19:07 <fdans> deploying wikistats [analytics]
19:06 <fdans> restarted banner_activity-druid-daily-coord from Sep 26 [analytics]
18:59 <fdans> restarting mediawiki-history-load-coord [analytics]
18:57 <fdans> creating hive table wmf_raw.mediawiki_page_props [analytics]
18:56 <fdans> creating hive table wmf_raw.mediawiki_user_properties [analytics]
17:40 <elukey> remove + re-create /srv/deployment/analytics/refinery* on stat100[46] (perm issues after reimage) [analytics]
17:32 <elukey> remove + re-create /srv/deployment/analytics/refinery on stat1007 (perm issues after reimage) [analytics]
17:18 <fdans> deploying refinery [analytics]
14:51 <elukey> bootstrap an-worker109[8-9] as hadoop workers (with GPU) [analytics]
13:35 <elukey> bootstrap an-worker1097 (GPU node) as hadoop worker [analytics]
13:15 <elukey> restart performance-asoranking on stat1007 [analytics]
13:15 <elukey> execute "sudo chown analytics-privatedata:analytics-privatedata-users /srv/published-datasets/performance/autonomoussystems/*" on stat1007 to fix a perm issue after reimage [analytics]
10:30 <elukey> add an-worker1103 to the hadoop cluster [analytics]
07:15 <elukey> restart hdfs namenodes on an-master100[1,2] to pick up new hadoop workers settings [analytics]
06:04 <elukey> execyte "sudo chown -R analytics-privatedata:analytics-privatedata-users /srv/geoip/archive" on stat1007 - T264152 [analytics]
05:58 <elukey> execute "sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chown -R analytics-privatedata /wmf/data/archive/geoip" - T264152 [analytics]
2020-09-30 §
07:29 <elukey> execute "alter table superset_production.alerts drop key ix_alerts_active;" on db1108's analytics-meta instance to fix replication after Superset upgrade - T262162 [analytics]
07:04 <elukey> superset upgraded to 0.37.2 on analytics-tool1004 - T262162 [analytics]
05:47 <elukey> "PURGE BINARY LOGS BEFORE '2020-09-22 00:00:00';" on an-coord1001's mariadb - T264081 [analytics]
2020-09-28 §
18:37 <elukey> execute "PURGE BINARY LOGS BEFORE '2020-09-20 00:00:00';" on an-coord1001's mariadb as attempt to recover space [analytics]
18:37 <elukey> execute "PURGE BINARY LOGS BEFORE '2020-09-15 00:00:00';" on an-coord1001's mariadb as attempt to recover space [analytics]
15:09 <elukey> execute set global max_connections=200 on an-coord1001's mariadb (hue reporting too many conns, but in reality the fault is from superset) [analytics]
10:02 <elukey> force /srv/jupyterhub/deploy/create_virtual_env.sh on stat1007 after the reimage [analytics]
07:58 <elukey> starting the process to decom the old hadoop test cluster [analytics]
2020-09-27 §
06:53 <elukey> manually ran /usr/bin/find /srv/backup/hadoop/namenode -mtime +14 -delete on an-master1002 to free space on the /srv partition [analytics]
2020-09-25 §
16:25 <elukey> systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear alerts [analytics]
15:52 <elukey> restart hdfs namenodes to correct rack settings of the new host [analytics]
15:42 <elukey> add an-worker1096 (GPU worker) to the hadoop cluster [analytics]
08:57 <elukey> restart daemons on analytics1052 (journalnode) to verify new TLS setting simplification (no truststore config in ssl-server.xml, not needed) [analytics]
07:18 <elukey> restart datanode on analytics1044 after new datanode partition settings (one partition was missing, caught by https://gerrit.wikimedia.org/r/c/operations/puppet/+/629647) [analytics]
2020-09-24 §
13:24 <elukey> moved the hadoop cluster to puppet TLS certificates [analytics]
13:20 <elukey> re-enable timers on an-launcher1002 after maintenance [analytics]
09:51 <elukey> stop all timers on an-launcher1002 to ease maintenance [analytics]