2901-2950 of 5886 results (29ms)
2020-09-25 §
15:42 <elukey> add an-worker1096 (GPU worker) to the hadoop cluster [analytics]
08:57 <elukey> restart daemons on analytics1052 (journalnode) to verify new TLS setting simplification (no truststore config in ssl-server.xml, not needed) [analytics]
07:18 <elukey> restart datanode on analytics1044 after new datanode partition settings (one partition was missing, caught by https://gerrit.wikimedia.org/r/c/operations/puppet/+/629647) [analytics]
2020-09-24 §
13:24 <elukey> moved the hadoop cluster to puppet TLS certificates [analytics]
13:20 <elukey> re-enable timers on an-launcher1002 after maintenance [analytics]
09:51 <elukey> stop all timers on an-launcher1002 to ease maintenance [analytics]
09:41 <elukey> force re-creation of jupyterhub's default venv on stat1006 after reimage [analytics]
07:29 <klausman> Starting reimaging of stat1006 [analytics]
06:48 <elukey> on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mirrys/logs/* [analytics]
06:45 <elukey> on an-launcher1002: sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/analytics-privatedata/logs/* [analytics]
06:39 <elukey> manually ran "/usr/bin/find /srv/backup/hadoop/namenode -mtime +15 -delete" on an-master1002 to free some space in the backup partition [analytics]
2020-09-23 §
07:29 <elukey> re-enable timers on al-launcher1002 - maintenance postponed [analytics]
06:06 <elukey> stop timers on an-launcher1002 as prep step before maintenance [analytics]
2020-09-22 §
06:29 <elukey> re-run webrequest-load-text 21/09T21 - failed due to sporadic hive/kerberos issue (SQLException: Could not open client transport with JDBC Uri: jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default;principal=hive/an-coord1001.eqiad.wmnet@WIKIMEDIA: Peer indicated failure: Failure to initialize security context) [analytics]
2020-09-21 §
18:00 <elukey> execute sudo -u hdfs kerberos-run-command hdfs hdfs dfs -rm -r -skipTrash /var/log/hadoop-yarn/apps/mgerlach/logs/* to free ~30TB of space on HDFS (Replicated) [analytics]
17:44 <elukey> restart yarn resource managers on an-master100[1,2] to pick up settings for https://gerrit.wikimedia.org/r/c/operations/puppet/+/628887 [analytics]
16:59 <joal> Manually add _SUCCESS file to events to hourly-partition of page_move events so that wikidata-item_page_link job starts [analytics]
16:21 <joal> Kill restart wikidata-item_page_link-weekly-coord to not wait on missing data [analytics]
15:45 <joal> Restart wikidata-json_entity-weekly coordinator after wrong kill in new hue UI [analytics]
15:42 <joal> manually killing wikidata-json_entity-weekly-wf-2020-08-31 - Raw data is missing from dumps folder (json dumps) [analytics]
2020-09-18 §
15:05 <elukey> systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 to clear icinga alrms [analytics]
10:38 <elukey> force ./create_virtualenv.sh in /srv/jupyterhub/deploy to update the jupyter's default venv [analytics]
2020-09-17 §
10:12 <klausman> started backup of stat1004's /srv to stat1008 [analytics]
2020-09-16 §
19:12 <joal> Manually kill webrequest-hour oozie job that started before the restart could happen (waiting for previous hour to be finished) [analytics]
19:00 <joal> Kill-restart data-quality-hourly bundle after deploy [analytics]
18:57 <joal> Kill-restart webrequest after deploy [analytics]
18:44 <joal> Kill restart mediawiki-history-reduced job after deploy [analytics]
17:59 <joal> Deploy refinery onto HDFS [analytics]
17:46 <joal> Deploy refinery using scap [analytics]
15:27 <elukey> update the TLS backend certificate for Analytics UIs (unified one) to include hue-next.w.o as SAN [analytics]
12:11 <klausman> stat1008 updated to use rock/rocm DKMS driver and back in operation [analytics]
11:28 <klausman> starting to upgrade to rock-dkms driver on stat1008 [analytics]
08:11 <elukey> superset 0.37.1 deployed to an-tool1005 (staging env) [analytics]
2020-09-15 §
13:43 <elukey> re-enable timers on an-launcher1002 after maintenance to an-coord1001 [analytics]
13:43 <elukey> restart of hive/oozie/presto daemons on an-coord1001 [analytics]
12:30 <elukey> stop timers on an-launcher1002 to drain the cluster and restart an-coord1001's daemons (hive/oozie/presto) [analytics]
06:48 <elukey> run systemctl reset-failed monitor_refine_eventlogging_legacy_failure_flags.service on an-launcher1002 [analytics]
2020-09-14 §
14:36 <milimetric> deployed eventstreams with new KafkaSSE version on staging, eqiad, codfw [analytics]
2020-09-11 §
15:41 <milimetric> restarted data quality stats bundles [analytics]
01:32 <milimetric> deployed small fix for hql of editors_bycountry load job [analytics]
00:46 <milimetric> deployed refinery source 0.0.136, refinery, and synced to HDFS [analytics]
2020-09-09 §
10:11 <klausman> Rebooting stat1005 for clearing GPU status and testing new DKMS driver (T260442) [analytics]
07:25 <elukey> restart varnishkafka-webrequest on cp5010 and cp5012, delivery reports errors happening since yesterday's network outage [analytics]
2020-09-04 §
18:11 <milimetric> aqs deploy went well! Geoeditors endpoint is live internally, data load job was successful, will submit pull request for public endpoint. [analytics]
06:54 <joal> Manually restart mediawiki-history-drop-snapshot after hive-partitions/hdfs-folders mismatch fix [analytics]
06:08 <elukey> reset-failed mediawiki-history-drop-snapshot on an-launcher1002 to clear icinga errors [analytics]
01:52 <milimetric> aborted aqs deploy due to cassandra error [analytics]
2020-09-03 §
19:15 <milimetric> finished deploying refinery and refinery-source, restarting jobs now [analytics]
13:59 <milimetric> edit-hourly-druid-wf-2020-08 fails consistently [analytics]
13:56 <joal> Kill-restart mediawiki-history-reduced oozie job into production queue [analytics]