2001-2050 of 3130 results (24ms)
2018-02-13 §
11:42 <elukey> force kill of yarn nodemanager + other containers on analytics1057 (node failed, unit masked, processes still around) [analytics]
2018-02-12 §
23:16 <elukey> re-run webrequest-load-wf-upload-2018-2-12-21 via Hue (node managers failure) [analytics]
23:13 <elukey> manual restart of Yarn Node Managers on analytics1058/31 [analytics]
23:09 <elukey> cleaned up tmp files on all analytics hadoop worker nodes, job filling up tmp [analytics]
17:18 <elukey> home dirs on stat1004 moved to /srv/home (/home symlinks to it) [analytics]
17:15 <ottomata> restarting eventlogging-processors to blacklist Print schema in eventlogging-valid-mixed (MySQL) [analytics]
14:46 <ottomata> deploying eventlogging for T186833 with EventCapsule in code and IP NO_DB_PROPERTIES [analytics]
2018-02-09 §
12:19 <joal> Rerun wikidata-articleplaceholder_metrics-wf-2018-2-8 [analytics]
2018-02-08 §
16:23 <elukey> stop archiva on meitnerium to swap /var/lib/archiva from the root partition to a new separate one [analytics]
2018-02-07 §
13:55 <joal> Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01 [analytics]
13:49 <elukey> restart overlord/middlemanager on druid1005 [analytics]
2018-02-06 §
19:40 <joal> Manually restarted druid indexation after weird failure of mediawiki-history-reduced-wf-2018-01 [analytics]
15:36 <elukey> drain + shutdown of analytics1038 to replace faulty BBU [analytics]
09:58 <elukey> applied https://gerrit.wikimedia.org/r/c/405687/ manually on deployment-eventlog02 for testing [analytics]
2018-02-05 §
15:51 <elukey> live hacked deployment-eventlog02's /srv/deployment/eventlogging/analytics/eventlogging/handlers.py to add poll(0) to the confluent kafka producer - T185291 [analytics]
11:03 <elukey> restart eventlogging/forwarder legacy-zmq on eventlog1001 due to slow memory leak over time (cached memory down to zero) [analytics]
2018-02-02 §
17:09 <joal> Webrequest upload 2018-02-02 hours 9 and 11 dataloss warning have been checked - They are false positive [analytics]
09:56 <joal> unique_devices-per_project_family-monthly-wf-2018-1 after failure [analytics]
2018-02-01 §
17:00 <ottomata> killing stuck JsonRefine eventlogging analytics job application_1515441536446_52892, not sure why this is stuck. [analytics]
14:06 <joal> Dataloss alerts for upload 2018-02-01 hours 1, 2, 3 and 5 were false positives [analytics]
12:17 <joal> Restart cassandra monthly bundle after January deploy [analytics]
2018-01-23 §
20:10 <ottomata> hdfs dfs -chmod 775 /wmf/data/archive/mediacounts/daily/2018 for T185419 [analytics]
09:26 <joal> Dataloss warning for upload and text 2018-01-23:06 is confirmed to be false positive [analytics]
2018-01-22 §
17:36 <joal> Kill-Restart clickstream oozie job after deploy [analytics]
17:12 <joal> deploying refinery onto HDFS [analytics]
17:12 <joal> Refinery deployed from scap [analytics]
2018-01-18 §
19:11 <joal> Kill-Restart coord_pageviews_top_bycountry_monthly ooie job from 2015-05 [analytics]
19:10 <joal> Add fake data to cassandra to silent alarms (Thanks again ema) [analytics]
18:56 <joal> Truncating table "local_group_default_T_top_bycountry"."data" in cassandra before reload [analytics]
15:21 <mforns> refinery deployment using scap and then deploying onto hdfs finished [analytics]
15:07 <mforns> starting refinery deployment [analytics]
12:43 <elukey> piwik on bohrium re-enabled [analytics]
12:40 <elukey> set piwik in readonly mode and stopped mysql on bohrium (prep step for reboot) [analytics]
09:38 <elukey> reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites [analytics]
09:37 <elukey> resumed druid hourly index jobs via hue and restored pivot's configuration [analytics]
09:21 <elukey> reboot druid1001 for kernel upgrades [analytics]
09:00 <elukey> suspended hourly druid batch index jobs via Hue [analytics]
08:58 <elukey> temporarily set druid1002 in superset's druid cluster config (via UI) [analytics]
08:53 <elukey> temporarily point pivot's configuration to druid1002 (druid1001 needs to be rebooted) [analytics]
08:52 <elukey> disable druid1001's middlemanager as prep step for reboot [analytics]
07:11 <elukey> re-run webrequest-load-wf-misc-2018-1-18-3 via Hue [analytics]
2018-01-17 §
17:33 <elukey> killed the banner impression spark job (application_1515441536446_27293) again to force it to respawn (real time indexers not present) [analytics]
17:29 <elukey> restarted all druid overlords on druid100[123] (weird race condition messages about who was the leader for some task) [analytics]
16:24 <elukey> re-run all the pageview-druid-hourly failed jobs via Hue [analytics]
14:42 <elukey> restart druid middlemanager on druid1003 as attempt to unblock realtime streaming [analytics]
14:21 <elukey> forced kill of banner impression data streaming job to get it restarted [analytics]
11:44 <elukey> re-run pageview-druid-hourly-wf-2018-1-17-9 and pageview-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's middlemanager being in a weird state after reboot) [analytics]
11:44 <elukey> restart druid middlemanager on druid1002 [analytics]
10:38 <elukey> stopped all crons on hadoop-coordinator-1 [analytics]
10:37 <elukey> re-run webrequest-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's reboot) [analytics]