2018-01-18
§
|
09:38 |
<elukey> |
reboot thorium (analytics webserver) for security upgrade - This maintenance will cause temporary unavailability of the Analytics websites |
[analytics] |
09:37 |
<elukey> |
resumed druid hourly index jobs via hue and restored pivot's configuration |
[analytics] |
09:21 |
<elukey> |
reboot druid1001 for kernel upgrades |
[analytics] |
09:00 |
<elukey> |
suspended hourly druid batch index jobs via Hue |
[analytics] |
08:58 |
<elukey> |
temporarily set druid1002 in superset's druid cluster config (via UI) |
[analytics] |
08:53 |
<elukey> |
temporarily point pivot's configuration to druid1002 (druid1001 needs to be rebooted) |
[analytics] |
08:52 |
<elukey> |
disable druid1001's middlemanager as prep step for reboot |
[analytics] |
07:11 |
<elukey> |
re-run webrequest-load-wf-misc-2018-1-18-3 via Hue |
[analytics] |
2018-01-17
§
|
17:33 |
<elukey> |
killed the banner impression spark job (application_1515441536446_27293) again to force it to respawn (real time indexers not present) |
[analytics] |
17:29 |
<elukey> |
restarted all druid overlords on druid100[123] (weird race condition messages about who was the leader for some task) |
[analytics] |
16:24 |
<elukey> |
re-run all the pageview-druid-hourly failed jobs via Hue |
[analytics] |
14:42 |
<elukey> |
restart druid middlemanager on druid1003 as attempt to unblock realtime streaming |
[analytics] |
14:21 |
<elukey> |
forced kill of banner impression data streaming job to get it restarted |
[analytics] |
11:44 |
<elukey> |
re-run pageview-druid-hourly-wf-2018-1-17-9 and pageview-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's middlemanager being in a weird state after reboot) |
[analytics] |
11:44 |
<elukey> |
restart druid middlemanager on druid1002 |
[analytics] |
10:38 |
<elukey> |
stopped all crons on hadoop-coordinator-1 |
[analytics] |
10:37 |
<elukey> |
re-run webrequest-druid-hourly-wf-2018-1-17-8 (failed due to druid1002's reboot) |
[analytics] |
10:22 |
<elukey> |
reboot druid1002 for kernel upgrades |
[analytics] |
09:53 |
<elukey> |
disable druid middlemanager on druid1002 as prep step for reboot |
[analytics] |
09:46 |
<elukey> |
rebooted analytics1003 |
[analytics] |
09:46 |
<elukey> |
removed upstart config for brrd on eventlog1001 (failing and spamming syslog, old leftover?) |
[analytics] |
08:53 |
<elukey> |
disabled camus as prep step for analytics1003 reboot |
[analytics] |
2018-01-11
§
|
22:35 |
<ottomata> |
restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/403774 |
[analytics] |
22:04 |
<ottomata> |
restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403762/ |
[analytics] |
20:57 |
<ottomata> |
restarting kafka-jumbo brokers to apply https://gerrit.wikimedia.org/r/#/c/403753/ |
[analytics] |
17:37 |
<joal> |
Kill manual banner-streaming job to see it restarted by cron |
[analytics] |
17:11 |
<ottomata> |
restart kafka on kafka-jumbo1003 |
[analytics] |
17:08 |
<ottomata> |
restart kafka on kafka-jumbo1001...something is not right with my certpath change yesterday |
[analytics] |
14:46 |
<joal> |
Deploy refinery onto HDFS |
[analytics] |
14:33 |
<joal> |
Deploy refinery with Scap |
[analytics] |
14:07 |
<joal> |
Manually restarting banner streaming job to prevent alerting |
[analytics] |
13:23 |
<joal> |
Killing banner-streaming job to have it auto-restarted from cron |
[analytics] |
11:45 |
<elukey> |
re-run webrequest-load-wf-text-2018-1-11-8 (failed due to reboots) |
[analytics] |
11:39 |
<joal> |
rerun mediacounts-load-wf-2018-1-11-8 |
[analytics] |
10:48 |
<joal> |
Restarting banner-streaming job after hadoop nodes reboot |
[analytics] |
10:01 |
<elukey> |
reboot analytics1059-61 for kernel updates |
[analytics] |
09:34 |
<elukey> |
reboot analytics1055->1058 for kernel updates |
[analytics] |
09:04 |
<elukey> |
reboot analytics1051->1054 for kernel updates |
[analytics] |