351-400 of 4935 results (24ms)
2022-06-27 §
20:16 <btullis> checking and restarting prometheus-mysqld-exporter on an-coord1001 [analytics]
15:25 <btullis> upgraded conda-base-env on an-test-client1001 from 0.0.1 to 0.0.4 [analytics]
2022-06-24 §
15:14 <ottomata> backfilled eventlogging data lost during failed gobblin job - T311263 [analytics]
2022-06-23 §
13:48 <btullis> started the namenode service on an-master1001 after failback failure [analytics]
13:41 <btullis> The failback didn't work again. [analytics]
13:39 <btullis> attempting failback of namenode service from an-master1002 to an-master1001 [analytics]
13:07 <btullis> restarted hadoop-hdfs-namenode service on an-master1001 [analytics]
11:25 <joal> kill oozie mediawiki-geoeditors-monthly-coord in favor of airflow job [analytics]
08:52 <joal> Deploy airflow [analytics]
2022-06-22 §
20:55 <aqu> `scap deploy -f analytics/refinery` because of a crash during `git-fat pull` [analytics]
19:30 <aqu> Deploying analytics/refinery [analytics]
2022-06-21 §
14:56 <aqu> RefineSanitize from an-launcher1002: sudo -u analytics kerberos-run-command analytics spark2-submit --class org.wikimedia.analytics.refinery.job.refine.RefineSanitize --master yarn --deploy-mode client /srv/deployment/analytics/refinery/artifacts/org/wikimedia/analytics/refinery/refinery-job-0.1.15.jar --config_file /home/aqu/refine.properties --since "2022-06-19T09:52:00+0000" --until [analytics]
13:33 <aqu> sudo systemctl start monitor_refine_event_sanitized_main_immediate.service on an-launcher1002 [analytics]
10:47 <btullis> proceeding with the hadoop.roll-restart-masters cookbook [analytics]
2022-06-20 §
07:14 <SandraEbele> Started Airflow 3 Wikidata metrics jobs (Articleplaceholder, Reliability and SpecialEntityData metrics). [analytics]
07:12 <SandraEbele> Started Airflow3 Wikidata metrics jobs (Articleplaceholder, Relia) [analytics]
07:11 <SandraEbele> killed Oozie wikidata-articleplaceholder_metrics-coord, wikidata-reliability_metrics-coord, and wikidata-specialentitydata_metrics-coord jobs. [analytics]
2022-06-17 §
12:35 <SandraEbele> deployed daily airflow dag for 3 Wikidata metrics. [analytics]
08:36 <btullis> power cycled an-worker1109 as it was stuck with CPU soft lockups [analytics]
2022-06-16 §
06:49 <joal> Rerun webrequest-load-wf-upload-2022-6-15-22 after weird oozie failure [analytics]
2022-06-15 §
14:48 <btullis> deploying datahub 0.8.38 [analytics]
2022-06-14 §
10:48 <joal> unpause renamed dags [analytics]
10:44 <joal> Deploy Airflow [analytics]
10:12 <btullis> manually failing back hdfs-namenode to an-master1001 after fixing typo [analytics]
09:36 <joal> deploy refinery onto HDFS [analytics]
08:48 <btullis> roll-restarting hadoop masters T310293 [analytics]
08:40 <joal> Deploying using scap again after failure cleanup on an-launcher1002 [analytics]
07:45 <joal> deploy refinery using scap [analytics]
2022-06-13 §
14:00 <btullis> restarting presto service on an-coord1001 [analytics]
13:20 <btullis> btullis@datahubsearch1001:~$ sudo systemctl reset-failed ifup@ens13.service T273026 [analytics]
13:09 <btullis> restarting oozie service on an-coord1001 [analytics]
12:59 <btullis> havaing failed over hive to an-coord1002 10 minutes ago, I'm restarting hive services on an-coord1001 [analytics]
12:26 <btullis> restarting hive-server2 and hive-metastore on an-coord1002 [analytics]
09:54 <joal> rerun failed refine for network_flows_internal [analytics]
09:54 <joal> Rerun failed refine for mediawiki_talk_page_edit events [analytics]
09:51 <joal> Manually rerun webrequest_text laod for hour 2022-06-13T03:00 [analytics]
07:18 <joal> Manually rerun webrequest_text laod for hour 2022-06-12T08:00 [analytics]
2022-06-10 §
17:00 <ottomata> applied change to airflow instances to bump scheduler parsing_processes = # of cpu processors [analytics]
08:58 <btullis> cookbook sre.hadoop.roll-restart-workers analytics [analytics]
2022-06-09 §
17:17 <joal> Rerun refine for failed datasets [analytics]
14:15 <btullis> manually failing back HDFS namenode from an-master1002 to an-master1001 [analytics]
13:15 <btullis> roll-restarting the hadoop masters to pick up new JRE [analytics]
2022-06-08 §
18:06 <joal> Restart airflow after deploy for dag reprocessing [analytics]
18:02 <joal> deploying Airflow dags [analytics]
13:45 <btullis> deploying refinery [analytics]
2022-06-07 §
13:45 <btullis> deploying updated eventgate images to all remaining deployments. [analytics]
11:33 <btullis> deployed an updated version of eventgate to eventgate-analytics-external to address the timing mis-calculation. [analytics]
10:51 <btullis> restart the eventlogging_to_druid_netflow-sanitization_daily service on an-launcher1002 [analytics]
2022-06-06 §
13:45 <btullis> restarting archiva service for new JRE [analytics]
06:31 <elukey> restart memcached on an-tool1005 to pick up puppet settings and clear an alert in icinga [analytics]