1701-1750 of 10000 results (39ms)
2021-07-20 ยง
16:53 <btullis@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on 64 hosts with reason: dealing with an-master1001 rebuild issue [production]
16:52 <razzi> starting hadoop processes on an-master1001 since they didn't failover cleanly [analytics]
16:44 <dcausse@deploy1002> helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [production]
16:37 <dzahn@cumin1001> END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1297.eqiad.wmnet [production]
16:31 <razzi> sudo bash gid_script.bash on an-maseter1001 [analytics]
16:29 <razzi> razzi@alert1001:~$ sudo icinga-downtime -h an-master1001 -d 7200 -r "an-master1001 debian upgrade" [analytics]
16:25 <razzi> razzi@an-master1001:~$ sudo systemctl stop hadoop-mapreduce-historyserver [analytics]
16:25 <razzi> sudo systemctl stop hadoop-hdfs-zkfc.service on an-master1001 again [analytics]
16:25 <dcausse@deploy1002> helmfile [staging] Ran 'sync' command on namespace 'rdf-streaming-updater' for release 'main' . [production]
16:25 <razzi> sudo systemctl stop hadoop-yarn-resourcemanager on an-master1001 again [analytics]
16:24 <dzahn@cumin1001> START - Cookbook sre.hosts.decommission for hosts mw1297.eqiad.wmnet [production]
16:23 <razzi> sudo systemctl stop hadoop-hdfs-namenode on an-master1001 [analytics]
16:21 <dzahn@cumin1001> END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1290.eqiad.wmnet [production]
16:19 <razzi> razzi@an-master1001:~$ sudo systemctl stop hadoop-hdfs-zkfc [analytics]
16:19 <razzi> razzi@an-master1001:~$ sudo systemctl stop hadoop-yarn-resourcemanager [analytics]
16:18 <razzi> sudo systemctl stop hadoop-hdfs-namenode [analytics]
16:11 <dzahn@cumin1001> START - Cookbook sre.hosts.decommission for hosts mw1290.eqiad.wmnet [production]
16:10 <razzi> razzi@cumin1001:~$ sudo transfer.py an-master1002.eqiad.wmnet:/home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz stat1004.eqiad.wmnet:/home/razzi/hdfs-namenode-fsimage [analytics]
16:10 <dzahn@cumin1001> END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw1289.eqiad.wmnet [production]
16:03 <razzi> root@an-master1002:/srv/hadoop/name# tar -czf /home/razzi/hdfs-namenode-snapshot-buster-reimage-$(date --iso-8601).tar.gz current [analytics]
15:59 <dzahn@cumin1001> START - Cookbook sre.hosts.decommission for hosts mw1289.eqiad.wmnet [production]
15:57 <dzahn@cumin1001> conftool action : set/pooled=inactive; selector: name=mw129[07].eqiad.wmnet [production]
15:57 <dzahn@cumin1001> conftool action : set/pooled=inactive; selector: name=mw1289.eqiad.wmnet [production]
15:57 <razzi> sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -saveNamespace [analytics]
15:52 <razzi> sudo -u hdfs kerberos-run-command hdfs hdfs dfsadmin -safemode enter [analytics]
15:48 <oblivian@deploy1002> helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [production]
15:45 <arturo> failback from labstore1006 to labstore1007 (dumps NFS) https://gerrit.wikimedia.org/r/c/operations/puppet/+/705417 [admin]
15:37 <razzi> kill yarn applications: for jobId in $(yarn application -list | awk 'NR > 2 { print $1 }'); do yarn application -kill $jobId; done [analytics]
15:23 <vgutierrez> pool dns1002 - T286069 [production]
15:21 <vgutierrez> pool cp[1087-1090].eqiad.wmnet - T286069 [production]
15:19 <jmm@puppetmaster1001> conftool action : set/pooled=yes; selector: name=ldap-replica1004.wikimedia.org [production]
15:17 <wm-bot> <bd808> Restarting becuase the bot is not working on all channels. Logs are inconclusive as to why. [tools.bridgebot]
15:14 <dzahn@cumin1001> conftool action : set/pooled=no; selector: name=mw1297.eqiad.wmnet [production]
15:14 <dzahn@cumin1001> conftool action : set/pooled=no; selector: name=mw1290.eqiad.wmnet [production]
15:14 <dzahn@cumin1001> conftool action : set/pooled=no; selector: name=mw1289.eqiad.wmnet [production]
15:08 <razzi> sudo -u yarn kerberos-run-command yarn yarn rmadmin -refreshQueues [analytics]
15:06 <kormat@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on 12 hosts with reason: Deploying schema change to s3 T281058 [production]
15:06 <kormat@cumin1001> START - Cookbook sre.hosts.downtime for 4:00:00 on 12 hosts with reason: Deploying schema change to s3 T281058 [production]
14:53 <urbanecm> Start server-side upload for 7 large PNG files (T285708) [production]
14:52 <razzi> sudo systemctl stop 'gobblin-*.timer' [analytics]
14:51 <herron> depooled and scheduled downtime for kafka-main100[45] [production]
14:51 <razzi> sudo systemctl stop analytics-reportupdater-logs-rsync.timer [analytics]
14:51 <vgutierrez@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lvs1016.eqiad.wmnet with reason: eqiad row D maintenance [production]
14:50 <vgutierrez@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on lvs1016.eqiad.wmnet with reason: eqiad row D maintenance [production]
14:48 <vgutierrez@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on dns1002.wikimedia.org with reason: eqiad row D maintenance [production]
14:48 <vgutierrez@cumin1001> START - Cookbook sre.hosts.downtime for 1:00:00 on dns1002.wikimedia.org with reason: eqiad row D maintenance [production]
14:47 <razzi> Disable jobs on an-launcher1002 (see https://phabricator.wikimedia.org/T278423#7190372) [analytics]
14:46 <razzi> razzi@an-launcher1002:~$ sudo puppet agent --disable 'razzi: upgrade hadoop masters to debian buster' [analytics]
14:46 <vgutierrez> depool dns1002 - T286069 [production]
14:40 <vgutierrez@cumin1001> END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cp[1087-1090].eqiad.wmnet with reason: eqiad row D maintenance [production]