951-1000 of 2275 results (15ms)
2018-07-18 §
12:08 <arturo> upgrading packages from `trusty-wikimedia` T199905 [tools]
2018-06-30 §
18:15 <chicocvenancio> pushed new config to PAWS to fix dumps nfs mountpoint [tools]
16:40 <zhuyifei1999_> because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state [tools]
16:39 <zhuyifei1999_> reboot tools-paws-master-01 [tools]
16:35 <zhuyifei1999_> `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab` [tools]
16:34 <andrewbogott> "sed -i '/labstore1006/d' /etc/fstab" everywhere [tools]
2018-06-29 §
17:41 <bd808> Rescheduling continuous jobs away from tools-exec-1408 where load is high [tools]
17:11 <bd808> Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121) [tools]
16:46 <bd808> Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070 [tools]
2018-06-28 §
19:50 <chasemp> tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org' [tools]
18:02 <chasemp> tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org" [tools]
17:53 <chasemp> tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'" [tools]
17:20 <chasemp> tools-worker-1007:~# /sbin/reboot [tools]
16:48 <arturo> rebooting tools-docker-registry-01 [tools]
16:42 <andrewbogott> rebooting tools-worker-<everything> to get NFS unstuck [tools]
16:40 <andrewbogott> rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck [tools]
2018-06-21 §
13:18 <chasemp> tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash [tools]
2018-06-20 §
15:09 <bd808> Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool [tools]
2018-06-14 §
14:20 <chasemp> timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash [tools]
2018-06-11 §
10:11 <arturo> T196137 `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'` [tools]
2018-06-08 §
07:46 <arturo> T196137 more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes [tools]
2018-06-07 §
11:01 <arturo> T196137 force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'` [tools]
2018-06-06 §
22:00 <bd808> Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589) [tools]
21:10 <bd808> Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220) [tools]
20:25 <bd808> Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220) [tools]
19:04 <chasemp> tools-bastion-03 is virtually unusable [tools]
09:49 <arturo> T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid [tools]
2018-06-05 §
18:02 <bd808> Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486) [tools]
17:39 <arturo> T196137 clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs [tools]
17:38 <bd808> Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486) [tools]
2018-06-04 §
10:28 <arturo> T196006 installing sqlite3 package in exec nodes [tools]
2018-06-03 §
10:19 <zhuyifei1999_> Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' T195834 [tools]
2018-05-31 §
11:31 <zhuyifei1999_> building & pushing python/web docker image T174769 [tools]
11:13 <zhuyifei1999_> force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101 [tools]
2018-05-30 §
10:52 <zhuyifei1999_> undid both changes to tools-bastion-05 [tools]
10:50 <zhuyifei1999_> also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05 [tools]
10:45 <zhuyifei1999_> installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close T195834 [tools]
2018-05-28 §
12:09 <arturo> T194665 adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia [tools]
12:06 <arturo> T194665 adding mono packages to apt.wikimedia.org for trusty-wikimedia [tools]
2018-05-25 §
05:31 <zhuyifei1999_> Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty T195558 [tools]
2018-05-22 §
11:53 <arturo> running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for T194665 (mono framework update) [tools]
2018-05-18 §
16:36 <bd808> Restarted bigbrother on tools-services-02 [tools]
2018-05-16 §
21:01 <zhuyifei1999_> maintain-kubeusers on stuck in infinite sleeps of 10 seconds [tools]
2018-05-15 §
04:28 <andrewbogott> depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons. [tools]
04:07 <zhuyifei1999_> Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs [tools]
04:05 <zhuyifei1999_> Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding [tools]
2018-05-12 §
10:09 <Hauskatze> tools.quentinv57-tools@tools-bastion-02:~$ webservice stop | T194343 [tools]
2018-05-11 §
14:34 <andrewbogott> repooling labvirt1001 tools instances [tools]
13:59 <andrewbogott> depooling a bunch of things before rebooting labvirt1001 for T194258: tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407 [tools]
2018-05-10 §
18:55 <andrewbogott> depooling, rebooting, repooling tools-exec-1401 to test a kernel update [tools]