production SAL

8901-8950 of 10000 results (31ms)

2018-04-06 §
08:07	<elukey>	upgrade prometheus-burrow-exporter on kafkamon1001/2001 - T188719	[production]
08:07	<elukey>	upload prometheus-burrow-exporter 0.0.5 to jessie/stretch-wikimedia - T188719	[production]
2018-04-04 §
15:06	<elukey>	delete /srv/deployment/prometheus from restbase* as clean up step for T181728	[production]
14:20	<elukey>	apply net.ipv4.tcp_tw_reuse=1 to restbase* via https://gerrit.wikimedia.org/r/#/c/421901 - T190213	[production]
12:02	<elukey>	removing /srv/deployment/prometheus from restbase2001/1007 - T181728	[production]
09:16	<elukey>	executed systemctl reset-failed kafka-mirror-main-eqiad_to_jumbo-eqiad.service on kafka1020	[production]
2018-04-03 §
17:40	<elukey>	manually set net.ipv4.tcp_tw_reuse=1 on restbase1007 as test for T190213	[production]
17:08	<elukey>	manually set net.ipv4.tcp_tw_reuse=1 on restbase2001 as test for T190213	[production]
15:39	<elukey>	roll restart of zookeeper on conf100[123] to pick up prometheus monitoring	[production]
13:18	<elukey>	roll restart of zookeeper on conf200[123] to pick up prometheus monitoring settings	[production]
08:01	<elukey>	restart of druid-(overlord\|middlemanager) on druid1004[456] as precautionary measure after zk restart	[production]
07:50	<elukey>	roll restart zookeeper on druid100[456] to enable prometheus monitoring	[production]
06:43	<elukey>	execute systemctl reset-failed kafka-mirror-main-eqiad_to_jumbo-eqiad.service on kafka102[23]	[production]
2018-03-30 §
10:17	<elukey>	roll restart of zookeeper daemons on druid100[123] (Druid analytics cluster) to pick up the new prometheus jmx agent	[production]
09:31	<elukey>	restart oozie/hive daemons on an1003 for openjdk-8 upgrades	[production]
08:38	<elukey>	rolling restart of hadoop-hdfs-datanode on all the hadoop worker nodes after https://gerrit.wikimedia.org/r/423000	[production]
07:39	<elukey>	rolling restart of yarn-hadoop-nodemanagers on all the hadoop worker nodes after https://gerrit.wikimedia.org/r/423000	[production]
2018-03-29 §
09:16	<elukey>	roll restart aqs on aqs100* for icu/openssl upgrades	[production]
08:07	<elukey>	roll restart of cassandra on aqs* for openjdk-8 upgrades	[production]
2018-03-28 §
13:51	<elukey>	reduced number of jobrunner runners on the videoscalers after the last burst of jobs that maxed out the cluster	[production]
2018-03-27 §
09:44	<elukey>	reboot aqs1009 for kernel + cassandra upgrades	[production]
09:28	<elukey>	reboot aqs1008 for kernel + cassandra upgrades	[production]
09:09	<elukey>	reboot aqs1007 for kernel + cassandra upgrades	[production]
08:33	<elukey>	reboot aqs1006 for kernel + openjdk-8 + cassandra upgrade	[production]
08:15	<elukey@puppetmaster1001>	conftool action : set/pooled=no; selector: name=aqs1005.eqiad.wmnet	[production]
08:11	<elukey>	reboot aqs1005 for kernel + openjdk-8 + cassandra upgrade	[production]
06:59	<elukey>	powercycle restbase2007 (no ssh, vsp not available via mgmt console)	[production]
2018-03-26 §
07:33	<elukey>	stop eventlogging zmq-forwarder on eventlog1001 as part of decom process - T189566	[production]
2018-03-24 §
15:00	<elukey>	rm -rf /srv/mediawiki/core on stat100[456] and force puppet run (git pull returned fatal: protocol error: bad pack header)	[production]
2018-03-23 §
11:09	<elukey>	restarting jvm daemons on analytics100[12] (Hadoop Masters) for openjdk-8 upgrade	[production]
10:36	<elukey>	upload cassandra2.2.6-wmf3 to jessie/stretch-wikimedia -C component/cassandra22 - T189529	[production]
08:19	<elukey>	reboot eventlog1001 for kernel upgrades	[production]
2018-03-22 §
14:16	<elukey>	rolling restart of the three hadoop hdfs journal nodes (an1028/35/52) for openjdk-8 upgrades	[production]
11:20	<elukey>	rolling restart of the hadoop hdfs datanode daemons on all the analytics hadoop workers for openjdk-8 upgrade	[production]
10:42	<elukey>	update puppet compiler's fact	[production]
09:55	<elukey>	rolling restart of yarn nodemanagers on the analytics hadoop workers for openjdk-8 upgrade	[production]
07:58	<elukey>	depool cp3010 + powercycle (no ssh access, mgmt console frozen)	[production]
2018-03-20 §
17:29	<elukey>	test a depool/repool action for kafka1001 (eventbus/jobqueue) - part of an investigation to figure out where timeouts come from	[production]
2018-03-19 §
15:23	<elukey>	reboot kafka1003 for kernel upgrades (jobqueues/eventbus)	[production]
14:34	<elukey>	reboot kafka1002 (eventbus/jobqueue) for kernel upgrades	[production]
09:37	<elukey>	restart hadoop daemons on analytics1070 for openjdk upgrades (canary)	[production]
08:41	<elukey>	reboot thorium for kernel security upgrades (hosts all analytics websites, they will go down temporary)	[production]
08:22	<elukey>	revert previous state on aqs1004, the new pkg might need some more work - T189529	[production]
07:58	<elukey>	manually installed cassandra-2.2.6-wmf3 on aqs1004 - T189529	[production]
07:47	<elukey>	drain cassandra instances and reboot aqs1004 for kernel upgrades	[production]
2018-03-17 §
18:41	<elukey>	executed apt-get clean on scb1004 to free some space (root partition disk space warning)	[production]
2018-03-16 §
14:25	<elukey>	reboot druid1002 for kernel updates	[production]
10:01	<elukey>	restart eventlogging_sync on db1108 (eventlogging db slave) as precautions after the change of m4-master.eqiad.wmnet's CNAME	[production]
09:57	<elukey>	restart eventlogging-consumer@mysql-eventbus on eventlog1002 to force the DNS resolution of m4-master (changed from dbproxy1009 -> dbproxy1004)	[production]
09:51	<elukey>	restart eventlogging-consumer@mysql-m4 on eventlog1002 to force the DNS resolution of m4-master (changed from dbproxy1009 -> dbproxy1004)	[production]