production SAL

9801-9850 of 10000 results (16ms)

2016-12-18 §
08:57	<elukey>	forced restart of cassandra-c on restbase1011	[production]
08:51	<elukey>	forced restart of cassandra-b/c on restbase1013 (b not really needed, my error)	[production]
08:49	<elukey>	forced restart for cassandra-a on restbase1009 (still OOMs)	[production]
08:43	<elukey>	forced puppet on restbase1009 to bring up cassandra-a (stopped due to OOM issues)	[production]
2016-12-17 §
09:38	<elukey>	ran apt-get clean and removed some /tmp files on stat1002 to free some space	[production]
09:24	<elukey>	restarted stuck hhvm on mw1168 (forgot to run hhvm-dump-debug)	[production]
2016-12-16 §
15:13	<elukey>	prometheus apache and hhvm exporters running on the eqiad MW appservers	[production]
14:30	<elukey>	disabling puppet on the eqiad appservers to rollout gradually the prometheus apache/hhvm exporters	[production]
2016-12-15 §
07:40	<elukey>	moved some home files on stat1002 to the data-tank partition to free some space	[production]
2016-12-14 §
23:37	<elukey>	sent an email to the owners of the biggest home directories on stat1002	[production]
2016-12-13 §
20:00	<elukey>	uploaded prometheus-apache-exporter 0.3-1 to jessie-wikimedia main	[production]
14:47	<elukey>	testing prometheus-apache-exporter on mw2198	[production]
2016-12-07 §
17:25	<elukey>	puppet run completed on mw1* hosts (10% batch-size)	[production]
17:08	<elukey>	Apache config changed on mw2*, tests look fine (apachectl -S does not show the vhost, apachectl -t is ok, apache-fast-test from tin is ok). Proceeding with eqiad	[production]
16:54	<elukey>	force puppet run on mw2* hosts (10% batch-size)	[production]
16:47	<elukey>	running puppet on some mw codfw appservers to check the new config	[production]
16:41	<elukey>	disabled puppet on mw1* hosts as prep step	[production]
16:39	<elukey>	removing bits.w.o VHost from mediawiki apache config (https://gerrit.wikimedia.org/r/#/c/305536)	[production]
2016-12-06 §
08:47	<elukey>	restarting hhvm on mw1285 (hhvm debug in /tmp/hhvm.100918.bt)	[production]
2016-12-05 §
17:19	<elukey>	restarting hhvm on mw1268 (hhvm-debug in /tmp/hhvm.16827.bt.)	[production]
17:16	<elukey>	restarting hhvm on mw1285 (hhvm-debug in /tmp/hhvm.140129.bt.)	[production]
16:50	<elukey>	added nagios process check alarms for varnishakfka-statsv and varnishkafka-eventlogging on cache::text hosts	[production]
14:08	<elukey>	depooling mw1239 for maintenance (T148421)	[production]
2016-12-02 §
08:37	<elukey>	restarting hhvm (/usr/local/bin/restart-hhvm) on G@cluster:api_appserver and G@site:eqiad (batch 10%)	[production]
2016-12-01 §
14:46	<elukey>	restarting kafka on kafka100[123] (EventBus) for openjdk upgrades	[production]
14:19	<elukey>	restarting kafka also on kafka2003	[production]
14:17	<elukey>	restarting kafka on kafka200[12] for openjdk upgrades	[production]
10:25	<elukey>	removed --debug flag to the puppet compiler output	[production]
09:54	<elukey>	added --debug to the puppet compiler options in Jenkins	[production]
07:57	<elukey@tin>	Finished deploy [analytics/pivot/deploy@0513a6e]: (no message) (duration: 00m 02s)	[production]
07:57	<elukey@tin>	Starting deploy [analytics/pivot/deploy@0513a6e]: (no message)	[production]
2016-11-29 §
12:05	<elukey>	complete rolling restart of apache in eqiad	[production]
11:48	<elukey>	re-enable puppet on mw1* hosts and apply Apache config change (https://gerrit.wikimedia.org/r/#/c/314519)	[production]
11:23	<elukey>	disabled puppet on mw1* hosts as pre-step for https://gerrit.wikimedia.org/r/#/c/314519	[production]
2016-11-27 §
09:35	<elukey>	removed all the files not used in /tmp on stat1002 after a follow up with the owner	[production]
2016-11-26 §
15:35	<elukey>	deleted tmp files on stat1002's /tmp partition because of disk space consumption. Will follow up with the owner.	[production]
2016-11-25 §
08:52	<elukey>	restarting Yarn and HDFS masters on analytics100[12] (Hadoop cluster) to complete the openjdk update	[production]
2016-11-24 §
12:36	<elukey>	launched preferred-replica-election to re-add kafka1022 among the Topic partition leader brokers of the Analytics Kafka cluster (all metrics looks good)	[production]
2016-11-21 §
17:29	<elukey>	unmasked kafka* on kafka1022 after disk swap	[production]
11:56	<elukey>	restarted jobchron/runner on mw208[0-5] since systemd was reporting degradation (broken pipes in the journald logs)	[production]
08:50	<elukey>	rolling restart of hadoop-related java daemons on analytics* hosts due to openjdk update	[production]
2016-11-18 §
08:33	<elukey>	kafka1022 up and running with kafka* daemon masked and broken disk removed from fstab (we mount partitions in there using UUIDs)	[production]
2016-11-17 §
10:22	<elukey>	cleanup on analytics1027 - Removed mysql-server-5.5 (not used) and ran apt autoremove (old kernels)	[production]
09:19	<elukey>	rebooting mc1019->mc1036 (memcached/redis servers, not taking any traffic) for kernel upgrades	[production]
2016-11-11 §
10:51	<elukey>	restored mw1284 to its normal settings	[production]
10:05	<elukey>	increasing apache log level on mw1284 (depooling, applying config manually, re-pooling with lower weight) for a 503 investigation	[production]
2016-11-10 §
15:01	<elukey>	restored mw1284 to its settings	[production]
14:47	<elukey>	de-pooling mw1284 to raise mod_proxy_fcgi log level manually (temporary for an ongoing investigation)	[production]
09:43	<elukey>	restarting druid daemons on druid100[123] for openjdk updates	[production]
2016-11-09 §
14:13	<elukey>	rebooting kafka1014.eqiad.wmnet for kernel and openjdk upgrades	[production]