production SAL

5801-5850 of 10000 results (22ms)

2011-04-25 §
18:42	<RobH>	db19 and db20 back online (not in services as they have other issues)	[production]
18:39	<RobH>	db19 and db20 powering back up	[production]
18:25	<RobH>	virt4 experienced an accidental reboot when rebalancing power in the rack, my fault, not the hardware	[production]
18:12	<RobH>	rack b2 power rebalanced	[production]
18:01	<RobH>	db19 set to slave, depooled in db.php, no other services evident, shutting down (mysql stopped cleanly)	[production]
18:00	<RobH>	db20 shutdown	[production]
18:00	<RobH>	didnt log that i setup ports 11/38-40 for db19, db20, and snapshot4 on csw1-sdtpa. tested out fine and all my major configuration changes on netowrk should be complete	[production]
17:56	<RobH>	ok, db20 and db19 are coming offline to relocate their rack location due to power distro issues	[production]
15:47	<RobH>	delay, not coming down yet, need more cables	[production]
15:46	<RobH>	db19 is coming down as well, it is depooled anyhow	[production]
15:46	<RobH>	db20 is coming down, ganglia aggregation for those hosts may be delayed until it is back online.	[production]
15:21	<RobH>	relocating snapshot4 into rack c2, it will be offline during this process	[production]
15:20	<RobH>	db43-db47 network setup, sites not down, yay me	[production]
15:10	<RobH>	being on csw1 makes robh nervous.	[production]
15:09	<RobH>	labeling and setting up ports on 11/33 through 11/37 on csw1-sdtpa for db43 through db47	[production]
14:47	<RobH>	fixed storage2 serial console (set it to higher rate, magically works, or it just fears me) and also confirmed its remote power control is functioning	[production]
14:42	<RobH>	stealing dataset1's known good scs connection to test storage2. dataset1 service will remain unaffected.	[production]
2011-04-24 §
21:30	<Ryan_Lane>	restarting apache on mobile1	[production]
15:35	<RobH>	swapping bad disk in db30, hotswap, should be fine	[production]
14:36	<RobH>	swapping out the management switch in c1-sdtpa. msw-c1-sdtpa will be offline, so the mgmt interfaces of servers in that rack will be offline. all normal services will remain unaffected.	[production]
2011-04-23 §
22:31	<RobH>	required even.	[production]
22:31	<RobH>	no drives display error leds, futher investigation requried	[production]
22:27	<RobH>	ms2 is having bad drive investigated. if we do this right, it wont go down. if we don't it will. is a slave es server.	[production]
22:00	<RobH>	singer returned to operation, blog, techblog, survey, and secure returned to normal operation	[production]
21:52	<RobH>	singer is once again coming back down for drive replacement. This will take offline blog.wikimedia.org, techblog.wikimedia.org, survey.wikimedia.org, and secure.wikipedia.org. Service will be returned as soon as possible.	[production]
21:19	<RobH>	singer back online, for awhile, will come back down for further repair shortly.	[production]
21:05	<RobH>	singer going down, blogs will be offline, so will secure, system will return to service as soon as possible	[production]
21:00	<RobH>	preparing to fix the dead drive in singer, this will offline secure, blog, techblog, and survey during the drive replacement process	[production]
19:50	<mark>	Upgrading mr1-pmtpa to junos 10.4R3.4	[production]
17:49	<RobH>	migrating searchidx1 & search1-search10 to new ports in same rack. moving one at a time and ensuring link lights between moves. (already tested with search10)	[production]
14:11	<RobH>	db19 is back online, seems to not have any mysql setup done.	[production]
14:02	<RobH>	restarting db19	[production]
14:02	<RobH>	arcconf checks out all drives on db19 are indeed working as rich found earlier	[production]
12:47	<mark>	Added (x121Address=1) condition to the LDAP query of the ldap_aliases router on mchenry's exim	[production]
00:32	<hcatlin>	Mobile: Deploying fix to an issue that kept the standard-style Main_Page from displaying on mobile	[production]
00:25	<Ryan_Lane>	restarting memcached on all of the mobile servers	[production]
00:23	<Ryan_Lane>	repooling mobile3, since mobile will die without it (fun!!)	[production]
00:17	<Ryan_Lane>	depooling mobile3	[production]
00:13	<Ryan_Lane>	restarting apache on mobile3	[production]
00:10	<Ryan_Lane>	puppet was broken on mobile1, reinstalled it	[production]
2011-04-22 §
23:56	<domas>	detached gdb from srv193 apache, apparently it was used for something	[production]
23:14	<notpeter>	restarting nagios (again)wq	[production]
22:43	<notpeter>	restarting nagios	[production]
19:23	<apergos>	shot all stopped rsyncs on ms5 (that were copying from ms4 about two weeks ago), changed all perms on the directories they had reached so thumbs can be served/read from them.. oh. not me, someone else must have done it, I'm not here :-P	[production]
19:02	<RobH>	ms4 shutting down for memory troubleshooting	[production]
18:52	<RobH>	ms4 troubleshooting, disragrd bounces]	[production]
18:51	<notpeter>	restarting nagios	[production]
12:41	<hcatlin>	Restarting mobile cluster with April code update.	[production]
00:49	<notpeter>	restarting nagios. hopefully now with more sms!	[production]
2011-04-21 §
23:32	<midom>	synchronized php-1.17/includes/ImagePage.php	[production]