2011-04-25
§
|
18:25 |
<RobH> |
virt4 experienced an accidental reboot when rebalancing power in the rack, my fault, not the hardware |
[production] |
18:12 |
<RobH> |
rack b2 power rebalanced |
[production] |
18:01 |
<RobH> |
db19 set to slave, depooled in db.php, no other services evident, shutting down (mysql stopped cleanly) |
[production] |
18:00 |
<RobH> |
db20 shutdown |
[production] |
18:00 |
<RobH> |
didnt log that i setup ports 11/38-40 for db19, db20, and snapshot4 on csw1-sdtpa. tested out fine and all my major configuration changes on netowrk should be complete |
[production] |
17:56 |
<RobH> |
ok, db20 and db19 are coming offline to relocate their rack location due to power distro issues |
[production] |
15:47 |
<RobH> |
delay, not coming down yet, need more cables |
[production] |
15:46 |
<RobH> |
db19 is coming down as well, it is depooled anyhow |
[production] |
15:46 |
<RobH> |
db20 is coming down, ganglia aggregation for those hosts may be delayed until it is back online. |
[production] |
15:21 |
<RobH> |
relocating snapshot4 into rack c2, it will be offline during this process |
[production] |
15:20 |
<RobH> |
db43-db47 network setup, sites not down, yay me |
[production] |
15:10 |
<RobH> |
being on csw1 makes robh nervous. |
[production] |
15:09 |
<RobH> |
labeling and setting up ports on 11/33 through 11/37 on csw1-sdtpa for db43 through db47 |
[production] |
14:47 |
<RobH> |
fixed storage2 serial console (set it to higher rate, magically works, or it just fears me) and also confirmed its remote power control is functioning |
[production] |
14:42 |
<RobH> |
stealing dataset1's known good scs connection to test storage2. dataset1 service will remain unaffected. |
[production] |
2011-04-23
§
|
22:31 |
<RobH> |
required even. |
[production] |
22:31 |
<RobH> |
no drives display error leds, futher investigation requried |
[production] |
22:27 |
<RobH> |
ms2 is having bad drive investigated. if we do this right, it wont go down. if we don't it will. is a slave es server. |
[production] |
22:00 |
<RobH> |
singer returned to operation, blog, techblog, survey, and secure returned to normal operation |
[production] |
21:52 |
<RobH> |
singer is once again coming back down for drive replacement. This will take offline blog.wikimedia.org, techblog.wikimedia.org, survey.wikimedia.org, and secure.wikipedia.org. Service will be returned as soon as possible. |
[production] |
21:19 |
<RobH> |
singer back online, for awhile, will come back down for further repair shortly. |
[production] |
21:05 |
<RobH> |
singer going down, blogs will be offline, so will secure, system will return to service as soon as possible |
[production] |
21:00 |
<RobH> |
preparing to fix the dead drive in singer, this will offline secure, blog, techblog, and survey during the drive replacement process |
[production] |
19:50 |
<mark> |
Upgrading mr1-pmtpa to junos 10.4R3.4 |
[production] |
17:49 |
<RobH> |
migrating searchidx1 & search1-search10 to new ports in same rack. moving one at a time and ensuring link lights between moves. (already tested with search10) |
[production] |
14:11 |
<RobH> |
db19 is back online, seems to not have any mysql setup done. |
[production] |
14:02 |
<RobH> |
restarting db19 |
[production] |
14:02 |
<RobH> |
arcconf checks out all drives on db19 are indeed working as rich found earlier |
[production] |
12:47 |
<mark> |
Added (x121Address=1) condition to the LDAP query of the ldap_aliases router on mchenry's exim |
[production] |
00:32 |
<hcatlin> |
Mobile: Deploying fix to an issue that kept the standard-style Main_Page from displaying on mobile |
[production] |
00:25 |
<Ryan_Lane> |
restarting memcached on all of the mobile servers |
[production] |
00:23 |
<Ryan_Lane> |
repooling mobile3, since mobile will die without it (fun!!) |
[production] |
00:17 |
<Ryan_Lane> |
depooling mobile3 |
[production] |
00:13 |
<Ryan_Lane> |
restarting apache on mobile3 |
[production] |
00:10 |
<Ryan_Lane> |
puppet was broken on mobile1, reinstalled it |
[production] |
2011-04-22
§
|
23:56 |
<domas> |
detached gdb from srv193 apache, apparently it was used for something |
[production] |
23:14 |
<notpeter> |
restarting nagios (again)wq |
[production] |
22:43 |
<notpeter> |
restarting nagios |
[production] |
19:23 |
<apergos> |
shot all stopped rsyncs on ms5 (that were copying from ms4 about two weeks ago), changed all perms on the directories they had reached so thumbs can be served/read from them.. oh. not me, someone else must have done it, I'm not here :-P |
[production] |
19:02 |
<RobH> |
ms4 shutting down for memory troubleshooting |
[production] |
18:52 |
<RobH> |
ms4 troubleshooting, disragrd bounces] |
[production] |
18:51 |
<notpeter> |
restarting nagios |
[production] |
12:41 |
<hcatlin> |
Restarting mobile cluster with April code update. |
[production] |
00:49 |
<notpeter> |
restarting nagios. hopefully now with more sms! |
[production] |