2017-04-19
§
|
06:52 |
<_joe_> |
artificially stopping slave replication on rdb2001 for a final test of the switchover redis stage |
[production] |
03:53 |
<urandom> |
T163292: Starting removal of Cassandra instance restbase1018-b.eqiad.wmnet |
[production] |
03:49 |
<mobrovac@tin> |
Started restart [restbase/deploy@1bfada4]: (no justification provided) |
[production] |
03:40 |
<mobrovac@tin> |
Started restart [restbase/deploy@1bfada4]: Kick RB to pick up restbase1018 instances are gone |
[production] |
03:32 |
<mobrovac@tin> |
Finished deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292 (duration: 00m 53s) |
[production] |
03:31 |
<mobrovac@tin> |
Started deploy [changeprop/deploy@a19ebf8]: Temp: Decrease the transclusion update from 400 to 200 for T163292 |
[production] |
01:58 |
<mutante> |
naos: rsyncd is of course legitimately running on a deployment server sepearate from this (unlike in other cases where we used it for syncing during migration), so this was just the one config fragment for /home and not removing the service or anything |
[production] |
01:56 |
<mutante> |
naos: manually deleting rsyncd config remnants (puppet wouldn't know to clean up after itself) |
[production] |
01:47 |
<mutante> |
rsyncing /home from mira to naos (T162900) |
[production] |
01:21 |
<urandom> |
T163292: Starting removal of Cassandra instance restbase1018-a.eqiad.wmnet |
[production] |
2017-04-18
§
|
23:04 |
<dzahn@puppetmaster1001> |
conftool action : set/pooled=no; selector: name=restbase1018.eqiad.wmnet |
[production] |
23:02 |
<mutante> |
ms1001 - deleting old GlobalCert SSL cert for dumps.wm that was about to expire and is replaced by Letsencrypt, |
[production] |
22:30 |
<mutante> |
ocg1003 gzipping ocg.log for disk space |
[production] |
21:12 |
<bblack@neodymium> |
conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=varnish-be |
[production] |
20:36 |
<bblack@neodymium> |
conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=varnish-be |
[production] |
17:26 |
<mobrovac@tin> |
Finished deploy [restbase/deploy@1bfada4]: Blacklist all user pages on commons (duration: 07m 12s) |
[production] |
17:26 |
<ssastry@tin> |
Finished deploy [parsoid/deploy@b067328]: Deploying Parsoid to bump heap limits to 900m (from 600m) (duration: 06m 25s) |
[production] |
17:19 |
<ssastry@tin> |
Started deploy [parsoid/deploy@b067328]: Deploying Parsoid to bump heap limits to 900m (from 600m) |
[production] |
17:19 |
<mobrovac@tin> |
Started deploy [restbase/deploy@1bfada4]: Blacklist all user pages on commons |
[production] |
17:12 |
<XenoRyet> |
updated tools from a8b8d7242799b61dd2a48ef4e804164cd1818bc9 to a1e9342e093a85032255fc1d9904db7df13680b7 |
[production] |
17:09 |
<elukey> |
restart nutcracker in codfw (profile::mediawiki::nutcracker) to make sure that all the daemons are running with the latest config |
[production] |
16:26 |
<bblack> |
completed Traffic-layer portions of codfw switchover ( https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Switchover_2 ) |
[production] |
16:21 |
<bblack> |
starting Traffic-layer portions of codfw switchover ( https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Switchover_2 ) |
[production] |
16:15 |
<jynus> |
reimporting some rows to dbstore1002 on jawiki and ruwiki T160509 |
[production] |
16:12 |
<godog> |
reboot tin to fix cpu mhz issue and check bios settings - T163158 |
[production] |
16:09 |
<mobrovac@tin> |
Finished deploy [restbase/deploy@960b468]: Blacklist an enwiki and a commons page (duration: 08m 16s) |
[production] |
16:01 |
<mobrovac@tin> |
Started deploy [restbase/deploy@960b468]: Blacklist an enwiki and a commons page |
[production] |
16:00 |
<mobrovac@tin> |
Finished deploy [restbase/deploy@960b468]: Dev Cluster: Blacklist an enwiki and a commons page (duration: 01m 42s) |
[production] |
15:58 |
<mobrovac@tin> |
Started deploy [restbase/deploy@960b468]: Dev Cluster: Blacklist an enwiki and a commons page |
[production] |
15:20 |
<elukey> |
restored default output-buffer config for rdb2005:6479 |
[production] |
15:08 |
<godog> |
puppet-run on cache_upload in codfw/eqiad to pick up swift a/p changes |
[production] |
15:02 |
<godog> |
puppet-run on cache_upload in codfw/eqiad to pick up switch a/a changes |
[production] |
15:02 |
<gehel> |
upgrading elastic2020 to elasticsearch 5.1.2 |
[production] |
14:55 |
<_joe_> |
switchover of services, misc things done |
[production] |
14:54 |
<oblivian:> |
Setting restbase-async in codfw DOWN |
[production] |
14:54 |
<oblivian:> |
Setting restbase-async in eqiad UP |
[production] |
14:43 |
<_joe_> |
switching traffic for all a/a services plus maps and restbase to codfw-only |
[production] |
14:38 |
<_joe_> |
forcing puppet run on caches for catching up with the a/a setting of maps and restbase |
[production] |
14:33 |
<oblivian:> |
Setting restbase in eqiad DOWN |
[production] |
14:33 |
<_joe_> |
starting switchover of services eqiad => codfw; external traffic will be switched over, as well as internal traffic to restbase |
[production] |
14:25 |
<gehel> |
un-ban elastic2020 to get ready for real-life test during switchover - T149006 |
[production] |
14:22 |
<elukey> |
executed config set client-output-buffer-limit "normal 0 0 0 slave 2147483648 2147483648 300 pubsub 33554432 8388608 60" on rdb2005:6749 as attempt to solve slave lagging - T159850 |
[production] |
14:21 |
<oblivian:> |
Setting mobileapps in eqiad UP |
[production] |
14:14 |
<oblivian:> |
Setting mobileapps in eqiad DOWN |
[production] |
14:11 |
<elukey> |
executed CONFIG SET appendfsync everysec (default) to restore defaults on rdb2005:6479- T159850 |
[production] |
14:08 |
<switchdc> |
(oblivian@sarin) END TASK - switchdc.stages.t09_restart_parsoid(codfw, eqiad) Successfully completed |
[production] |
14:04 |
<elukey> |
executed CONFIG SET appendfsync no on rdb2005:6479 to test if fsync stalls affect replication - T159850 |
[production] |
13:50 |
<switchdc> |
(oblivian@sarin) START TASK - switchdc.stages.t09_restart_parsoid(codfw, eqiad) Rolling restart parsoid in eqiad and codfw |
[production] |
13:35 |
<switchdc> |
(oblivian@sarin) END TASK - switchdc.stages.t01_stop_maintenance(codfw, eqiad) Failed to execute |
[production] |
13:35 |
<switchdc> |
(oblivian@sarin) START TASK - switchdc.stages.t01_stop_maintenance(codfw, eqiad) Stop MediaWiki maintenance in the old master DC |
[production] |