Upgrade plan, starting with codfw cluster. Feel free to edit as necessary to better reflect reality, lessons learned while performing initial beta cluster migration, etc.
25-27 May, after wmf.3 branch cut:
Update warmers on current 1.7 clusterRecord the output of curl search.svc.codfw.wmnet/_cluster/settings to be re-appliedUpdate plugins repository. Do not sync out to prodPull new plugins to beta clusterInstall elasticsearch 2.3.3 deb to beta clusterBring down beta cluster elasticsearch serversBring full cluster back up. All servers are master capable.Re-apply transient cluster settings recorded aboveAt this point we are running new elasticsearch on old CirrusSearch code. Ensure writes are still going through.Merge es2.x branch to master (CirrusSearch, Elastica, and vendor). Merge individual patches for ApiFeatureUsage, Translate and GeoData. Push to gerrit and merge- Manual testing. Ensure daily browser tests still pass.
- TODO: Is it possible to do more testing with beta cluster this week?
Monday, May 30:
Prepare and merge patch disabling completion suggester index rebuilds(not needed elastic version check will prevent the script from running)Prepare mediawiki-config patch to send wmf.4 (expected to be cut May 31) search traffic to codfw cluster.https://gerrit.wikimedia.org/r/291257- Swat out config patch sending wmf.4 searches to codfw with monday evening (SF) SWAT
Pull plugins repository on terbium and sync out to all elasticsearch servers- eqiad systems *must*not* be restarted after this has been done
Copy ttmserver index from eqiad to codfw (16 to 20 min with adhoc bash script on terbium)Install elasticsearch 2.3.3 deb to all elasticsearch servers in codfwBring full codfw cluster downStart all codfw master nodes- Note that once a node has been started under 2.x we will no longer be able to restart that node as 1.7 without losing all the data.
After a master has been decided bring up all codfw data nodes- Ensure codfw cluster is green and that writes are draining from the job queue into the cluster
Tues:
- Test on testwiki after train rolls forward. Monitor logs to ensure everything seems sane
Wed:
- Be available during train rollout
- Like tues, but non-wikipedia sites
Thurs:
- Be available during train deploy for any potential issues.
- Verify via grafana that all search traffic is on codfw.
- Prepare and merge patch enabling completion suggester index rebuilds
Fri/Mon:
- Prepare mediawiki-config patch to remove wmf.4 hacks and point all search traffic back at eqiad
- Follow same process to upgrade eqiad cluster as was used for codfw
- Ensure eqiad cluster is green and that writes are draining from the job queue into the cluster
Alternative ideas rejected:
- Considered doing the upgrade outside the train, with a big "flip the switch" sending all traffic to codfw. Rejected as more dangerous than necessary.
Expected errors:
- If the 1.7 code is talking to a 2.x cluster (index updates), any exception such as the normal DocumentMissingException will cause an Array to string conversion notice. These are acceptable.