-
Story
-
Resolution: Done
-
Major
-
OSSM 2.0.0
-
None
-
3
-
False
-
False
-
Undefined
-
After more consideration, I think we do need a section of our documentation on how to upgrade OSSM, as it differs slightly from the upstream guidance (https://istio.io/latest/docs/setup/upgrade/).
In particular, we don't support the use of Canary deployments for a variety of reasons (which Kevin/Rob articulate below). As this is a place we differ from upstream Istio (but current don't document), we should be documenting this here as well as the reason that we don't support canary roll outs. This question comes up every now and then with customers.
This would be for minor upgrades 2.0 -> 2.1 -> 2.2, etc. Thus, for 2.0, we can simply point to this guide. For a major version upgrade, we wouldn't have backward compatibility and would provide a specific migration guide.
I could see this having the following sections:
- Upgrading the Control Plane (Service Mesh, Kiali, Jaeger)
- In Place Upgrades
- This section may be very short. The 1.0 -> 1.1 release notes content provides some guidance: https://docs.openshift.com/container-platform/4.5/service_mesh/v1x/servicemesh-release-notes.html#ossm-manual-updates-1.0-1.1_ossm-release-notes-v1x
- Canary Rollouts and why we don't support them.
- In Place Upgrades
- Upgrading the Data Plane (Envoy Proxies)
- Pods need to be bounced to upgrade Envoy.
- Considerations to avoid service disruptions (i.e. have multiple instances, K8s will provide a rolling upgrade by default).
For the "Why we we don't support canary roll outs, this conversation with Rob/Kevin provides some context (needs to be word-smithed/cleansed - hence this is for internal eyes):
Kevin:
We don't support it for a number of reasons
- IIRC the latest version of istio we are using may not support it (or if it does it was new and error prone)
- we favour in place upgrades (not canary)
- there are still conversations in upstream as to whether canary works or even if it's desirable
__
The short version is that certain engineers upstream do not believe in retaining backwards compatibility (it's hard) and would like to make whatever changes they feel like without having to address the consequences. I'm not yet sure where this will end up.
Rob:
I just want to reiterate what Kev said, and highlight that they're still fixing issues with revision based upgrades in 1.10. Ironically, every member of the TOC _not working for google advocated for in-place upgrades, and their own user survey showed over half of the respondents preferred in-place upgrades, while only about 10% preferred revision based (canary) upgrades._
__
Some additional technical reasons are that upstream are unsure whether or not they will break their envoy integration, so they worry whether or not new config will work with old proxies. I think they try to solve that with revisions by forcing you to update the revision label on your application pods, which would cause a new rollout of a deployment.
__
I think there's also a perception, both upstream and by users, that new releases may not work with existing configuration/applications or may just be too buggy, so doing a canary rollout may help limit the extent of the damage. I think this speaks to a low level of general quality, coupled with a lack of forethought regarding the consequences of any code changes. As Kev mentioned, there is little thought given when changing things. Ironically, they are actually pretty strict when it comes to changing any API. The only problem is that this only applies to configuration CRDs and not much else (e.g. behavior, environment variables, command line arguments), and some CRDs contain raw maps (specifically MeshConfig), so they can workaround not changing "API" that way (or just add more environment variables for config). Quite honestly, I think this is born out of a lack of experience delivering software to end users and blatant disregard for impacts to end users. Revisions/canary upgrades appear to help mitigate the effects of this. In my opinion, this is just dumping the problem on the user.
__
That said, we may be running a bit loose with this, but I think in-place upgrades are much less work for users. My opinion is that we should focus our efforts making istio more robust and ensuring in-place upgrades work. (We may also want to consider rollback at some point, which we support through rolling back the operator, but I don't know how well that's been tested, specifically the effects on availability.) FWIW, we do attempt to scan the user's install, configuration, and applications to validate whether or not an upgrade is feasible. For example, we won't let you upgrade from 1.0 to 1.1 if you're using legacy mixer resources or using port 443 for an http or http2 endpoint, neither of which are supported in 1.1. Part of the issue with supporting this for 2.0, in addition to the major architectural differences, was that there were some resources that were deprecated between 1.1 and 2.0 and the replacements were not available in 1.1 (i.e. users couldn't migrate existing apps to use non-deprecated features before upgrading).
- is related to
-
OSSM-2717 OpenShift Service Mesh documentation is missing information on operator upgrades
- Closed
-
OSSM-2954 Document Operator version scheme
- Closed
-
OSSM-2805 Support Multiple Control Planes within a Single Mesh
- Closed
- relates to
-
OSSM-2949 Updating the control plane version
- Closed
- links to