-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.18, 4.19
-
None
-
False
-
Description of problem:
Vertical Scaling E2E test failed with etcd membership unexpectedly changing from 3 voting members to 2, thereby failing to confirm that scale-down hasn't occurred before scale up when cluster membership is healthy.
The following E2E test is failing at times with error "scale-down should not have happened before scale up when cluster membership is healthy":
[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when CPMS is disabled [apigroup:machine.openshift.io]
The following four CI jobs have the above test failing:
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-vsphere-ovn-etcd-scaling/1861800921579655168
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-etcd-scaling/1861800663244083200
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-aws-ovn-etcd-scaling/1858992886654177280
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.19-e2e-gcp-ovn-etcd-scaling/1864066804607881216
This E2E test covers a basic vertical scaling scenario when CPMS is disabled to validate that the scale-down does not happen before the scale-up event when cluster membership is healthy. This test has the following steps:
- Disable the CPMS if it is active
- Delete a machine
- Wait for 2 minutes to ensure voting member count remains at 3 after the deletion of a machine and before a new machine is added, to verify that scale-down hasn't occurred before scale up when cluster membership is healthy
- Create a new master machine and ensure it is running
- Scale-down is validated by confirming the member removal and changes in the cluster membership
The test fails at step 3 with voting member count unexpectedly changing to 2 as the member is removed.
Reason
This case accidentally transitions into an unhealthy case, resulting in the removal of the member. Inspecting the logs, it can be seen that a new revision is triggerd with {{'StartingNewRevision' new revision 14 triggered by "required secret/etcd-all-certs has changed". }}If the revision is rolled out to the member being deleted while the health check is performed on it(while waiting on step 3), the member is reported as unhealthy resulting in the unintended removal of the member.
The secret/etcd-all-certs changes when a new node is added to or removed from the cluster. In all four CI jobs mentioned above, the test "etcd is able to vertically scale up and down when the CPMS is disabled" {}is run after the test "etcd is able to vertically scale up and down with a single node". During the first test, the removal of nodes updates the "secret/etcd-all-certs" triggering a new revision at the end of the test. This revision rollout carries over into the second test. If the revision has not yet been rolled out to the member being deleted before step 3, but is rolled out while waiting during step 3, the member is reported as unhealthy and removed.
A new revision is triggered later in the test after the removal of node, because of the updation in "secret/etcd-all-certs", due to the following change: https://github.com/openshift/cluster-etcd-operator/commit/3eff7415334c8b7860f160a1f87cd2a16ad1a513#diff-273071b77ba329777b70cb3c4d3fb2e33bc8abf45cb3da28cbee512d591ab9eeR195-R197 which implements a gate to avoid triggering leaf cert generation in the same static pod revision as an update to the signer certificates (and their respective bundles). Thus it misses the WaitForAPIServerToStabilizeOnTheSameRevision and gets carried over to the next test.
The same issue can be observed mostly in the runs where "etcd is able to vertically scale up and down with a single node" is run after "etcd is able to vertically scale up and down when the CPMS is disabled" like in the CI job run:https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.18-e2e-gcp-ovn-etcd-scaling/1859640342366654464. This issue is not caught in the test failures, as the former test deletes a machine when CPMS is enabled and only validates the voting member count of three at the end of the test. However, it can be observed in the logs that the member being deleted is removed before the new member is added.
- is related to
-
OCPBUGS-43379 etcd-scaling jobs failing ~60% of the time
- New