Details
-
Bug
-
Resolution: Done
-
Major
-
None
-
4.11
-
Moderate
-
1
-
Sprint 235
-
1
-
Rejected
-
Unspecified
-
If docs needed, set a value
Description
Description of problem: In 4.11, configure timeout of liveness probe and readiness probe for the router deploy in openshift-ingress namespace with 5s, try to downgrade the cluster to 4.10, expect the timeout will change to the default 1s.
But more than 5 hours has passed, it is still in "waiting on ingress"
OpenShift release version:
Cluster Platform:
cluster access info: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/96936/
How reproducible:
configure timeout of liveness probe and readiness probe, and then downgrade the cluster
Steps to Reproduce (in detail):
1. configure timeout of liveness probe and readiness probe
% oc -n openshift-ingress patch deploy/router-default --type=strategic --patch='{"spec":{"template":{"spec":{"containers":[{"name":"router","livenessProbe":
,"readinessProbe":{"timeoutSeconds":5}}]}}}}'
Warning: would violate PodSecurity "restricted:latest": allowPrivilegeEscalation != false (container "router" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "router" must set securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "router" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container "router" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
deployment.apps/router-default patched
%
2. check the configuration of timeout of liveness probe and readiness probe
% oc -n openshift-ingress get deploy/router-default -o yaml | grep -A8 nessProbe:
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 1936
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
–
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz/ready
port: 1936
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
%
3. downgrade the cluster to 4.10.0-0.nightly-2022-04-24-083512
% oc patch clusterversion/version --patch '{"spec":{"upstream":"https://amd64.ocp.releases.ci.openshift.org/graph"}}' --type=merge
clusterversion.config.openshift.io/version patched
%
% oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-04-24-083512 --allow-explicit-upgrade=true --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2022-04-24-083512
%
4. oc get clusterversion from time to time, it seems the downgrade is stuck in "waiting on ingress"
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-24-135651 True True 3m39s Working towards 4.10.0-0.nightly-2022-04-24-083512: 95 of 771 done (12% complete)
%
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-24-135651 True True 31m Unable to apply 4.10.0-0.nightly-2022-04-24-083512: an unknown error has occurred: MultipleErrors
%
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-24-135651 True True 36m Working towards 4.10.0-0.nightly-2022-04-24-083512: 610 of 771 done (79% complete)
%
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-24-135651 True True 53m Working towards 4.10.0-0.nightly-2022-04-24-083512: 611 of 771 done (79% complete), waiting on ingress
%
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.11.0-0.nightly-2022-04-24-135651 True True 5h30m Working towards 4.10.0-0.nightly-2022-04-24-083512: 611 of 771 done (79% complete), waiting on ingress
%
5. check the timeout, it is changed to 1s
% oc -n openshift-ingress get deploy/router-default -o yaml | grep -A8 nessProbe:
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 1936
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
–
readinessProbe:
failureThreshold: 3
httpGet:
path: /healthz/ready
port: 1936
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
%
Actual results:
More than 5 hours passed, the downgrade hasn't been completed.
Expected results:
About 1 hour, the downgrade is successful.
Impact of the problem:
Additional info:
-
- Please do not disregard the report template; filling the template out as much as possible will allow us to help you. Please consider attaching a must-gather archive (via `oc adm must-gather`). Please review must-gather contents for sensitive information before attaching any must-gathers to a bugzilla report. You may also mark the bug private if you wish.