-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
4.14
-
Important
-
No
-
2
-
Sprint 248
-
1
-
Rejected
-
False
-
Description of problem:
Recently TRT has noticed test [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade] failing with "cluster is reporting a failing condition: Cluster operator dns is degraded". Take this job as an example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640 The test is reporting a failing condition at May 10 17:21:27.530: blob:https://prow.ci.openshift.org/00fd5dc0-3a8b-4d95-b142-a6560a9ce898 Yet the cluster version indicates the upgrade is completed at 2023-05-10T17:11:54Z. May 10 17:19:33.022: INFO: Cluster version: { "metadata": { "name": "version", "uid": "7dbd4d9d-9c66-4ef4-97db-599cf7a63b7a", "resourceVersion": "97888", "generation": 3, "creationTimestamp": "2023-05-10T14:49:58Z", ...... "status": { "desired": { "version": "4.14.0-0.nightly-2023-05-10-143825", "image": "registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275" }, "history": [ { "state": "Completed", "startedTime": "2023-05-10T15:53:59Z", "completionTime": "2023-05-10T17:11:54Z", "version": "4.14.0-0.nightly-2023-05-10-143825", "image": "registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275", "verified": false, "acceptedRisks": "Target release version=\"\" image=\"registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275\" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275 against keyrings: verifier-public-key-redhat\n[2023-05-10T15:53:42Z: prefix sha256-8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275 in config map signatures-managed: no more signatures to check, 2023-05-10T15:53:42Z: unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275/signature-1: no more signatures to check, 2023-05-10T15:53:42Z: unable to retrieve signature from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275/signature-1: no more signatures to check, 2023-05-10T15:53:42Z: parallel signature store wrapping containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release, containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release: no more signatures to check, 2023-05-10T15:53:42Z: serial signature store wrapping config maps in openshift-config-managed with label \"release.openshift.io/verification-signatures\", parallel signature store wrapping containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release, containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release: no more signatures to check]\nForced through blocking failures: Multiple precondition checks failed:\n* Precondition \"EtcdRecentBackup\" failed because of \"ControllerStarted\": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required\n* Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"NoChannel\": Configured channel is unset, so the recommended status of updating from 4.13.0-0.nightly-2023-05-10-062807 to 4.14.0-0.nightly-2023-05-10-143825 is unknown." }, ], "conditions": [ { "type": "Failing", "status": "False", "lastTransitionTime": "2023-05-10T17:09:09Z" }, { "type": "Progressing", "status": "False", "lastTransitionTime": "2023-05-10T17:11:54Z", "message": "Cluster version is 4.14.0-0.nightly-2023-05-10-143825" }, ], "availableUpdates": null } } The above cluster version can be seen here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/clusterversion.json Looking at the dns operator log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-dns-operator_dns-operator-fc68b6df5-6gk5k_dns-operator.log DNS operator seems to report update complete at 2023-05-10T17:11:04Z: time="2023-05-10T17:11:04Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 9, 31, 0, time.Local), Reason:\"MaxUnavailableDNSPodsExceeded\", Message:\"Too many DNS pods are unavailable (1 > 0 max unavailable).\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 9, 52, 0, time.Local), Reason:\"Reconciling\", Message:\"Have 5 available DNS pods, want 6.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}" The relevant section for the new state: v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}" But shortly after at 2023-05-10T17:17:24Z, dns operator changed Progressing to True again with message "Have 5 available node-resolver pods, want 6." Full log line: time="2023-05-10T17:17:24Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 17, 24, 0, time.Local), Reason:\"Reconciling\", Message:\"Have 5 available node-resolver pods, want 6.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}" Now look at the status for the POD (in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods.json) We see that this particular pod did not turn ready until 2023-05-10T17:18:21Z. { "apiVersion": "v1", "kind": "Pod", "metadata": { "creationTimestamp": "2023-05-10T16:44:36Z", "generateName": "node-resolver-", "labels": { "controller-revision-hash": "55589c997", "dns.operator.openshift.io/daemonset-node-resolver": "", "pod-template-generation": "2" }, "name": "node-resolver-xz776", "namespace": "openshift-dns", "ownerReferences": [ { "apiVersion": "apps/v1", "blockOwnerDeletion": true, "controller": true, "kind": "DaemonSet", "name": "node-resolver", "uid": "14d3633f-98eb-421a-be72-3ed92c820fd0" } ], "resourceVersion": "100344", "uid": "ff559a23-8031-4692-a678-5e99bea74379" }, "spec": { "affinity": { "nodeAffinity": { "requiredDuringSchedulingIgnoredDuringExecution": { "nodeSelectorTerms": [ { "matchFields": [ { "key": "metadata.name", "operator": "In", "values": [ "ci-op-x7zz0b2c-d5fb3-6q2zf-worker-c-f62dr" ] } ] } ] } } }, "containers": [ { "command": [ "/bin/bash", "-c", "#!/bin/bash\nset -uo pipefail\n\ntrap 'jobs -p | xargs kill || true; wait; exit 0' TERM\n\nOPENSHIFT_MARKER=\"openshift-generated-node-resolver\"\nHOSTS_FILE=\"/etc/hosts\"\nTEMP_FILE=\"/etc/hosts.tmp\"\n\nIFS=', ' read -r -a services \u003c\u003c\u003c \"${SERVICES}\"\n\n# Make a temporary file with the old hosts file's attributes.\nif ! cp -f --attributes-only \"${HOSTS_FILE}\" \"${TEMP_FILE}\"; then\n echo \"Failed to preserve hosts file. Exiting.\"\n exit 1\nfi\n\nwhile true; do\n declare -A svc_ips\n for svc in \"${services[@]}\"; do\n # Fetch service IP from cluster dns if present. We make several tries\n # to do it: IPv4, IPv6, IPv4 over TCP and IPv6 over TCP. The two last ones\n # are for deployments with Kuryr on older OpenStack (OSP13) - those do not\n # support UDP loadbalancers and require reaching DNS through TCP.\n cmds=('dig -t A @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n 'dig -t AAAA @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n 'dig -t A +tcp +retry=0 @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n 'dig -t AAAA +tcp +retry=0 @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"')\n for i in ${!cmds[*]}\n do\n ips=($(eval \"${cmds[i]}\"))\n if [[ \"$?\" -eq 0 \u0026\u0026 \"${#ips[@]}\" -ne 0 ]]; then\n svc_ips[\"${svc}\"]=\"${ips[@]}\"\n break\n fi\n done\n done\n\n # Update /etc/hosts only if we get valid service IPs\n # We will not update /etc/hosts when there is coredns service outage or api unavailability\n # Stale entries could exist in /etc/hosts if the service is deleted\n if [[ -n \"${svc_ips[*]-}\" ]]; then\n # Build a new hosts file from /etc/hosts with our custom entries filtered out\n if ! sed --silent \"/# ${OPENSHIFT_MARKER}/d; w ${TEMP_FILE}\" \"${HOSTS_FILE}\"; then\n # Only continue rebuilding the hosts entries if its original content is preserved\n sleep 60 \u0026 wait\n continue\n fi\n\n # Append resolver entries for services\n rc=0\n for svc in \"${!svc_ips[@]}\"; do\n for ip in ${svc_ips[${svc}]}; do\n echo \"${ip} ${svc} ${svc}.${CLUSTER_DOMAIN} # ${OPENSHIFT_MARKER}\" \u003e\u003e \"${TEMP_FILE}\" || rc=$?\n done\n done\n if [[ $rc -ne 0 ]]; then\n sleep 60 \u0026 wait\n continue\n fi\n\n\n # TODO: Update /etc/hosts atomically to avoid any inconsistent behavior\n # Replace /etc/hosts with our modified version if needed\n cmp \"${TEMP_FILE}\" \"${HOSTS_FILE}\" || cp -f \"${TEMP_FILE}\" \"${HOSTS_FILE}\"\n # TEMP_FILE is not removed to avoid file create/delete and attributes copy churn\n fi\n sleep 60 \u0026 wait\n unset svc_ips\ndone\n" ], "env": [ { "name": "SERVICES", "value": "image-registry.openshift-image-registry.svc" }, { "name": "NAMESERVER", "value": "172.30.0.10" }, { "name": "CLUSTER_DOMAIN", "value": "cluster.local" } ], "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e", "imagePullPolicy": "IfNotPresent", "name": "dns-node-resolver", "resources": { "requests": { "cpu": "5m", "memory": "21Mi" } }, "securityContext": { "privileged": true }, "terminationMessagePath": "/dev/termination-log", "terminationMessagePolicy": "FallbackToLogsOnError", "volumeMounts": [ { "mountPath": "/etc/hosts", "name": "hosts-file" }, { "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount", "name": "kube-api-access-pmn5k", "readOnly": true } ] } ], "dnsPolicy": "ClusterFirst", "enableServiceLinks": true, "hostNetwork": true, "imagePullSecrets": [ { "name": "node-resolver-dockercfg-rdxn6" } ], "nodeName": "ci-op-x7zz0b2c-d5fb3-6q2zf-worker-c-f62dr", "nodeSelector": { "kubernetes.io/os": "linux" }, "preemptionPolicy": "PreemptLowerPriority", "priority": 2000001000, "priorityClassName": "system-node-critical", "restartPolicy": "Always", "schedulerName": "default-scheduler", "securityContext": {}, "serviceAccount": "node-resolver", "serviceAccountName": "node-resolver", "terminationGracePeriodSeconds": 30, "tolerations": [ { "operator": "Exists" } ], "volumes": [ { "hostPath": { "path": "/etc/hosts", "type": "File" }, "name": "hosts-file" }, { "name": "kube-api-access-pmn5k", "projected": { "defaultMode": 420, "sources": [ { "serviceAccountToken": { "expirationSeconds": 3607, "path": "token" } }, { "configMap": { "items": [ { "key": "ca.crt", "path": "ca.crt" } ], "name": "kube-root-ca.crt" } }, { "downwardAPI": { "items": [ { "fieldRef": { "apiVersion": "v1", "fieldPath": "metadata.namespace" }, "path": "namespace" } ] } }, { "configMap": { "items": [ { "key": "service-ca.crt", "path": "service-ca.crt" } ], "name": "openshift-service-ca.crt" } } ] } } ] }, "status": { "conditions": [ { "lastProbeTime": null, "lastTransitionTime": "2023-05-10T16:44:36Z", "status": "True", "type": "Initialized" }, { "lastProbeTime": null, "lastTransitionTime": "2023-05-10T17:18:21Z", "status": "True", "type": "Ready" }, { "lastProbeTime": null, "lastTransitionTime": "2023-05-10T17:18:21Z", "status": "True", "type": "ContainersReady" }, { "lastProbeTime": null, "lastTransitionTime": "2023-05-10T16:44:36Z", "status": "True", "type": "PodScheduled" } ], "containerStatuses": [ { "containerID": "cri-o://a98dc10659008451a6aecc079008c08abca3e18dc1c17be517275e62f1423d84", "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e", "lastState": {}, "name": "dns-node-resolver", "ready": true, "restartCount": 1, "started": true, "state": { "running": { "startedAt": "2023-05-10T17:18:20Z" } } } ], "hostIP": "10.0.128.4", "phase": "Running", "podIP": "10.0.128.4", "podIPs": [ { "ip": "10.0.128.4" } ], "qosClass": "Burstable", "startTime": "2023-05-10T16:44:36Z" } },
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- relates to
-
OCPBUGS-14346 DNS operator reports degraded when simply progressing during a 4.14 upgrade
- POST