-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
2
-
Important
-
No
-
None
-
Rejected
-
Sprint 248
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Recently TRT has noticed test [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade] failing with "cluster is reporting a failing condition: Cluster operator dns is degraded".
Take this job as an example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640
The test is reporting a failing condition at May 10 17:21:27.530: blob:https://prow.ci.openshift.org/00fd5dc0-3a8b-4d95-b142-a6560a9ce898
Yet the cluster version indicates the upgrade is completed at 2023-05-10T17:11:54Z.
May 10 17:19:33.022: INFO: Cluster version:
{
"metadata": {
"name": "version",
"uid": "7dbd4d9d-9c66-4ef4-97db-599cf7a63b7a",
"resourceVersion": "97888",
"generation": 3,
"creationTimestamp": "2023-05-10T14:49:58Z",
......
"status": {
"desired": {
"version": "4.14.0-0.nightly-2023-05-10-143825",
"image": "registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275"
},
"history": [
{
"state": "Completed",
"startedTime": "2023-05-10T15:53:59Z",
"completionTime": "2023-05-10T17:11:54Z",
"version": "4.14.0-0.nightly-2023-05-10-143825",
"image": "registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275",
"verified": false,
"acceptedRisks": "Target release version=\"\" image=\"registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275\" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275 against keyrings: verifier-public-key-redhat\n[2023-05-10T15:53:42Z: prefix sha256-8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275 in config map signatures-managed: no more signatures to check, 2023-05-10T15:53:42Z: unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275/signature-1: no more signatures to check, 2023-05-10T15:53:42Z: unable to retrieve signature from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275/signature-1: no more signatures to check, 2023-05-10T15:53:42Z: parallel signature store wrapping containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release, containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release: no more signatures to check, 2023-05-10T15:53:42Z: serial signature store wrapping config maps in openshift-config-managed with label \"release.openshift.io/verification-signatures\", parallel signature store wrapping containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release, containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release: no more signatures to check]\nForced through blocking failures: Multiple precondition checks failed:\n* Precondition \"EtcdRecentBackup\" failed because of \"ControllerStarted\": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required\n* Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"NoChannel\": Configured channel is unset, so the recommended status of updating from 4.13.0-0.nightly-2023-05-10-062807 to 4.14.0-0.nightly-2023-05-10-143825 is unknown."
},
],
"conditions": [
{
"type": "Failing",
"status": "False",
"lastTransitionTime": "2023-05-10T17:09:09Z"
},
{
"type": "Progressing",
"status": "False",
"lastTransitionTime": "2023-05-10T17:11:54Z",
"message": "Cluster version is 4.14.0-0.nightly-2023-05-10-143825"
},
],
"availableUpdates": null
}
}
The above cluster version can be seen here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/clusterversion.json
Looking at the dns operator log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-dns-operator_dns-operator-fc68b6df5-6gk5k_dns-operator.log
DNS operator seems to report update complete at 2023-05-10T17:11:04Z:
time="2023-05-10T17:11:04Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 9, 31, 0, time.Local), Reason:\"MaxUnavailableDNSPodsExceeded\", Message:\"Too many DNS pods are unavailable (1 > 0 max unavailable).\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 9, 52, 0, time.Local), Reason:\"Reconciling\", Message:\"Have 5 available DNS pods, want 6.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"
The relevant section for the new state:
v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"},
v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"},
v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"},
v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"
But shortly after at 2023-05-10T17:17:24Z, dns operator changed Progressing to True again with message "Have 5 available node-resolver pods, want 6."
Full log line:
time="2023-05-10T17:17:24Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 17, 24, 0, time.Local), Reason:\"Reconciling\", Message:\"Have 5 available node-resolver pods, want 6.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"
Now look at the status for the POD (in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods.json)
We see that this particular pod did not turn ready until 2023-05-10T17:18:21Z.
{
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"creationTimestamp": "2023-05-10T16:44:36Z",
"generateName": "node-resolver-",
"labels": {
"controller-revision-hash": "55589c997",
"dns.operator.openshift.io/daemonset-node-resolver": "",
"pod-template-generation": "2"
},
"name": "node-resolver-xz776",
"namespace": "openshift-dns",
"ownerReferences": [
{
"apiVersion": "apps/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "DaemonSet",
"name": "node-resolver",
"uid": "14d3633f-98eb-421a-be72-3ed92c820fd0"
}
],
"resourceVersion": "100344",
"uid": "ff559a23-8031-4692-a678-5e99bea74379"
},
"spec": {
"affinity": {
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchFields": [
{
"key": "metadata.name",
"operator": "In",
"values": [
"ci-op-x7zz0b2c-d5fb3-6q2zf-worker-c-f62dr"
]
}
]
}
]
}
}
},
"containers": [
{
"command": [
"/bin/bash",
"-c",
"#!/bin/bash\nset -uo pipefail\n\ntrap 'jobs -p | xargs kill || true; wait; exit 0' TERM\n\nOPENSHIFT_MARKER=\"openshift-generated-node-resolver\"\nHOSTS_FILE=\"/etc/hosts\"\nTEMP_FILE=\"/etc/hosts.tmp\"\n\nIFS=', ' read -r -a services \u003c\u003c\u003c \"${SERVICES}\"\n\n# Make a temporary file with the old hosts file's attributes.\nif ! cp -f --attributes-only \"${HOSTS_FILE}\" \"${TEMP_FILE}\"; then\n echo \"Failed to preserve hosts file. Exiting.\"\n exit 1\nfi\n\nwhile true; do\n declare -A svc_ips\n for svc in \"${services[@]}\"; do\n # Fetch service IP from cluster dns if present. We make several tries\n # to do it: IPv4, IPv6, IPv4 over TCP and IPv6 over TCP. The two last ones\n # are for deployments with Kuryr on older OpenStack (OSP13) - those do not\n # support UDP loadbalancers and require reaching DNS through TCP.\n cmds=('dig -t A @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n 'dig -t AAAA @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n 'dig -t A +tcp +retry=0 @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n 'dig -t AAAA +tcp +retry=0 @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"')\n for i in ${!cmds[*]}\n do\n ips=($(eval \"${cmds[i]}\"))\n if [[ \"$?\" -eq 0 \u0026\u0026 \"${#ips[@]}\" -ne 0 ]]; then\n svc_ips[\"${svc}\"]=\"${ips[@]}\"\n break\n fi\n done\n done\n\n # Update /etc/hosts only if we get valid service IPs\n # We will not update /etc/hosts when there is coredns service outage or api unavailability\n # Stale entries could exist in /etc/hosts if the service is deleted\n if [[ -n \"${svc_ips[*]-}\" ]]; then\n # Build a new hosts file from /etc/hosts with our custom entries filtered out\n if ! sed --silent \"/# ${OPENSHIFT_MARKER}/d; w ${TEMP_FILE}\" \"${HOSTS_FILE}\"; then\n # Only continue rebuilding the hosts entries if its original content is preserved\n sleep 60 \u0026 wait\n continue\n fi\n\n # Append resolver entries for services\n rc=0\n for svc in \"${!svc_ips[@]}\"; do\n for ip in ${svc_ips[${svc}]}; do\n echo \"${ip} ${svc} ${svc}.${CLUSTER_DOMAIN} # ${OPENSHIFT_MARKER}\" \u003e\u003e \"${TEMP_FILE}\" || rc=$?\n done\n done\n if [[ $rc -ne 0 ]]; then\n sleep 60 \u0026 wait\n continue\n fi\n\n\n # TODO: Update /etc/hosts atomically to avoid any inconsistent behavior\n # Replace /etc/hosts with our modified version if needed\n cmp \"${TEMP_FILE}\" \"${HOSTS_FILE}\" || cp -f \"${TEMP_FILE}\" \"${HOSTS_FILE}\"\n # TEMP_FILE is not removed to avoid file create/delete and attributes copy churn\n fi\n sleep 60 \u0026 wait\n unset svc_ips\ndone\n"
],
"env": [
{
"name": "SERVICES",
"value": "image-registry.openshift-image-registry.svc"
},
{
"name": "NAMESERVER",
"value": "172.30.0.10"
},
{
"name": "CLUSTER_DOMAIN",
"value": "cluster.local"
}
],
"image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
"imagePullPolicy": "IfNotPresent",
"name": "dns-node-resolver",
"resources": {
"requests": {
"cpu": "5m",
"memory": "21Mi"
}
},
"securityContext": {
"privileged": true
},
"terminationMessagePath": "/dev/termination-log",
"terminationMessagePolicy": "FallbackToLogsOnError",
"volumeMounts": [
{
"mountPath": "/etc/hosts",
"name": "hosts-file"
},
{
"mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
"name": "kube-api-access-pmn5k",
"readOnly": true
}
]
}
],
"dnsPolicy": "ClusterFirst",
"enableServiceLinks": true,
"hostNetwork": true,
"imagePullSecrets": [
{
"name": "node-resolver-dockercfg-rdxn6"
}
],
"nodeName": "ci-op-x7zz0b2c-d5fb3-6q2zf-worker-c-f62dr",
"nodeSelector": {
"kubernetes.io/os": "linux"
},
"preemptionPolicy": "PreemptLowerPriority",
"priority": 2000001000,
"priorityClassName": "system-node-critical",
"restartPolicy": "Always",
"schedulerName": "default-scheduler",
"securityContext": {},
"serviceAccount": "node-resolver",
"serviceAccountName": "node-resolver",
"terminationGracePeriodSeconds": 30,
"tolerations": [
{
"operator": "Exists"
}
],
"volumes": [
{
"hostPath": {
"path": "/etc/hosts",
"type": "File"
},
"name": "hosts-file"
},
{
"name": "kube-api-access-pmn5k",
"projected": {
"defaultMode": 420,
"sources": [
{
"serviceAccountToken": {
"expirationSeconds": 3607,
"path": "token"
}
},
{
"configMap": {
"items": [
{
"key": "ca.crt",
"path": "ca.crt"
}
],
"name": "kube-root-ca.crt"
}
},
{
"downwardAPI": {
"items": [
{
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.namespace"
},
"path": "namespace"
}
]
}
},
{
"configMap": {
"items": [
{
"key": "service-ca.crt",
"path": "service-ca.crt"
}
],
"name": "openshift-service-ca.crt"
}
}
]
}
}
]
},
"status": {
"conditions": [
{
"lastProbeTime": null,
"lastTransitionTime": "2023-05-10T16:44:36Z",
"status": "True",
"type": "Initialized"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2023-05-10T17:18:21Z",
"status": "True",
"type": "Ready"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2023-05-10T17:18:21Z",
"status": "True",
"type": "ContainersReady"
},
{
"lastProbeTime": null,
"lastTransitionTime": "2023-05-10T16:44:36Z",
"status": "True",
"type": "PodScheduled"
}
],
"containerStatuses": [
{
"containerID": "cri-o://a98dc10659008451a6aecc079008c08abca3e18dc1c17be517275e62f1423d84",
"image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
"imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
"lastState": {},
"name": "dns-node-resolver",
"ready": true,
"restartCount": 1,
"started": true,
"state": {
"running": {
"startedAt": "2023-05-10T17:18:20Z"
}
}
}
],
"hostIP": "10.0.128.4",
"phase": "Running",
"podIP": "10.0.128.4",
"podIPs": [
{
"ip": "10.0.128.4"
}
],
"qosClass": "Burstable",
"startTime": "2023-05-10T16:44:36Z"
}
},
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- relates to
-
OCPBUGS-14346 DNS operator reports degraded when simply progressing during a 4.14 upgrade
-
- POST
-