Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Networking / DNS
Labels:
- stale
- trt
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
2
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
Rejected
Sprint:
Sprint 248
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Recently TRT has noticed test [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade] failing with "cluster is reporting a failing condition: Cluster operator dns is degraded". 

Take this job as an example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640

The test is reporting a failing condition at May 10 17:21:27.530: blob:https://prow.ci.openshift.org/00fd5dc0-3a8b-4d95-b142-a6560a9ce898

Yet the cluster version indicates the upgrade is completed at 2023-05-10T17:11:54Z. 

May 10 17:19:33.022: INFO: Cluster version:
{
  "metadata": {
    "name": "version",
    "uid": "7dbd4d9d-9c66-4ef4-97db-599cf7a63b7a",
    "resourceVersion": "97888",
    "generation": 3,
    "creationTimestamp": "2023-05-10T14:49:58Z",
......

  "status": {
    "desired": {
      "version": "4.14.0-0.nightly-2023-05-10-143825",
      "image": "registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275"
    },
    "history": [
      {
        "state": "Completed",
        "startedTime": "2023-05-10T15:53:59Z",
        "completionTime": "2023-05-10T17:11:54Z",
        "version": "4.14.0-0.nightly-2023-05-10-143825",
        "image": "registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275",
        "verified": false,
        "acceptedRisks": "Target release version=\"\" image=\"registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275\" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275 against keyrings: verifier-public-key-redhat\n[2023-05-10T15:53:42Z: prefix sha256-8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275 in config map signatures-managed: no more signatures to check, 2023-05-10T15:53:42Z: unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275/signature-1: no more signatures to check, 2023-05-10T15:53:42Z: unable to retrieve signature from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275/signature-1: no more signatures to check, 2023-05-10T15:53:42Z: parallel signature store wrapping containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release, containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release: no more signatures to check, 2023-05-10T15:53:42Z: serial signature store wrapping config maps in openshift-config-managed with label \"release.openshift.io/verification-signatures\", parallel signature store wrapping containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release, containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release: no more signatures to check]\nForced through blocking failures: Multiple precondition checks failed:\n* Precondition \"EtcdRecentBackup\" failed because of \"ControllerStarted\": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required\n* Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"NoChannel\": Configured channel is unset, so the recommended status of updating from 4.13.0-0.nightly-2023-05-10-062807 to 4.14.0-0.nightly-2023-05-10-143825 is unknown."
      },
    ],
    "conditions": [
      {
        "type": "Failing",
        "status": "False",
        "lastTransitionTime": "2023-05-10T17:09:09Z"
      },
      {
        "type": "Progressing",
        "status": "False",
        "lastTransitionTime": "2023-05-10T17:11:54Z",
        "message": "Cluster version is 4.14.0-0.nightly-2023-05-10-143825"
      },
    ],
    "availableUpdates": null
  }
}


The above cluster version can be seen here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/clusterversion.json

Looking at the dns operator log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-dns-operator_dns-operator-fc68b6df5-6gk5k_dns-operator.log

DNS operator seems to report update complete at 2023-05-10T17:11:04Z:

time="2023-05-10T17:11:04Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 9, 31, 0, time.Local), Reason:\"MaxUnavailableDNSPodsExceeded\", Message:\"Too many DNS pods are unavailable (1 > 0 max unavailable).\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 9, 52, 0, time.Local), Reason:\"Reconciling\", Message:\"Have 5 available DNS pods, want 6.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"

The relevant section for the new state:

     v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, 
     v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, 
     v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, 
     v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"


But shortly after at 2023-05-10T17:17:24Z, dns operator changed Progressing to True again with message "Have 5 available node-resolver pods, want 6."

Full log line:
time="2023-05-10T17:17:24Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 17, 24, 0, time.Local), Reason:\"Reconciling\", Message:\"Have 5 available node-resolver pods, want 6.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"

Now look at the status for the POD (in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods.json)

We see that this particular pod did not turn ready until 2023-05-10T17:18:21Z.


{
            "apiVersion": "v1",
            "kind": "Pod",
            "metadata": {
                "creationTimestamp": "2023-05-10T16:44:36Z",
                "generateName": "node-resolver-",
                "labels": {
                    "controller-revision-hash": "55589c997",
                    "dns.operator.openshift.io/daemonset-node-resolver": "",
                    "pod-template-generation": "2"
                },
                "name": "node-resolver-xz776",
                "namespace": "openshift-dns",
                "ownerReferences": [
                    {
                        "apiVersion": "apps/v1",
                        "blockOwnerDeletion": true,
                        "controller": true,
                        "kind": "DaemonSet",
                        "name": "node-resolver",
                        "uid": "14d3633f-98eb-421a-be72-3ed92c820fd0"
                    }
                ],
                "resourceVersion": "100344",
                "uid": "ff559a23-8031-4692-a678-5e99bea74379"
            },
            "spec": {
                "affinity": {
                    "nodeAffinity": {
                        "requiredDuringSchedulingIgnoredDuringExecution": {
                            "nodeSelectorTerms": [
                                {
                                    "matchFields": [
                                        {
                                            "key": "metadata.name",
                                            "operator": "In",
                                            "values": [
                                                "ci-op-x7zz0b2c-d5fb3-6q2zf-worker-c-f62dr"
                                            ]
                                        }
                                    ]
                                }
                            ]
                        }
                    }
                },
                "containers": [
                    {
                        "command": [
                            "/bin/bash",
                            "-c",
                            "#!/bin/bash\nset -uo pipefail\n\ntrap 'jobs -p | xargs kill || true; wait; exit 0' TERM\n\nOPENSHIFT_MARKER=\"openshift-generated-node-resolver\"\nHOSTS_FILE=\"/etc/hosts\"\nTEMP_FILE=\"/etc/hosts.tmp\"\n\nIFS=', ' read -r -a services \u003c\u003c\u003c \"${SERVICES}\"\n\n# Make a temporary file with the old hosts file's attributes.\nif ! cp -f --attributes-only \"${HOSTS_FILE}\" \"${TEMP_FILE}\"; then\n  echo \"Failed to preserve hosts file. Exiting.\"\n  exit 1\nfi\n\nwhile true; do\n  declare -A svc_ips\n  for svc in \"${services[@]}\"; do\n    # Fetch service IP from cluster dns if present. We make several tries\n    # to do it: IPv4, IPv6, IPv4 over TCP and IPv6 over TCP. The two last ones\n    # are for deployments with Kuryr on older OpenStack (OSP13) - those do not\n    # support UDP loadbalancers and require reaching DNS through TCP.\n    cmds=('dig -t A @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n          'dig -t AAAA @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n          'dig -t A +tcp +retry=0 @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n          'dig -t AAAA +tcp +retry=0 @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"')\n    for i in ${!cmds[*]}\n    do\n      ips=($(eval \"${cmds[i]}\"))\n      if [[ \"$?\" -eq 0 \u0026\u0026 \"${#ips[@]}\" -ne 0 ]]; then\n        svc_ips[\"${svc}\"]=\"${ips[@]}\"\n        break\n      fi\n    done\n  done\n\n  # Update /etc/hosts only if we get valid service IPs\n  # We will not update /etc/hosts when there is coredns service outage or api unavailability\n  # Stale entries could exist in /etc/hosts if the service is deleted\n  if [[ -n \"${svc_ips[*]-}\" ]]; then\n    # Build a new hosts file from /etc/hosts with our custom entries filtered out\n    if ! sed --silent \"/# ${OPENSHIFT_MARKER}/d; w ${TEMP_FILE}\" \"${HOSTS_FILE}\"; then\n      # Only continue rebuilding the hosts entries if its original content is preserved\n      sleep 60 \u0026 wait\n      continue\n    fi\n\n    # Append resolver entries for services\n    rc=0\n    for svc in \"${!svc_ips[@]}\"; do\n      for ip in ${svc_ips[${svc}]}; do\n        echo \"${ip} ${svc} ${svc}.${CLUSTER_DOMAIN} # ${OPENSHIFT_MARKER}\" \u003e\u003e \"${TEMP_FILE}\" || rc=$?\n      done\n    done\n    if [[ $rc -ne 0 ]]; then\n      sleep 60 \u0026 wait\n      continue\n    fi\n\n\n    # TODO: Update /etc/hosts atomically to avoid any inconsistent behavior\n    # Replace /etc/hosts with our modified version if needed\n    cmp \"${TEMP_FILE}\" \"${HOSTS_FILE}\" || cp -f \"${TEMP_FILE}\" \"${HOSTS_FILE}\"\n    # TEMP_FILE is not removed to avoid file create/delete and attributes copy churn\n  fi\n  sleep 60 \u0026 wait\n  unset svc_ips\ndone\n"
                        ],
                        "env": [
                            {
                                "name": "SERVICES",
                                "value": "image-registry.openshift-image-registry.svc"
                            },
                            {
                                "name": "NAMESERVER",
                                "value": "172.30.0.10"
                            },
                            {
                                "name": "CLUSTER_DOMAIN",
                                "value": "cluster.local"
                            }
                        ],
                        "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
                        "imagePullPolicy": "IfNotPresent",
                        "name": "dns-node-resolver",
                        "resources": {
                            "requests": {
                                "cpu": "5m",
                                "memory": "21Mi"
                            }
                        },
                        "securityContext": {
                            "privileged": true
                        },
                        "terminationMessagePath": "/dev/termination-log",
                        "terminationMessagePolicy": "FallbackToLogsOnError",
                        "volumeMounts": [
                            {
                                "mountPath": "/etc/hosts",
                                "name": "hosts-file"
                            },
                            {
                                "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
                                "name": "kube-api-access-pmn5k",
                                "readOnly": true
                            }
                        ]
                    }
                ],
                "dnsPolicy": "ClusterFirst",
                "enableServiceLinks": true,
                "hostNetwork": true,
                "imagePullSecrets": [
                    {
                        "name": "node-resolver-dockercfg-rdxn6"
                    }
                ],
                "nodeName": "ci-op-x7zz0b2c-d5fb3-6q2zf-worker-c-f62dr",
                "nodeSelector": {
                    "kubernetes.io/os": "linux"
                },
                "preemptionPolicy": "PreemptLowerPriority",
                "priority": 2000001000,
                "priorityClassName": "system-node-critical",
                "restartPolicy": "Always",
                "schedulerName": "default-scheduler",
                "securityContext": {},
                "serviceAccount": "node-resolver",
                "serviceAccountName": "node-resolver",
                "terminationGracePeriodSeconds": 30,
                "tolerations": [
                    {
                        "operator": "Exists"
                    }
                ],
                "volumes": [
                    {
                        "hostPath": {
                            "path": "/etc/hosts",
                            "type": "File"
                        },
                        "name": "hosts-file"
                    },
                    {
                        "name": "kube-api-access-pmn5k",
                        "projected": {
                            "defaultMode": 420,
                            "sources": [
                                {
                                    "serviceAccountToken": {
                                        "expirationSeconds": 3607,
                                        "path": "token"
                                    }
                                },
                                {
                                    "configMap": {
                                        "items": [
                                            {
                                                "key": "ca.crt",
                                                "path": "ca.crt"
                                            }
                                        ],
                                        "name": "kube-root-ca.crt"
                                    }
                                },
                                {
                                    "downwardAPI": {
                                        "items": [
                                            {
                                                "fieldRef": {
                                                    "apiVersion": "v1",
                                                    "fieldPath": "metadata.namespace"
                                                },
                                                "path": "namespace"
                                            }
                                        ]
                                    }
                                },
                                {
                                    "configMap": {
                                        "items": [
                                            {
                                                "key": "service-ca.crt",
                                                "path": "service-ca.crt"
                                            }
                                        ],
                                        "name": "openshift-service-ca.crt"
                                    }
                                }
                            ]
                        }
                    }
                ]
            },
            "status": {
                "conditions": [
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2023-05-10T16:44:36Z",
                        "status": "True",
                        "type": "Initialized"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2023-05-10T17:18:21Z",
                        "status": "True",
                        "type": "Ready"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2023-05-10T17:18:21Z",
                        "status": "True",
                        "type": "ContainersReady"
                    },
                    {
                        "lastProbeTime": null,
                        "lastTransitionTime": "2023-05-10T16:44:36Z",
                        "status": "True",
                        "type": "PodScheduled"
                    }
                ],
                "containerStatuses": [
                    {
                        "containerID": "cri-o://a98dc10659008451a6aecc079008c08abca3e18dc1c17be517275e62f1423d84",
                        "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
                        "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
                        "lastState": {},
                        "name": "dns-node-resolver",
                        "ready": true,
                        "restartCount": 1,
                        "started": true,
                        "state": {
                            "running": {
                                "startedAt": "2023-05-10T17:18:20Z"
                            }
                        }
                    }
                ],
                "hostIP": "10.0.128.4",
                "phase": "Running",
                "podIP": "10.0.128.4",
                "podIPs": [
                    {
                        "ip": "10.0.128.4"
                    }
                ],
                "qosClass": "Burstable",
                "startTime": "2023-05-10T16:44:36Z"
            }
        },

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

relates to

OCPBUGS-14346 DNS operator reports degraded when simply progressing during a 4.14 upgrade

POST

Assignee:: Candace Holman

Reporter:: Ken Zhang

Need Info From:: None

Contributors:: None

QA Contact:: Melvin Joseph

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/05/11 7:37 PM

Updated:: 2025/10/11 7:17 AM

Resolved:: 2024/01/19 4:54 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates