Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13555

Cluster operator dns is degraded after upgrade is complete

    XMLWordPrintable

Details

    • Important
    • No
    • 2
    • Sprint 248
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      Recently TRT has noticed test [sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade] failing with "cluster is reporting a failing condition: Cluster operator dns is degraded". 
      
      Take this job as an example: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640
      
      The test is reporting a failing condition at May 10 17:21:27.530: blob:https://prow.ci.openshift.org/00fd5dc0-3a8b-4d95-b142-a6560a9ce898
      
      Yet the cluster version indicates the upgrade is completed at 2023-05-10T17:11:54Z. 
      
      May 10 17:19:33.022: INFO: Cluster version:
      {
        "metadata": {
          "name": "version",
          "uid": "7dbd4d9d-9c66-4ef4-97db-599cf7a63b7a",
          "resourceVersion": "97888",
          "generation": 3,
          "creationTimestamp": "2023-05-10T14:49:58Z",
      ......
      
        "status": {
          "desired": {
            "version": "4.14.0-0.nightly-2023-05-10-143825",
            "image": "registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275"
          },
          "history": [
            {
              "state": "Completed",
              "startedTime": "2023-05-10T15:53:59Z",
              "completionTime": "2023-05-10T17:11:54Z",
              "version": "4.14.0-0.nightly-2023-05-10-143825",
              "image": "registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275",
              "verified": false,
              "acceptedRisks": "Target release version=\"\" image=\"registry.build05.ci.openshift.org/ci-op-x7zz0b2c/release@sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275\" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275 against keyrings: verifier-public-key-redhat\n[2023-05-10T15:53:42Z: prefix sha256-8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275 in config map signatures-managed: no more signatures to check, 2023-05-10T15:53:42Z: unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275/signature-1: no more signatures to check, 2023-05-10T15:53:42Z: unable to retrieve signature from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=8f8c4e518ecd2d5e64e2255a1d749a83bec629624beb507d2b67e98293d90275/signature-1: no more signatures to check, 2023-05-10T15:53:42Z: parallel signature store wrapping containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release, containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release: no more signatures to check, 2023-05-10T15:53:42Z: serial signature store wrapping config maps in openshift-config-managed with label \"release.openshift.io/verification-signatures\", parallel signature store wrapping containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release, containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release: no more signatures to check]\nForced through blocking failures: Multiple precondition checks failed:\n* Precondition \"EtcdRecentBackup\" failed because of \"ControllerStarted\": RecentBackup: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required\n* Precondition \"ClusterVersionRecommendedUpdate\" failed because of \"NoChannel\": Configured channel is unset, so the recommended status of updating from 4.13.0-0.nightly-2023-05-10-062807 to 4.14.0-0.nightly-2023-05-10-143825 is unknown."
            },
          ],
          "conditions": [
            {
              "type": "Failing",
              "status": "False",
              "lastTransitionTime": "2023-05-10T17:09:09Z"
            },
            {
              "type": "Progressing",
              "status": "False",
              "lastTransitionTime": "2023-05-10T17:11:54Z",
              "message": "Cluster version is 4.14.0-0.nightly-2023-05-10-143825"
            },
          ],
          "availableUpdates": null
        }
      }
      
      
      The above cluster version can be seen here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/clusterversion.json
      
      Looking at the dns operator log: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods/openshift-dns-operator_dns-operator-fc68b6df5-6gk5k_dns-operator.log
      
      DNS operator seems to report update complete at 2023-05-10T17:11:04Z:
      
      time="2023-05-10T17:11:04Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 9, 31, 0, time.Local), Reason:\"MaxUnavailableDNSPodsExceeded\", Message:\"Too many DNS pods are unavailable (1 > 0 max unavailable).\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 9, 52, 0, time.Local), Reason:\"Reconciling\", Message:\"Have 5 available DNS pods, want 6.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"
      
      The relevant section for the new state:
      
           v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, 
           v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, 
           v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, 
           v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"
      
      
      But shortly after at 2023-05-10T17:17:24Z, dns operator changed Progressing to True again with message "Have 5 available node-resolver pods, want 6."
      
      Full log line:
      time="2023-05-10T17:17:24Z" level=info msg="updated DNS default status: old: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"All DNS and node-resolver pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}, new: v1.DNSStatus{ClusterIP:\"172.30.0.10\", ClusterDomain:\"cluster.local\", Conditions:[]v1.OperatorCondition{v1.OperatorCondition{Type:\"Degraded\", Status:\"False\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 11, 4, 0, time.Local), Reason:\"AsExpected\", Message:\"Enough DNS pods are available, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Progressing\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 17, 17, 24, 0, time.Local), Reason:\"Reconciling\", Message:\"Have 5 available node-resolver pods, want 6.\"}, v1.OperatorCondition{Type:\"Available\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 15, 0, 19, 0, time.Local), Reason:\"AsExpected\", Message:\"The DNS daemonset has available pods, and the DNS service has a cluster IP address.\"}, v1.OperatorCondition{Type:\"Upgradeable\", Status:\"True\", LastTransitionTime:time.Date(2023, time.May, 10, 14, 59, 57, 0, time.Local), Reason:\"AsExpected\", Message:\"DNS Operator can be upgraded\"}}}"
      
      Now look at the status for the POD (in https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1656308394878832640/artifacts/e2e-gcp-ovn-rt-upgrade/gather-extra/artifacts/pods.json)
      
      We see that this particular pod did not turn ready until 2023-05-10T17:18:21Z.
      
      
      {
                  "apiVersion": "v1",
                  "kind": "Pod",
                  "metadata": {
                      "creationTimestamp": "2023-05-10T16:44:36Z",
                      "generateName": "node-resolver-",
                      "labels": {
                          "controller-revision-hash": "55589c997",
                          "dns.operator.openshift.io/daemonset-node-resolver": "",
                          "pod-template-generation": "2"
                      },
                      "name": "node-resolver-xz776",
                      "namespace": "openshift-dns",
                      "ownerReferences": [
                          {
                              "apiVersion": "apps/v1",
                              "blockOwnerDeletion": true,
                              "controller": true,
                              "kind": "DaemonSet",
                              "name": "node-resolver",
                              "uid": "14d3633f-98eb-421a-be72-3ed92c820fd0"
                          }
                      ],
                      "resourceVersion": "100344",
                      "uid": "ff559a23-8031-4692-a678-5e99bea74379"
                  },
                  "spec": {
                      "affinity": {
                          "nodeAffinity": {
                              "requiredDuringSchedulingIgnoredDuringExecution": {
                                  "nodeSelectorTerms": [
                                      {
                                          "matchFields": [
                                              {
                                                  "key": "metadata.name",
                                                  "operator": "In",
                                                  "values": [
                                                      "ci-op-x7zz0b2c-d5fb3-6q2zf-worker-c-f62dr"
                                                  ]
                                              }
                                          ]
                                      }
                                  ]
                              }
                          }
                      },
                      "containers": [
                          {
                              "command": [
                                  "/bin/bash",
                                  "-c",
                                  "#!/bin/bash\nset -uo pipefail\n\ntrap 'jobs -p | xargs kill || true; wait; exit 0' TERM\n\nOPENSHIFT_MARKER=\"openshift-generated-node-resolver\"\nHOSTS_FILE=\"/etc/hosts\"\nTEMP_FILE=\"/etc/hosts.tmp\"\n\nIFS=', ' read -r -a services \u003c\u003c\u003c \"${SERVICES}\"\n\n# Make a temporary file with the old hosts file's attributes.\nif ! cp -f --attributes-only \"${HOSTS_FILE}\" \"${TEMP_FILE}\"; then\n  echo \"Failed to preserve hosts file. Exiting.\"\n  exit 1\nfi\n\nwhile true; do\n  declare -A svc_ips\n  for svc in \"${services[@]}\"; do\n    # Fetch service IP from cluster dns if present. We make several tries\n    # to do it: IPv4, IPv6, IPv4 over TCP and IPv6 over TCP. The two last ones\n    # are for deployments with Kuryr on older OpenStack (OSP13) - those do not\n    # support UDP loadbalancers and require reaching DNS through TCP.\n    cmds=('dig -t A @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n          'dig -t AAAA @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n          'dig -t A +tcp +retry=0 @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"'\n          'dig -t AAAA +tcp +retry=0 @\"${NAMESERVER}\" +short \"${svc}.${CLUSTER_DOMAIN}\"|grep -v \"^;\"')\n    for i in ${!cmds[*]}\n    do\n      ips=($(eval \"${cmds[i]}\"))\n      if [[ \"$?\" -eq 0 \u0026\u0026 \"${#ips[@]}\" -ne 0 ]]; then\n        svc_ips[\"${svc}\"]=\"${ips[@]}\"\n        break\n      fi\n    done\n  done\n\n  # Update /etc/hosts only if we get valid service IPs\n  # We will not update /etc/hosts when there is coredns service outage or api unavailability\n  # Stale entries could exist in /etc/hosts if the service is deleted\n  if [[ -n \"${svc_ips[*]-}\" ]]; then\n    # Build a new hosts file from /etc/hosts with our custom entries filtered out\n    if ! sed --silent \"/# ${OPENSHIFT_MARKER}/d; w ${TEMP_FILE}\" \"${HOSTS_FILE}\"; then\n      # Only continue rebuilding the hosts entries if its original content is preserved\n      sleep 60 \u0026 wait\n      continue\n    fi\n\n    # Append resolver entries for services\n    rc=0\n    for svc in \"${!svc_ips[@]}\"; do\n      for ip in ${svc_ips[${svc}]}; do\n        echo \"${ip} ${svc} ${svc}.${CLUSTER_DOMAIN} # ${OPENSHIFT_MARKER}\" \u003e\u003e \"${TEMP_FILE}\" || rc=$?\n      done\n    done\n    if [[ $rc -ne 0 ]]; then\n      sleep 60 \u0026 wait\n      continue\n    fi\n\n\n    # TODO: Update /etc/hosts atomically to avoid any inconsistent behavior\n    # Replace /etc/hosts with our modified version if needed\n    cmp \"${TEMP_FILE}\" \"${HOSTS_FILE}\" || cp -f \"${TEMP_FILE}\" \"${HOSTS_FILE}\"\n    # TEMP_FILE is not removed to avoid file create/delete and attributes copy churn\n  fi\n  sleep 60 \u0026 wait\n  unset svc_ips\ndone\n"
                              ],
                              "env": [
                                  {
                                      "name": "SERVICES",
                                      "value": "image-registry.openshift-image-registry.svc"
                                  },
                                  {
                                      "name": "NAMESERVER",
                                      "value": "172.30.0.10"
                                  },
                                  {
                                      "name": "CLUSTER_DOMAIN",
                                      "value": "cluster.local"
                                  }
                              ],
                              "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
                              "imagePullPolicy": "IfNotPresent",
                              "name": "dns-node-resolver",
                              "resources": {
                                  "requests": {
                                      "cpu": "5m",
                                      "memory": "21Mi"
                                  }
                              },
                              "securityContext": {
                                  "privileged": true
                              },
                              "terminationMessagePath": "/dev/termination-log",
                              "terminationMessagePolicy": "FallbackToLogsOnError",
                              "volumeMounts": [
                                  {
                                      "mountPath": "/etc/hosts",
                                      "name": "hosts-file"
                                  },
                                  {
                                      "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
                                      "name": "kube-api-access-pmn5k",
                                      "readOnly": true
                                  }
                              ]
                          }
                      ],
                      "dnsPolicy": "ClusterFirst",
                      "enableServiceLinks": true,
                      "hostNetwork": true,
                      "imagePullSecrets": [
                          {
                              "name": "node-resolver-dockercfg-rdxn6"
                          }
                      ],
                      "nodeName": "ci-op-x7zz0b2c-d5fb3-6q2zf-worker-c-f62dr",
                      "nodeSelector": {
                          "kubernetes.io/os": "linux"
                      },
                      "preemptionPolicy": "PreemptLowerPriority",
                      "priority": 2000001000,
                      "priorityClassName": "system-node-critical",
                      "restartPolicy": "Always",
                      "schedulerName": "default-scheduler",
                      "securityContext": {},
                      "serviceAccount": "node-resolver",
                      "serviceAccountName": "node-resolver",
                      "terminationGracePeriodSeconds": 30,
                      "tolerations": [
                          {
                              "operator": "Exists"
                          }
                      ],
                      "volumes": [
                          {
                              "hostPath": {
                                  "path": "/etc/hosts",
                                  "type": "File"
                              },
                              "name": "hosts-file"
                          },
                          {
                              "name": "kube-api-access-pmn5k",
                              "projected": {
                                  "defaultMode": 420,
                                  "sources": [
                                      {
                                          "serviceAccountToken": {
                                              "expirationSeconds": 3607,
                                              "path": "token"
                                          }
                                      },
                                      {
                                          "configMap": {
                                              "items": [
                                                  {
                                                      "key": "ca.crt",
                                                      "path": "ca.crt"
                                                  }
                                              ],
                                              "name": "kube-root-ca.crt"
                                          }
                                      },
                                      {
                                          "downwardAPI": {
                                              "items": [
                                                  {
                                                      "fieldRef": {
                                                          "apiVersion": "v1",
                                                          "fieldPath": "metadata.namespace"
                                                      },
                                                      "path": "namespace"
                                                  }
                                              ]
                                          }
                                      },
                                      {
                                          "configMap": {
                                              "items": [
                                                  {
                                                      "key": "service-ca.crt",
                                                      "path": "service-ca.crt"
                                                  }
                                              ],
                                              "name": "openshift-service-ca.crt"
                                          }
                                      }
                                  ]
                              }
                          }
                      ]
                  },
                  "status": {
                      "conditions": [
                          {
                              "lastProbeTime": null,
                              "lastTransitionTime": "2023-05-10T16:44:36Z",
                              "status": "True",
                              "type": "Initialized"
                          },
                          {
                              "lastProbeTime": null,
                              "lastTransitionTime": "2023-05-10T17:18:21Z",
                              "status": "True",
                              "type": "Ready"
                          },
                          {
                              "lastProbeTime": null,
                              "lastTransitionTime": "2023-05-10T17:18:21Z",
                              "status": "True",
                              "type": "ContainersReady"
                          },
                          {
                              "lastProbeTime": null,
                              "lastTransitionTime": "2023-05-10T16:44:36Z",
                              "status": "True",
                              "type": "PodScheduled"
                          }
                      ],
                      "containerStatuses": [
                          {
                              "containerID": "cri-o://a98dc10659008451a6aecc079008c08abca3e18dc1c17be517275e62f1423d84",
                              "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
                              "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:af280e9ef6f3eedb47e3b06380c3ad766bc2de9a4f75a7cbb97185794e21f06e",
                              "lastState": {},
                              "name": "dns-node-resolver",
                              "ready": true,
                              "restartCount": 1,
                              "started": true,
                              "state": {
                                  "running": {
                                      "startedAt": "2023-05-10T17:18:20Z"
                                  }
                              }
                          }
                      ],
                      "hostIP": "10.0.128.4",
                      "phase": "Running",
                      "podIP": "10.0.128.4",
                      "podIPs": [
                          {
                              "ip": "10.0.128.4"
                          }
                      ],
                      "qosClass": "Burstable",
                      "startTime": "2023-05-10T16:44:36Z"
                  }
              },
      
      
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

      Attachments

        Issue Links

          Activity

            People

              cholman@redhat.com Candace Holman
              kenzhang@redhat.com Ken Zhang
              Melvin Joseph Melvin Joseph
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: