Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-21589

Cluster Version Operator in CrashLoopBackOff - no such host

    XMLWordPrintable

Details

    • No
    • Sprint 246, Sprint 247, Sprint 248, Sprint 249, Sprint 250, Sprint 251, Sprint 252
    • 7
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      The cluster version operator is currently experiencing an issue with the following error message:
      
       $ oc logs pod/cluster-version-operator-fdd98d77c-899x7 -n openshift-cluster-version
         2023-10-02T13:03:53.956713295Z W1002 13:03:53.956647       1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host
         2023-10-02T13:03:55.956235251Z W1002 13:03:55.956166       1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host
         2023-10-02T13:03:57.956336455Z W1002 13:03:57.956273       1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host
         2023-10-02T13:03:59.956560629Z W1002 13:03:59.956518       1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host

      Version-Release number of selected component (if applicable):

      Cluster ID: 1c087e4c-ab97-442d-9eb2-dace2e252958
      Cluster Version: 4.12.4
      Desired Version: 4.12.4
      Channel: stable-4.12
      Previous Version(s): 4.12.3, 4.11.26, 4.10.51, 4.10.6(unverified)
      
      Infrastructure
      --------------
      Platform: VSphere
      Install Type: IPI
      apiServerInternalIP: 172.17.98.107
      apiServerInternalIPs: 172.17.98.107
      ingressIP: 172.17.98.108
      ingressIPs: 172.17.98.108
      
      Network
      -------
      Network Type: OpenShiftSDN
      httpProxy: None
      httpsProxy: None
      Cluster network: 10.132.0.0/14
              Host prefix: 23
              Max nodes: 512
              Max pods per node: 510

      How reproducible:

      It started seemingly randomly when nothing much had been changed on the cluster for a while.

      Troubleshoot steps taken so far:

      1. On the master0 node where the cluster version operator is running, the 'dig' command was successful: 
      sh-4.4# dig api-int.os-cluster-prod-01.ats-inc.com
      
        ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8_6.1 <<>> api-int.os-cluster-prod-01.ats-inc.com
        ;; global options: +cmd
        ;; Got answer:
        ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59604
        ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
        ;; WARNING: recursion requested but not available
      
        ;; OPT PSEUDOSECTION:
        ; EDNS: version: 0, flags:; udp: 512
        ; COOKIE: 1183add7ab786034 (echoed)
        ;; QUESTION SECTION:
        ;api-int.os-cluster-prod-01.ats-inc.com.        IN A
      
        ;; ANSWER SECTION:
        api-int.os-cluster-prod-01.ats-inc.com. 16 IN A 172.17.98.107
      
        ;; Query time: 0 msec
        ;; SERVER: 172.17.98.163#53(172.17.98.163)
        ;; WHEN: Mon Oct 02 17:05:31 UTC 2023
        ;; MSG SIZE  rcvd: 133
      
      
      2. The nameserver 172.17.98.163 corresponds to the IP address of the master0 node itself:
      
      $ cat ip_addr
      2: ens192    inet 172.17.98.163/24 brd 172.17.98.255 scope global dynamic noprefixroute ens192\       valid_lft 80512sec preferred_lft 80512sec
      
      
      3. When examining the DNS operator, it is confirmed that the pod is configured to use the correct DNS server (CoreDNS configures the kubelet to instruct pods to use the CoreDNS service IP address for name resolution):
      
      $ oc get svc -n  openshift-dns
       NAME                 TYPE       CLUSTER-IP   EXTERNAL-IP  PORT(S)                 AGE
       service/dns-default  ClusterIP  10.136.0.10  <none>       53/UDP,53/TCP,9154/TCP  227d
      
      
      4. Despite restarting the node-resolver, the issue persists:
      $ oc delete pod/node-resolver-hmlvs -n openshift-dns
      
      
      5. Deleting the CVO operator pod did not resolve the issue:
      $ oc delete pod/cluster-version-operator-fdd98d77c-899x7 -n openshift-cluster-version
      
      
      6. I considered adding a forward nameserver to CoreDNS, but I was advised that this is not a recommended solution. Therefore, I am opening a Jira issue to seek assistance.

      Attachments

        Activity

          People

            alebedev@redhat.com Andrey Lebedev
            rhn-support-jaykim Jihoon Kim
            Jia Liu Jia Liu
            Jihoon Kim
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: