Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Networking / DNS
Labels:
- groomed
- ne-triaged

Regression:
No
Sprint:
Sprint 246, Sprint 247, Sprint 248, Sprint 249, Sprint 250, Sprint 251, Sprint 252
sprint_count:
7
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Links:

Description

Description of problem:

The cluster version operator is currently experiencing an issue with the following error message:

 $ oc logs pod/cluster-version-operator-fdd98d77c-899x7 -n openshift-cluster-version
   2023-10-02T13:03:53.956713295Z W1002 13:03:53.956647       1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host
   2023-10-02T13:03:55.956235251Z W1002 13:03:55.956166       1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host
   2023-10-02T13:03:57.956336455Z W1002 13:03:57.956273       1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host
   2023-10-02T13:03:59.956560629Z W1002 13:03:59.956518       1 start.go:157] Failed to get FeatureGate from cluster: Get "https://api-int.os-cluster-prod-01.ats-inc.com:6443/apis/config.openshift.io/v1/featuregates/cluster": dial tcp: lookup api-int.os-cluster-prod-01.ats-inc.com on 10.136.0.10:53: no such host

Version-Release number of selected component (if applicable):

Cluster ID: 1c087e4c-ab97-442d-9eb2-dace2e252958
Cluster Version: 4.12.4
Desired Version: 4.12.4
Channel: stable-4.12
Previous Version(s): 4.12.3, 4.11.26, 4.10.51, 4.10.6(unverified)

Infrastructure
--------------
Platform: VSphere
Install Type: IPI
apiServerInternalIP: 172.17.98.107
apiServerInternalIPs: 172.17.98.107
ingressIP: 172.17.98.108
ingressIPs: 172.17.98.108

Network
-------
Network Type: OpenShiftSDN
httpProxy: None
httpsProxy: None
Cluster network: 10.132.0.0/14
        Host prefix: 23
        Max nodes: 512
        Max pods per node: 510

How reproducible:

It started seemingly randomly when nothing much had been changed on the cluster for a while.

Troubleshoot steps taken so far:

1. On the master0 node where the cluster version operator is running, the 'dig' command was successful: 
sh-4.4# dig api-int.os-cluster-prod-01.ats-inc.com

  ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8_6.1 <<>> api-int.os-cluster-prod-01.ats-inc.com
  ;; global options: +cmd
  ;; Got answer:
  ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 59604
  ;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
  ;; WARNING: recursion requested but not available

  ;; OPT PSEUDOSECTION:
  ; EDNS: version: 0, flags:; udp: 512
  ; COOKIE: 1183add7ab786034 (echoed)
  ;; QUESTION SECTION:
  ;api-int.os-cluster-prod-01.ats-inc.com.        IN A

  ;; ANSWER SECTION:
  api-int.os-cluster-prod-01.ats-inc.com. 16 IN A 172.17.98.107

  ;; Query time: 0 msec
  ;; SERVER: 172.17.98.163#53(172.17.98.163)
  ;; WHEN: Mon Oct 02 17:05:31 UTC 2023
  ;; MSG SIZE  rcvd: 133


2. The nameserver 172.17.98.163 corresponds to the IP address of the master0 node itself:

$ cat ip_addr
2: ens192    inet 172.17.98.163/24 brd 172.17.98.255 scope global dynamic noprefixroute ens192\       valid_lft 80512sec preferred_lft 80512sec


3. When examining the DNS operator, it is confirmed that the pod is configured to use the correct DNS server (CoreDNS configures the kubelet to instruct pods to use the CoreDNS service IP address for name resolution):

$ oc get svc -n  openshift-dns
 NAME                 TYPE       CLUSTER-IP   EXTERNAL-IP  PORT(S)                 AGE
 service/dns-default  ClusterIP  10.136.0.10  <none>       53/UDP,53/TCP,9154/TCP  227d


4. Despite restarting the node-resolver, the issue persists:
$ oc delete pod/node-resolver-hmlvs -n openshift-dns


5. Deleting the CVO operator pod did not resolve the issue:
$ oc delete pod/cluster-version-operator-fdd98d77c-899x7 -n openshift-cluster-version


6. I considered adding a forward nameserver to CoreDNS, but I was advised that this is not a recommended solution. Therefore, I am opening a Jira issue to seek assistance.

Attachments

Activity

People

Assignee:: Andrey Lebedev

Reporter:: Jihoon Kim

QA Contact:: Jia Liu

Need Info From:: Jihoon Kim

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2023/10/13 9:20 PM

Updated:: 2024/04/10 6:00 AM