[OCPBUGS-17446] Wrong advertise address is used in hosted control plane etcd - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.14.0
Affects Version/s: 4.14
Component/s: HyperShift
Labels:
None

Severity:
Moderate
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

The advertise address configured for our hcp etcd clusters is not resolvable via DNS (ie. etcd-0.etcd-client.namespace.svc:2379). This impacts some of the etcd tooling that expects to access each member by their advertise address.

Version-Release number of selected component (if applicable):

4.14 (and earlier)

How reproducible:

Always

Steps to Reproduce:

1. Create a HostedCluster and wait for it to come up.
2. Exec into an etcd pod and query cluster endpoint health:
   $ oc rsh etcd-0
   $ etcdctl --cacert /etc/etcd/tls/etcd-ca/ca.crt \
             --cert /etc/etcd/tls/server/server.crt \
             --key /etc/etcd/tls/server/server.key \
             --endpoints https://localhost:2379 \
             endpoint health --cluster -w table

Actual results:

An error is returned similar to:
{"level":"warn","ts":"2023-08-07T20:40:49.890254Z","logger":"client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000378fc0/etcd-0.etcd-client.clusters-test-cluster.svc:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp: lookup etcd-0.etcd-client.clusters-test-cluster.svc on 172.30.0.10:53: no such host\""}

Expected results:

Actual cluster health is returned:
+--------------------------------------------------------------+--------+-------------+-------+
|                           ENDPOINT                           | HEALTH |    TOOK     | ERROR |
+--------------------------------------------------------------+--------+-------------+-------+
| https://etcd-0.etcd-discovery.clusters-cewong-guest.svc:2379 |   true |  9.372168ms |       |
| https://etcd-2.etcd-discovery.clusters-cewong-guest.svc:2379 |   true | 12.269226ms |       |
| https://etcd-1.etcd-discovery.clusters-cewong-guest.svc:2379 |   true | 12.291392ms |       |
+--------------------------------------------------------------+--------+-------------+-------+

Additional info:

The etcd statefulset is created with spec.serviceName set to `etcd-discovery`. This means that pods in the statefulset, get subdomain set to `etcd-discovery` and names like etcd-0.etcd-discovery.[ns].svc are resolvable. However, the same is not true for the etcd-client service. etcd-0.etcd-client.[ns].svc is not resolvable. The fix would be to change the advertise address of each member to a resolvable name (ie. etcd-0.etcd-discvoery.[ns].svc) and adjust the server certificate to allow those names as well.

links to

openshift/hypershift#2884: OCPBUGS-17446: Set advertise-address in HCP etcd to resolvable name

RHEA-2023:5006 rpm

mentioned on

Merge request - Bump IBM integration to our latest prod image.

Assignee:: Cesar Wong

Reporter:: Cesar Wong

QA Contact:: Jie Zhao

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2023/08/07 9:08 PM

Updated:: 2024/01/23 9:09 PM

Resolved:: 2023/10/31 1:39 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates