Loading...

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: SiteConfig Operator
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:
RH Private Keywords:

Severity:
Important

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

In an IPv6 only enviornment, when a server is replaced via IBBF/IBR, the spoke cluster fails to rejoin the hub. The spoke cluster cannot reach the hub cluster.

The spoke is trying to reach the hub at:

Hub API: https://api.kni-qe-81.telcoqe.eng.rdu2.dc.redhat.com:6443
IPv6 address: 2620:52:9:1684:8c7:e6ff:fee3:5e21:6443

The klusterlet-agent logs show repeated timeouts trying to connect to the hub:

dial tcp [2620:52:9:1684:8c7:e6ff:fee3:5e21]:6443: i/o timeout

The klusterlet status also confirms this:

type: HubConnectionDegradedstatus: "True"reason: BootstrapSecretError,HubKubeConfigSecretMissingmessage: |  Failed to create &SelfSubjectAccessReview... Post "https://api.kni-qe-81.telcoqe.eng.rdu2.dc.redhat.com:6443/...":   dial tcp [2620:52:9:1684:8c7:e6ff:fee3:5e21]:6443: i/o timeout

This causes reimport to loop infinitely because the spoke cluster is never reported as healthy.

2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController    controller/clusterinstance_controller.go:157    Starting reconcile ClusterInstance    {"name": "helix82", "namespace": "helix82"}
2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController    controller/clusterinstance_controller.go:178    Loaded ClusterInstance    {"name": "helix82", "namespace": "helix82", "version": "3829902"}
2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController.EnsureClusterIsReimported    reinstall/reimport.go:173    Checking cluster health after provisioning    {"name": "helix82", "namespace": "helix82", "version": "3829902", "timeSinceProvisioning": "6m32.247818148s", "gracePeriod": "5m0s", "gracePeriodRemaining": "-1m32.247818148s"}
2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController.EnsureClusterIsReimported    reinstall/reimport.go:256    Cluster lease update stopped after grace period - triggering reimport    {"name": "helix82", "namespace": "helix82", "version": "3829902", "timeSinceProvisioning": "6m32.247818148s", "AvailableReason": "ManagedClusterLeaseUpdateStopped", "AvailableMessage": "Registration agent stopped updating its lease."}
2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController.EnsureClusterIsReimported    reinstall/reimport.go:269    Grace period expired and cluster still unhealthy, triggering reimport    {"name": "helix82", "namespace": "helix82", "version": "3829902", "cluster": "helix82", "timeSinceProvisioning": "6m32.247818148s"}
2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController.EnsureClusterIsReimported    reinstall/reimport.go:278    Reimport already initiated, waiting for cluster to become healthy    {"name": "helix82", "namespace": "helix82", "version": "3829902", "conditionMessage": "Klusterlet manifests applied to spoke cluster, waiting for cluster to become healthy"}
2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController    controller/clusterinstance_controller.go:214    Cluster reimport still in progress, requeuing    {"name": "helix82", "namespace": "helix82", "version": "3829902"}
2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController    controller/clusterinstance_controller.go:154    Finished reconciling ClusterInstance    {"name": "helix82", "namespace": "helix82", "version": "3829902"}

The root cause of this is a stale IPv6 neighbor cache on the hub cluster node running the API VIP. The new cluster is reinstalled with the preserved IP address of the old server, but uses the MAC address of the new server. Manually flushing the IPv6 neighbor cache on the hub node with the API VIP fixes the problem.

This has likely been an ongoing problem that was masked by ~~ACM-23801~~. When the reimport logic was added to IBBF/IBR, this new problem was detected.

Version-Release number of selected component (if applicable): 2.16

How reproducible: Always

Steps to Reproduce:

Reinstall server via IBR/IBBF
Monitor managedcluster available=unknown
Monitor siteconfig pod log, watching for reimport failures
Monitor kubelet logs on spoke cluster, watching for HubConnectionDegradedstatus
Attempt to ping API VIP from replaced spoke cluster (fails)
Manually flush Ipv6 neighbor cache on hub node running API VIP ( sudo ip -6 neigh del <IP> dev <bridge name>)
Monitor managedcluster available=true

ManagedCluster fails to rejoin hub cluster after successful server replacement in IPv6 environment

Description of problem:

Version-Release number of selected component (if applicable): 2.16

How reproducible: Always

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Details

Description

Description of problem:

Version-Release number of selected component (if applicable): 2.16

How reproducible: Always

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Easy Agile Planning Poker

Activity

People

Dates