Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-30330

ManagedCluster fails to rejoin hub cluster after successful server replacement in IPv6 environment

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • SiteConfig Operator
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Important
    • None

      Description of problem:

      In an IPv6 only enviornment, when a server is replaced via IBBF/IBR, the spoke cluster fails to rejoin the hub. The spoke cluster cannot reach the hub cluster.

      The spoke is trying to reach the hub at:

      Hub API: https://api.kni-qe-81.telcoqe.eng.rdu2.dc.redhat.com:6443
      IPv6 address: 2620:52:9:1684:8c7:e6ff:fee3:5e21:6443

      The klusterlet-agent logs show repeated timeouts trying to connect to the hub:

      dial tcp [2620:52:9:1684:8c7:e6ff:fee3:5e21]:6443: i/o timeout

      The klusterlet status also confirms this:

      type: HubConnectionDegradedstatus: "True"reason: BootstrapSecretError,HubKubeConfigSecretMissingmessage: |  Failed to create &SelfSubjectAccessReview... Post "https://api.kni-qe-81.telcoqe.eng.rdu2.dc.redhat.com:6443/...":   dial tcp [2620:52:9:1684:8c7:e6ff:fee3:5e21]:6443: i/o timeout 

      This causes reimport to loop infinitely because the spoke cluster is never reported as healthy.

      2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController    controller/clusterinstance_controller.go:157    Starting reconcile ClusterInstance    {"name": "helix82", "namespace": "helix82"}
      2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController    controller/clusterinstance_controller.go:178    Loaded ClusterInstance    {"name": "helix82", "namespace": "helix82", "version": "3829902"}
      2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController.EnsureClusterIsReimported    reinstall/reimport.go:173    Checking cluster health after provisioning    {"name": "helix82", "namespace": "helix82", "version": "3829902", "timeSinceProvisioning": "6m32.247818148s", "gracePeriod": "5m0s", "gracePeriodRemaining": "-1m32.247818148s"}
      2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController.EnsureClusterIsReimported    reinstall/reimport.go:256    Cluster lease update stopped after grace period - triggering reimport    {"name": "helix82", "namespace": "helix82", "version": "3829902", "timeSinceProvisioning": "6m32.247818148s", "AvailableReason": "ManagedClusterLeaseUpdateStopped", "AvailableMessage": "Registration agent stopped updating its lease."}
      2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController.EnsureClusterIsReimported    reinstall/reimport.go:269    Grace period expired and cluster still unhealthy, triggering reimport    {"name": "helix82", "namespace": "helix82", "version": "3829902", "cluster": "helix82", "timeSinceProvisioning": "6m32.247818148s"}
      2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController.EnsureClusterIsReimported    reinstall/reimport.go:278    Reimport already initiated, waiting for cluster to become healthy    {"name": "helix82", "namespace": "helix82", "version": "3829902", "conditionMessage": "Klusterlet manifests applied to spoke cluster, waiting for cluster to become healthy"}
      2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController    controller/clusterinstance_controller.go:214    Cluster reimport still in progress, requeuing    {"name": "helix82", "namespace": "helix82", "version": "3829902"}
      2026-02-23T13:53:44.247Z    INFO    ClusterInstanceController    controller/clusterinstance_controller.go:154    Finished reconciling ClusterInstance    {"name": "helix82", "namespace": "helix82", "version": "3829902"}
       

       

      The root cause of this is a stale IPv6 neighbor cache on the hub cluster node running the API VIP. The new cluster is reinstalled with the preserved IP address of the old server, but uses the MAC address of the new server. Manually flushing the IPv6 neighbor cache on the hub node with the API VIP fixes the problem.

      This has likely been an ongoing problem that was masked by ACM-23801. When the reimport logic was added to IBBF/IBR, this new problem was detected.

       

      Version-Release number of selected component (if applicable): 2.16

      How reproducible: Always

      Steps to Reproduce:

      1. Reinstall server via IBR/IBBF
      2. Monitor managedcluster available=unknown
      3. Monitor siteconfig pod log, watching for reimport failures
      4. Monitor kubelet logs on spoke cluster, watching for HubConnectionDegradedstatus
      5. Attempt to ping API VIP from replaced spoke cluster (fails)
      6. Manually flush Ipv6 neighbor cache on hub node running API VIP ( sudo ip -6 neigh del <IP> dev <bridge name>)
      7. Monitor managedcluster available=true

      Actual results:

      Reinstalled spoke cluster cannot communicate with hub cluster API

      Expected results:

      Spoke cluster rejoins hub after successful IBR/IBBF

      Additional info:

              sakhoury@redhat.com Sharat Akhoury
              josclark@redhat.com Joshua Clark
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: