Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-20024

Excessive TopologyAwareHintsDisabled events due to service/dns-default with topology aware hints activated.

XMLWordPrintable

    • Moderate
    • 5
    • Sprint 243, Sprint 244, Sprint 245, Sprint 246, Sprint 247, Sprint 248, Sprint 249, Sprint 250, Sprint 251, Sprint 252, Sprint 253
    • 11
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: The DNS Operator configures the cluster DNS service to use topology-aware hints when it determines that the conditions for enabling topology-aware hints are met. Using topology-aware hints for the cluster DNS service improves the reliability and latency of DNS lookups, in that it causes DNS queries to be routed to a DNS pod in the same availability zone as the client pod when possible. However, the Operator did not properly check one of those conditions. Specifically, the Operator did not verify that the cluster had nodes in at least two availability zones. Also, the DNS daemon set did not use surge for rolling updates, and so a rolling update of DNS pods would delete the old DNS pod on each node before creating the new DNS pod on that node.

      Consequence: Clusters in which all nodes were in the same availability zone would repeatedly emit "TopologyAwareHintsDisabled" events for the cluster DNS service. In addition, during upgrades, there was an increased risk of disruption or latency for DNS lookups from client pods on a node between the time when the old DNS pod on that node was deleted and when the new DNS pod on that node became ready.

      Fix: The Operator now checks that the cluster has nodes in at least two availability zones before enabling topology-aware hints for the cluster DNS service. In addition, the Operator enables surge for the DNS daemon set, so that a rolling update of DNS pods creates new pods before deleting old pods.

      Result: The "TopologyAwareHintsDisabled" events are no longer emitted on clusters that do not have nodes in multiple availability zones. In addition, the cluster DNS service is more reliable during upgrades on clusters that have topology-aware hints or the equivalent functionality to prefer local DNS pods.
      Show
      Cause: The DNS Operator configures the cluster DNS service to use topology-aware hints when it determines that the conditions for enabling topology-aware hints are met. Using topology-aware hints for the cluster DNS service improves the reliability and latency of DNS lookups, in that it causes DNS queries to be routed to a DNS pod in the same availability zone as the client pod when possible. However, the Operator did not properly check one of those conditions. Specifically, the Operator did not verify that the cluster had nodes in at least two availability zones. Also, the DNS daemon set did not use surge for rolling updates, and so a rolling update of DNS pods would delete the old DNS pod on each node before creating the new DNS pod on that node. Consequence: Clusters in which all nodes were in the same availability zone would repeatedly emit "TopologyAwareHintsDisabled" events for the cluster DNS service. In addition, during upgrades, there was an increased risk of disruption or latency for DNS lookups from client pods on a node between the time when the old DNS pod on that node was deleted and when the new DNS pod on that node became ready. Fix: The Operator now checks that the cluster has nodes in at least two availability zones before enabling topology-aware hints for the cluster DNS service. In addition, the Operator enables surge for the DNS daemon set, so that a rolling update of DNS pods creates new pods before deleting old pods. Result: The "TopologyAwareHintsDisabled" events are no longer emitted on clusters that do not have nodes in multiple availability zones. In addition, the cluster DNS service is more reliable during upgrades on clusters that have topology-aware hints or the equivalent functionality to prefer local DNS pods.
    • Bug Fix
    • In Progress

      Kube 1.26 introduced the warning level TopologyAwareHintsDisabled event. TopologyAwareHintsDisabled is fired by the EndpointSliceController whenever reconciling a service that has activated topology aware hints via the service.kubernetes.io/topology-aware-hints annotation, but there is not enough information in the existing cluster resources (typically nodes) to apply the topology aware hints.

      When re-basing OpnShift onto Kube 1.26, are CI builds are failing (except on AWS), because these events are firing "pathologically", for example:

      : [sig-arch] events should not repeat pathologically
        events happened too frequently event happened 83 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 result=reject 

      AWS nodes seem to have the proper values in the nodes. GCP has the values also, but they are not "right" for the purposes of the EndpointSliceController:

      event happened 38 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 result=reject }

      https://github.com/openshift/origin/pull/27666 will mask this problem (make it stop erroring in CI) but changes still need to be made in the product so end users are not subjected to these events.

      Now links to:

      [sig-arch] events should not repeat pathologically for ns/openshift-dns

            mmasters1@redhat.com Miciah Masters
            lusanche@redhat.com Luis Sanchez
            Melvin Joseph Melvin Joseph
            Devan Goodwin, Miciah Masters
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: