Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-6095

[3.11] reference bug 1814804 / case 03416197 - feedback requested on possible mitigation submitted

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • 3.11.z
    • Node / Kubelet
    • Critical
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      One of my customer have been facing an issue on 2 different nodes which looks quite similar.
      The OpenShift nodes were marked as Ready although they were never marked as NotReady:
      
      ###
      Jan 19 23:35:44 ocpxxxx.local atomic-openshift-node[4090]: I0119 23:35:44.142960    4090 kubelet_node_status.go:441] Recording NodeReady event message for node ocpxxxx.local
      ###
      Jan 19 07:48:01 ocpxxxx.local atomic-openshift-node[195489]: I0119 07:48:01.713226  195489 kubelet_node_status.go:441] Recording NodeReady event message for node ocpxxxx.local
      ###
      
      The OpenShift nodes were never marked as NotReady so the pods were not rescheduled but the Endpoints were marked as NotReady:
      
      ###
      [RED3QY9@popshba01c ~]$ oc get ep -n xxx genesysxxxx -o yaml
      apiVersion: v1
      kind: Endpoints
      metadata:
        creationTimestamp: 2022-03-11T12:57:34Z
        labels:
          app.kubernetes.io/instance: genesys-xxxx
          app.kubernetes.io/managed-by: Helm
          app.kubernetes.io/name: genesys-xxxx
          app.kubernetes.io/version: 1.0.0
          helm.sh/chart: genesys-xxxx
        name: genesys-services-gestione-esiti-ricontatto
        namespace: xxxx
        resourceVersion: "675954227"
        selfLink: /api/v1/namespaces/xxx/endpoints/genesys-sxxxx
        uid: d1e2f763-a13a-11ec-b2c6-5448107ce401
      subsets:
      - notReadyAddresses:
        - ip: 10.xxxx
          nodeName: ocpxxxx.local
          targetRef:
            kind: Pod
            name: genesys-xxxx
            namespace: xxxx
            resourceVersion: "675954219"
            uid: c4e5b7c8-8223-11ed-bcbe-5448107ce41b
        ports:
        - name: 8443-tcp
          port: 8443
          protocol: TCP
      ###
      
      This looks similar to the bug https://bugzilla.redhat.com/show_bug.cgi?id=1814804.
      Customers would like to know if the mitigation submitted here it's a valid mitigation which can avoid restarting all pods running on that OpenShift nodes as described here [1].
      
      
      OpenShift version: 3.11.153
      RHEL version: 7.7
      
      They have ELS subscription.
      
      Thanks
      Angelo
      
      [1] https://access.redhat.com/solutions/5002781

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

      Attachments

        Activity

          People

            nagrawal@redhat.com Neelesh Agrawal
            rhn-support-agabriel Angelo Gabrieli
            Sunil Choudhary Sunil Choudhary
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: