Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-32203

EgressIP Healthcheck silently breaks 18 days after ovn-cert rotation

XMLWordPrintable

    • +
    • Important
    • No
    • SDN Sprint 253
    • 1
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress
    • Customer Escalated

      After Cert expires, ovnkube-master starts to log the below error as per our grpc loglevel this is actually a x509, I also confirmed a successful tcp conection was established and torndown in <100ms, https://github.com/grpc/grpc-go/issues/2561
      
      egressip_healthcheck.go:162] Could not connect to $hostname ($ip:9107):context deadline exceeded

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      Rotate the openshift-ovn-kubernetes/ovn-cert and Wait for the old cert to expire

      Steps to Reproduce:

          1. wait %10 days after ovn-cert rotation, and with no pod restarts.
          2. After cert rotation, 18days (%10 of validity) egressIPs will be removed from all nodes.
          3. all nodes will start to fail egress health probes with context deadline exceeded
          

      Actual results:

      Silent failure of egressIP healthchecks 

      Expected results:

          No noticable impact, automatic loading of new cert/restart of pod

      Additional info:

          whom ever rotated the cert should be restarting the daemonsets. and we should also log x509 issues in the grpc library.

              pdiak@redhat.com Patryk Diak
              rhn-support-tidawson Tim Dawson
              Jean Chen Jean Chen
              Votes:
              3 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated:
                Resolved: