Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1486

Avoid re-metric'ing the pods that are already setup when ovnkube-master disrupts/reinitializes/restarts/goes through leader election

    • None
    • SDN Sprint 227
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Bug Fix
    • Done

      Description of problem:

      When ovnkube-master leader container is disrupted/reinitializes/restarts/goes through leader election, the pod annotation latency spikes up from 1.5 seconds to ~27.5 seconds as it processes all the pods after it comes back. The creation latency for pods that are already configured and working fine is being bumped, from the time they were scheduled until the time the restart happened like Dan Williams suggested which shouldn't be the case: https://coreos.slack.com/archives/C01G7T6SYSD/p1663607960049459.

      Version-Release number of selected component (if applicable):
      
       

      How reproducible:

      Always

      Steps to Reproduce:

      1. Install an OpenShift or HyperShift cluster using the latest nightly - 4.11.0-0.nightly-2022-09-15-164423 has been used for this run 
      2. Use Kraken container scenarios to disrupt ovnkube-master container: https://github.com/redhat-chaos/krkn/blob/main/docs/container_scenarios.md 
      3. Observe histogram_quantile(0.99, sum(rate(ovnkube_master_pod_creation_latency_seconds_bucket metric
      
      

      Actual results:

       

      Expected results:

       

      Additional info:

      Logs:http://dell-r510-01.perf.lab.eng.rdu2.redhat.com/chaos/hypershift/ovn-disruption/

            [OCPBUGS-1486] Avoid re-metric'ing the pods that are already setup when ovnkube-master disrupts/reinitializes/restarts/goes through leader election

            Since the problem described in this issue should be resolved in a recent advisory, it has been closed.

            For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated files, follow the link below.

            If the solution does not work for you, open a new bug report.
            https://access.redhat.com/errata/RHSA-2023:1326

            Errata Tool added a comment - Since the problem described in this issue should be resolved in a recent advisory, it has been closed. For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2023:1326

            Hi bpickard@redhat.com, is the fix part of 4.12 nightlies? Asking the target release is marked as 4.13.

            Naga Ravi Chaitanya Elluri added a comment - Hi bpickard@redhat.com , is the fix part of 4.12 nightlies? Asking the target release is marked as 4.13.

            Ben Pickard added a comment -

            Ben Pickard added a comment - https://github.com/openshift/ovn-kubernetes/pull/1364 pulls downstream

            Ben Pickard added a comment -

            https://github.com/ovn-org/ovn-kubernetes/pull/3259 upstream pr, will pull downstream after this merges

            Ben Pickard added a comment - https://github.com/ovn-org/ovn-kubernetes/pull/3259 upstream pr, will pull downstream after this merges

            Dan Williams (Inactive) added a comment - - edited

            More context... addLogicalPort() records pod creation latency from ovnkube-master's perspective with:

            func RecordPodCreated(pod *kapi.Pod) {
            	t := time.Now()
            
            	// Find the scheduled timestamp
            	for _, cond := range pod.Status.Conditions {
            		if cond.Type != kapi.PodScheduled {
            			continue
            		}
            		if cond.Status != kapi.ConditionTrue {
            			return
            		}
            		creationLatency := t.Sub(cond.LastTransitionTime.Time).Seconds()
            		metricPodCreationLatency.Observe(creationLatency)
            		return
            	}
            }
            

            When a new ovnkube-master starts up, or gains leadership, it re-runs addLogicalPort() for all existing pods to ensure they are set up correctly. Unfortunately that means t above is the current time but the pods Scheduled status last transition time is still way in the past, so we end up setting the creation latency for the pod to a really, really long time depending on how old the pod is at the point that ovnkube-master leadership changes.

            I think we should just skip calling RecordPodCreated() for pods that (a) are annotated and (b) have a logical switch port already when addLogicalPort runs. In other words, only record metrics for pods that are newly set up by addLogicalPort().

            Dan Williams (Inactive) added a comment - - edited More context... addLogicalPort() records pod creation latency from ovnkube-master's perspective with: func RecordPodCreated(pod *kapi.Pod) { t := time.Now() // Find the scheduled timestamp for _, cond := range pod.Status.Conditions { if cond.Type != kapi.PodScheduled { continue } if cond.Status != kapi.ConditionTrue { return } creationLatency := t.Sub(cond.LastTransitionTime.Time).Seconds() metricPodCreationLatency.Observe(creationLatency) return } } When a new ovnkube-master starts up, or gains leadership, it re-runs addLogicalPort() for all existing pods to ensure they are set up correctly. Unfortunately that means t above is the current time but the pods Scheduled status last transition time is still way in the past, so we end up setting the creation latency for the pod to a really, really long time depending on how old the pod is at the point that ovnkube-master leadership changes. I think we should just skip calling RecordPodCreated() for pods that (a) are annotated and (b) have a logical switch port already when addLogicalPort runs. In other words, only record metrics for pods that are newly set up by addLogicalPort().

              bpickard@redhat.com Ben Pickard
              nelluri Naga Ravi Chaitanya Elluri
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: