Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22703

Monitor tests are failing in Local Zone jobs (edge nodes)

XMLWordPrintable

    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The following pre submit jobs for Local Zones are perm failing since August:
      - e2e-aws-ovn-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones?buildId=1716457254460329984
      - e2e-aws-ovn-shared-vpc-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones
      
      Investigating we can see common failures in tests '[sig-network] can collect <poller_name> poller pod logs', leading the most of jobs to not completed correctly for those failures.
      
      Exploring the code I can see it was recently added, near August and matches with when the failures started.
      
      It is required to tolerate the label "node-role.kubernetes.io/edge" to run pods on instances located in Local Zone ("edge nodes"). I am not sure if I am looking in the correct place, but it seems it is tolerating only master labels: https://github.com/openshift/origin/blob/master/pkg/monitortests/network/disruptionpodnetwork/host-network-target-deployment.yaml#L42
      
      

      Version-Release number of selected component (if applicable):

      4.15.0

      How reproducible:

      always

      Steps to Reproduce:

      trigger the job:
      1. open a PR on installer
      2. run the job
      3. check failed tests '[sig-network] can collect <poller_name> poller pod logs' 
      
      Example of 4.15 blocked feature PR (Wavelength Zones): https://github.com/openshift/installer/pull/7369#issuecomment-1783699175

      Actual results:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7590/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones/1715075142427611136
      {  1 pods lacked sampler output: [pod-network-to-pod-network-disruption-poller-d94fb55db-9qfpz]}
      
      E1018 22:06:34.773866       1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close
      E1018 22:06:34.774669       1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close
      

      Expected results:

      Monitor jobs be scheduled in edge nodes?
      How we can track job failures for new monitor tests?

      Additional info:

      Edge nodes have NoSchedule taints applied by default, to run monitor pods in those nodes you need to tolerate the label "node-role.kubernetes.io/edge"
      
      See the enhancement for more informaation: https://github.com/openshift/enhancements/blob/master/enhancements/installer/aws-custom-edge-machineset-local-zones.md#user-workload-deployments
      
      Looking the must-gather of job 1716457254460329984, you can see the monitor pods not scheduled due the missing tolerations:
      
      $ grep -rni pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2 \
        1716457254460329984-must-gather/09abb0d6fc08ee340563e6e11f5ceafb42fb371e50ab6acee6764031062525b7/namespaces/openshift-kube-scheduler/pods/ \
        | awk -F'] "' '{print$2}' | sort | uniq -c
          215 Unable to schedule pod; no fit; waiting" pod="e2e-pod-network-disruption-test-59s5d/pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2" 
      err="0/7 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/edge: }, 
      6 node(s) didn't match pod anti-affinity rules. preemption: 0/7 nodes are available: 
      1 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod.."

            rhn-support-mrbraga Marco Braga
            rhn-support-mrbraga Marco Braga
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: