Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34475

Disruption monitor failing when running conformance against hypershift cluster

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When running a conformance suite against a hypershift cluster (for example, CNI conformance) the MonitorTests step fails because of missing files from the disruption monitor.
          

      Version-Release number of selected component (if applicable):

      4.15.13
          

      How reproducible:

      Consistent
          

      Steps to Reproduce:

          1. Create a hypershift cluster
          2. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface
          3. Note errors in logs
          

      Actual results:

      found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-130-177.us-west-2.compute.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-152-10.us-west-2.compute.internal: the server could not find the requested resource]
      Failed to write events from in-cluster monitors, err: open /tmp/artifacts/junit/AdditionalEvents__in_cluster_disruption.json: no such file or directory
          

      Expected results:

      No errors 
          

      Additional info:

      The first error can be avoided by creating the directory it's looking for on all nodes:
      for node in $(oc get nodes -oname); do oc debug -n default $node -- chroot /host mkdir -p /var/log/disruption-data/monitor-events; done
      However, I'm not sure if this directory not being created is due to the disruption monitor working properly on hypershift, or if this should be skipped on hypershift entirely.
      
      The second error is related to the ARTIFACT_DIR env var not being set locally, and can be avoided by creating a directory, setting that directory as the ARTIFACT_DIR, and then creating an empty "junit" dir inside of it.
      It looks like ARTIFACT_DIR defaults to a temporary directory if it's not set in the env, but the "junit" directory doesn't exist inside of it, so file creation in that non-existant directory fails.
          

            [OCPBUGS-34475] Disruption monitor failing when running conformance against hypershift cluster

            Devan Goodwin added a comment - Discussion: https://redhat-internal.slack.com/archives/C02LM9FABFW/p1715974242700929

            I see the file retrieval failure in hypershift CI e2e too:
            https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance/1794107938944585728/artifacts/e2e-aws-ovn-conformance/conformance-tests/build-log.txt

            found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-132-225.ec2.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-138-252.ec2.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-142-160.ec2.internal: the server could not find the requested resource]
            

            I don't see the `AdditionalEvents__in_cluster_disruption.json` error, I think specifically because the `junit` directory is created by some other process.

            Christoph Blecker added a comment - I see the file retrieval failure in hypershift CI e2e too: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance/1794107938944585728/artifacts/e2e-aws-ovn-conformance/conformance-tests/build-log.txt found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-132-225.ec2.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-138-252.ec2.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-142-160.ec2.internal: the server could not find the requested resource] I don't see the `AdditionalEvents__in_cluster_disruption.json` error, I think specifically because the `junit` directory is created by some other process.

            The missing resource it's complaining of looks like it's probably coming from:

            func StreamNodeLogFile(ctx context.Context, client kubernetes.Interface, nodeName, filename string) (io.ReadCloser, error) {
            	path := client.CoreV1().RESTClient().Get().
            		Namespace("").Name(nodeName).
            		Resource("nodes").SubResource("proxy", "logs").Suffix(filename).URL().Path
            
            	req := client.CoreV1().RESTClient().Get().RequestURI(path).
            		SetHeader("Accept", "text/plain, */*")
            
            	return req.Stream(ctx)
            }
            

            I'd be interested to see the pod logs for the in-cluster disruption monitor pods. Namespace should be something starting with e2e-disruption, if you could poke in there while the tests were running we might learn a lot about what's going on here. From what I can see, this is working on hypershift e2e jobs.

            Devan Goodwin added a comment - The missing resource it's complaining of looks like it's probably coming from: func StreamNodeLogFile(ctx context.Context, client kubernetes.Interface, nodeName, filename string) (io.ReadCloser, error) { path := client.CoreV1().RESTClient().Get(). Namespace("").Name(nodeName). Resource( "nodes" ).SubResource( "proxy" , "logs" ).Suffix(filename).URL().Path req := client.CoreV1().RESTClient().Get().RequestURI(path). SetHeader( "Accept" , "text/plain, */*" ) return req.Stream(ctx) } I'd be interested to see the pod logs for the in-cluster disruption monitor pods. Namespace should be something starting with e2e-disruption, if you could poke in there while the tests were running we might learn a lot about what's going on here. From what I can see, this is working on hypershift e2e jobs.

              stbenjam Stephen Benjamin
              cblecker.openshift Christoph Blecker
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: