[OCPBUGS-34475] Disruption monitor failing when running conformance against hypershift cluster

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.15
Component/s: Test Framework
Labels:
- ServiceDeliveryImpact
- trt-standup

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.17.z
Target Backport Versions:

4.14, 4.15, 4.16

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

When running a conformance suite against a hypershift cluster (for example, CNI conformance) the MonitorTests step fails because of missing files from the disruption monitor.

Version-Release number of selected component (if applicable):

4.15.13

How reproducible:

Consistent

Steps to Reproduce:

    1. Create a hypershift cluster
    2. Attempt to run an ose-tests suite. For example, the CNI conformance suite documented here: https://access.redhat.com/documentation/en-us/red_hat_software_certification/2024/html/red_hat_software_certification_workflow_guide/con_cni-certification_openshift-sw-cert-workflow-working-with-cloud-native-network-function#running-the-cni-tests_openshift-sw-cert-workflow-working-with-container-network-interface
    3. Note errors in logs

Actual results:

found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-130-177.us-west-2.compute.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-152-10.us-west-2.compute.internal: the server could not find the requested resource]
Failed to write events from in-cluster monitors, err: open /tmp/artifacts/junit/AdditionalEvents__in_cluster_disruption.json: no such file or directory

Expected results:

No errors

Additional info:

The first error can be avoided by creating the directory it's looking for on all nodes:
for node in $(oc get nodes -oname); do oc debug -n default $node -- chroot /host mkdir -p /var/log/disruption-data/monitor-events; done
However, I'm not sure if this directory not being created is due to the disruption monitor working properly on hypershift, or if this should be skipped on hypershift entirely.

The second error is related to the ARTIFACT_DIR env var not being set locally, and can be avoided by creating a directory, setting that directory as the ARTIFACT_DIR, and then creating an empty "junit" dir inside of it.
It looks like ARTIFACT_DIR defaults to a temporary directory if it's not set in the env, but the "junit" directory doesn't exist inside of it, so file creation in that non-existant directory fails.

blocks

OCPBUGS-36241 Disruption monitor failing when running conformance against hypershift cluster

Verified

is cloned by

OCPBUGS-37630 Disruption monitor failing when running conformance against hypershift cluster

MODIFIED

OCPBUGS-36241 Disruption monitor failing when running conformance against hypershift cluster

Verified

links to

openshift/origin#28908: OCPBUGS-34475: remove unused in-cluster monitoring code

openshift/origin#28956: [release-4.15] OCPBUGS-34475: remove unused in-cluster monitoring code

Devan Goodwin added a comment - 2024/06/03 1:46 PM

Discussion: https://redhat-internal.slack.com/archives/C02LM9FABFW/p1715974242700929

Devan Goodwin added a comment - 2024/06/03 1:46 PM Discussion: https://redhat-internal.slack.com/archives/C02LM9FABFW/p1715974242700929

Christoph Blecker added a comment - 2024/05/30 8:03 PM

I see the file retrieval failure in hypershift CI e2e too:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance/1794107938944585728/artifacts/e2e-aws-ovn-conformance/conformance-tests/build-log.txt

found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-132-225.ec2.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-138-252.ec2.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-142-160.ec2.internal: the server could not find the requested resource]

I don't see the `AdditionalEvents__in_cluster_disruption.json` error, I think specifically because the `junit` directory is created by some other process.

Christoph Blecker added a comment - 2024/05/30 8:03 PM I see the file retrieval failure in hypershift CI e2e too: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-hypershift-release-4.16-periodics-e2e-aws-ovn-conformance/1794107938944585728/artifacts/e2e-aws-ovn-conformance/conformance-tests/build-log.txt found errors fetching in-cluster data: [failed to list files in disruption event folder on node ip-10-0-132-225.ec2.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-138-252.ec2.internal: the server could not find the requested resource failed to list files in disruption event folder on node ip-10-0-142-160.ec2.internal: the server could not find the requested resource] I don't see the `AdditionalEvents__in_cluster_disruption.json` error, I think specifically because the `junit` directory is created by some other process.

Devan Goodwin added a comment - 2024/05/30 5:14 PM

The missing resource it's complaining of looks like it's probably coming from:

func StreamNodeLogFile(ctx context.Context, client kubernetes.Interface, nodeName, filename string) (io.ReadCloser, error) {
	path := client.CoreV1().RESTClient().Get().
		Namespace("").Name(nodeName).
		Resource("nodes").SubResource("proxy", "logs").Suffix(filename).URL().Path

	req := client.CoreV1().RESTClient().Get().RequestURI(path).
		SetHeader("Accept", "text/plain, */*")

	return req.Stream(ctx)
}

I'd be interested to see the pod logs for the in-cluster disruption monitor pods. Namespace should be something starting with e2e-disruption, if you could poke in there while the tests were running we might learn a lot about what's going on here. From what I can see, this is working on hypershift e2e jobs.

Devan Goodwin added a comment - 2024/05/30 5:14 PM The missing resource it's complaining of looks like it's probably coming from: func StreamNodeLogFile(ctx context.Context, client kubernetes.Interface, nodeName, filename string) (io.ReadCloser, error) { path := client.CoreV1().RESTClient().Get(). Namespace("").Name(nodeName). Resource( "nodes" ).SubResource( "proxy" , "logs" ).Suffix(filename).URL().Path req := client.CoreV1().RESTClient().Get().RequestURI(path). SetHeader( "Accept" , "text/plain, */*" ) return req.Stream(ctx) } I'd be interested to see the pod logs for the in-cluster disruption monitor pods. Namespace should be something starting with e2e-disruption, if you could poke in there while the tests were running we might learn a lot about what's going on here. From what I can see, this is working on hypershift e2e jobs.

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Devan Goodwin added a comment - 2024/06/03 1:46 PM

Expand comment: Devan Goodwin added a comment - 2024/06/03 1:46 PM

Collapse comment: Christoph Blecker added a comment - 2024/05/30 8:03 PM

Expand comment: Christoph Blecker added a comment - 2024/05/30 8:03 PM

Collapse comment: Devan Goodwin added a comment - 2024/05/30 5:14 PM

Expand comment: Devan Goodwin added a comment - 2024/05/30 5:14 PM

People

Dates