Loading...

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.19, 4.20
Component/s: Networking / router
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Low
Regression:
None

Target Backport Versions:
None
Target Version:

4.21
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Release Note Not Required
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem

CI is flaky because of test failures such as the following:

=== RUN   TestAll/parallel/TestInternalLoadBalancerGlobalAccessGCP
  operator_test.go:1303: Expected conditions: map[Admitted:True Available:True DNSManaged:True DNSReady:True LoadBalancerManaged:True LoadBalancerReady:True]
         Current conditions: map[Admitted:True Available:False DNSManaged:True DNSReady:False Degraded:True DeploymentAvailable:True DeploymentReplicasAllAvailable:True DeploymentReplicasMinAvailable:True DeploymentRollingOut:False EvaluationConditionsDetected:False LoadBalancerManaged:True LoadBalancerProgressing:False LoadBalancerReady:False Progressing:False Upgradeable:True]
    operator_test.go:1303: Ingress Controller openshift-ingress-operator/test-gcp status: {
          "availableReplicas": 1,
          "selector": "ingresscontroller.operator.openshift.io/deployment-ingresscontroller=test-gcp",
          "domain": "test-gcp.ci-op-wh3wqrpm-76b3b.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
          "endpointPublishingStrategy": {
            "type": "LoadBalancerService",
            "loadBalancer": {
              "scope": "Internal",
              "providerParameters": {
                "type": "GCP",
                "gcp": {
                  "clientAccess": "Global"
                }
              },
              "dnsManagementPolicy": "Managed"
            }
          },
          "conditions": [
            {
              "type": "Admitted",
              "status": "True",
              "lastTransitionTime": "2025-09-18T16:43:22Z",
              "reason": "Valid"
            },
            {
              "type": "DeploymentAvailable",
              "status": "True",
              "lastTransitionTime": "2025-09-18T16:43:56Z",
              "reason": "DeploymentAvailable",
              "message": "The deployment has Available status condition set to True"
            },
            {
              "type": "DeploymentReplicasMinAvailable",
              "status": "True",
              "lastTransitionTime": "2025-09-18T16:43:56Z",
              "reason": "DeploymentMinimumReplicasMet",
              "message": "Minimum replicas requirement is met"
            },
            {
              "type": "DeploymentReplicasAllAvailable",
              "status": "True",
              "lastTransitionTime": "2025-09-18T16:43:56Z",
              "reason": "DeploymentReplicasAvailable",
              "message": "All replicas are available"
            },
            {
              "type": "DeploymentRollingOut",
              "status": "False",
              "lastTransitionTime": "2025-09-18T16:43:56Z",
              "reason": "DeploymentNotRollingOut",
              "message": "Deployment is not actively rolling out"
            },
            {
              "type": "LoadBalancerManaged",
              "status": "True",
              "lastTransitionTime": "2025-09-18T16:43:23Z",
              "reason": "WantedByEndpointPublishingStrategy",
              "message": "The endpoint publishing strategy supports a managed load balancer"
            },
            {
              "type": "LoadBalancerReady",
              "status": "False",
              "lastTransitionTime": "2025-09-18T16:43:23Z",
              "reason": "SyncLoadBalancerFailed",
              "message": "The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/XXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-b/instances/ci-op-wh3wqrpm-76b3b-xckjt-worker-b-dq7tp' is already a member of 'projects/XXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-b/instanceGroups/k8s-ig--b9791267d22f5359'., memberAlreadyExists\nThe cloud-controller-manager logs may contain more details."
            },
            {
              "type": "LoadBalancerProgressing",
              "status": "False",
              "lastTransitionTime": "2025-09-18T16:43:23Z",
              "reason": "LoadBalancerNotProgressing",
              "message": "LoadBalancer is not progressing"
            },
            {
              "type": "DNSManaged",
              "status": "True",
              "lastTransitionTime": "2025-09-18T16:43:23Z",
              "reason": "Normal",
              "message": "DNS management is supported and zones are specified in the cluster DNS config."
            },
            {
              "type": "DNSReady",
              "status": "False",
              "lastTransitionTime": "2025-09-18T16:43:23Z",
              "reason": "RecordNotFound",
              "message": "The wildcard record resource was not found."
            },
            {
              "type": "Available",
              "status": "False",
              "lastTransitionTime": "2025-09-18T16:43:23Z",
              "reason": "IngressControllerUnavailable",
              "message": "One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/XXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-b/instances/ci-op-wh3wqrpm-76b3b-xckjt-worker-b-dq7tp' is already a member of 'projects/XXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-b/instanceGroups/k8s-ig--b9791267d22f5359'., memberAlreadyExists\nThe cloud-controller-manager logs may contain more details.)"
            },
            {
              "type": "Progressing",
              "status": "False",
              "lastTransitionTime": "2025-09-18T16:43:56Z"
            },
            {
              "type": "Degraded",
              "status": "True",
              "lastTransitionTime": "2025-09-18T16:44:53Z",
              "reason": "DegradedConditions",
              "message": "One or more other status conditions indicate a degraded state: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/XXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-b/instances/ci-op-wh3wqrpm-76b3b-xckjt-worker-b-dq7tp' is already a member of 'projects/XXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-b/instanceGroups/k8s-ig--b9791267d22f5359'., memberAlreadyExists\nThe cloud-controller-manager logs may contain more details.)"
            },
            {
              "type": "Upgradeable",
              "status": "True",
              "lastTransitionTime": "2025-09-18T16:43:23Z",
              "reason": "Upgradeable",
              "message": "IngressController is upgradeable."
            },
            {
              "type": "EvaluationConditionsDetected",
              "status": "False",
              "lastTransitionTime": "2025-09-18T16:43:23Z",
              "reason": "NoEvaluationCondition",
              "message": "No evaluation condition is detected."
            }
          ],
          "tlsProfile": {
            "ciphers": [
              "ECDHE-ECDSA-AES128-GCM-SHA256",
              "ECDHE-RSA-AES128-GCM-SHA256",
              "ECDHE-ECDSA-AES256-GCM-SHA384",
              "ECDHE-RSA-AES256-GCM-SHA384",
              "ECDHE-ECDSA-CHACHA20-POLY1305",
              "ECDHE-RSA-CHACHA20-POLY1305",
              "DHE-RSA-AES128-GCM-SHA256",
              "DHE-RSA-AES256-GCM-SHA384",
              "TLS_AES_128_GCM_SHA256",
              "TLS_AES_256_GCM_SHA384",
              "TLS_CHACHA20_POLY1305_SHA256"
            ],
            "minTLSVersion": "VersionTLS12"
          },
          "observedGeneration": 1
        }
    operator_test.go:1304: failed to observe expected conditions: timed out waiting for the condition

This particular failure comes fromhttps://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-ingress-operator/1189/pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator/1968698892950179840.

This error can be fixed with proper retries and analysis on why a DNS gets degraded

Version-Release number of selected component (if applicable)

I have seen this in 4.20 CI jobs.

How reproducible

Presently, search.ci shows the following stats for the past 14 days:

pull-ci-openshift-cluster-ingress-operator-master-e2e-gcp-operator (all) - 12 runs, 25% failed, 33% of failures match = 8% impact

Steps to Reproduce

1. Post a PR and have bad luck.
2. Check search.ci.

Actual results

CI fails.

Expected results

CI passes, or fails on some other test failure.

Additional info

In the search.ci results, the failures all are in e2e-aws-operator jobs.

The test output isn't very helpful in diagnosing the failures. The output shows DNSReady:False with the status condition message "The wildcard record resource was not found." Unfortunately, the must-gather archives did not capture any relevant ingress-operator logs or DNSRecord CRs. It would be useful if the test output included the DNSRecord CR manifest.

Details

Description

Description of problem

Version-Release number of selected component (if applicable)

How reproducible

Steps to Reproduce

Actual results

Expected results

Additional info

Attachments

Easy Agile Planning Poker

Activity

People

Dates