Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13.0
Component/s: AWS Load Balancer Operator, Networking / openshift-sdn
Labels:
- trt

Severity:
Moderate
Regression:
None
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

AWS installs are struggling lately and we've found at least one of the issues is due to the network operator being degraded at the end of install. The problem manifests with either a "connection refused" or a "read: connection reset by peer" while updating the ClusterOperator.

Examples:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-serial/1594569311073079296

{Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: Patch "https://api-int.ci-op-8t6mvd1i-a157f.aws-2.ci.openshift.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 10.0.170.103:6443: connect: connection refused  Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: Patch "https://api-int.ci-op-8t6mvd1i-a157f.aws-2.ci.openshift.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 10.0.170.103:6443: connect: connection refused}

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial/1594529754332008448

{Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (/v1, Kind=Namespace) /openshift-multus: failed to apply / update (/v1, Kind=Namespace) /openshift-multus: Patch "https://api-int.ci-op-brwgsfxm-2ac23.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-multus?fieldManager=cluster-network-operator%2Foperconfig&force=true": read tcp 10.0.217.20:49064->10.0.196.167:6443: read: connection reset by peer  Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (/v1, Kind=Namespace) /openshift-multus: failed to apply / update (/v1, Kind=Namespace) /openshift-multus: Patch "https://api-int.ci-op-brwgsfxm-2ac23.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-multus?fieldManager=cluster-network-operator%2Foperconfig&force=true": read tcp 10.0.217.20:49064->10.0.196.167:6443: read: connection reset by peer}

The problem appears to only impact AWS and 4.13, there are no hits anywhere else.

https://search.ci.openshift.org/?search=Cluster+operator+network+Degraded+is+True+with+ApplyOperatorConfig&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=4.13.*aws&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

It seems to span both sdn and ovn. Apologies for component, there is nothing for generic network operator problems I can find.

Because this is during install we don't have the tools TRT would normally use to debug pod status or what was actually down at this time.

Focusing on why this is only appearing in 4.13 might be a good start.

Could the network operator be more resilient with retries here or are we in a permanent failure?

Very possible this ends up going to API server or something similar, but it's curious only the network operator is complaining.

test=operator conditions network
incident=variants=aws

is related to

TRT-700 Investigate AWS install problems

Closed

Assignee:: Riccardo Ravaioli

Reporter:: Devan Goodwin

QA Contact:: Zhanqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022/11/22 1:18 PM

Updated:: 2023/04/06 8:24 PM

Resolved:: 2023/04/06 8:24 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates