-
Bug
-
Resolution: Cannot Reproduce
-
Normal
-
None
-
4.13.0
-
Moderate
-
None
-
Rejected
-
False
-
AWS installs are struggling lately and we've found at least one of the issues is due to the network operator being degraded at the end of install. The problem manifests with either a "connection refused" or a "read: connection reset by peer" while updating the ClusterOperator.
Examples:
{Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: Patch "https://api-int.ci-op-8t6mvd1i-a157f.aws-2.ci.openshift.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 10.0.170.103:6443: connect: connection refused Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: Patch "https://api-int.ci-op-8t6mvd1i-a157f.aws-2.ci.openshift.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 10.0.170.103:6443: connect: connection refused}
{Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (/v1, Kind=Namespace) /openshift-multus: failed to apply / update (/v1, Kind=Namespace) /openshift-multus: Patch "https://api-int.ci-op-brwgsfxm-2ac23.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-multus?fieldManager=cluster-network-operator%2Foperconfig&force=true": read tcp 10.0.217.20:49064->10.0.196.167:6443: read: connection reset by peer Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (/v1, Kind=Namespace) /openshift-multus: failed to apply / update (/v1, Kind=Namespace) /openshift-multus: Patch "https://api-int.ci-op-brwgsfxm-2ac23.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-multus?fieldManager=cluster-network-operator%2Foperconfig&force=true": read tcp 10.0.217.20:49064->10.0.196.167:6443: read: connection reset by peer}
The problem appears to only impact AWS and 4.13, there are no hits anywhere else.
It seems to span both sdn and ovn. Apologies for component, there is nothing for generic network operator problems I can find.
Because this is during install we don't have the tools TRT would normally use to debug pod status or what was actually down at this time.
Focusing on why this is only appearing in 4.13 might be a good start.
Could the network operator be more resilient with retries here or are we in a permanent failure?
Very possible this ends up going to API server or something similar, but it's curious only the network operator is complaining.
test=operator conditions network
incident=variants=aws
- is related to
-
TRT-700 Investigate AWS install problems
- Closed