Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-3982

AWS Install Problems due to network operator unable to update status

XMLWordPrintable

    • Moderate
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      AWS installs are struggling lately and we've found at least one of the issues is due to the network operator being degraded at the end of install. The problem manifests with either a "connection refused" or a "read: connection reset by peer" while updating the ClusterOperator.

      Examples:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-e2e-aws-sdn-serial/1594569311073079296

      {Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: Patch "https://api-int.ci-op-8t6mvd1i-a157f.aws-2.ci.openshift.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 10.0.170.103:6443: connect: connection refused  Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: failed to apply / update (rbac.authorization.k8s.io/v1, Kind=RoleBinding) openshift-config-managed/openshift-network-public-role-binding: Patch "https://api-int.ci-op-8t6mvd1i-a157f.aws-2.ci.openshift.org:6443/apis/rbac.authorization.k8s.io/v1/namespaces/openshift-config-managed/rolebindings/openshift-network-public-role-binding?fieldManager=cluster-network-operator%2Foperconfig&force=true": dial tcp 10.0.170.103:6443: connect: connection refused}
      

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-aws-sdn-serial/1594529754332008448

      {Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (/v1, Kind=Namespace) /openshift-multus: failed to apply / update (/v1, Kind=Namespace) /openshift-multus: Patch "https://api-int.ci-op-brwgsfxm-2ac23.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-multus?fieldManager=cluster-network-operator%2Foperconfig&force=true": read tcp 10.0.217.20:49064->10.0.196.167:6443: read: connection reset by peer  Operator degraded (ApplyOperatorConfig): Error while updating operator configuration: could not apply (/v1, Kind=Namespace) /openshift-multus: failed to apply / update (/v1, Kind=Namespace) /openshift-multus: Patch "https://api-int.ci-op-brwgsfxm-2ac23.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-multus?fieldManager=cluster-network-operator%2Foperconfig&force=true": read tcp 10.0.217.20:49064->10.0.196.167:6443: read: connection reset by peer}
      

      The problem appears to only impact AWS and 4.13, there are no hits anywhere else.

      https://search.ci.openshift.org/?search=Cluster+operator+network+Degraded+is+True+with+ApplyOperatorConfig&maxAge=168h&context=1&type=bug%2Bissue%2Bjunit&name=4.13.*aws&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      It seems to span both sdn and ovn. Apologies for component, there is nothing for generic network operator problems I can find.

      Because this is during install we don't have the tools TRT would normally use to debug pod status or what was actually down at this time.

      Focusing on why this is only appearing in 4.13 might be a good start.

      Could the network operator be more resilient with retries here or are we in a permanent failure?

      Very possible this ends up going to API server or something similar, but it's curious only the network operator is complaining.

      test=operator conditions network
      incident=variants=aws

              rravaiol@redhat.com Riccardo Ravaioli
              rhn-engineering-dgoodwin Devan Goodwin
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: