Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48673

The bootstrap node is removed too early which can cause API disruption

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.18.0, 4.19.0
    • kube-apiserver
    • None
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      During the cluster bootstrap, disruption can occur when a kube-apiserver instance doesn't have access to any live etcd endpoints. This happens in one very specific scenario:

      • kube-apiserver is running on a node and is at revision 1. Its etcd-servers list contains the bootstrap node IP and localhost
      • when bootstrap node is deleted, the etcd instance that was running on it will become unavailable
      • when the etcd instance running the same node as the kube-apiserver instance from above is rolled-out to a new revision it will also become unavailable

      When both of these scenarios happens whilst a kube-apiserver instance is still on revision 1, its readyz probe will fail

      The suggested solution to fix this issue is to add a check in cluster-bootstrap that makes sure that we have at least 2 etcd-servers that are not bootstrap and localhost for each kube-apiserver pods before getting rid of the bootstrap node.

      Job where this is happening: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1387/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-serial/1880358740390055936

              dgrisonn@redhat.com Damien Grisonnet
              dgrisonn@redhat.com Damien Grisonnet
              Ke Wang Ke Wang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: