Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-43631

Static pod controller pods sometimes fail to start [kube-controller-manager]

XMLWordPrintable

    • Important
    • No
    • 5
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Release Note Not Required
    • In Progress

      deads reported in this thread that the static pod controller appears to sometimes deploy pods that do not show up in a reasonable timeframe, which occasionally triggers this test to fail (source job):

      [sig-node] static pods should start after being created 
      
      {  static pod lifecycle failure - static pod: "kube-controller-manager" in namespace: "openshift-kube-controller-manager" for revision: 6 on node: "ip-10-0-56-130.us-east-2.compute.internal" didn't show up, waited: 3m30s}
      

      David suspects that this actually happens far more often than the test failures indicate, however this test should be a good resource to find affected runs.

      Test details indicates this fails up to 10% of the time on some job variants. The most common component affected appears to be kube-controller-manager, but apiserver and etcd are both appearing at times. Use the test details link if looking for more job runs.

      Slack thread has more details from both deads@redhat.com and tjungblu@redhat.com.

      Suspicion is that fixing this could improve install times and reliability.

      jpoulin's notes. This issues is seen much more prevalently in the single-node jobs, for some reason. When combined with the cloned bug related to etcd, this is seen for close to 30% of single node upgrade runs.

      Digging deeper into the logs, I see:

      I1018 19:11:06.671321       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-controller-manager-operator", Name:"kube-controller-manager-operator", UID:"8e24d779-45f6-467c-b730-6925dc516f21", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-controller-manager changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready" to "RevisionControllerDegraded: Internal error occurred: resource quota evaluation timed out\nMissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: \"kube-controller-manager\" in namespace: \"openshift-kube-controller-manager\" for revision: 6 on node: \"ip-10-0-56-130.us-east-2.compute.internal\" didn't show up, waited: 3m30s\nNodeControllerDegraded: All master nodes are ready\nInstallerControllerDegraded: missing required resources: secrets: localhost-recovery-client-token-7",Progressing message changed from "NodeInstallerProgressing: 1 node is at revision 5; 0 nodes have achieved new revision 6" to "NodeInstallerProgressing: 1 node is at revision 5; 0 nodes have achieved new revision 7",Available message changed from "StaticPodsAvailable: 1 nodes are active; 1 node is at revision 5; 0 nodes have achieved new revision 6" to "StaticPodsAvailable: 1 nodes are active; 1 node is at revision 5; 0 nodes have achieved new revision 7"
      

      I believe this line could be the cause of such a delay:

      missing required resources: secrets: localhost-recovery-client-token-7"

      Is this secret needed for the deployment of the static pod? If so, would it be possible to check if it is available before scheduling the pod?
      I tried scanning openshift/kubernetes for localhost-recovery-client, since I expected it to be part of the manifest, but then I found myself out of my depth since I don't know if this is even the true source of the relevant manifest.

            fkrepins@redhat.com Filip Krepinsky
            jpoulin Jeremy Poulin
            ying zhou ying zhou
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: