Loading...

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.11.z
Component/s: Node / Kubelet
Labels:
- bugbash-226
- telco-priority-1

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
None
Latest Status Summary:

Hide
12/19: removing from the 4.12 gating list based on test results
12/15: moved to rank 2 since current test results (on 4.11.13) have not reproduced this issue; planning to run more test iterations & plan to remove this issue from the 4.12 list altogether if those also don't reproduce the issue
12/12: Telco Scale team evaluating latest test run & gearing up to run w/ proposed fix
12/8: This will not make 4.12 but will be a release note. Node team has a proposed fix; Telco QE will collaborate to test the fix.
12/5: Node team engaging upstream on this.
Rel Note for Telco: Not Required, nothing to say now, it was not reproducible

Show
12/19: removing from the 4.12 gating list based on test results 12/15: moved to rank 2 since current test results (on 4.11.13) have not reproduced this issue; planning to run more test iterations & plan to remove this issue from the 4.12 list altogether if those also don't reproduce the issue 12/12: Telco Scale team evaluating latest test run & gearing up to run w/ proposed fix 12/8: This will not make 4.12 but will be a release note. Node team has a proposed fix; Telco QE will collaborate to test the fix. 12/5: Node team engaging upstream on this. Rel Note for Telco: Not Required, nothing to say now, it was not reproducible

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Discussion Needed:

PM Sync
Sprint:
None

Internal Whiteboard:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Installing 1000+ SNOs via ACM/MCE via ZTP with gitops, a small percentage of clusters end up never completing install because the openshift-controller-manager operator does not reconcile to available.

Example:

oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get co openshift-controller-manager
NAME                           VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
openshift-controller-manager             False       True          False      17h     Available: no daemon pods available on any node.

Version-Release number of selected component (if applicable):

Hub OCP and SNO OCP - 4.11.0

ACM - 2.6.0-DOWNSTREAM-2022-08-11-23-41-09 (FC5)

How reproducible:

21 out of 23 failures out of 1728 installs
~90% of the failures are because of this issue
failure rate of ~1.2% of the total installs

Additional info:

# oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          18h     Unable to apply 4.11.0: the cluster operator openshift-controller-manager has not yet successfully rolled out  # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.11.0    True        False         False      10m     
baremetal                                  4.11.0    True        False         False      17h     
cloud-controller-manager                   4.11.0    True        False         False      17h     
cloud-credential                           4.11.0    True        False         False      17h     
cluster-autoscaler                         4.11.0    True        False         False      17h     
config-operator                            4.11.0    True        False         False      18h     
console                                    4.11.0    True        False         False      16h     
csi-snapshot-controller                    4.11.0    True        False         False      17h     
dns                                        4.11.0    True        False         False      17h     
etcd                                       4.11.0    True        False         False      18h     
image-registry                             4.11.0    True        False         False      17h     
ingress                                    4.11.0    True        False         False      17h     
insights                                   4.11.0    True        False         False      51s     
kube-apiserver                             4.11.0    True        False         False      17h     
kube-controller-manager                    4.11.0    True        False         False      17h     
kube-scheduler                             4.11.0    True        False         False      17h     
kube-storage-version-migrator              4.11.0    True        False         False      17h     
machine-api                                4.11.0    True        False         False      17h     
machine-approver                           4.11.0    True        False         False      17h     
machine-config                             4.11.0    True        False         False      17h     
marketplace                                4.11.0    True        False         False      17h     
monitoring                                 4.11.0    True        False         False      17h     
network                                    4.11.0    True        False         False      18h     
node-tuning                                4.11.0    True        False         False      17h     
openshift-apiserver                        4.11.0    True        False         False      11m     
openshift-controller-manager                         False       True          False      18h     Available: no daemon pods available on any node.
openshift-samples                          4.11.0    True        False         False      17h     
operator-lifecycle-manager                 4.11.0    True        False         False      17h     
operator-lifecycle-manager-catalog         4.11.0    True        False         False      17h     
operator-lifecycle-manager-packageserver   4.11.0    True        False         False      17h     
service-ca                                 4.11.0    True        False         False      18h     
storage                                    4.11.0    True        False         False      17h

# oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get po -n openshift-controller-manager
NAME                       READY   STATUS        RESTARTS   AGE
controller-manager-26stq   0/1     Terminating   0          17h # oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig describe po -n openshift-controller-manager
Name:                      controller-manager-26stq
Namespace:                 openshift-controller-manager
Priority:                  2000001000
Priority Class Name:       system-node-critical
Node:                      sno00090/fc00:1001::441
Start Time:                Mon, 15 Aug 2022 20:40:49 +0000
Labels:                    app=openshift-controller-manager
                           controller-manager=true
                           controller-revision-hash=5cdd6cbbf4
                           pod-template-generation=3
Annotations:               k8s.ovn.org/pod-networks:
                             {"default":{"ip_addresses":["fd01:0:0:1::26/64"],"mac_address":"0a:58:4c:a7:02:14","gateway_ips":["fd01:0:0:1::1"],"ip_address":"fd01:0:0:...
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "ovn-kubernetes",
                                 "interface": "eth0",
                                 "ips": [
                                     "fd01:0:0:1::26"
                                 ],
                                 "mac": "0a:58:4c:a7:02:14",
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "ovn-kubernetes",
                                 "interface": "eth0",
                                 "ips": [
                                     "fd01:0:0:1::26"
                                 ],
                                 "mac": "0a:58:4c:a7:02:14",
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: restricted-v2
                           operator.openshift.io/force: a45bf93a-5310-494f-9b47-efe5e122e439
                           seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:                    Terminating (lasts 17h)
Termination Grace Period:  30s
IP:                        fd01:0:0:1::26
IPs:
  IP:           fd01:0:0:1::26
Controlled By:  DaemonSet/controller-manager
Containers:
  controller-manager:
    Container ID:
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fead5da1351abee9e77f5c46247c7947d6f9cf76b5bf7a8b4905222fe2564665
    Image ID:
    Port:          8443/TCP
    Host Port:     0/TCP
    Command:
      openshift-controller-manager
      start
    Args:
      --config=/var/run/configmaps/config/config.yaml
      -v=2
    State:          Terminated
      Reason:       ContainerStatusUnknown
      Message:      The container could not be located when the pod was terminated
      Exit Code:    137
      Started:      Mon, 01 Jan 0001 00:00:00 +0000
      Finished:     Mon, 01 Jan 0001 00:00:00 +0000
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     100Mi
    Environment:  <none>
    Mounts:
      /etc/pki/ca-trust/extracted/pem from proxy-ca-bundles (rw)
      /var/run/configmaps/client-ca from client-ca (rw)
      /var/run/configmaps/config from config (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-d6rhr (ro)
      /var/run/secrets/serving-cert from serving-cert (rw)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      config
    Optional:  false
  client-ca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      client-ca
    Optional:  false
  serving-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  serving-cert
    Optional:    false
  proxy-ca-bundles:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      openshift-global-ca
    Optional:  false
  kube-api-access-d6rhr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 op=Exists
Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  81m (x493 over 17h)    kubelet  MountVolume.SetUp failed for volume "proxy-ca-bundles" : object "openshift-controller-manager"/"openshift-global-ca" not registered
  Warning  FailedMount  63m (x502 over 17h)    kubelet  MountVolume.SetUp failed for volume "kube-api-access-d6rhr" : [object "openshift-controller-manager"/"kube-root-ca.crt" not registered, object "openshift-controller-manager"/"openshift-service-ca.crt" not registered]
  Warning  FailedMount  16m (x525 over 17h)    kubelet  MountVolume.SetUp failed for volume "client-ca" : object "openshift-controller-manager"/"client-ca" not registered
  Warning  FailedMount  12m (x527 over 17h)    kubelet  MountVolume.SetUp failed for volume "serving-cert" : object "openshift-controller-manager"/"serving-cert" not registered
  Warning  FailedMount  2m34s (x532 over 17h)  kubelet  MountVolume.SetUp failed for volume "config" : object "openshift-controller-manager"/"config" not registered 

# oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig logs -n openshift-controller-manager controller-manager-26stq
Error from server (BadRequest): container "controller-manager" in pod "controller-manager-26stq" is terminated

Restarting kubelet resolves the issue

# ssh core@sno00090 sudo systemctl restart kubelet 
# oc --kubeconfig=/root/hv-vm/sno/manifests/sno00090/kubeconfig get clusterversion -w
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version             False       True          18h     Unable to apply 4.11.0: the cluster operator openshift-controller-manager has not yet successfully rolled out
version             False       True          18h     Working towards 4.11.0: 801 of 802 done (99% complete)
version   4.11.0    True        False         0s      Cluster version is 4.11.0