Resolution: Duplicate
Description of problem:
The cluster failed to upgrade to 4.13.30 from 4.12.47 due to the MCO failing to evict the no-controller pods.
jiazha-mac:~ jiazha$ omg get nodes NAME STATUS ROLES AGE VERSION maxu-47686-b7jsh-rhel-1 Ready worker 2h20m v1.25.16+6df2177 maxu-47686-b7jsh-rhel-0 Ready worker 2h20m v1.25.16+6df2177 maxu-47686-b7jsh-worker-2 Ready worker 2h51m v1.25.16+5c97f5b maxu-47686-b7jsh-master-0 Ready control-plane,master 3h14m v1.25.16+5c97f5b maxu-47686-b7jsh-master-1 Ready control-plane,master 3h10m v1.25.16+5c97f5b maxu-47686-b7jsh-master-2 Ready control-plane,master 2h56m v1.25.16+5c97f5b maxu-47686-b7jsh-worker-1 Ready worker 2h51m v1.25.16+5c97f5b maxu-47686-b7jsh-worker-0 Ready,SchedulingDisabled worker 2h51m v1.25.16+5c97f5b jiazha-mac:~ jiazha$ omg get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-f88335e9afd20564c05e8b0cd4573df2 False True True 5 0 0 1 3h10m master rendered-master-1023ac264533c5f1926448ec0a816c28 True False False 3 3 3 0 3h10m jiazha-mac:~ jiazha$ omg get co machine-config -o yaml apiVersion: config.openshift.io/v1 kind: ClusterOperator ... extension: master: all 3 nodes are at latest configuration rendered-master-1023ac264533c5f1926448ec0a816c28 worker: 'pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node maxu-47686-b7jsh-worker-0 is reporting: \"failed to drain node: maxu-47686-b7jsh-worker-0 after 1 hour. Please see machine-config-controller logs for more information\""' relatedObjects: jiazha-mac:~ jiazha$ omg -n openshift-machine-config-operator logs machine-config-controller-74b57df9d6-2gfmp -c machine-config-controller |grep "Drain failed" 2024-01-23T11:07:39.778225664Z I0123 11:07:39.778190 1 drain_controller.go:139] node maxu-47686-b7jsh-worker-0: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "hello-pod" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-cwsvf" terminating: global timeout reached: 1m30s, error when waiting for pod "ocp-54745-pod-0" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-6cmvf" terminating: global timeout reached: 1m30s] ... 2024-01-23T12:16:39.381322587Z I0123 12:16:39.380041 1 drain_controller.go:139] node maxu-47686-b7jsh-worker-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: [error when waiting for pod "qe-app-registry-cwsvf" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-6cmvf" terminating: global timeout reached: 1m30s]
Seems like only two `qe-app-registry-xxx` pods are failed to evicted at the end, as follows
jiazha-mac:~ jiazha$ omg get pods -o wide -n openshift-marketplace NAME READY STATUS RESTARTS AGE IP NODE marketplace-operator-845b865dbd-qhd6d 0/1 Running 0 1h13m maxu-47686-b7jsh-master-1 qe-app-registry-6cmvf 0/1 Pending 0 2h33m maxu-47686-b7jsh-worker-0 qe-app-registry-cwsvf 0/1 Pending 0 2h17m maxu-47686-b7jsh-worker-0 jiazha-mac:~ jiazha$ omg get pods qe-app-registry-cwsvf -o yaml apiVersion: v1 kind: Pod metadata: annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: 'true' k8s.v1.cni.cncf.io/network-status: "[{\n \"name\": \"openshift-sdn\",\n \ \ \"interface\": \"eth0\",\n \"ips\": [\n \"\"\n ],\n\ \ \"default\": true,\n \"dns\": {}\n}]" k8s.v1.cni.cncf.io/networks-status: "[{\n \"name\": \"openshift-sdn\",\n \ \ \"interface\": \"eth0\",\n \"ips\": [\n \"\"\n ],\n\ \ \"default\": true,\n \"dns\": {}\n}]" kubectl.kubernetes.io/last-applied-configuration: '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"annotations":{},"name":"qe-app-registry","namespace":"openshift-marketplace"},"spec":{"displayName":"Production Operators","image":"upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12","publisher":"OpenShift QE","sourceType":"grpc","updateStrategy":{"registryPoll":{"interval":"15m"}}}} ' openshift.io/scc: anyuid creationTimestamp: '2024-01-23T10:00:53Z' deletionGracePeriodSeconds: '30' deletionTimestamp: '2024-01-23T10:23:01Z' generateName: qe-app-registry- labels: catalogsource.operators.coreos.com/update: qe-app-registry olm.catalogSource: '' olm.pod-spec-hash: 9b66974d5 managedFields: - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:cluster-autoscaler.kubernetes.io/safe-to-evict: {} f:kubectl.kubernetes.io/last-applied-configuration: {} f:generateName: {} f:labels: .: {} f:catalogsource.operators.coreos.com/update: {} f:olm.catalogSource: {} f:olm.pod-spec-hash: {} f:ownerReferences: .: {} k:{"uid":"632372ee-c42f-40b8-9da9-dc57097cf4ec"}: {} f:spec: f:containers: k:{"name":"registry-server"}: .: {} f:image: {} f:imagePullPolicy: {} f:livenessProbe: .: {} f:exec: .: {} f:command: {} f:failureThreshold: {} f:initialDelaySeconds: {} f:periodSeconds: {} f:successThreshold: {} f:timeoutSeconds: {} f:name: {} f:ports: .: {} k:{"containerPort":50051,"protocol":"TCP"}: .: {} f:containerPort: {} f:name: {} f:protocol: {} f:readinessProbe: .: {} f:exec: .: {} f:command: {} f:failureThreshold: {} f:initialDelaySeconds: {} f:periodSeconds: {} f:successThreshold: {} f:timeoutSeconds: {} f:resources: .: {} f:requests: .: {} f:cpu: {} f:memory: {} f:securityContext: .: {} f:readOnlyRootFilesystem: {} f:startupProbe: .: {} f:exec: .: {} f:command: {} f:failureThreshold: {} f:periodSeconds: {} f:successThreshold: {} f:timeoutSeconds: {} f:terminationMessagePath: {} f:terminationMessagePolicy: {} f:dnsPolicy: {} f:enableServiceLinks: {} f:nodeSelector: {} f:restartPolicy: {} f:schedulerName: {} f:securityContext: {} f:serviceAccount: {} f:serviceAccountName: {} f:terminationGracePeriodSeconds: {} manager: catalog operation: Update time: '2024-01-23T10:00:53Z' - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:status: f:conditions: k:{"type":"ContainersReady"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:message: {} f:reason: {} f:status: {} f:type: {} k:{"type":"Initialized"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:status: {} f:type: {} k:{"type":"Ready"}: .: {} f:lastProbeTime: {} f:lastTransitionTime: {} f:message: {} f:reason: {} f:status: {} f:type: {} f:containerStatuses: {} f:hostIP: {} f:startTime: {} manager: kubelet operation: Update subresource: status time: '2024-01-23T10:00:53Z' - apiVersion: v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: f:k8s.v1.cni.cncf.io/network-status: {} f:k8s.v1.cni.cncf.io/networks-status: {} manager: multus operation: Update subresource: status time: '2024-01-23T10:00:55Z' name: qe-app-registry-cwsvf namespace: openshift-marketplace ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: 'false' controller: 'false' kind: CatalogSource name: qe-app-registry uid: 632372ee-c42f-40b8-9da9-dc57097cf4ec resourceVersion: '55526' uid: de223db3-6a3c-42f8-9c58-e80c8b9837de spec: containers: - image: upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12 imagePullPolicy: Always livenessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: '3' initialDelaySeconds: '10' periodSeconds: '10' successThreshold: '1' timeoutSeconds: '5' name: registry-server ports: - containerPort: '50051' name: grpc protocol: TCP readinessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: '3' initialDelaySeconds: '5' periodSeconds: '10' successThreshold: '1' timeoutSeconds: '5' resources: requests: cpu: 10m memory: 50Mi securityContext: capabilities: drop: - MKNOD readOnlyRootFilesystem: 'false' startupProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: '10' periodSeconds: '10' successThreshold: '1' timeoutSeconds: '5' terminationMessagePath: /dev/termination-log terminationMessagePolicy: FallbackToLogsOnError volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-d8vvx readOnly: 'true' dnsPolicy: ClusterFirst enableServiceLinks: 'true' imagePullSecrets: - name: qe-app-registry-dockercfg-nkr5j nodeName: maxu-47686-b7jsh-worker-0 nodeSelector: kubernetes.io/os: linux preemptionPolicy: PreemptLowerPriority priority: '0' restartPolicy: Always schedulerName: default-scheduler securityContext: seLinuxOptions: level: s0:c16,c5 serviceAccount: qe-app-registry serviceAccountName: qe-app-registry terminationGracePeriodSeconds: '30' tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: '300' - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: '300' - effect: NoSchedule key: node.kubernetes.io/memory-pressure operator: Exists volumes: - name: kube-api-access-d8vvx projected: defaultMode: '420' sources: - serviceAccountToken: expirationSeconds: '3607' path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace - configMap: items: - key: service-ca.crt path: service-ca.crt name: openshift-service-ca.crt status: conditions: - lastProbeTime: 'null' lastTransitionTime: '2024-01-23T10:00:53Z' status: 'True' type: Initialized - lastProbeTime: 'null' lastTransitionTime: '2024-01-23T10:00:53Z' message: 'containers with unready status: [registry-server]' reason: ContainersNotReady status: 'False' type: Ready - lastProbeTime: 'null' lastTransitionTime: '2024-01-23T10:00:53Z' message: 'containers with unready status: [registry-server]' reason: ContainersNotReady status: 'False' type: ContainersReady - lastProbeTime: 'null' lastTransitionTime: '2024-01-23T10:00:53Z' status: 'True' type: PodScheduled containerStatuses: - image: upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12 imageID: '' lastState: {} name: registry-server ready: 'false' restartCount: '0' started: 'false' state: waiting: reason: ContainerCreating hostIP: phase: Pending qosClass: Burstable startTime: '2024-01-23T10:00:53Z'
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
You can trigger this job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/47686/consoleFull
1. build 4.12.47 cluster 2. upgrade it to the 4.13.30 3.
Actual results:
Failed on upgrading.
jiazha-mac:~ jiazha$ omg get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE marketplace-operator-845b865dbd-qhd6d 0/1 Running 0 1h13m maxu-47686-b7jsh-master-1 qe-app-registry-6cmvf 0/1 Pending 0 2h33m maxu-47686-b7jsh-worker-0 qe-app-registry-cwsvf 0/1 Pending 0 2h17m maxu-47686-b7jsh-worker-0 jiazha-mac:~ jiazha$ omg get nodes NAME STATUS ROLES AGE VERSION maxu-47686-b7jsh-rhel-1 Ready worker 2h20m v1.25.16+6df2177 maxu-47686-b7jsh-rhel-0 Ready worker 2h20m v1.25.16+6df2177 maxu-47686-b7jsh-worker-2 Ready worker 2h51m v1.25.16+5c97f5b maxu-47686-b7jsh-master-0 Ready control-plane,master 3h14m v1.25.16+5c97f5b maxu-47686-b7jsh-master-1 Ready control-plane,master 3h10m v1.25.16+5c97f5b maxu-47686-b7jsh-master-2 Ready control-plane,master 2h56m v1.25.16+5c97f5b maxu-47686-b7jsh-worker-1 Ready worker 2h51m v1.25.16+5c97f5b maxu-47686-b7jsh-worker-0 Ready,SchedulingDisabled worker 2h51m v1.25.16+5c97f5b jiazha-mac:~ jiazha$ jiazha-mac:~ jiazha$ omg get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-f88335e9afd20564c05e8b0cd4573df2 False True True 5 0 0 1 3h10m master rendered-master-1023ac264533c5f1926448ec0a816c28 True False False 3 3 3 0 3h10m jiazha-mac:~ jiazha$ omg get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.12.47 True False True 3h7m
Expected results:
The cluster can be updated successfully.
Additional info:
The must-gather log link: https://drive.google.com/file/d/1BRJPwc8YAtVh0x6PD4wyB5TPzl7qdVHS/view?usp=drive_link
- duplicates
OCPBUGS-28229 openshift-marketplace pods with no 'controller: true' ownerReferences
- Closed
- is blocked by
OPRUN-3204 Impact statement request for OCPBUGS-27826 MCO failed to drain the node due to the custom catalog source pod with no 'controller: true' ownerReferences
- Closed
- is caused by
OCPBUGS-7431 openshift-marketplace pods with no 'controller: true' ownerReferences
- Closed