Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: OLM
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The cluster failed to upgrade to 4.13.30 from 4.12.47 due to the MCO failing to evict the no-controller pods.

    jiazha-mac:~ jiazha$ omg get nodes 
NAME                       STATUS                    ROLES                 AGE    VERSION
maxu-47686-b7jsh-rhel-1    Ready                     worker                2h20m  v1.25.16+6df2177
maxu-47686-b7jsh-rhel-0    Ready                     worker                2h20m  v1.25.16+6df2177
maxu-47686-b7jsh-worker-2  Ready                     worker                2h51m  v1.25.16+5c97f5b
maxu-47686-b7jsh-master-0  Ready                     control-plane,master  3h14m  v1.25.16+5c97f5b
maxu-47686-b7jsh-master-1  Ready                     control-plane,master  3h10m  v1.25.16+5c97f5b
maxu-47686-b7jsh-master-2  Ready                     control-plane,master  2h56m  v1.25.16+5c97f5b
maxu-47686-b7jsh-worker-1  Ready                     worker                2h51m  v1.25.16+5c97f5b
maxu-47686-b7jsh-worker-0  Ready,SchedulingDisabled  worker                2h51m  v1.25.16+5c97f5b
jiazha-mac:~ jiazha$ omg get mcp
NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
worker  rendered-worker-f88335e9afd20564c05e8b0cd4573df2  False    True      True      5             0                  0                    1                     3h10m
master  rendered-master-1023ac264533c5f1926448ec0a816c28  True     False     False     3             3                  3                    0                     3h10m

jiazha-mac:~ jiazha$ omg get co machine-config -o yaml
apiVersion: config.openshift.io/v1
kind: ClusterOperator
...
  extension:
    master: all 3 nodes are at latest configuration rendered-master-1023ac264533c5f1926448ec0a816c28
    worker: 'pool is degraded because nodes fail with "1 nodes are reporting degraded
      status on sync": "Node maxu-47686-b7jsh-worker-0 is reporting: \"failed to drain
      node: maxu-47686-b7jsh-worker-0 after 1 hour. Please see machine-config-controller
      logs for more information\""'
  relatedObjects:

jiazha-mac:~ jiazha$ omg -n openshift-machine-config-operator logs machine-config-controller-74b57df9d6-2gfmp -c machine-config-controller |grep "Drain failed"
2024-01-23T11:07:39.778225664Z I0123 11:07:39.778190       1 drain_controller.go:139] node maxu-47686-b7jsh-worker-0: Drain failed. Waiting 1 minute then retrying. Error message from drain: [error when waiting for pod "hello-pod" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-cwsvf" terminating: global timeout reached: 1m30s, error when waiting for pod "ocp-54745-pod-0" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-6cmvf" terminating: global timeout reached: 1m30s]
...
2024-01-23T12:16:39.381322587Z I0123 12:16:39.380041       1 drain_controller.go:139] node maxu-47686-b7jsh-worker-0: Drain failed. Drain has been failing for more than 10 minutes. Waiting 5 minutes then retrying. Error message from drain: [error when waiting for pod "qe-app-registry-cwsvf" terminating: global timeout reached: 1m30s, error when waiting for pod "qe-app-registry-6cmvf" terminating: global timeout reached: 1m30s]

Seems like only two `qe-app-registry-xxx` pods are failed to evicted at the end, as follows

jiazha-mac:~ jiazha$ omg get pods -o wide -n openshift-marketplace 
NAME                                   READY  STATUS   RESTARTS  AGE    IP           NODE
marketplace-operator-845b865dbd-qhd6d  0/1    Running  0         1h13m  10.129.0.19  maxu-47686-b7jsh-master-1
qe-app-registry-6cmvf                  0/1    Pending  0         2h33m               maxu-47686-b7jsh-worker-0
qe-app-registry-cwsvf                  0/1    Pending  0         2h17m               maxu-47686-b7jsh-worker-0
 
jiazha-mac:~ jiazha$ omg get pods qe-app-registry-cwsvf -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: 'true'
    k8s.v1.cni.cncf.io/network-status: "[{\n    \"name\": \"openshift-sdn\",\n   \
      \ \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.131.0.16\"\n    ],\n\
      \    \"default\": true,\n    \"dns\": {}\n}]"
    k8s.v1.cni.cncf.io/networks-status: "[{\n    \"name\": \"openshift-sdn\",\n  \
      \  \"interface\": \"eth0\",\n    \"ips\": [\n        \"10.131.0.16\"\n    ],\n\
      \    \"default\": true,\n    \"dns\": {}\n}]"
    kubectl.kubernetes.io/last-applied-configuration: '{"apiVersion":"operators.coreos.com/v1alpha1","kind":"CatalogSource","metadata":{"annotations":{},"name":"qe-app-registry","namespace":"openshift-marketplace"},"spec":{"displayName":"Production
      Operators","image":"upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12","publisher":"OpenShift
      QE","sourceType":"grpc","updateStrategy":{"registryPoll":{"interval":"15m"}}}}


      '
    openshift.io/scc: anyuid
  creationTimestamp: '2024-01-23T10:00:53Z'
  deletionGracePeriodSeconds: '30'
  deletionTimestamp: '2024-01-23T10:23:01Z'
  generateName: qe-app-registry-
  labels:
    catalogsource.operators.coreos.com/update: qe-app-registry
    olm.catalogSource: ''
    olm.pod-spec-hash: 9b66974d5
  managedFields:
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:cluster-autoscaler.kubernetes.io/safe-to-evict: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
        f:generateName: {}
        f:labels:
          .: {}
          f:catalogsource.operators.coreos.com/update: {}
          f:olm.catalogSource: {}
          f:olm.pod-spec-hash: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"632372ee-c42f-40b8-9da9-dc57097cf4ec"}: {}
      f:spec:
        f:containers:
          k:{"name":"registry-server"}:
            .: {}
            f:image: {}
            f:imagePullPolicy: {}
            f:livenessProbe:
              .: {}
              f:exec:
                .: {}
                f:command: {}
              f:failureThreshold: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:name: {}
            f:ports:
              .: {}
              k:{"containerPort":50051,"protocol":"TCP"}:
                .: {}
                f:containerPort: {}
                f:name: {}
                f:protocol: {}
            f:readinessProbe:
              .: {}
              f:exec:
                .: {}
                f:command: {}
              f:failureThreshold: {}
              f:initialDelaySeconds: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:resources:
              .: {}
              f:requests:
                .: {}
                f:cpu: {}
                f:memory: {}
            f:securityContext:
              .: {}
              f:readOnlyRootFilesystem: {}
            f:startupProbe:
              .: {}
              f:exec:
                .: {}
                f:command: {}
              f:failureThreshold: {}
              f:periodSeconds: {}
              f:successThreshold: {}
              f:timeoutSeconds: {}
            f:terminationMessagePath: {}
            f:terminationMessagePolicy: {}
        f:dnsPolicy: {}
        f:enableServiceLinks: {}
        f:nodeSelector: {}
        f:restartPolicy: {}
        f:schedulerName: {}
        f:securityContext: {}
        f:serviceAccount: {}
        f:serviceAccountName: {}
        f:terminationGracePeriodSeconds: {}
    manager: catalog
    operation: Update
    time: '2024-01-23T10:00:53Z'
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
          k:{"type":"ContainersReady"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
          k:{"type":"Initialized"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:status: {}
            f:type: {}
          k:{"type":"Ready"}:
            .: {}
            f:lastProbeTime: {}
            f:lastTransitionTime: {}
            f:message: {}
            f:reason: {}
            f:status: {}
            f:type: {}
        f:containerStatuses: {}
        f:hostIP: {}
        f:startTime: {}
    manager: kubelet
    operation: Update
    subresource: status
    time: '2024-01-23T10:00:53Z'
  - apiVersion: v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:k8s.v1.cni.cncf.io/network-status: {}
          f:k8s.v1.cni.cncf.io/networks-status: {}
    manager: multus
    operation: Update
    subresource: status
    time: '2024-01-23T10:00:55Z'
  name: qe-app-registry-cwsvf
  namespace: openshift-marketplace
  ownerReferences:
  - apiVersion: operators.coreos.com/v1alpha1
    blockOwnerDeletion: 'false'
    controller: 'false'
    kind: CatalogSource
    name: qe-app-registry
    uid: 632372ee-c42f-40b8-9da9-dc57097cf4ec
  resourceVersion: '55526'
  uid: de223db3-6a3c-42f8-9c58-e80c8b9837de
spec:
  containers:
  - image: upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12
    imagePullPolicy: Always
    livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: '3'
      initialDelaySeconds: '10'
      periodSeconds: '10'
      successThreshold: '1'
      timeoutSeconds: '5'
    name: registry-server
    ports:
    - containerPort: '50051'
      name: grpc
      protocol: TCP
    readinessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: '3'
      initialDelaySeconds: '5'
      periodSeconds: '10'
      successThreshold: '1'
      timeoutSeconds: '5'
    resources:
      requests:
        cpu: 10m
        memory: 50Mi
    securityContext:
      capabilities:
        drop:
        - MKNOD
      readOnlyRootFilesystem: 'false'
    startupProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: '10'
      periodSeconds: '10'
      successThreshold: '1'
      timeoutSeconds: '5'
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-d8vvx
      readOnly: 'true'
  dnsPolicy: ClusterFirst
  enableServiceLinks: 'true'
  imagePullSecrets:
  - name: qe-app-registry-dockercfg-nkr5j
  nodeName: maxu-47686-b7jsh-worker-0
  nodeSelector:
    kubernetes.io/os: linux
  preemptionPolicy: PreemptLowerPriority
  priority: '0'
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    seLinuxOptions:
      level: s0:c16,c5
  serviceAccount: qe-app-registry
  serviceAccountName: qe-app-registry
  terminationGracePeriodSeconds: '30'
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: '300'
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: '300'
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  volumes:
  - name: kube-api-access-d8vvx
    projected:
      defaultMode: '420'
      sources:
      - serviceAccountToken:
          expirationSeconds: '3607'
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: 'null'
    lastTransitionTime: '2024-01-23T10:00:53Z'
    status: 'True'
    type: Initialized
  - lastProbeTime: 'null'
    lastTransitionTime: '2024-01-23T10:00:53Z'
    message: 'containers with unready status: [registry-server]'
    reason: ContainersNotReady
    status: 'False'
    type: Ready
  - lastProbeTime: 'null'
    lastTransitionTime: '2024-01-23T10:00:53Z'
    message: 'containers with unready status: [registry-server]'
    reason: ContainersNotReady
    status: 'False'
    type: ContainersReady
  - lastProbeTime: 'null'
    lastTransitionTime: '2024-01-23T10:00:53Z'
    status: 'True'
    type: PodScheduled
  containerStatuses:
  - image: upshift.mirror-registry.qe.devcluster.openshift.com:6001/openshift-qe-optional-operators/aosqe-index:v4.12
    imageID: ''
    lastState: {}
    name: registry-server
    ready: 'false'
    restartCount: '0'
    started: 'false'
    state:
      waiting:
        reason: ContainerCreating
  hostIP: 192.168.0.18
  phase: Pending
  qosClass: Burstable
  startTime: '2024-01-23T10:00:53Z'

Version-Release number of selected component (if applicable):

    4.12.47

How reproducible:

    often

Steps to Reproduce:

You can trigger this job: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/47686/consoleFull

    1. build 4.12.47 cluster
    2. upgrade it to the 4.13.30
    3.

Actual results:

Failed on upgrading.

jiazha-mac:~ jiazha$ omg get pods -o wide 
NAME                                   READY  STATUS   RESTARTS  AGE    IP           NODE
marketplace-operator-845b865dbd-qhd6d  0/1    Running  0         1h13m  10.129.0.19  maxu-47686-b7jsh-master-1
qe-app-registry-6cmvf                  0/1    Pending  0         2h33m               maxu-47686-b7jsh-worker-0
qe-app-registry-cwsvf                  0/1    Pending  0         2h17m               maxu-47686-b7jsh-worker-0
jiazha-mac:~ jiazha$ omg get nodes 
NAME                       STATUS                    ROLES                 AGE    VERSION
maxu-47686-b7jsh-rhel-1    Ready                     worker                2h20m  v1.25.16+6df2177
maxu-47686-b7jsh-rhel-0    Ready                     worker                2h20m  v1.25.16+6df2177
maxu-47686-b7jsh-worker-2  Ready                     worker                2h51m  v1.25.16+5c97f5b
maxu-47686-b7jsh-master-0  Ready                     control-plane,master  3h14m  v1.25.16+5c97f5b
maxu-47686-b7jsh-master-1  Ready                     control-plane,master  3h10m  v1.25.16+5c97f5b
maxu-47686-b7jsh-master-2  Ready                     control-plane,master  2h56m  v1.25.16+5c97f5b
maxu-47686-b7jsh-worker-1  Ready                     worker                2h51m  v1.25.16+5c97f5b
maxu-47686-b7jsh-worker-0  Ready,SchedulingDisabled  worker                2h51m  v1.25.16+5c97f5b
jiazha-mac:~ jiazha$ 
jiazha-mac:~ jiazha$ omg get mcp 
NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
worker  rendered-worker-f88335e9afd20564c05e8b0cd4573df2  False    True      True      5             0                  0                    1                     3h10m
master  rendered-master-1023ac264533c5f1926448ec0a816c28  True     False     False     3             3                  3                    0                     3h10m
jiazha-mac:~ jiazha$ omg get co machine-config
NAME            VERSION  AVAILABLE  PROGRESSING  DEGRADED  SINCE
machine-config  4.12.47  True       False        True      3h7m

Expected results:

    The cluster can be updated successfully.

Additional info:

The must-gather log link: https://drive.google.com/file/d/1BRJPwc8YAtVh0x6PD4wyB5TPzl7qdVHS/view?usp=drive_link

duplicates

OCPBUGS-28229 openshift-marketplace pods with no 'controller: true' ownerReferences

Closed

is blocked by

OPRUN-3204 Impact statement request for OCPBUGS-27826 MCO failed to drain the node due to the custom catalog source pod with no 'controller: true' ownerReferences

Closed

is caused by

OCPBUGS-7431 openshift-marketplace pods with no 'controller: true' ownerReferences

Closed

Assignee:: Kevin Rizza

Reporter:: Jian Zhang

Need Info From:: None

Contributors:: None

QA Contact:: Jian Zhang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/01/24 1:33 AM

Updated:: 2025/07/23 11:56 PM

Resolved:: 2024/02/07 7:44 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide