Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: HyperShift
Labels:

Severity:
Moderate
Regression:
No
Epic Link:
SDE-2908
Sprint:
Hypershift Sprint 242
sprint_count:
1
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Even after a hostedcontrolplane is successfully deleted, some PDBs can remain behind:

❯ k get pdb -n ocm-production-23eq513k328ldsivorgobo1vbu9j670k-mhs-hyper 
NAME                                MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
aws-ebs-csi-driver-controller-pdb   N/A             1                 0                     4d1h
csi-snapshot-controller-pdb         N/A             1                 0                     4d1h
csi-snapshot-webhook-pdb            N/A             1                 0                     4d1h
ovn-raft-quorum-guard               2               N/A               0                     4d1h

Sometimes, this is because the uninstall gets stuck on another component, for example an awsendpointservice as in OCPBUGS-13056. This causes upgrades and/or MachineConfig rollouts to fail on the management cluster because the machine-config-operator cannot drain nodes that are running these pods. The state of the pods in the hostedcontrolplane namespace is:

❯ k get po -n ocm-production-23eq513k328ldsivorgobo1vbu9j670k-mhs-hyper                                      
NAME                                                     READY   STATUS              RESTARTS   AGE
audit-webhook-585fbf86d6-mmjx4                           0/2     Init:0/1            0          5h40m
audit-webhook-585fbf86d6-q2rxl                           0/2     Init:0/1            0          9h
audit-webhook-7b56b4ddbc-46rdh                           0/2     Init:0/1            0          22h
aws-ebs-csi-driver-controller-bcb6bf86-k6vd5             0/7     ContainerCreating   0          9h
aws-ebs-csi-driver-operator-5665898f58-f4bg8             0/1     ContainerCreating   0          9h
capi-provider-5d77d577c4-bdbxw                           0/2     Init:0/1            0          5h40m
cloud-network-config-controller-76f4ccf965-hjhrj         0/3     Init:0/1            0          5h40m
cluster-api-fc4f9579-zcws2                               1/1     Running             0          5h41m
control-plane-operator-598ffd9696-kssb2                  2/2     Running             0          5h40m
csi-snapshot-controller-7d89bf444-khmbp                  0/1     ContainerCreating   0          9h
csi-snapshot-webhook-8c945d4f9-dvmpw                     0/1     ContainerCreating   0          9h
metrics-forwarder-deployment-65db577ff6-76jgl            1/1     Running             0          3h46m
metrics-forwarder-secret-ensurer-28055191-zdqfk          0/1     Completed           0          2m6s
metrics-forwarder-secret-ensurer-28055192-gwqh2          0/1     Completed           0          66s
metrics-forwarder-secret-ensurer-28055193-27n9k          0/1     Completed           0          6s
multus-admission-controller-7b6db8b49d-8ls5t             0/2     Init:0/1            0          5h40m
multus-admission-controller-7b6db8b49d-kcsgw             0/2     Init:0/1            0          5h40m
ovnkube-master-0                                         0/7     Init:0/1            0          3h45m
ovnkube-master-1                                         0/7     Init:0/1            0          9h
ovnkube-master-2                                         5/7     Running             0          4d1h
package-operator-remote-phase-manager-7f54bf6c68-2sfnh   0/1     ContainerCreating   0          3h46m
validating-webhook-patching-job-28055130-db729           0/1     Completed           0          63m
validating-webhook-patching-job-28055160-88drq           0/1     Completed           0          33m
validating-webhook-patching-job-28055190-hwcw6           0/1     Completed           0          3m6s
validation-webhook-6bb4c7ddc5-qk8sf                      1/1     Running             0          3d2h
validation-webhook-6bb4c7ddc5-rrvh5                      1/1     Running             0          5h40m

Version-Release number of selected component (if applicable):

Observed throughout 4.12 hosted clusters

How reproducible:

100% reproducible if a hostedcontrolplane is deleted, but other resources remain

Steps to Reproduce:

1. Allow a hostedcontrolplane CR to completely delete
2. Cause the uninstall to fail due to an awsendpointservice that won't delete
3. The hostedcontrolplane namespace will have pods and pdbs leftover in the meantime that impact the ability of a management cluster to do MachineConfig rollouts

Actual results:

Expected results:

When a hostedcontrolplane is deleted, all PDBs are cleaned up earlier on.

Additional info:

This bug doesn't impact the management clusters if every hostedcluster/hostedcontrolplane deletion is able to cleanup the hostedcontrolplane namespace eventually - so I understand that if we have high confidence the uninstall bugs are ironed out, this bug itself may not present an issue. However, so far we are still in a state where new uninstall issues crop up fairly regularly.

is related to

OCPBUGS-13056 awsendpointservice stuck deleting due to invalid STS token

Closed

Assignee:: Alberto Garcia Lamela

Reporter:: Michael Shen

QA Contact:: Jie Zhao

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2023/05/05 6:38 PM

Updated:: 2023/09/05 2:11 PM

Resolved:: 2023/09/05 2:11 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates