Uploaded image for project: 'Machine Config Operator'
  1. Machine Config Operator
  2. MCO-1216

PodDisruptionBudget (PDB) causes Machine-Config-Operator (MCO) to be degraded during OCP4 upgrade

XMLWordPrintable

    • False
    • None
    • False
    • 0
    • 0

      • Red Hat Openshift Container Platform 4.x
      • We have frequently encountered a persistent issue that significantly delays the upgrade process whenever a Pod Disruption Budget (PDB) is configured in the environment.
      • Typically, when a PDB is configured, the machine config operator struggles to drain the node because it is unable to effectively remove the pods from the node, leading to a bottleneck in the process.
      • The machine config operator repeatedly attempts to drain the node, sometimes for an extended period of 3-4 hours, before eventually returning the error message: 'mcp degraded because it is unable to drain the node.' This prolonged process is highly inefficient and frustrating.
      • Many of our customers experience this issue consistently during each and every upgrade, which makes the upgrade process extremely time-consuming and labor-intensive for them. This repetitive issue affects their overall operational efficiency and satisfaction.
      • The current workaround solutions are less than ideal. They involve either removing the PDB, which is not always feasible, or manually deleting the pods when the drain operation fails, which is labor-intensive and prone to human error.
      • The exact error can only be identified by examining the machine config daemon pod logs, making it difficult for customers to diagnose and resolve the issue on their own. Consequently, our customers raise support tickets for this problem each time it occurs, adding to their frustration and our support workload. 
      • It does not matter if the pods are healthy or not , they will simply block the upgrade process
      • I am raising this Jira to emphasize the critical nature of this issue and to request that it be treated as a high priority. It is imperative that we develop a robust solution within the machine config operator that enables it to autonomously and effectively remove the pods during the upgrade process, thereby preventing these delays and improving the overall customer experience.
      • To explain it in more detail , https://access.redhat.com/solutions/4857671 is already present which shows the exact diagnostics and resolution. Please have a look at the attached cases , numbers are very high and also time consuming

            Unassigned Unassigned
            rhn-support-vismishr Vishvranjan Mishra
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: