Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-505

[2293058] [Ceph] ceph cluster reported as not healthy - PodDisruptionBudgetAtLimit error reached

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.16
    • rook
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • Proposed
    • None

      Description of problem:
      PodDisruptionBudgetAtLimit errors reported and ceph cluster is unhealthy

      Version-Release number of selected component (if applicable):
      rook-ceph-operator-stable-4.16-odf-catalogsource-openshift-marketplace
      rook-ceph-operator.v4.16.0-118.stable
      Installed version
      ClusterServiceVersion
      CSV
      odf-operator.v4.16.0-118.stable
      Starting version
      odf-operator.v4.16.0-94.stable

      registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:17e899c9c4f2f64bc7acea361446a64927b829d6766e6dde42f8d0336b9125a4

      How reproducible:
      Ongoing status reported

      Steps to Reproduce:
      1. I have a Regional DR Openshift Virtualization managed cluster as part of the Regional DR environment.
      2.Following an ODF upgrade where the MCO operator reconciles the VeleroNamespaceSecretKeyRef and CACertificates fields as reported in bz https://bugzilla.redhat.com/show_bug.cgi?id=2277941
      3. I reconfigured the CACertificates
      4. After this noticed that that the ceph cluster was reported as not healthy reporting PodDisruptionBudgetLimit errors:

      PodDisruptionBudgetLimit
      Jun 7, 2024, 11:09 PM
      The pod disruption budget is below the minimum disruptions allowed level and is not satisfied. The number of current healthy pods is less than the desired healthy pods.

      The pod disruption budget is below the minimum disruptions allowed level and is not satisfied. The number of current healthy pods is less than the desired healthy pods.

      Summary
      The pod disruption budget registers insufficient amount of pods.

      Actual results:
      Ceph cluster is reported as not Healthy

      Expected results:
      Should be healthy

      Additional info:
      Data Foundation events reported:
      failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd-virtualization": rpc error: code = DeadlineExceeded desc = context deadline exceeded

      Ceph pods in CLBO status:
      + child_pid=
      + sigterm_received=false
      + trap sigterm SIGTERM
      + child_pid=1035934
      + wait 1035934
      + ceph-osd -foreground --id 2 --fsid c48929dd-4981-4e8a-b7b1-03751eb8eba3 --setuser ceph --setgroup ceph 'crush-location=root=default host=rdr-c2-gxmhx-worker-0-blw77 zone=nova' --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true '-default-log-stderr-prefix=debug ' --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.1.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41
      debug 2024-06-19T10:22:41.500+0000 7f9c87c007c0 0 monclient(hunting): authenticate timed out after 300
      failed to fetch mon config (--no-mon-config to skip)
      + wait 1035934
      + ceph_osd_rc=1
      + '[' 1 -eq 0 ']'
      + exit 1

              sapillai Santosh Pillai
              kgoldbla Kevin Alon Goldblatt
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated: