-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
odf-4.16
-
None
Description of problem:
PodDisruptionBudgetAtLimit errors reported and ceph cluster is unhealthy
Version-Release number of selected component (if applicable):
rook-ceph-operator-stable-4.16-odf-catalogsource-openshift-marketplace
rook-ceph-operator.v4.16.0-118.stable
Installed version
ClusterServiceVersion
CSV
odf-operator.v4.16.0-118.stable
Starting version
odf-operator.v4.16.0-94.stable
registry.redhat.io/rhceph/rhceph-7-rhel9@sha256:17e899c9c4f2f64bc7acea361446a64927b829d6766e6dde42f8d0336b9125a4
How reproducible:
Ongoing status reported
Steps to Reproduce:
1. I have a Regional DR Openshift Virtualization managed cluster as part of the Regional DR environment.
2.Following an ODF upgrade where the MCO operator reconciles the VeleroNamespaceSecretKeyRef and CACertificates fields as reported in bz https://bugzilla.redhat.com/show_bug.cgi?id=2277941
3. I reconfigured the CACertificates
4. After this noticed that that the ceph cluster was reported as not healthy reporting PodDisruptionBudgetLimit errors:
PodDisruptionBudgetLimit
Jun 7, 2024, 11:09 PM
The pod disruption budget is below the minimum disruptions allowed level and is not satisfied. The number of current healthy pods is less than the desired healthy pods.
The pod disruption budget is below the minimum disruptions allowed level and is not satisfied. The number of current healthy pods is less than the desired healthy pods.
Summary
The pod disruption budget registers insufficient amount of pods.
Actual results:
Ceph cluster is reported as not Healthy
Expected results:
Should be healthy
Additional info:
Data Foundation events reported:
failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd-virtualization": rpc error: code = DeadlineExceeded desc = context deadline exceeded
Ceph pods in CLBO status:
+ child_pid=
+ sigterm_received=false
+ trap sigterm SIGTERM
+ child_pid=1035934
+ wait 1035934
+ ceph-osd -foreground --id 2 --fsid c48929dd-4981-4e8a-b7b1-03751eb8eba3 --setuser ceph --setgroup ceph 'crush-location=root=default host=rdr-c2-gxmhx-worker-0-blw77 zone=nova' --osd-op-num-threads-per-shard=2 --osd-op-num-shards=8 --osd-recovery-sleep=0 --osd-snap-trim-sleep=0 --osd-delete-sleep=0 --bluestore-min-alloc-size=4096 --bluestore-prefer-deferred-size=0 --bluestore-compression-min-blob-size=8192 --bluestore-compression-max-blob-size=65536 --bluestore-max-blob-size=65536 --bluestore-cache-size=3221225472 --bluestore-throttle-cost-per-io=4000 --bluestore-deferred-batch-ops=16 --default-log-to-stderr=true --default-err-to-stderr=true --default-mon-cluster-log-to-stderr=true '-default-log-stderr-prefix=debug ' --default-log-to-file=false --default-mon-cluster-log-to-file=false --ms-learn-addr-from-peer=false --public-addr=242.1.255.251 --public-bind-addr=10.131.0.41 --cluster-addr=10.131.0.41
debug 2024-06-19T10:22:41.500+0000 7f9c87c007c0 0 monclient(hunting): authenticate timed out after 300
failed to fetch mon config (--no-mon-config to skip)
+ wait 1035934
+ ceph_osd_rc=1
+ '[' 1 -eq 0 ']'
+ exit 1