Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-454

[2254162] Add the WA mentioned in comment https://bugzilla.redhat.com/show_bug.cgi?id=2249976#c22 as a WA for the issue in troubleshooting guide.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • odf-4.13
    • Documentation
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • ?
    • ?
    • If docs needed, set a value
    • None

      +++ This bug was initially created as a clone of Bug #2249976 +++

      +++ This bug was initially created as a clone of Bug #2247731 +++

      Description of problem (please be detailed as possible and provide log
      snippests):

      Version of all relevant components (if applicable):

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?

      Is there any workaround available to the best of your knowledge?

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?

      Can this issue reproducible?

      Can this issue reproduce from the UI?

      If this is a regression, please provide more details to justify this:

      Steps to Reproduce:
      1.
      2.
      3.

      Actual results:

      Expected results:

      Additional info:

      — Additional comment from Alexander on 2023-11-03 03:36:49 CET —

      Cluster is in FIPS mode
      ODF 4.13
      OCP 4.13

      The Nooba BackingStore object becomes "Rejected" when Noobaa-core pod is being rescheduled/restarted with a different pod IP. This results in a heartbeat RPC issue.

      The expectation is to have the operator update the secret with the updated IP address.

      However, when this secret is edited manually, the operator seeks a agent_conf.json file located within the PV through "/noobaa_init_files/noobaa_init.sh"

      This means the only way to fix it is to go into each BackingStore pod and edit the PV agent_conf.json

      1. is this expected behaviour for noobaa operator? seems like an issue in the architecture as it is not prioritizing the secret.
      2. why does it not use the Secret as precedence?

      recommending that /noobaa_init_files/noobaa_init.sh should check for existing secret first and workflow to update the agent_conf.json in each PV this way.

      ```
      time="2023-11-02T16:24:49Z" level=info msg="✅ Exists: NooBaa \"noobaa\"\n"
      time="2023-11-02T16:24:49Z" level=info msg="✅ Exists: BucketClass \"it-local-bucket-class-prd\"\n"
      time="2023-11-02T16:24:49Z" level=info msg="SetPhase: Verifying" bucketclass=openshift-storage/it-local-bucket-class-prd
      time="2023-11-02T16:24:49Z" level=info msg="✅ Exists: ServiceAccount \"default\"\n"
      time="2023-11-02T16:24:49Z" level=info msg="SetPhase: Rejected" backingstore=openshift-storage/it-local-backing-store-prd
      time="2023-11-02T16:24:49Z" level=info msg="✅ Exists: BackingStore \"it-local-backing-store-prd\"\n"
      time="2023-11-02T16:24:49Z" level=info msg="SetPhase: temporary error during phase \"Verifying\"" bucketclass=openshift-storage/it-local-bucket-class-prd
      time="2023-11-02T16:24:49Z" level=warning msg="⏳ Temporary Error: NooBaa BackingStore \"it-local-backing-store-prd\" is not yet ready" bucketclass=openshift-storage/it-local-bucket-class-prd
      time="2023-11-02T16:24:49Z" level=info msg="Update event detected for it-local-backing-store-prd (openshift-storage), queuing Reconcile"
      time="2023-11-02T16:24:49Z" level=info msg="checking which bucketclasses to reconcile. mapping backingstore openshift-storage/it-local-backing-store-prd to bucketclasses"
      time="2023-11-02T16:24:49Z" level=info msg="UpdateStatus: Done" backingstore=openshift-storage/it-local-backing-store-prd
      time="2023-11-02T16:24:49Z" level=info msg="Start BackingStore Reconcile ..." backingstore=openshift-storage/noobaa-default-backing-store
      ```

      — Additional comment from RHEL Program Management on 2023-11-03 03:36:59 CET —

      This bug having no release flag set previously, is now set with release flag 'odf‑4.14.0' to '?', and so is being proposed to be fixed at the ODF 4.14.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

      — Additional comment from Liran Mauda on 2023-11-05 11:35:18 CET —

      Adding Boris Ranto

      Hi Alexander,

      It seems that this is an issue with the kubectl itself, and it indeed related to FIPS

      from the core logs we can see:

      Nov-2 16:20:57.043 [WebServer/38] [WARN] core.util.os_utils:: discover_k8s_services: could not list OpenShift routes: Error: Command failed: kubectl api-versions
      FIPS mode is enabled, but the required OpenSSL library is not available

      at ChildProcess.exithandler (node:child_process:422:12)
      at ChildProcess.emit (node:events:517:28)
      at ChildProcess.emit (node:domain:489:12)
      at maybeClose (node:internal/child_process:1098:16)
      at Socket.<anonymous> (node:internal/child_process:450:11)
      at Socket.emit (node:events:517:28)
      at Socket.emit (node:domain:489:12)
      at Pipe.<anonymous> (node:net:350:12) {
      code: 1,
      killed: false,
      signal: null,
      cmd: 'kubectl api-versions ',
      stdout: '',
      stderr: 'FIPS mode is enabled, but the required OpenSSL library is not available\n'
      }

      The flow in the code is:
      discover_k8s_services -> _list_openshift_routes -> kube_utils.api_exists('route.openshift.io'); -> exec_kubectl(`api-versions`, 'raw'); -> and then `kubectl`
      The `kubectl` itself return the above error, which in turn we cannot list the openshift_routes

      We can see from the logs that the noobaa upstream version is: 5.13.4-d296296 (first line in the logs), which translate to downstream 4.13.4

      There was a CVE that lead us to bump the node version and with it the OpenSSL.

      from the logs we can clearly see that FIPS is detected and the version of OpenSSL:

      detect_fips_mode: found /proc/sys/crypto/fips_enabled with value 1
      OpenSSL 3.0.7 1 Nov 2022 setting up

      We should check if the correct version of kubectl was also updated in the 4.13 downstream build

      @branto do we use the latest 4.13 kubectl?

      Best Regards,
      Liran

      — Additional comment from Sunil Kumar Acharya on 2023-11-06 13:29:23 CET —

      Moving the Non-blocker BZs out of ODF-4.14.0. If you think this is blocker issue for ODF-4.14.0, feel free to propose it as a blocker with justification note.

      — Additional comment from Boris Ranto on 2023-11-08 14:45:13 CET —

      OK, I believe I know what the issue is. The issue is that we use RHEL 8 binary in the RHEL 9 container.

      We are currently using ocs-cli containers to get the binary

      https://gitlab.cee.redhat.com/ceph/rhodf/-/blob/rhodf-4.14-rhel-9/distgit/containers/mcg-core/Dockerfile.in?ref_type=heads#L96

      However, these containers are rhel 8 only no matter what OCP version you use.

      After some digging, I found the openshift-clients RPM that is built by OCP. This RPM contains both oc and kubectl binaries. After some more digging, I was able to find the content sets that have this RPM in them:

      rhocp-4_DOT_14-for-rhel-9-aarch64-rpms
      rhocp-4_DOT_14-for-rhel-9-s390x-rpms
      rhocp-4_DOT_14-for-rhel-9-ppc64le-rpms
      rhocp-4_DOT_14-for-rhel-9-x86_64-rpms

      I have made the changes to in a private branch to test it out

      https://pkgs.devel.redhat.com/cgit/containers/mcg-core/tree/?h=private-oc-clients

      The scratch build did pass and it installed the oc/kubectl binaries in the final container:

      https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=56829321

      Based on the above, I'm providing devel_ack here. We should also back-port this all the way down to ODF 4.13 (no need to back-port it to the earlier releases as those are based on RHEL 8).

      — Additional comment from Liran Mauda on 2023-11-08 17:11:55 CET —

      Based on Boris's analysis, moving to the build team.

      — Additional comment from Boris Ranto on 2023-11-09 22:54:43 CET —

      This should fix this:

      https://gitlab.cee.redhat.com/ceph/rhodf/-/commit/8f29799cd6ee3c11c9bf6c930f30e05e44a9dc46

      This is technically a regression from 4.12 where FIPS mode worked fine as it was based on RHEL 8 and we could use the RHEL8-based ose-cli container to get the oc/kubectl binaries and it worked fine as it matched the RHEL version of final container. This is an issue from ODF 4.13 and we should back-port all the way there.

      — Additional comment from RHEL Program Management on 2023-11-09 22:54:53 CET —

      This BZ is being approved for ODF 4.15.0 release, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.15.0

      — Additional comment from RHEL Program Management on 2023-11-09 22:54:53 CET —

      Since this bug has been approved for ODF 4.15.0 release, through release flag 'odf-4.15.0+', the Target Release is being set to 'ODF 4.15.0

      — Additional comment from Bipin Kunal on 2023-11-15 07:21:21 CET —

      Hi Alexander,

      We have a probable fix and we would like to get that tested in the non-production environment first before we allow customer to use it in the production. Can you check with the customer if they have a test environment where they can simulate the scenario/issue and test the fix? We will be providing a new image for mcg-core and steps to use that image.

      -Bipin Kunal

      — Additional comment from Alexander on 2023-11-16 00:34:05 CET —

      Hi Bipin,

      Thanks for that progress update, that is very good news. Okay, getting on that now and will let you know!

      — Additional comment from RHEL Program Management on 2023-11-16 13:23:08 IST —

      This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

      — Additional comment from RHEL Program Management on 2023-11-16 13:23:08 IST —

      The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

      The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "" for a specific release flag and that release flag gets auto set to "".

      — Additional comment from RHEL Program Management on 2023-11-16 13:24:18 IST —

      The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

      The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "" for a specific release flag and that release flag gets auto set to "".

      — Additional comment from RHEL Program Management on 2023-11-16 13:25:06 IST —

      The 'Target Release' is not to be set manually at the Red Hat OpenShift Data Foundation product.

      The 'Target Release' will be auto set appropriately, after the 3 Acks (pm,devel,qa) are set to "" for a specific release flag and that release flag gets auto set to "".

      — Additional comment from Boris Ranto on 2023-11-16 13:26:05 IST —

      The fix is now back-ported to 4.13 branch:

      https://gitlab.cee.redhat.com/ceph/rhodf/-/commit/09ae499321bb3b826ce8aa93ccb683c07d167926

      — Additional comment from RHEL Program Management on 2023-11-16 13:51:21 IST —

      This BZ is being approved for an ODF 4.13.z z-stream update, upon receipt of the 3 ACKs (PM,Devel,QA) for the release flag 'odf‑4.13.z', and having been marked for an approved z-stream update

      — Additional comment from RHEL Program Management on 2023-11-16 13:51:21 IST —

      Since this bug has been approved for ODF 4.13.5 release, through release flag 'odf-4.13.z+', and appropriate update number entry at the 'Internal Whiteboard', the Target Release is being set to 'ODF 4.13.5'

      — Additional comment from errata-xmlrpc on 2023-11-17 09:46:01 IST —

      This bug has been added to advisory RHBA-2023:123491 by ceph-build service account (ceph-build@IPA.REDHAT.COM)

      — Additional comment from krishnaram Karthick on 2023-11-23 11:05:37 IST —

      verification job is running here -> https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/9799/

      — Additional comment from errata-xmlrpc on 2023-11-23 22:18:56 IST —

      Advisory RHBA-2023:123491 expected publication date changed from 2023-12-04 to 2023-12-13

      — Additional comment from Uday kurundwade on 2023-11-29 21:22:10 IST —

      Hi All,

      On version "4.13.5-10", I am still seeing backing store with type=PV-pool is in rejected state even after upgrade.

      Following are the steps which I followed to reproduce and verify this bug:
      1. Deploy ODF cluster with FIPS enabled on version 4.13.3
      2. Verify noobaa-core logs for FIPS warnings
      Nov-29 12:44:33.411 [BGWorkers/34] [WARN] core.util.os_utils:: discover_k8s_services: could not list k8s services: Error: Command failed: kubectl get service --selector="app=noobaa" -o=json
      FIPS mode is enabled, but the required OpenSSL library is not available

      at ChildProcess.exithandler (node:child_process:419:12)
      at ChildProcess.emit (node:events:513:28)
      at ChildProcess.emit (node:domain:489:12)
      at maybeClose (node:internal/child_process:1091:16)
      at ChildProcess._handle.onexit (node:internal/child_process:302:5)

      { code: 1, killed: false, signal: null, cmd: 'kubectl get service --selector="app=noobaa" -o=json', stdout: '', stderr: 'FIPS mode is enabled, but the required OpenSSL library is not available\n' }

      3. Create backing store with type=PV pool
      4. restart the "noobaa-core" pod for moving backing store to "Rejected state"
      ➜ ~ oc get backingstore
      NAME TYPE PHASE AGE
      noobaa-default-backing-store s3-compatible Ready 4h10m
      ud-pv-bs pv-pool Rejected 88m
      5. Upgrade ODF to latest version i.e. 4.13.5-10
      6. Check for the FIPS warning and error
      FIPS warnings are gone from noobaa-core pod
      7. Check for the status of PV backing store for ready state
      ➜ ~ oc get backingstore
      NAME TYPE PHASE AGE
      noobaa-default-backing-store s3-compatible Ready 4h50m
      ud-pv-bs pv-pool Rejected 128m
      ➜ ~

      I tried moving agent pod i.e ud-pv-bs-noobaa-pod-efe98ab8 on different node but no luck.
      ➜ ~ oc get backingstore
      NAME TYPE PHASE AGE
      noobaa-default-backing-store s3-compatible Ready 4h50m
      ud-pv-bs pv-pool Rejected 128m
      ➜ ~
      ➜ ~ oc get po -l app=noobaa -o wide
      NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
      noobaa-core-0 1/1 Running 0 67m 10.128.2.43 ukurundw-5-xjv2n-worker-0-cz4gn <none> <none>
      noobaa-db-pg-0 1/1 Running 0 66m 10.131.0.15 ukurundw-5-xjv2n-worker-0-9t9jd <none> <none>
      noobaa-endpoint-869bb7b854-8wr5q 1/1 Running 0 68m 10.131.0.14 ukurundw-5-xjv2n-worker-0-9t9jd <none> <none>
      noobaa-operator-65cbc6b9b7-5lxh6 1/1 Running 0 68m 10.129.2.39 ukurundw-5-xjv2n-worker-0-2g5zc <none> <none>
      ud-pv-bs-noobaa-pod-efe98ab8 1/1 Running 0 53m 10.128.2.52 ukurundw-5-xjv2n-worker-0-cz4gn <none> <none>
      ➜ ~ oc adm cordon ukurundw-5-xjv2n-worker-0-cz4gn
      node/ukurundw-5-xjv2n-worker-0-cz4gn cordoned
      ➜ ~ oc delete pod ud-pv-bs-noobaa-pod-efe98ab8
      pod "ud-pv-bs-noobaa-pod-efe98ab8" deleted
      ➜ ~ oc get po -l app=noobaa -o wide
      NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
      noobaa-core-0 1/1 Running 0 72m 10.128.2.43 ukurundw-5-xjv2n-worker-0-cz4gn <none> <none>
      noobaa-db-pg-0 1/1 Running 0 70m 10.131.0.15 ukurundw-5-xjv2n-worker-0-9t9jd <none> <none>
      noobaa-endpoint-869bb7b854-8wr5q 1/1 Running 0 72m 10.131.0.14 ukurundw-5-xjv2n-worker-0-9t9jd <none> <none>
      noobaa-operator-65cbc6b9b7-5lxh6 1/1 Running 0 73m 10.129.2.39 ukurundw-5-xjv2n-worker-0-2g5zc <none> <none>
      ud-pv-bs-noobaa-pod-efe98ab8 1/1 Running 0 2m27s 10.131.0.25 ukurundw-5-xjv2n-worker-0-9t9jd <none> <none>
      ➜ ~ oc get backingstore
      NAME TYPE PHASE AGE
      noobaa-default-backing-store s3-compatible Ready 5h28m
      ud-pv-bs pv-pool Rejected 166m
      ➜ ~

      Let me know if I am missing any step here.

      Thanks
      Uday

      — Additional comment from Uday kurundwade on 2023-12-01 16:54:41 IST —

      Attaching must gather location here:
      http://qerepo-backup01.lab.eng.blr.redhat.com/OCS/uday/BZ_2249976/must-gather.local.11831108998813961/

      — Additional comment from Liran Mauda on 2023-12-03 15:44:50 IST —

      Hi Uday,
      Either the build did not take the proper kubectl or the kubectl has an Issue.
      So we need to verify that the kubectl is the correct version or get OCP to fix the kubectl.

      This is not a noobaa code so we can’t help, this is either a build or OCP issue.

      Best Regards,
      Liran.

      — Additional comment from Liran Mauda on 2023-12-03 15:49:02 IST —

      Boris, can you verify?

      — Additional comment from Bipin Kunal on 2023-12-04 11:50:43 IST —

      Liran,

      With the new build, we do not observe the FIPS warning as indicated in the step 6 in comment #11. So I feel we have the right kubectl. But I would happy to understand why you think there isn't the right kubectl in use.

      Can you help me understand why you say that backingstore getting into a rejected state has nothing to do with NooBaa code?

      -Bipin Kunal

      — Additional comment from Boris Ranto on 2023-12-04 14:51:41 IST —

      Yes, the warning is now gone, the build fix worked as expected there and it turned out the FIPS warning was non-blocking anyways. The error seems to be caused by something else and so we believe it is a NooBaa code issue now.

      — Additional comment from Liran Mauda on 2023-12-05 11:45:08 IST —

      Hi Bipin,

      I misread comment #11

      Uday,

      I cannot access the must-gather.
      Can you provide a valid link?

      — Additional comment from Uday kurundwade on 2023-12-05 13:19:08 IST —

      Hi Liran,

      As per our gchat conversation, I have shared cluster for debug.

      clearing the need info now

      Thanks,
      Uday

      — Additional comment from Liran Mauda on 2023-12-05 17:03:02 IST —

      Hi,

      The steps to un-reject the backingstore pv-pool based when the address of the noobaa-mgmt has changed is:
      1. delete the secret of the backingstore (backing-store-pv-pool-<backingstore_name>)
      2. exec into the agent pod kubectl exec it pod/<backingstore_name>-noobaa-pod<hash> – bash
      3. from the pod itself, delete the agent_conf.json file: rm -rf /noobaa_storage/agent_conf.json
      4. delete the agent pod: kubectl delete pod/<backingstore_name>noobaa-pod<hash>
      5. wait a few and see that the backingstore is in a ready state

      Please let me know how it goes.

      Best Regards,
      Liran.

      — Additional comment from RHEL Program Management on 2023-12-07 18:13:19 IST —

      Since this bug has been approved for ODF 4.13.6 release, through release flag 'odf-4.13.z+', and appropriate update number entry at the 'Internal Whiteboard', the Target Release is being set to 'ODF 4.13.6'

      — Additional comment from errata-xmlrpc on 2023-12-08 01:11:04 IST —

      This bug has been dropped from advisory RHBA-2023:123491 by Boris Ranto (branto@redhat.com)

      — Additional comment from Liran Mauda on 2023-12-12 12:49:28 IST —

      Hi,

      We have found the RC for this and It was indeed the kubectl FIPS issue.
      On a FIPS cluster, that was affected by the kubectl issue, NooBaa failedback to get the IP from the open interface.
      If a pv-pool-based backingstore is created in this affected version, we are writing the IP into a persistent file on the PVC.
      Then even if we upgrade to a fixed version, we need the steps above to fix that issue.

      It will only affect a FIPS cluster that uses pv-pool-based backingstores created during the affected versions.

      The steps to un-reject the backingstore pv-pool based when the address of the noobaa-mgmt has changed is:
      1. delete the secret of the backingstore (backing-store-pv-pool-<backingstore_name>)
      2. exec into the agent pod kubectl exec it pod/<backingstore_name>-noobaa-pod<hash> – bash
      3. from the pod itself, delete the agent_conf.json file: rm -rf /noobaa_storage/agent_conf.json
      4. delete the agent pod: kubectl delete pod/<backingstore_name>noobaa-pod<hash>
      5. wait a few and see that the backingstore is in a ready state

      Best Regards,
      Liran.

              asriram@redhat.com Anjana Sriram
              rhn-support-bkunal Bipin Kunal
              Neha Berry Neha Berry
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Created:
                Updated: