[DFBUGS-153] [2284585] [MCG 4.16] NSFS Namepacestore appears Rejected even though its functional

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: odf-4.16
Component/s: Multi-Cloud Object Gateway
Labels:

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2284585
Dev Approval:
Committed
QE Approval:
?
Release Note Type:
Release Note Not Required
Target Release:

odf-4.19
Intelligence Requested:
Market:

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Description of problem (please be detailed as possible and provide log
snippests):
-----------------------------------------------------------------------
Debugging of a failed NSFS regression test on OCS-CI reveals that when we create a NSFS Namespacestore, it remains stuck in Rejected with the following error at the YAML:
```
...
...
status:
conditions:

lastHeartbeatTime: "2024-06-03T12:17:14Z"
lastTransitionTime: "2024-06-03T12:17:14Z"
message: NamespaceStorePhaseRejected
reason: 'Namespace store mode: STORAGE_NOT_EXIST'
status: Unknown
type: Available
```

However, after pausing the test at this phase I was still able to create the required MCG account that uses the namespacestore for its default buckets, and then create and write to/from the said bucket. It appears the Namespacestore is still functional despite the error message.

Version of all relevant components (if applicable):
-----------------------------------------------------------------------
OCP: 4.16.0-0.nightly-2024-06-02-000851
ODF: 4.16.0-113
ceph: 18.2.1-188.el9cp (b1ae9c989e2f41dcfec0e680c11d1d9465b1db0e) reef (stable)
rook: v4.16.0-0.a2396a5186cc038b22154e857e0f7865e709d06a
noobaa core: 5.16.0-03db21f
noobaa operator: 5.16.0-705652b55ddaabc6bbdf16cb648c4f9a72345cf1

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
-----------------------------------------------------------------------
It fails the regression test, but otherwise the feature still seems functional.

Is there any workaround available to the best of your knowledge?
-----------------------------------------------------------------------
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
-----------------------------------------------------------------------------
3

Can this issue reproducible?
-----------------------------------------------------------------------------
Yep

Can this issue reproduce from the UI?
-----------------------------------------------------------------------------
N/A

If this is a regression, please provide more details to justify this:
-----------------------------------------------------------------------------
The OCS-CI tests in https://github.com/red-hat-storage/ocs-ci/blob/master/tests/functional/object/mcg/test_nsfs.py have been passing before 4.16, and are now failing because the Namespacestore appears unfunctional.

Steps to Reproduce:
------------------------------------------------------------------------------
1. Create an RWX CephFS PVC in the openshift-storage project:

2. Create a deployment that mounts PVC at mountpath "/nsfs":
```
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
namespace: openshift-storage
spec:
accessModes:

ReadWriteMany
resources:
requests:
storage: 25Gi
storageClassName: ocs-storagecluster-cephfs
```
apiVersion: apps/v1
kind: Deployment
metadata:
name: nsfs-interface
namespace: openshift-storage
spec:
replicas: 1
selector:
matchLabels:
app: nsfs-interface
template:
metadata:
labels:
app: nsfs-interface
spec:
containers:
command:
/bin/sh
image: registry.access.redhat.com/ubi8/ubi:8.5-214
imagePullPolicy: IfNotPresent
name: ubi8
stdin: true
tty: true
volumeMounts:
mountPath: /nsfs
name: my-pvc
securityContext:
runAsUser: 1000620000
volumes:
name: my-pvc
persistentVolumeClaim:
claimName: my-pvc
```
3. Create an NSFS Namespacestore:
```
apiVersion: noobaa.io/v1alpha1
kind: NamespaceStore
metadata:
name: nsfs-nss
namespace: openshift-storage
labels:
app: noobaa
finalizers:
noobaa.io/finalizer
spec:
type: nsfs
nsfs:
pvcName: my-pvc
subpath: "nsfs"
```
4. Create a new MCG account that uses the NSFS NSS by default:
```
noobaa account create nsfs-account --allow_bucket_create=True --default_resource nsfs-nss --gid 1234 --new_buckets_path / --nsfs_account_config=True --nsfs_only=False --uid 5678 -n openshift-storage
```

5. Use its credentials to create a new NSFS bucket on the NSS via S3 and make some I.O against it:
```
ACC_NAME=nsfs-account
S3_ENDPOINT=https://$(oc get route s3 -n openshift-storage -o json | jq -r '.status.ingress[0].host')
S3_ACCESS_KEY=$(kubectl get secret noobaa-account-$ACC_NAME -n openshift-storage -o json | jq -r '.data.AWS_ACCESS_KEY_ID|@base64d')
S3_SECRET_KEY=$(kubectl get secret noobaa-account-$ACC_NAME -n openshift-storage -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|@base64d')

alias my_s3="AWS_ACCESS_KEY_ID=$S3_ACCESS_KEY AWS_SECRET_ACCESS_KEY=$S3_SECRET_KEY aws --endpoint $S3_ENDPOINT --no-verify-ssl s3

my_s3 mb s3://nsfs-bucket --region=us-east-2

my_s3 sync test_objects/ s3://nsfs-bucket/

my_s3 ls s3://nsfs-bucket/

```

Actual results:
---------------------------------------------------------------
The Namespacestore appears Rejected after its creation, but the bucket creation at the last step and the I.O against it works in spite of the status of the NSS.

Expected results:
---------------------------------------------------------------
The NSFS Namespacestore should remain in the Ready phase shortly after its creation, and the bucket creation and I.O against it at the last step should work.

Additional info:
---------------------------------------------------------------

A live cluster with the issue should be available here until the 5th of June: https://url.corp.redhat.com/c406c3c

Due to an automation issue, the OCS-CI test requires the following fix that hasn't been merged yet in order to reach the step at the test in which we see the NSS remains Rejected: https://github.com/red-hat-storage/ocs-ci/pull/9892

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

yamls.tar.gz
1 kB
2024/11/21 2:35 PM

links to

noobaa operator PR

noobaa-core PR

red-hat-storage/ocs-ci#11473: test_nsfs.py - add the Jira label for DFBUGS-153

Nimrod Becker added a comment - 2025/02/13 4:18 PM

We've had it in 4.17 and we didn't receive any customer cases on it, while we would still want to fix this, we have concerns about ho exactly and with no customer cases in the existing version in the field this one is not blocking 4.18. updating to 4.19

Nimrod Becker added a comment - 2025/02/13 4:18 PM We've had it in 4.17 and we didn't receive any customer cases on it, while we would still want to fix this, we have concerns about ho exactly and with no customer cases in the existing version in the field this one is not blocking 4.18. updating to 4.19

Data Foundation bot added a comment - 2025/01/29 4:48 PM

Please backport the fix to ODF 4.18 and update the RNT(Release Note Type/Text) field appropriately.

Data Foundation bot added a comment - 2025/01/29 4:48 PM Please backport the fix to ODF 4.18 and update the RNT(Release Note Type/Text) field appropriately.

Nimrod Becker added a comment - 2024/11/25 7:16 AM

rh-ee-aprinzse FYI (connected fixing PR https://github.com/noobaa/noobaa-core/pull/8474)

Nimrod Becker added a comment - 2024/11/25 7:16 AM rh-ee-aprinzse FYI (connected fixing PR https://github.com/noobaa/noobaa-core/pull/8474 )

Data Foundation bot added a comment - 2024/11/21 3:38 PM

FailedQA

Data Foundation bot added a comment - 2024/11/21 3:38 PM FailedQA

Sagi Hirshfeld added a comment - 2024/11/21 3:38 PM - edited

I was still able to reproduce the issue with the following loop, which repeatedly creates and deletes a deployment and a NSFS Namespacestore that mount the PVC:

$ SLEEP_TIME=75
for i in $(seq 1 10); do
      sleep $SLEEP_TIME
       oc create -f nsfs_nss.yaml
       sleep $SLEEP_TIME
       RES=$(oc get namespacestore nsfs-nss -o json | jq -r ".status.phase")
       echo $RES
       if [[ "$RES" != "Ready" ]]; then
                oc get namespacestore nsfs-nss -o yaml
        fi
       oc create -f nsfs_interface_deployment.yaml
       sleep $SLEEP_TIME
       RES=$(oc get namespacestore nsfs-nss -o json | jq -r ".status.phase")
       echo $RES
       if [[ "$RES" != "Ready" ]]; then
                oc get namespacestore nsfs-nss -o yaml
        fi
       oc delete namespacestore nsfs-nss
       oc delete deployment nsfs-interface
done

(I attached the .yaml files that are referenced here under yamls.tar.gz)

Eventually, the created NSFS Namespacestores will become Rejected due to the STORAGE_NOT_EXIST error. The Namespacestore is still functional.

Before the proposed fix, all it took to reproduce the error was to repeatedly create and delete only the Namespacestore. It's also worth mentioning that before the proposed fix, the above script produced the error with sleep intervals of 30 seconds, and now it only reproduces the error with 60+ seconds.

Additionally, I managed to catch the following noobaa-endpoint error:

Nov-21 15:31:37.340 [Endpoint/8] [ERROR] core.server.bg_services.namespace_monitor:: test_nsfs_resource: got error: [Error: No such file or directory] { code: 'ENOENT', context: 'Readdir _path=/nsfs/nsfs-nss ' } { fs_root_path: '/nsfs/nsfs-nss' }

The following steps helped capturing this error:

Reducing the endpoint monitor interval from 3 minutes to 30 seconds:

$ oc set env deployment/noobaa-endpoint CONFIG_JS_NAMESPACE_MONITOR_DELAY=30000

Increasing the logging level:

$ oc patch configmap  noobaa-config  -p '{"data":{"NOOBAA_LOG_LEVEL": "all"}}'
$ oc rollout restart deployment/noobaa-endpoint

Note that a live cluster with the issue is still available at https://url.corp.redhat.com/83992fc (requires RH VPN to inspect).

Setting the status of the ticket back to ASSIGNED since the fix failed QA.

Sagi Hirshfeld added a comment - 2024/11/21 3:38 PM - edited I was still able to reproduce the issue with the following loop, which repeatedly creates and deletes a deployment and a NSFS Namespacestore that mount the PVC: $ SLEEP_TIME=75 for i in $(seq 1 10); do sleep $SLEEP_TIME oc create -f nsfs_nss.yaml sleep $SLEEP_TIME RES=$(oc get namespacestore nsfs-nss -o json | jq -r ".status.phase" ) echo $RES if [[ "$RES" != "Ready" ]]; then oc get namespacestore nsfs-nss -o yaml fi oc create -f nsfs_interface_deployment.yaml sleep $SLEEP_TIME RES=$(oc get namespacestore nsfs-nss -o json | jq -r ".status.phase" ) echo $RES if [[ "$RES" != "Ready" ]]; then oc get namespacestore nsfs-nss -o yaml fi oc delete namespacestore nsfs-nss oc delete deployment nsfs- interface done (I attached the .yaml files that are referenced here under yamls.tar.gz) Eventually, the created NSFS Namespacestores will become Rejected due to the STORAGE_NOT_EXIST error. The Namespacestore is still functional. Before the proposed fix, all it took to reproduce the error was to repeatedly create and delete only the Namespacestore. It's also worth mentioning that before the proposed fix, the above script produced the error with sleep intervals of 30 seconds, and now it only reproduces the error with 60+ seconds. Additionally, I managed to catch the following noobaa-endpoint error: Nov-21 15:31:37.340 [Endpoint/8] [ERROR] core.server.bg_services.namespace_monitor:: test_nsfs_resource: got error: [Error: No such file or directory] { code: 'ENOENT' , context: 'Readdir _path=/nsfs/nsfs-nss ' } { fs_root_path: '/nsfs/nsfs-nss' } The following steps helped capturing this error: Reducing the endpoint monitor interval from 3 minutes to 30 seconds: $ oc set env deployment/noobaa-endpoint CONFIG_JS_NAMESPACE_MONITOR_DELAY=30000 Increasing the logging level: $ oc patch configmap noobaa-config -p '{ "data" :{ "NOOBAA_LOG_LEVEL" : "all" }}' $ oc rollout restart deployment/noobaa-endpoint Note that a live cluster with the issue is still available at https://url.corp.redhat.com/83992fc (requires RH VPN to inspect). Setting the status of the ticket back to ASSIGNED since the fix failed QA.

Sagi Hirshfeld added a comment - 2024/11/12 4:56 PM

I see that the fix that was referenced in the original BZ is already incorporated in the noobaa-core source code on 4.18.0-47:

$ oc rsh sts/noobaa-core cat /root/node_modules/noobaa-core/src/endpoint/endpoint.js | head -n 252 | tail -n 11
Defaulted container "core" out of: core, noobaa-log-processor            //wait with monitoring until pod has started
            setTimeout(() => {
                // Register a bg monitor on the endpoint
                background_scheduler.register_bg_worker(new NamespaceMonitor({
                    name: 'namespace_fs_monitor',
                    client: internal_rpc_client,
                    should_monitor: nsr => Boolean(nsr.nsfs_config),
                }));
            }, 1000 * 60);
        }

So I'm setting the status to ON_QA

Sagi Hirshfeld added a comment - 2024/11/12 4:56 PM I see that the fix that was referenced in the original BZ is already incorporated in the noobaa-core source code on 4.18.0-47: $ oc rsh sts/noobaa-core cat /root/node_modules/noobaa-core/src/endpoint/endpoint.js | head -n 252 | tail -n 11 Defaulted container "core" out of: core, noobaa-log-processor //wait with monitoring until pod has started setTimeout(() => { // Register a bg monitor on the endpoint background_scheduler.register_bg_worker( new NamespaceMonitor({ name: 'namespace_fs_monitor' , client: internal_rpc_client, should_monitor: nsr => Boolean (nsr.nsfs_config), })); }, 1000 * 60); } So I'm setting the status to ON_QA

Nimrod Becker added a comment - 2024/11/12 3:25 PM

There is a fix that was merged so yes, probably on qe but modified is good enough.

Nimrod Becker added a comment - 2024/11/12 3:25 PM There is a fix that was merged so yes, probably on qe but modified is good enough.

Eran Tamir added a comment - 2024/11/12 2:48 PM

rh-ee-shirshfe should this one be in Modified status?

Eran Tamir added a comment - 2024/11/12 2:48 PM rh-ee-shirshfe should this one be in Modified status?

Eran Tamir added a comment - 2024/11/12 2:33 PM

rh-ee-shirshfe rh-ee-nbecker Should this one really needs to be in Modified?

Eran Tamir added a comment - 2024/11/12 2:33 PM rh-ee-shirshfe rh-ee-nbecker Should this one really needs to be in Modified?

Amit Prinz Setter added a comment - 2024/10/09 10:21 PM

Generally, waiting until pods are available before proceeding to next step is a good practice.

Assuming we want to continue this-
The scenarios in #17 and original description has different order of namespacestore and deployment creation
(original description first create deployment, then namespacestore, but #17 first create namespacestore).
I'm guessing the original description is more accurate?

If we're going by original scenario
-is it possible to get the state of the namespacestore before and after each step?
I think that would help pinning in on the problematic step.

If we're going by #17-
The deployment is for cephFS, right?
If so, maybe it issues some s3 commands to noobaa upon creation?
As mentioned in #10 ans #12, this might be the trigger for failure that results in rejected namespacestore status.

Amit Prinz Setter added a comment - 2024/10/09 10:21 PM Generally, waiting until pods are available before proceeding to next step is a good practice. Assuming we want to continue this- The scenarios in #17 and original description has different order of namespacestore and deployment creation (original description first create deployment, then namespacestore, but #17 first create namespacestore). I'm guessing the original description is more accurate? If we're going by original scenario -is it possible to get the state of the namespacestore before and after each step? I think that would help pinning in on the problematic step. If we're going by #17- The deployment is for cephFS, right? If so, maybe it issues some s3 commands to noobaa upon creation? As mentioned in #10 ans #12, this might be the trigger for failure that results in rejected namespacestore status.

Assignee:: Amit Prinz Setter

Reporter:: Sagi Hirshfeld

Need Info From:: Amit Prinz Setter

QA Contact:: Sagi Hirshfeld

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2024/06/03 1:53 PM

Updated:: 2025/04/09 6:23 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Nimrod Becker added a comment - 2025/02/13 4:18 PM

Expand comment: Nimrod Becker added a comment - 2025/02/13 4:18 PM

Collapse comment: Data Foundation bot added a comment - 2025/01/29 4:48 PM

Expand comment: Data Foundation bot added a comment - 2025/01/29 4:48 PM

Collapse comment: Nimrod Becker added a comment - 2024/11/25 7:16 AM

Expand comment: Nimrod Becker added a comment - 2024/11/25 7:16 AM

Collapse comment: Data Foundation bot added a comment - 2024/11/21 3:38 PM

Expand comment: Data Foundation bot added a comment - 2024/11/21 3:38 PM

Collapse comment: Sagi Hirshfeld added a comment - 2024/11/21 3:38 PM, Edited by Sagi Hirshfeld - 2024/11/24 9:29 AM

Expand comment: Sagi Hirshfeld added a comment - 2024/11/21 3:38 PM, Edited by Sagi Hirshfeld - 2024/11/24 9:29 AM

Collapse comment: Sagi Hirshfeld added a comment - 2024/11/12 4:56 PM

Expand comment: Sagi Hirshfeld added a comment - 2024/11/12 4:56 PM

Collapse comment: Nimrod Becker added a comment - 2024/11/12 3:25 PM

Expand comment: Nimrod Becker added a comment - 2024/11/12 3:25 PM

Collapse comment: Eran Tamir added a comment - 2024/11/12 2:48 PM

Expand comment: Eran Tamir added a comment - 2024/11/12 2:48 PM

Collapse comment: Eran Tamir added a comment - 2024/11/12 2:33 PM

Expand comment: Eran Tamir added a comment - 2024/11/12 2:33 PM

Collapse comment: Amit Prinz Setter added a comment - 2024/10/09 10:21 PM

Expand comment: Amit Prinz Setter added a comment - 2024/10/09 10:21 PM

People

Dates

PagerDuty