Uploaded image for project: 'Data Foundation Bugs'
  1. Data Foundation Bugs
  2. DFBUGS-153

[2284585] [MCG 4.16] NSFS Namepacestore appears Rejected even though its functional

    • False
    • Hide

      None

      Show
      None
    • False
    • Committed
    • ?
    • Release Note Not Required
    • None

      Description of problem (please be detailed as possible and provide log
      snippests):
      -----------------------------------------------------------------------
      Debugging of a failed NSFS regression test on OCS-CI reveals that when we create a NSFS Namespacestore, it remains stuck in Rejected with the following error at the YAML:
      ```
      ...
      ...
      status:
      conditions:

      • lastHeartbeatTime: "2024-06-03T12:17:14Z"
        lastTransitionTime: "2024-06-03T12:17:14Z"
        message: NamespaceStorePhaseRejected
        reason: 'Namespace store mode: STORAGE_NOT_EXIST'
        status: Unknown
        type: Available
        ```

      However, after pausing the test at this phase I was still able to create the required MCG account that uses the namespacestore for its default buckets, and then create and write to/from the said bucket. It appears the Namespacestore is still functional despite the error message.

      Version of all relevant components (if applicable):
      -----------------------------------------------------------------------
      OCP: 4.16.0-0.nightly-2024-06-02-000851
      ODF: 4.16.0-113
      ceph: 18.2.1-188.el9cp (b1ae9c989e2f41dcfec0e680c11d1d9465b1db0e) reef (stable)
      rook: v4.16.0-0.a2396a5186cc038b22154e857e0f7865e709d06a
      noobaa core: 5.16.0-03db21f
      noobaa operator: 5.16.0-705652b55ddaabc6bbdf16cb648c4f9a72345cf1

      Does this issue impact your ability to continue to work with the product
      (please explain in detail what is the user impact)?
      -----------------------------------------------------------------------
      It fails the regression test, but otherwise the feature still seems functional.

      Is there any workaround available to the best of your knowledge?
      -----------------------------------------------------------------------
      No

      Rate from 1 - 5 the complexity of the scenario you performed that caused this
      bug (1 - very simple, 5 - very complex)?
      -----------------------------------------------------------------------------
      3

      Can this issue reproducible?
      -----------------------------------------------------------------------------
      Yep

      Can this issue reproduce from the UI?
      -----------------------------------------------------------------------------
      N/A

      If this is a regression, please provide more details to justify this:
      -----------------------------------------------------------------------------
      The OCS-CI tests in https://github.com/red-hat-storage/ocs-ci/blob/master/tests/functional/object/mcg/test_nsfs.py have been passing before 4.16, and are now failing because the Namespacestore appears unfunctional.

      Steps to Reproduce:
      ------------------------------------------------------------------------------
      1. Create an RWX CephFS PVC in the openshift-storage project:

      2. Create a deployment that mounts PVC at mountpath "/nsfs":
      ```
      apiVersion: v1
      kind: PersistentVolumeClaim
      metadata:
      name: my-pvc
      namespace: openshift-storage
      spec:
      accessModes:

      • ReadWriteMany
        resources:
        requests:
        storage: 25Gi
        storageClassName: ocs-storagecluster-cephfs
        ```
        apiVersion: apps/v1
        kind: Deployment
        metadata:
        name: nsfs-interface
        namespace: openshift-storage
        spec:
        replicas: 1
        selector:
        matchLabels:
        app: nsfs-interface
        template:
        metadata:
        labels:
        app: nsfs-interface
        spec:
        containers:
      • command:
      • /bin/sh
        image: registry.access.redhat.com/ubi8/ubi:8.5-214
        imagePullPolicy: IfNotPresent
        name: ubi8
        stdin: true
        tty: true
        volumeMounts:
      • mountPath: /nsfs
        name: my-pvc
        securityContext:
        runAsUser: 1000620000
        volumes:
      • name: my-pvc
        persistentVolumeClaim:
        claimName: my-pvc
        ```
        3. Create an NSFS Namespacestore:
        ```
        apiVersion: noobaa.io/v1alpha1
        kind: NamespaceStore
        metadata:
        name: nsfs-nss
        namespace: openshift-storage
        labels:
        app: noobaa
        finalizers:
      • noobaa.io/finalizer
        spec:
        type: nsfs
        nsfs:
        pvcName: my-pvc
        subpath: "nsfs"
        ```
        4. Create a new MCG account that uses the NSFS NSS by default:
        ```
        noobaa account create nsfs-account --allow_bucket_create=True --default_resource nsfs-nss --gid 1234 --new_buckets_path / --nsfs_account_config=True --nsfs_only=False --uid 5678 -n openshift-storage
        ```

      5. Use its credentials to create a new NSFS bucket on the NSS via S3 and make some I.O against it:
      ```
      ACC_NAME=nsfs-account
      S3_ENDPOINT=https://$(oc get route s3 -n openshift-storage -o json | jq -r '.status.ingress[0].host')
      S3_ACCESS_KEY=$(kubectl get secret noobaa-account-$ACC_NAME -n openshift-storage -o json | jq -r '.data.AWS_ACCESS_KEY_ID|@base64d')
      S3_SECRET_KEY=$(kubectl get secret noobaa-account-$ACC_NAME -n openshift-storage -o json | jq -r '.data.AWS_SECRET_ACCESS_KEY|@base64d')

      alias my_s3="AWS_ACCESS_KEY_ID=$S3_ACCESS_KEY AWS_SECRET_ACCESS_KEY=$S3_SECRET_KEY aws --endpoint $S3_ENDPOINT --no-verify-ssl s3

      my_s3 mb s3://nsfs-bucket --region=us-east-2

      my_s3 sync test_objects/ s3://nsfs-bucket/

      my_s3 ls s3://nsfs-bucket/

      ```

      Actual results:
      ---------------------------------------------------------------
      The Namespacestore appears Rejected after its creation, but the bucket creation at the last step and the I.O against it works in spite of the status of the NSS.

      Expected results:
      ---------------------------------------------------------------
      The NSFS Namespacestore should remain in the Ready phase shortly after its creation, and the bucket creation and I.O against it at the last step should work.

      Additional info:
      ---------------------------------------------------------------

            [DFBUGS-153] [2284585] [MCG 4.16] NSFS Namepacestore appears Rejected even though its functional

            We've had it in 4.17 and we didn't receive any customer cases on it, while we would still want to fix this, we have concerns about ho exactly and with no customer cases in the existing version in the field this one is not blocking 4.18. updating to 4.19

            Nimrod Becker added a comment - We've had it in 4.17 and we didn't receive any customer cases on it, while we would still want to fix this, we have concerns about ho exactly and with no customer cases in the existing version in the field this one is not blocking 4.18. updating to 4.19

            Please backport the fix to ODF 4.18 and update the RNT(Release Note Type/Text) field appropriately.

            Data Foundation bot added a comment - Please backport the fix to ODF 4.18 and update the RNT(Release Note Type/Text) field appropriately.

            Nimrod Becker added a comment - rh-ee-aprinzse FYI (connected fixing PR https://github.com/noobaa/noobaa-core/pull/8474 )

            FailedQA

            Data Foundation bot added a comment - FailedQA

            Sagi Hirshfeld added a comment - - edited

            I was still able to reproduce the issue with the following loop, which repeatedly creates and deletes a deployment and a NSFS Namespacestore that mount the PVC:

            $ SLEEP_TIME=75
            for i in $(seq 1 10); do
                  sleep $SLEEP_TIME
                   oc create -f nsfs_nss.yaml
                   sleep $SLEEP_TIME
                   RES=$(oc get namespacestore nsfs-nss -o json | jq -r ".status.phase")
                   echo $RES
                   if [[ "$RES" != "Ready" ]]; then
                            oc get namespacestore nsfs-nss -o yaml
                    fi
                   oc create -f nsfs_interface_deployment.yaml
                   sleep $SLEEP_TIME
                   RES=$(oc get namespacestore nsfs-nss -o json | jq -r ".status.phase")
                   echo $RES
                   if [[ "$RES" != "Ready" ]]; then
                            oc get namespacestore nsfs-nss -o yaml
                    fi
                   oc delete namespacestore nsfs-nss
                   oc delete deployment nsfs-interface
            done 

            (I attached the .yaml files that are referenced here under yamls.tar.gz)

            Eventually, the created NSFS Namespacestores will become Rejected due to the STORAGE_NOT_EXIST error. The Namespacestore is still functional. 

            Before the proposed fix, all it took to reproduce the error was to repeatedly create and delete only the Namespacestore. It's also worth mentioning that before the proposed fix, the above script produced the error with sleep intervals of 30 seconds, and now it only reproduces the error with 60+ seconds.

            Additionally, I managed to catch the following noobaa-endpoint error:

            Nov-21 15:31:37.340 [Endpoint/8] [ERROR] core.server.bg_services.namespace_monitor:: test_nsfs_resource: got error: [Error: No such file or directory] { code: 'ENOENT', context: 'Readdir _path=/nsfs/nsfs-nss ' } { fs_root_path: '/nsfs/nsfs-nss' } 

            The following steps helped capturing this error:

            • Reducing the endpoint monitor interval from 3 minutes to 30 seconds:
              $ oc set env deployment/noobaa-endpoint CONFIG_JS_NAMESPACE_MONITOR_DELAY=30000
            • Increasing the logging level:
              $ oc patch configmap  noobaa-config  -p '{"data":{"NOOBAA_LOG_LEVEL": "all"}}'
              $ oc rollout restart deployment/noobaa-endpoint 

            Note that a live cluster with the issue is still available at https://url.corp.redhat.com/83992fc (requires RH VPN to inspect).

            Setting the status of the ticket back to ASSIGNED since the fix failed QA.

             

            Sagi Hirshfeld added a comment - - edited I was still able to reproduce the issue with the following loop, which repeatedly creates and deletes a deployment and a NSFS Namespacestore that mount the PVC: $ SLEEP_TIME=75 for i in $(seq 1 10); do       sleep $SLEEP_TIME        oc create -f nsfs_nss.yaml        sleep $SLEEP_TIME        RES=$(oc get namespacestore nsfs-nss -o json | jq -r ".status.phase" )        echo $RES         if [[ "$RES" != "Ready" ]]; then                 oc get namespacestore nsfs-nss -o yaml         fi        oc create -f nsfs_interface_deployment.yaml        sleep $SLEEP_TIME        RES=$(oc get namespacestore nsfs-nss -o json | jq -r ".status.phase" )        echo $RES         if [[ "$RES" != "Ready" ]]; then                 oc get namespacestore nsfs-nss -o yaml         fi        oc delete namespacestore nsfs-nss        oc delete deployment nsfs- interface done (I attached the .yaml files that are referenced here under yamls.tar.gz) Eventually, the created NSFS Namespacestores will become Rejected due to the STORAGE_NOT_EXIST error. The Namespacestore is still functional.  Before the proposed fix, all it took to reproduce the error was to repeatedly create and delete only the Namespacestore. It's also worth mentioning that before the proposed fix, the above script produced the error with sleep intervals of 30 seconds, and now it only reproduces the error with 60+ seconds. Additionally, I managed to catch the following noobaa-endpoint error: Nov-21 15:31:37.340 [Endpoint/8] [ERROR] core.server.bg_services.namespace_monitor:: test_nsfs_resource: got error: [Error: No such file or directory] { code: 'ENOENT' , context: 'Readdir _path=/nsfs/nsfs-nss ' } { fs_root_path: '/nsfs/nsfs-nss' } The following steps helped capturing this error: Reducing the endpoint monitor interval from 3 minutes to 30 seconds: $ oc set env deployment/noobaa-endpoint CONFIG_JS_NAMESPACE_MONITOR_DELAY=30000 Increasing the logging level: $ oc patch configmap  noobaa-config  -p '{ "data" :{ "NOOBAA_LOG_LEVEL" : "all" }}' $ oc rollout restart deployment/noobaa-endpoint Note that a live cluster with the issue is still available at https://url.corp.redhat.com/83992fc (requires RH VPN to inspect). Setting the status of the ticket back to ASSIGNED since the fix failed QA.  

            I see that the fix that was referenced in the original BZ is already incorporated in the noobaa-core source code on 4.18.0-47: 

            $ oc rsh sts/noobaa-core cat /root/node_modules/noobaa-core/src/endpoint/endpoint.js | head -n 252 | tail -n 11
            Defaulted container "core" out of: core, noobaa-log-processor            //wait with monitoring until pod has started
                        setTimeout(() => {
                            // Register a bg monitor on the endpoint
                            background_scheduler.register_bg_worker(new NamespaceMonitor({
                                name: 'namespace_fs_monitor',
                                client: internal_rpc_client,
                                should_monitor: nsr => Boolean(nsr.nsfs_config),
                            }));
                        }, 1000 * 60);
                    } 

            So I'm setting the status to ON_QA 

            Sagi Hirshfeld added a comment - I see that the fix that was referenced in the original BZ is already incorporated in the noobaa-core source code on 4.18.0-47:  $ oc rsh sts/noobaa-core cat /root/node_modules/noobaa-core/src/endpoint/endpoint.js | head -n 252 | tail -n 11 Defaulted container "core" out of: core, noobaa-log-processor            //wait with monitoring until pod has started             setTimeout(() => {                 // Register a bg monitor on the endpoint                 background_scheduler.register_bg_worker( new NamespaceMonitor({                     name: 'namespace_fs_monitor' ,                     client: internal_rpc_client,                     should_monitor: nsr => Boolean (nsr.nsfs_config),                 }));             }, 1000 * 60);         } So I'm setting the status to ON_QA 

            There is a fix that was merged so yes, probably on qe but modified is good enough.

            Nimrod Becker added a comment - There is a fix that was merged so yes, probably on qe but modified is good enough.

            Eran Tamir added a comment -

            rh-ee-shirshfe should this one be in Modified status? 

             

            Eran Tamir added a comment - rh-ee-shirshfe should this one be in Modified status?   

            Eran Tamir added a comment -

            rh-ee-shirshfe rh-ee-nbecker Should this one really needs to be in Modified? 

            Eran Tamir added a comment - rh-ee-shirshfe rh-ee-nbecker Should this one really needs to be in Modified? 

            Generally, waiting until pods are available before proceeding to next step is a good practice.

            Assuming we want to continue this-
            The scenarios in #17 and original description has different order of namespacestore and deployment creation
            (original description first create deployment, then namespacestore, but #17 first create namespacestore).
            I'm guessing the original description is more accurate?

            If we're going by original scenario
            -is it possible to get the state of the namespacestore before and after each step?
            I think that would help pinning in on the problematic step.

            If we're going by #17-
            The deployment is for cephFS, right?
            If so, maybe it issues some s3 commands to noobaa upon creation?
            As mentioned in #10 ans #12, this might be the trigger for failure that results in rejected namespacestore status.

            Amit Prinz Setter added a comment - Generally, waiting until pods are available before proceeding to next step is a good practice. Assuming we want to continue this- The scenarios in #17 and original description has different order of namespacestore and deployment creation (original description first create deployment, then namespacestore, but #17 first create namespacestore). I'm guessing the original description is more accurate? If we're going by original scenario -is it possible to get the state of the namespacestore before and after each step? I think that would help pinning in on the problematic step. If we're going by #17- The deployment is for cephFS, right? If so, maybe it issues some s3 commands to noobaa upon creation? As mentioned in #10 ans #12, this might be the trigger for failure that results in rejected namespacestore status.

              rh-ee-aprinzse Amit Prinz Setter
              rh-ee-shirshfe Sagi Hirshfeld
              Amit Prinz Setter
              Sagi Hirshfeld Sagi Hirshfeld
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

                Created:
                Updated: