Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-12358

PVC mount failed with "system call failed: Structure needs cleaning" when starting linuxptp-daemon pod

    • Important
    • No
    • CNF RAN Sprint 235, CNF RAN Sprint 236, CNF RAN Sprint 237, CNF RAN Sprint 238, CNF RAN Sprint 239
    • 5
    • False
    • Hide

      None

      Show
      None
    • Hide
      When deployed through ptp-operator and ZTP pipeline using HTTP transport, the linuxptp-daemon pod intermittently fails to start due to PVC mount error "system call failed: Structure needs cleaning". The workaround is to delete the cloud-event-proxy-store-storage-class-http-events PVC and re-deploy.
      Show
      When deployed through ptp-operator and ZTP pipeline using HTTP transport, the linuxptp-daemon pod intermittently fails to start due to PVC mount error "system call failed: Structure needs cleaning". The workaround is to delete the cloud-event-proxy-store-storage-class-http-events PVC and re-deploy.
    • Hide
      7/19: this will be closed once config map is added to 4.13
      4/26: telco review pending (JD/KY)
      Rel Note for Telco: Yes (4.13) - Jack will write the text.
      Show
      7/19: this will be closed once config map is added to 4.13 4/26: telco review pending (JD/KY) Rel Note for Telco: Yes (4.13) - Jack will write the text.

      Description of problem:

      linuxptp-daemon pod does not start with the following volume mount error out of fresh install and config:
      
      
        Warning  FailedMount  17m                  kubelet  Unable to attach or mount volumes: unmounted volumes=[pubsubstore], unattached volumes=[config-volume pubsubstore event-bus-socket socket-dir kube-api-access-kltbp linuxptp-certs]: timed out waiting for the condition
        Warning  FailedMount  15m (x3 over 24m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[pubsubstore], unattached volumes=[socket-dir kube-api-access-kltbp linuxptp-certs config-volume pubsubstore event-bus-socket]: timed out waiting for the condition
        Warning  FailedMount  4m8s                 kubelet  Unable to attach or mount volumes: unmounted volumes=[pubsubstore], unattached volumes=[linuxptp-certs config-volume pubsubstore event-bus-socket socket-dir kube-api-access-kltbp]: timed out waiting for the condition
        Warning  FailedMount  4m2s (x16 over 24m)  kubelet  (combined from similar events): MountVolume.MountDevice failed for volume "local-pv-bc42d358" : local: failed to mount device /mnt/local-storage/storage-class-http-events/scsi-36f4ee08039aa91002a97d2f8da695437-part5 at /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-bc42d358 (fstype: xfs), error mount failed: exit status 32
      Mounting command: systemd-run
      Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-bc42d358 --scope -- mount -t xfs -o defaults /mnt/local-storage/storage-class-http-events/scsi-36f4ee08039aa91002a97d2f8da695437-part5 /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-bc42d358
      Output: Running scope as unit: run-r63873a5c66584274bed05ed7edfc8001.scope
      mount: /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-bc42d358: mount(2) system call failed: Structure needs cleaning.
      

      Version-Release number of selected component (if applicable):

      4.13

      How reproducible:

      Seems to be producible - happened 2 out of 2 times in recent 4.13 ptp CIs. 

      Steps to Reproduce:

      1. Install SNO DU via ZTP with http events configured for both ptp and bmer
      2. 
      3.
      

      Actual results:

      Expected results:

      bmer pod started properly with PVC mounted
      ptp daemon pod failed to start with PVC mount error

      Additional info:

      Restarting ptplinux-daemon pod did not resolve this issue

       

            [OCPBUGS-12358] PVC mount failed with "system call failed: Structure needs cleaning" when starting linuxptp-daemon pod

            Jack Ding added a comment -

            closing this since PVC is obsolete in 4.13. We have replaced it with configMap for persistent storage of subscriber data.

            Jack Ding added a comment - closing this since PVC is obsolete in 4.13. We have replaced it with configMap for persistent storage of subscriber data.

            Jack Ding added a comment -

            ConfigMap CR is being QE tested so we are on a path to retire PVC.

            Jack Ding added a comment - ConfigMap CR is being QE tested so we are on a path to retire PVC.

            Ken Young added a comment -

            jacding@redhat.com - what is the latest on this issue?

            rhn-support-yliu1 aputtur@redhat.com - maybe we get on a plan to backport the CR and retire this mechanism.

            /KenY

            Ken Young added a comment - jacding@redhat.com - what is the latest on this issue? rhn-support-yliu1 aputtur@redhat.com - maybe we get on a plan to backport the CR and retire this mechanism. /KenY

            Yang Liu added a comment -

            jacding@redhat.com kenyis  can we revisit the priority for this issue? Customer will likely encounter this issue if they configure ptp HTTP event as per Doc during ZTP. 

            This issue happens very frequently in QE CI env.

            Yang Liu added a comment - jacding@redhat.com kenyis   can we revisit the priority for this issue? Customer will likely encounter this issue if they configure ptp HTTP event as per Doc during ZTP.  This issue happens very frequently in QE CI env.

            Jack Ding added a comment -

            Release NOTE: When deployed through ptp-operator and ZTP pipeline using HTTP transport, the linuxptp-daemon pod intermittently fails to start due to PVC mount error "system call failed: Structure needs cleaning". The workaround is to delete the cloud-event-proxy-store-storage-class-http-events PVC and re-deploy.

            Jack Ding added a comment - Release NOTE: When deployed through ptp-operator and ZTP pipeline using HTTP transport, the linuxptp-daemon pod intermittently fails to start due to PVC mount error "system call failed: Structure needs cleaning". The workaround is to delete the cloud-event-proxy-store-storage-class-http-events PVC and re-deploy.

            Ken Young added a comment -

            jacding@redhat.com 

            Can you write the release notes for this?

            /KenY

            cc rhn-support-kdassing 

            Ken Young added a comment - jacding@redhat.com   Can you write the release notes for this? /KenY cc rhn-support-kdassing  

            Yang Liu added a comment - - edited

            Some extra data, when the issue was reoccurred, we tried to reconfigure ptp to use AMQP,  delete the existing pvc created by ptp operator (without any change to existing PVs), then configure bmer to use the same PV (new pvc was created), and bmer test ran successfully. That indicates the partition and filesystems created via ztp siteconfig are fine, issues are likely elsewhere.

             

            Wonder if this could happens when PVC creation was too early - currently the PV and PVC creation for ptp are in the same policy. 

            We have never seen this issue in bmer testing, and only difference might be in bmer tests, we only creates PVC sometime after ZTP/PV creation.

            Yang Liu added a comment - - edited Some extra data, when the issue was reoccurred, we tried to reconfigure ptp to use AMQP,  delete the existing pvc created by ptp operator (without any change to existing PVs), then configure bmer to use the same PV (new pvc was created), and bmer test ran successfully. That indicates the partition and filesystems created via ztp siteconfig are fine, issues are likely elsewhere.   Wonder if this could happens when PVC creation was too early - currently the PV and PVC creation for ptp are in the same policy.  We have never seen this issue in bmer testing, and only difference might be in bmer tests, we only creates PVC sometime after ZTP/PV creation.

            Jack Ding added a comment - - edited

            The issue occurred again in helix49. One of the partition created by ignitionConfigOverride shows error 117 which means "Structure needs cleaning". e2fsck (e2fsck -c -y /dev/sda5) shows the file structure is corrupted. Tried xfs_repair (xfs_repair /dev/sda5) but could not repair.

            [123956.990977] XFS (sda5): Mounting V5 Filesystem
            [123957.001505] XFS (sda5): Log inconsistent (didn't find previous header)
            [123957.001509] XFS (sda5): failed to find log head
            [123957.001510] XFS (sda5): log mount/recovery failed: error -117
            [123957.001598] XFS (sda5): log mount failed

            rhn-support-yliu1 found this wipePartitionEntry=true] option might help cleanup the structure after partition creation.
            We will set wipePartitionEntry=true in ignitionConfigOverride for helix49 and continue monitoring ci tests to see if it resolves the issue.

            Jack Ding added a comment - - edited The issue occurred again in helix49. One of the partition created by ignitionConfigOverride shows error 117 which means " Structure needs cleaning" . e2fsck (e2fsck -c -y /dev/sda5) shows the file structure is corrupted. Tried xfs_repair (xfs_repair /dev/sda5) but could not repair. [123956.990977] XFS (sda5): Mounting V5 Filesystem [123957.001505] XFS (sda5): Log inconsistent (didn't find previous header) [123957.001509] XFS (sda5): failed to find log head [123957.001510] XFS (sda5): log mount/recovery failed: error -117 [123957.001598] XFS (sda5): log mount failed rhn-support-yliu1 found this wipePartitionEntry=true ] option might help cleanup the structure after partition creation. We will set wipePartitionEntry=true in ignitionConfigOverride for helix49 and continue monitoring ci tests to see if it resolves the issue.

            Jack Ding added a comment - - edited

            The problem is intermittent and did not occur in today's CI test. I could not manually reproduce it either. Will continue to monitor the CI tests and debug on site when this issue reoccur.

            Jack Ding added a comment - - edited The problem is intermittent and did not occur in today's CI test. I could not manually reproduce it either. Will continue to monitor the CI tests and debug on site when this issue reoccur.

            Tao Liu added a comment -

            When 4.13 was installed a week ago, there were no issues experienced. However, it should be noted that helix48 was only installed in 4.12 since then.

            Tao Liu added a comment - When 4.13 was installed a week ago, there were no issues experienced. However, it should be noted that helix48 was only installed in 4.12 since then.

              jacding@redhat.com Jack Ding
              rhn-support-yliu1 Yang Liu
              Tao Liu Tao Liu
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: