[OCPBUGS-12358] PVC mount failed with "system call failed: Structure needs cleaning" when starting linuxptp-daemon pod

Type: Bug
Resolution: Won't Do
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Networking / ptp
Labels:
- telco-4.13.z
- telco-priority-2

Severity:
Important
Regression:
No
Sprint:
CNF RAN Sprint 235, CNF RAN Sprint 236, CNF RAN Sprint 237, CNF RAN Sprint 238, CNF RAN Sprint 239
sprint_count:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
When deployed through ptp-operator and ZTP pipeline using HTTP transport, the linuxptp-daemon pod intermittently fails to start due to PVC mount error "system call failed: Structure needs cleaning". The workaround is to delete the cloud-event-proxy-store-storage-class-http-events PVC and re-deploy.

Show
When deployed through ptp-operator and ZTP pipeline using HTTP transport, the linuxptp-daemon pod intermittently fails to start due to PVC mount error "system call failed: Structure needs cleaning". The workaround is to delete the cloud-event-proxy-store-storage-class-http-events PVC and re-deploy.
Internal Whiteboard:
Latest Status Summary:

Hide
7/19: this will be closed once config map is added to 4.13
4/26: telco review pending (JD/KY)
Rel Note for Telco: Yes (4.13) - Jack will write the text.

Show
7/19: this will be closed once config map is added to 4.13 4/26: telco review pending (JD/KY) Rel Note for Telco: Yes (4.13) - Jack will write the text.

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

linuxptp-daemon pod does not start with the following volume mount error out of fresh install and config:


  Warning  FailedMount  17m                  kubelet  Unable to attach or mount volumes: unmounted volumes=[pubsubstore], unattached volumes=[config-volume pubsubstore event-bus-socket socket-dir kube-api-access-kltbp linuxptp-certs]: timed out waiting for the condition
  Warning  FailedMount  15m (x3 over 24m)    kubelet  Unable to attach or mount volumes: unmounted volumes=[pubsubstore], unattached volumes=[socket-dir kube-api-access-kltbp linuxptp-certs config-volume pubsubstore event-bus-socket]: timed out waiting for the condition
  Warning  FailedMount  4m8s                 kubelet  Unable to attach or mount volumes: unmounted volumes=[pubsubstore], unattached volumes=[linuxptp-certs config-volume pubsubstore event-bus-socket socket-dir kube-api-access-kltbp]: timed out waiting for the condition
  Warning  FailedMount  4m2s (x16 over 24m)  kubelet  (combined from similar events): MountVolume.MountDevice failed for volume "local-pv-bc42d358" : local: failed to mount device /mnt/local-storage/storage-class-http-events/scsi-36f4ee08039aa91002a97d2f8da695437-part5 at /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-bc42d358 (fstype: xfs), error mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-bc42d358 --scope -- mount -t xfs -o defaults /mnt/local-storage/storage-class-http-events/scsi-36f4ee08039aa91002a97d2f8da695437-part5 /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-bc42d358
Output: Running scope as unit: run-r63873a5c66584274bed05ed7edfc8001.scope
mount: /var/lib/kubelet/plugins/kubernetes.io/local-volume/mounts/local-pv-bc42d358: mount(2) system call failed: Structure needs cleaning.

Version-Release number of selected component (if applicable):

4.13

How reproducible:

Seems to be producible - happened 2 out of 2 times in recent 4.13 ptp CIs.

Steps to Reproduce:

1. Install SNO DU via ZTP with http events configured for both ptp and bmer
2. 
3.

Actual results:

Expected results:

bmer pod started properly with PVC mounted
ptp daemon pod failed to start with PVC mount error

Additional info:

Restarting ptplinux-daemon pod did not resolve this issue

Jack Ding added a comment - 2023/07/27 3:05 PM

closing this since PVC is obsolete in 4.13. We have replaced it with configMap for persistent storage of subscriber data.

Jack Ding added a comment - 2023/07/27 3:05 PM closing this since PVC is obsolete in 4.13. We have replaced it with configMap for persistent storage of subscriber data.

Jack Ding added a comment - 2023/05/31 2:31 PM

ConfigMap CR is being QE tested so we are on a path to retire PVC.

Jack Ding added a comment - 2023/05/31 2:31 PM ConfigMap CR is being QE tested so we are on a path to retire PVC.

Ken Young added a comment - 2023/05/29 5:44 PM

jacding@redhat.com - what is the latest on this issue?

rhn-support-yliu1 aputtur@redhat.com - maybe we get on a plan to backport the CR and retire this mechanism.

/KenY

Ken Young added a comment - 2023/05/29 5:44 PM jacding@redhat.com - what is the latest on this issue? rhn-support-yliu1 aputtur@redhat.com - maybe we get on a plan to backport the CR and retire this mechanism. /KenY

Yang Liu added a comment - 2023/05/24 1:55 PM

jacding@redhat.com kenyis can we revisit the priority for this issue? Customer will likely encounter this issue if they configure ptp HTTP event as per Doc during ZTP.

This issue happens very frequently in QE CI env.

Yang Liu added a comment - 2023/05/24 1:55 PM jacding@redhat.com kenyis can we revisit the priority for this issue? Customer will likely encounter this issue if they configure ptp HTTP event as per Doc during ZTP. This issue happens very frequently in QE CI env.

Jack Ding added a comment - 2023/05/10 4:02 PM

Release NOTE: When deployed through ptp-operator and ZTP pipeline using HTTP transport, the linuxptp-daemon pod intermittently fails to start due to PVC mount error "system call failed: Structure needs cleaning". The workaround is to delete the cloud-event-proxy-store-storage-class-http-events PVC and re-deploy.

Jack Ding added a comment - 2023/05/10 4:02 PM Release NOTE: When deployed through ptp-operator and ZTP pipeline using HTTP transport, the linuxptp-daemon pod intermittently fails to start due to PVC mount error "system call failed: Structure needs cleaning". The workaround is to delete the cloud-event-proxy-store-storage-class-http-events PVC and re-deploy.

Ken Young added a comment - 2023/05/10 3:00 PM

jacding@redhat.com

Can you write the release notes for this?

/KenY

cc rhn-support-kdassing

Ken Young added a comment - 2023/05/10 3:00 PM jacding@redhat.com Can you write the release notes for this? /KenY cc rhn-support-kdassing

Yang Liu added a comment - 2023/05/08 6:48 PM - edited

Some extra data, when the issue was reoccurred, we tried to reconfigure ptp to use AMQP, delete the existing pvc created by ptp operator (without any change to existing PVs), then configure bmer to use the same PV (new pvc was created), and bmer test ran successfully. That indicates the partition and filesystems created via ztp siteconfig are fine, issues are likely elsewhere.

Wonder if this could happens when PVC creation was too early - currently the PV and PVC creation for ptp are in the same policy.

We have never seen this issue in bmer testing, and only difference might be in bmer tests, we only creates PVC sometime after ZTP/PV creation.

Yang Liu added a comment - 2023/05/08 6:48 PM - edited Some extra data, when the issue was reoccurred, we tried to reconfigure ptp to use AMQP, delete the existing pvc created by ptp operator (without any change to existing PVs), then configure bmer to use the same PV (new pvc was created), and bmer test ran successfully. That indicates the partition and filesystems created via ztp siteconfig are fine, issues are likely elsewhere. Wonder if this could happens when PVC creation was too early - currently the PV and PVC creation for ptp are in the same policy. We have never seen this issue in bmer testing, and only difference might be in bmer tests, we only creates PVC sometime after ZTP/PV creation.

Jack Ding added a comment - 2023/05/01 4:40 PM - edited

The issue occurred again in helix49. One of the partition created by ignitionConfigOverride shows error 117 which means "Structure needs cleaning". e2fsck (e2fsck -c -y /dev/sda5) shows the file structure is corrupted. Tried xfs_repair (xfs_repair /dev/sda5) but could not repair.

[123956.990977] XFS (sda5): Mounting V5 Filesystem
[123957.001505] XFS (sda5): Log inconsistent (didn't find previous header)
[123957.001509] XFS (sda5): failed to find log head
[123957.001510] XFS (sda5): log mount/recovery failed: error -117
[123957.001598] XFS (sda5): log mount failed

rhn-support-yliu1 found this wipePartitionEntry=true] option might help cleanup the structure after partition creation.
We will set wipePartitionEntry=true in ignitionConfigOverride for helix49 and continue monitoring ci tests to see if it resolves the issue.

Jack Ding added a comment - 2023/05/01 4:40 PM - edited The issue occurred again in helix49. One of the partition created by ignitionConfigOverride shows error 117 which means " Structure needs cleaning" . e2fsck (e2fsck -c -y /dev/sda5) shows the file structure is corrupted. Tried xfs_repair (xfs_repair /dev/sda5) but could not repair. [123956.990977] XFS (sda5): Mounting V5 Filesystem [123957.001505] XFS (sda5): Log inconsistent (didn't find previous header) [123957.001509] XFS (sda5): failed to find log head [123957.001510] XFS (sda5): log mount/recovery failed: error -117 [123957.001598] XFS (sda5): log mount failed rhn-support-yliu1 found this wipePartitionEntry=true ] option might help cleanup the structure after partition creation. We will set wipePartitionEntry=true in ignitionConfigOverride for helix49 and continue monitoring ci tests to see if it resolves the issue.

Jack Ding added a comment - 2023/04/27 1:38 AM - edited

The problem is intermittent and did not occur in today's CI test. I could not manually reproduce it either. Will continue to monitor the CI tests and debug on site when this issue reoccur.

Jack Ding added a comment - 2023/04/27 1:38 AM - edited The problem is intermittent and did not occur in today's CI test. I could not manually reproduce it either. Will continue to monitor the CI tests and debug on site when this issue reoccur.

Tao Liu added a comment - 2023/04/25 11:25 PM

When 4.13 was installed a week ago, there were no issues experienced. However, it should be noted that helix48 was only installed in 4.12 since then.

Tao Liu added a comment - 2023/04/25 11:25 PM When 4.13 was installed a week ago, there were no issues experienced. However, it should be noted that helix48 was only installed in 4.12 since then.

Assignee:: Jack Ding

Reporter:: Yang Liu

QA Contact:: Tao Liu

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/04/23 4:23 PM

Updated:: 2023/07/27 3:05 PM

Resolved:: 2023/07/27 3:05 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

Collapse comment: Jack Ding added a comment - 2023/07/27 3:05 PM

Expand comment: Jack Ding added a comment - 2023/07/27 3:05 PM

Collapse comment: Jack Ding added a comment - 2023/05/31 2:31 PM

Expand comment: Jack Ding added a comment - 2023/05/31 2:31 PM

Collapse comment: Ken Young added a comment - 2023/05/29 5:44 PM

Expand comment: Ken Young added a comment - 2023/05/29 5:44 PM

Collapse comment: Yang Liu added a comment - 2023/05/24 1:55 PM

Expand comment: Yang Liu added a comment - 2023/05/24 1:55 PM

Collapse comment: Jack Ding added a comment - 2023/05/10 4:02 PM

Expand comment: Jack Ding added a comment - 2023/05/10 4:02 PM

Collapse comment: Ken Young added a comment - 2023/05/10 3:00 PM

Expand comment: Ken Young added a comment - 2023/05/10 3:00 PM

Collapse comment: Yang Liu added a comment - 2023/05/08 6:48 PM, Edited by Yang Liu - 2023/05/08 6:54 PM

Expand comment: Yang Liu added a comment - 2023/05/08 6:48 PM, Edited by Yang Liu - 2023/05/08 6:54 PM

Collapse comment: Jack Ding added a comment - 2023/05/01 4:40 PM, Edited by Jack Ding - 2023/05/01 4:42 PM

Expand comment: Jack Ding added a comment - 2023/05/01 4:40 PM, Edited by Jack Ding - 2023/05/01 4:42 PM

Collapse comment: Jack Ding added a comment - 2023/04/27 1:38 AM, Edited by Jack Ding - 2023/04/27 1:38 AM

Expand comment: Jack Ding added a comment - 2023/04/27 1:38 AM, Edited by Jack Ding - 2023/04/27 1:38 AM

Collapse comment: Tao Liu added a comment - 2023/04/25 11:25 PM

Expand comment: Tao Liu added a comment - 2023/04/25 11:25 PM

People

Dates