-
Bug
-
Resolution: Obsolete
-
Normal
-
None
-
4.8
-
None
Description of problem:
Customer replace a drive on node w8 associated with a OCS 4.8 OSD. And the new drive attached is not PV provision by LSO.
This issue happen on OCP4.8 OCS4.8 using "local-storage-operator" to provide drive for Ceph/OCS.
So this bug is related to "openshift-local-storage" not OCS 4.8
Customer has a support exception to continue using this software.
The "diskmaker-manager-kqrd4" pod seem to be stuck in a loop and don't ingest any new drive for creating a PV.
Deleting this "diskmaker-manager" pod associated to node w8 and "local-storage-operator" pod does not make it move forward.
The "diskmaker-manager-kqrd4" ( for w8 node ) after startup quickly loop with the following logs :
2023-06-13T13:48:20.672103650Z {"level":"info","ts":1686664100.6719604,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""} 2023-06-13T13:48:20.672249550Z I0613 13:48:20.672222 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"] 2023-06-13T13:48:50.673440909Z {"level":"info","ts":1686664130.673301,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""} 2023-06-13T13:48:50.673568923Z I0613 13:48:50.673531 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"] 2023-06-13T13:49:20.674220779Z {"level":"info","ts":1686664160.6741552,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""} 2023-06-13T13:49:20.674438831Z I0613 13:49:20.674405 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
Customer is particulary interested in solving the issue with this w8 node to recreate a new OSD.
But we can see that it's not the only pod which has this issue and it look like multiple "diskmaker-manager" pod are affected by this plague and don't loop over drive anymore :
$ grep -c "Device" 0050-inspect.local.7853900946867902518.tar.gz/inspect.local.7853900946867902518/namespaces/openshift-local-storage/pods/diskmaker-manager-*/diskmaker-manager/diskmaker-manager/logs/current.log | cut -b 81-500 namespaces/openshift-local-storage/pods/diskmaker-manager-22qrq/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-68qpt/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-7p7jv/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-7ptl2/diskmaker-manager/diskmaker-manager/logs/current.log:12480 namespaces/openshift-local-storage/pods/diskmaker-manager-96vxc/diskmaker-manager/diskmaker-manager/logs/current.log:63344 namespaces/openshift-local-storage/pods/diskmaker-manager-9vjtf/diskmaker-manager/diskmaker-manager/logs/current.log:19085 namespaces/openshift-local-storage/pods/diskmaker-manager-fnqxw/diskmaker-manager/diskmaker-manager/logs/current.log:111879 namespaces/openshift-local-storage/pods/diskmaker-manager-gl4wf/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-kqrd4/diskmaker-manager/diskmaker-manager/logs/current.log:0 namespaces/openshift-local-storage/pods/diskmaker-manager-pdnct/diskmaker-manager/diskmaker-manager/logs/current.log:129458 namespaces/openshift-local-storage/pods/diskmaker-manager-q6s52/diskmaker-manager/diskmaker-manager/logs/current.log:59920 namespaces/openshift-local-storage/pods/diskmaker-manager-zdstc/diskmaker-manager/diskmaker-manager/logs/current.log:0
In general, on every healthy diskmaker pod, every 60s, we have a loop which browse all the "block device" and create a symlink in /mnt/local-storage/localblock if condition are met.
Like this :
2023-06-20T16:32:54.391660888Z {"level":"info","ts":1687278774.3914862,"logger":"localvolumeset-symlink-controller","msg":"Reconciling LocalVolumeSet","Request.Namespace":"openshift-local-storage","Request.Name":"local-block"} 2023-06-20T16:32:54.391872431Z I0620 16:32:54.391823 15549 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"] 2023-06-20T16:32:54.569914265Z {"level":"info","ts":1687278774.5697858,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop0","matcher.Name":"inTypeList"} 2023-06-20T16:32:54.593580217Z {"level":"info","ts":1687278774.5934741,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop1","matcher.Name":"inTypeList"} 2023-06-20T16:32:54.614147565Z {"level":"info","ts":1687278774.6140366,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop2","matcher.Name":"inTypeList"} 2023-06-20T16:32:54.632720733Z {"level":"info","ts":1687278774.6326215,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop3","matcher.Name":"inSizeRange"} 2023-06-20T16:32:54.637415307Z {"level":"info","ts":1687278774.6373258,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdb","filter.Name":"noBindMounts"} 2023-06-20T16:32:54.640344701Z {"level":"info","ts":1687278774.6402955,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdc","filter.Name":"noBindMounts"} 2023-06-20T16:32:54.642941844Z {"level":"info","ts":1687278774.6428287,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdd","filter.Name":"noBindMounts"} 2023-06-20T16:32:54.645585695Z {"level":"info","ts":1687278774.645547,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde","filter.Name":"noChildren"} 2023-06-20T16:32:54.645585695Z {"level":"info","ts":1687278774.645573,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde1","filter.Name":"noBiosBootInPartLabel"} 2023-06-20T16:32:54.645784322Z {"level":"info","ts":1687278774.6456826,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde2","filter.Name":"noFilesystemSignature"} 2023-06-20T16:32:54.645829280Z {"level":"info","ts":1687278774.645764,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde3","filter.Name":"noBiosBootInPartLabel"} 2023-06-20T16:32:54.645867197Z {"level":"info","ts":1687278774.6458266,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde4","filter.Name":"noFilesystemSignature"} 2023-06-20T16:32:54.646096199Z {"level":"info","ts":1687278774.6460469,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0","filter.Name":"noChildren"} 2023-06-20T16:32:54.650369603Z {"level":"info","ts":1687278774.6503327,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p1","matcher.Name":"inSizeRange"} 2023-06-20T16:32:54.653112455Z {"level":"info","ts":1687278774.6530833,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p2","filter.Name":"noFilesystemSignature"} 2023-06-20T16:32:54.653160997Z {"level":"info","ts":1687278774.653139,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p3","filter.Name":"noFilesystemSignature"} 2023-06-20T16:33:17.781643365Z {"level":"info","ts":1687278797.781562,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
On 6 node it is not the case anymore ( see the previous grep "Device" command above ).
Affected node are : w4,w3,w6,w2,w8,w7 => w2,w3,w4,w6,w7,w8
LocalVolumeSet look quite standard :
{"apiVersion":"local.storage.openshift.io/v1alpha1","kind":"LocalVolumeSet","metadata":{"annotations":{},"name":"local-block","namespace":"openshift-local-storage"},"spec":{"deviceInclusionSpec":{"deviceMechanicalProperties":["NonRotational"],"deviceTypes":["disk"],"maxSize":"5Ti","minSize":"2Ti"},"fstype":"ext4","maxDeviceCount":3,"nodeSelector":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"cluster.ocs.openshift.io/openshift-storage","operator":"In","values":[""]}]}]},"storageClassName":"localblock","volumeMode":"Block"}}
And the drive to be ingest on w8 is 3.5T, which match the criteria.
And the only worker node which don't have 2 pv associated in local storage is w8 :
$ for localpv in $(omg get pv -n openshift-local-storage |grep local| awk '{print $1}') ; do DEV=$(omg get pv -n openshift-local-storage $localpv -A -o json | jq '.metadata.annotations["storage.openshift.com/device-name"]') ; NODE=$(omg get pv -n openshift-local-storage $localpv -A -o json | jq '.metadata.labels["kubernetes.io/hostname"]' ) ; printf "%s,\t%s,\t%s\n" $localpv $NODE $DEV | tr -d \" ; done | sort -k 2 local-pv-a5bfdd0b, m0.aaa.bbb.ccc.ddd.eee, sdb local-pv-897e5460, m0.aaa.bbb.ccc.ddd.eee, sdc local-pv-85e43bbf, m0.aaa.bbb.ccc.ddd.eee, sdd local-pv-8d13be14, m1.aaa.bbb.ccc.ddd.eee, sdb local-pv-d331dc90, m1.aaa.bbb.ccc.ddd.eee, sdc local-pv-4aec2195, m1.aaa.bbb.ccc.ddd.eee, sdd local-pv-8136a36b, m2.aaa.bbb.ccc.ddd.eee, sdb local-pv-3a57baf, m2.aaa.bbb.ccc.ddd.eee, sdc local-pv-37cbee9d, m2.aaa.bbb.ccc.ddd.eee, sdd local-pv-50e2b3fb, w0.aaa.bbb.ccc.ddd.eee, sdb local-pv-a8f9abf1, w0.aaa.bbb.ccc.ddd.eee, sdc local-pv-5ca1cf5c, w1.aaa.bbb.ccc.ddd.eee, sdb local-pv-b78af445, w1.aaa.bbb.ccc.ddd.eee, sdc local-pv-870a6554, w2.aaa.bbb.ccc.ddd.eee, sdb local-pv-b1acc488, w2.aaa.bbb.ccc.ddd.eee, sdc local-pv-2215dc76, w3.aaa.bbb.ccc.ddd.eee, sdb local-pv-e40a8f20, w3.aaa.bbb.ccc.ddd.eee, sdc local-pv-dc1075c2, w4.aaa.bbb.ccc.ddd.eee, sdb local-pv-898de341, w4.aaa.bbb.ccc.ddd.eee, sdc local-pv-89c985ac, w5.aaa.bbb.ccc.ddd.eee, sdb local-pv-cc7509b9, w5.aaa.bbb.ccc.ddd.eee, sdc local-pv-21253b6, w6.aaa.bbb.ccc.ddd.eee, sdb local-pv-bfb8d2f1, w6.aaa.bbb.ccc.ddd.eee, sdc local-pv-de18c144, w7.aaa.bbb.ccc.ddd.eee, sdb local-pv-170bbd3f, w7.aaa.bbb.ccc.ddd.eee, sdc local-pv-f5306836, w8.aaa.bbb.ccc.ddd.eee, sdc
We had strace two diskmaker pod for 1800s.
We see that it still discuss on the network, but it does not do any syscall related to /dev or /mnt/local-storage/localblock.
Restarting the pod does not help, so there is a "state" save somewhere.
Version-Release number of selected component (if applicable):
OCP Version : 4.8.39
OCS Version : 4.8.14
local-storage : local-storage-operator.4.8.0-202208020324
How reproducible:
We have not try to reproduce it. It seem multiple node have the same diskmaker logs symptom.
Steps to Reproduce:
N/A
Actual results:
N/A
Expected results:
PV is created by diskmaker and then later an osd is created.
Additional info:
We have not try yet to reboot node w8.
That's plan to be done.
w8 node will need to be rebooted because there is still one loopback loop1 device associated to a drive which has been replaced.
$ grep ^ sosreport-w8/sys/block/loop*/loop/backing_file
sosreport-w8/sys/block/loop0/loop/backing_file:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-f5306836/4da40238-0786-473e-bfb5-cf23c6ce1b5d
sosreport-w8/sys/block/loop1/loop/backing_file:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-7bfb9892/09797f5e-b321-47de-ac84-de4fd47ec603 (deleted)
- impacts account
-
OCPBUGS-15358 LocalStorageOperator does not create PersistentVolumes
- Closed