Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-15929

[OCS 4.8] openshift-local-storage diskmaker stuck in Looking for released PVs to clean up and does not create a new pv

XMLWordPrintable

      Description of problem:

      Customer replace a drive on node w8 associated with a OCS 4.8 OSD. And the new drive attached is not PV provision by LSO.

       

      This issue happen on OCP4.8 OCS4.8 using "local-storage-operator" to provide drive for Ceph/OCS.
      So this bug is related to "openshift-local-storage" not OCS 4.8
      Customer has a support exception to continue using this software.

       

      The "diskmaker-manager-kqrd4" pod seem to be stuck in a loop and don't ingest any new drive for creating a PV.
      Deleting this "diskmaker-manager" pod associated to node w8 and "local-storage-operator" pod does not make it move forward.

      The "diskmaker-manager-kqrd4" ( for w8 node ) after startup quickly loop with the following logs :

      2023-06-13T13:48:20.672103650Z {"level":"info","ts":1686664100.6719604,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
      2023-06-13T13:48:20.672249550Z I0613 13:48:20.672222 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
      2023-06-13T13:48:50.673440909Z {"level":"info","ts":1686664130.673301,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
      2023-06-13T13:48:50.673568923Z I0613 13:48:50.673531 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
      2023-06-13T13:49:20.674220779Z {"level":"info","ts":1686664160.6741552,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
      2023-06-13T13:49:20.674438831Z I0613 13:49:20.674405 1635570 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
      

       

      Customer is particulary interested in solving the issue with this w8 node to recreate a new OSD.
      But we can see that it's not the only pod which has this issue and it look like multiple "diskmaker-manager" pod are affected by this plague and don't loop over drive anymore :

      $ grep -c "Device" 0050-inspect.local.7853900946867902518.tar.gz/inspect.local.7853900946867902518/namespaces/openshift-local-storage/pods/diskmaker-manager-*/diskmaker-manager/diskmaker-manager/logs/current.log | cut -b 81-500
      namespaces/openshift-local-storage/pods/diskmaker-manager-22qrq/diskmaker-manager/diskmaker-manager/logs/current.log:0
      namespaces/openshift-local-storage/pods/diskmaker-manager-68qpt/diskmaker-manager/diskmaker-manager/logs/current.log:0
      namespaces/openshift-local-storage/pods/diskmaker-manager-7p7jv/diskmaker-manager/diskmaker-manager/logs/current.log:0
      namespaces/openshift-local-storage/pods/diskmaker-manager-7ptl2/diskmaker-manager/diskmaker-manager/logs/current.log:12480
      namespaces/openshift-local-storage/pods/diskmaker-manager-96vxc/diskmaker-manager/diskmaker-manager/logs/current.log:63344
      namespaces/openshift-local-storage/pods/diskmaker-manager-9vjtf/diskmaker-manager/diskmaker-manager/logs/current.log:19085
      namespaces/openshift-local-storage/pods/diskmaker-manager-fnqxw/diskmaker-manager/diskmaker-manager/logs/current.log:111879
      namespaces/openshift-local-storage/pods/diskmaker-manager-gl4wf/diskmaker-manager/diskmaker-manager/logs/current.log:0
      namespaces/openshift-local-storage/pods/diskmaker-manager-kqrd4/diskmaker-manager/diskmaker-manager/logs/current.log:0
      namespaces/openshift-local-storage/pods/diskmaker-manager-pdnct/diskmaker-manager/diskmaker-manager/logs/current.log:129458
      namespaces/openshift-local-storage/pods/diskmaker-manager-q6s52/diskmaker-manager/diskmaker-manager/logs/current.log:59920
      namespaces/openshift-local-storage/pods/diskmaker-manager-zdstc/diskmaker-manager/diskmaker-manager/logs/current.log:0
      

      In general, on every healthy diskmaker pod, every 60s, we have a loop which browse all the "block device" and create a symlink in /mnt/local-storage/localblock if condition are met.

      Like this :

      2023-06-20T16:32:54.391660888Z {"level":"info","ts":1687278774.3914862,"logger":"localvolumeset-symlink-controller","msg":"Reconciling LocalVolumeSet","Request.Namespace":"openshift-local-storage","Request.Name":"local-block"}
      2023-06-20T16:32:54.391872431Z I0620 16:32:54.391823   15549 common.go:334] StorageClass "localblock" configured with MountDir "/mnt/local-storage/localblock", HostDir "/mnt/local-storage/localblock", VolumeMode "Block", FsType "", BlockCleanerCommand ["/scripts/quick_reset.sh"]
      2023-06-20T16:32:54.569914265Z {"level":"info","ts":1687278774.5697858,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop0","matcher.Name":"inTypeList"}
      2023-06-20T16:32:54.593580217Z {"level":"info","ts":1687278774.5934741,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop1","matcher.Name":"inTypeList"}
      2023-06-20T16:32:54.614147565Z {"level":"info","ts":1687278774.6140366,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop2","matcher.Name":"inTypeList"}
      2023-06-20T16:32:54.632720733Z {"level":"info","ts":1687278774.6326215,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"loop3","matcher.Name":"inSizeRange"}
      2023-06-20T16:32:54.637415307Z {"level":"info","ts":1687278774.6373258,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdb","filter.Name":"noBindMounts"}
      2023-06-20T16:32:54.640344701Z {"level":"info","ts":1687278774.6402955,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdc","filter.Name":"noBindMounts"}
      2023-06-20T16:32:54.642941844Z {"level":"info","ts":1687278774.6428287,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sdd","filter.Name":"noBindMounts"}
      2023-06-20T16:32:54.645585695Z {"level":"info","ts":1687278774.645547,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde","filter.Name":"noChildren"}
      2023-06-20T16:32:54.645585695Z {"level":"info","ts":1687278774.645573,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde1","filter.Name":"noBiosBootInPartLabel"}
      2023-06-20T16:32:54.645784322Z {"level":"info","ts":1687278774.6456826,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde2","filter.Name":"noFilesystemSignature"}
      2023-06-20T16:32:54.645829280Z {"level":"info","ts":1687278774.645764,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde3","filter.Name":"noBiosBootInPartLabel"}
      2023-06-20T16:32:54.645867197Z {"level":"info","ts":1687278774.6458266,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"sde4","filter.Name":"noFilesystemSignature"}
      2023-06-20T16:32:54.646096199Z {"level":"info","ts":1687278774.6460469,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0","filter.Name":"noChildren"}
      2023-06-20T16:32:54.650369603Z {"level":"info","ts":1687278774.6503327,"logger":"localvolumeset-symlink-controller","msg":"match negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p1","matcher.Name":"inSizeRange"}
      2023-06-20T16:32:54.653112455Z {"level":"info","ts":1687278774.6530833,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p2","filter.Name":"noFilesystemSignature"}
      2023-06-20T16:32:54.653160997Z {"level":"info","ts":1687278774.653139,"logger":"localvolumeset-symlink-controller","msg":"filter negative","Request.Namespace":"openshift-local-storage","Request.Name":"local-block","Device.Name":"rbd0p3","filter.Name":"noFilesystemSignature"}
      2023-06-20T16:33:17.781643365Z {"level":"info","ts":1687278797.781562,"logger":"deleter","msg":"Looking for released PVs to clean up","Request.Namespace":"openshift-local-storage","Request.Name":""}
      

      On 6 node it is not the case anymore ( see the previous grep "Device" command above ).
      Affected node are : w4,w3,w6,w2,w8,w7 => w2,w3,w4,w6,w7,w8

      LocalVolumeSet look quite standard :

       

      {"apiVersion":"local.storage.openshift.io/v1alpha1","kind":"LocalVolumeSet","metadata":{"annotations":{},"name":"local-block","namespace":"openshift-local-storage"},"spec":{"deviceInclusionSpec":{"deviceMechanicalProperties":["NonRotational"],"deviceTypes":["disk"],"maxSize":"5Ti","minSize":"2Ti"},"fstype":"ext4","maxDeviceCount":3,"nodeSelector":{"nodeSelectorTerms":[{"matchExpressions":[{"key":"cluster.ocs.openshift.io/openshift-storage","operator":"In","values":[""]}]}]},"storageClassName":"localblock","volumeMode":"Block"}} 

       

      And the drive to be ingest on w8 is 3.5T, which match the criteria.

      And the only worker node which don't have 2 pv associated in local storage is w8 : 

      $ for localpv in $(omg get pv -n openshift-local-storage |grep local| awk '{print $1}') ; do DEV=$(omg get pv -n openshift-local-storage $localpv -A -o json | jq '.metadata.annotations["storage.openshift.com/device-name"]') ; NODE=$(omg get pv -n openshift-local-storage $localpv -A -o json | jq  '.metadata.labels["kubernetes.io/hostname"]' ) ; printf "%s,\t%s,\t%s\n" $localpv $NODE $DEV | tr -d \" ; done | sort -k 2 
      
      local-pv-a5bfdd0b,      m0.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-897e5460,      m0.aaa.bbb.ccc.ddd.eee,  sdc
      local-pv-85e43bbf,      m0.aaa.bbb.ccc.ddd.eee,  sdd
      
      local-pv-8d13be14,      m1.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-d331dc90,      m1.aaa.bbb.ccc.ddd.eee,  sdc
      local-pv-4aec2195,      m1.aaa.bbb.ccc.ddd.eee,  sdd
      
      local-pv-8136a36b,      m2.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-3a57baf,       m2.aaa.bbb.ccc.ddd.eee,  sdc
      local-pv-37cbee9d,      m2.aaa.bbb.ccc.ddd.eee,  sdd
      
      local-pv-50e2b3fb,      w0.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-a8f9abf1,      w0.aaa.bbb.ccc.ddd.eee,  sdc
      
      local-pv-5ca1cf5c,      w1.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-b78af445,      w1.aaa.bbb.ccc.ddd.eee,  sdc
      
      local-pv-870a6554,      w2.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-b1acc488,      w2.aaa.bbb.ccc.ddd.eee,  sdc
      
      local-pv-2215dc76,      w3.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-e40a8f20,      w3.aaa.bbb.ccc.ddd.eee,  sdc
      
      local-pv-dc1075c2,      w4.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-898de341,      w4.aaa.bbb.ccc.ddd.eee,  sdc
      
      local-pv-89c985ac,      w5.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-cc7509b9,      w5.aaa.bbb.ccc.ddd.eee,  sdc
      
      local-pv-21253b6,       w6.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-bfb8d2f1,      w6.aaa.bbb.ccc.ddd.eee,  sdc
      
      local-pv-de18c144,      w7.aaa.bbb.ccc.ddd.eee,  sdb
      local-pv-170bbd3f,      w7.aaa.bbb.ccc.ddd.eee,  sdc
      
      local-pv-f5306836,      w8.aaa.bbb.ccc.ddd.eee,  sdc
       

       

       

      We had strace two diskmaker pod for 1800s.
      We see that it still discuss on the network, but it does not do any syscall related to /dev or /mnt/local-storage/localblock.
      Restarting the pod does not help, so there is a "state" save somewhere.

      Version-Release number of selected component (if applicable):

      OCP Version : 4.8.39
      OCS Version : 4.8.14
      local-storage : local-storage-operator.4.8.0-202208020324

       

      How reproducible:

      We have not try to reproduce it. It seem multiple node have the same diskmaker logs symptom.

       

      Steps to Reproduce:

      N/A

       

      Actual results:

      N/A

       

      Expected results:

      PV is created by diskmaker and then later an osd is created.

       

      Additional info:

      We have not try yet to reboot node w8.

      That's plan to be done.

      w8 node will need to be rebooted because there is still one loopback loop1 device associated to a drive which has been replaced.  

       
      $ grep ^ sosreport-w8/sys/block/loop*/loop/backing_file
      sosreport-w8/sys/block/loop0/loop/backing_file:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-f5306836/4da40238-0786-473e-bfb5-cf23c6ce1b5d
      sosreport-w8/sys/block/loop1/loop/backing_file:/var/lib/kubelet/plugins/kubernetes.io~local-volume/volumeDevices/local-pv-7bfb9892/09797f5e-b321-47de-ac84-de4fd47ec603 (deleted)

            jdobson@redhat.com Jonathan Dobson
            rhn-support-jpeyrard Johann Peyrard
            Chao Yang Chao Yang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: