Description of problem:
The vsphere-problem-detector-operator pod crashed and restarted with a fatal error: concurrent map writes. This appears to be a race condition in the SetCbtData function when multiple nodes are being scanned for Changed Block Tracking (CBT) status simultaneously.
Version-Release number of selected component (if applicable):
4.18.28
Relevant Output:
oc get pods vsphere-problem-detector-operator-5cdfbbd7d8-gmwdw NAME READY STATUS RESTARTS AGE vsphere-problem-detector-operator-5cdfbbd7d8-gmwdw 1/1 Running 1 28d oc logs vsphere-problem-detector-operator-5cdfbbd7d8-gmwdw -p 2026-01-02T07:54:52.698943047Z I0102 07:54:52.698882 1 node_cbt.go:52] Property not found for node cb-w1.cbdr.iiabank.com.jo 2026-01-02T07:54:52.698943047Z fatal error: concurrent map writes 2026-01-02T07:54:52.698943047Z I0102 07:54:52.698889 1 vsphere_check.go:321] CollectNodeCBT:cb-w3.cbdr.iiabank.com.jo passed 2026-01-02T07:54:52.702159386Z 2026-01-02T07:54:52.702159386Z goroutine 1610214 [running]: 2026-01-02T07:54:52.702206787Z github.com/openshift/vsphere-problem-detector/pkg/util.(*ClusterInfo).SetCbtData(0xc000f808e0?, {0x2e99d0e?, 0x1?}) 2026-01-02T07:54:52.702206787Z github.com/openshift/vsphere-problem-detector/pkg/util/cluster_info.go:151 +0xa5 2026-01-02T07:54:52.702206787Z github.com/openshift/vsphere-problem-detector/pkg/check.(*CollectNodeCBT).CheckNode(0xc00127f040?, 0xc0014ae780, 0xc000d5f508, 0xc000c42008) 2026-01-02T07:54:52.702206787Z github.com/openshift/vsphere-problem-detector/pkg/check/node_cbt.go:53 +0x3d8 2026-01-02T07:54:52.702206787Z github.com/openshift/vsphere-problem-detector/pkg/operator.runSingleNodeSingleCheck(0xc0014ae780, 0xc0016f1840, 0xc000d5f508, 0xc000c42008, {0x336dc50, 0x4daf080})
Actual results:
The Go runtime detects unsafe concurrent map access and terminates the process, leading to a pod restart.
Expected results:
The pod should not crash with the unsafe concurrent map.
Additional info:
Will Upload the must-gather and share the details soon