-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
3/20: Fix is on it way to prevent sidecar crash due to concurrent write
-
None
-
None
-
None
-
Done
-
Release Note Not Required
-
N/A
-
None
-
None
-
-
None
Description
Description of problem: On frequent restarts of node .
The cloud-event-proxy experienced a fatal error, causing a loss of awareness of PTP Clock Class data. As a result, when a client later requested the Clock Class event, it returned a zero value with an "event not found" error.
Timeline of Events
- Clock Class Announced
Timestamp: 2025-03-18T09:36:22.660908894
Event: The system announced a Clock Class change.
Impact: This should have updated the cloud-event-proxy with the latest Clock Class data.
- cloud-event-proxy Fatal Error
Timestamp: 2025-03-18T09:36:22.654826461
Event: cloud-event-proxy crashed, encountering a fatal error.
Duration: The failure persisted until 2025-03-18T09:36:22.664739733.
Impact: During this period, cloud-event-proxy was non-functional, likely leading to data loss with PTP Clock Class updates.
- Sidecar Recovery & Transport Setup
Timestamp: 2025-03-18T09:36:25.440540718
Event: The sidecar component initiated recovery after the fatal crash .
Impact: This marks the re-establishment of event streaming, but prior events (such as the Clock Class announcement) were not successfully restored and there were no new updated clockclass announced by the daemon.
- Client Requested Clock Class Event
Timestamp: 2025-03-18T11:04:50.030682807Z
Event: A client requested the Clock Class event.
Expected Behavior: The request should return the most recent Clock Class announcement.
Actual Behaviour: Event Not Found Error
Timestamp: 2025-03-18T11:04:00.386282859
Event: The system returned an "event not found" error.
Cause: Since the Clock Class announcement occurred before the cloud-event-proxy crash, the proxy was unaware of PTP Clock Class updates after it was recovered.
Impact: The system defaulted to a zero value for the requested Clock Class.
Root Cause Analysis
The cloud-event-proxy fatal error caused event tracking loss.
The Clock Class announcement happened before the crash, but the event proxy failed to retain this update on recovery.
When the client requested the event later, it returned zero, leading to an "event not found" error.
Fatal Crash error details:
2025-03-18T09:36:22.211309058+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="updating new subscriber in configmap" 2025-03-18T09:36:22.213575269+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="configmap updated" 2025-03-18T09:36:22.213610918+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="subscriber saved in configmap EndpointUri: http://ptp-event-publisher-service.openshift-ptp.svc.cluster.local:9043\n clientID: 46ccdb8f-80d4-3a54-942f-08c73a37b26e\n sub :{ { EndpointUri: \n UriLocation: \n SubscriptionId: 64cea808-8152-406b-83a9-384ffc1d4f07\n Resource: /cluster/node/abc/sync/ptp-status/ptp-clock-class-change\n\n }, { EndpointUri: \n UriLocation: \n SubscriptionId: 67abff17-a492-46fd-a73f-8c0b88876696\n Resource: /cluster/node/abc/sync/sync-status/os-clock-sync-state\n\n }, { EndpointUri: \n UriLocation: \n SubscriptionId: 7576c110-d275-4fe4-81e6-ad8c1efdbd5d\n Resource: /cluster/node/abc/sync/ptp-status/lock-state\n\n }, }"
2025-03-18T09:36:22.211309058+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="updating new subscriber in configmap" 2025-03-18T09:36:22.213575269+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="configmap updated" 2025-03-18T09:36:22.213610918+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="subscriber saved in configmap EndpointUri: http://ptp-event-publisher-service-abc.openshift-ptp.svc.cluster.local:9043\n clientID: 46ccdb8f-80d4-3a54-942f-08c73a37b26e\n sub :{ { EndpointUri: \n UriLocation: \n SubscriptionId: 64cea808-8152-406b-83a9-384ffc1d4f07\n Resource: /cluster/node/abc/sync/ptp-status/ptp-clock-class-change\n\n }, { EndpointUri: \n UriLocation: \n SubscriptionId: 67abff17-a492-46fd-a73f-8c0b88876696\n Resource: /cluster/node/abc/sync/sync-status/os-clock-sync-state\n\n }, { EndpointUri: \n UriLocation: \n SubscriptionId: 7576c110-d275-4fe4-81e6-ad8c1efdbd5d\n Resource: /cluster/node/abc/sync/ptp-status/lock-state\n\n }, }" 2025-03-18T09:36:22.654826461+00:00 stderr P fatal error: 2025-03-18T09:36:22.654874485+00:00 stderr P concurrent map writes 2025-03-18T09:36:22.654896711+00:00 stderr F 2025-03-18T09:36:22.658309955+00:00 stderr F 2025-03-18T09:36:22.658309955+00:00 stderr F goroutine 151 [running]: 2025-03-18T09:36:22.658309955+00:00 stderr F github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/metrics.(*PTPEventManager).GetStats(...) 2025-03-18T09:36:22.658309955+00:00 stderr F /go/src/github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/metrics/manager.go:161 2025-03-18T09:36:22.658309955+00:00 stderr F github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/metrics.(*PTPEventManager).ExtractMetrics(0xc000765d40, {0xc0004e20c0, 0x39}) 2025-03-18T09:36:22.658309955+00:00 stderr F /go/src/github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/metrics/metrics.go:109 +0x326 2025-03-18T09:36:22.658309955+00:00 stderr F plugin/unnamed-c25f887eb2f7c1227773d9a8acc8a929d0a3f192.processMessages({0x23df4e0?, 0xc0009104a8?}) 2025-03-18T09:36:22.658309955+00:00 stderr F /go/src/github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/ptp_operator_plugin.go:621 +0xa7 2025-03-18T09:36:22.658309955+00:00 stderr F created by plugin/unnamed-c25f887eb2f7c1227773d9a8acc8a929d0a3f192.listenToSocket in goroutine 132 2025-03-18T09:36:22.658309955+00:00 stderr F /go/src/github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/ptp_operator_plugin.go:607 +0x245 2025-03-18T09:36:22.658309955+00:00 stderr F 2025-03-18T09:36:22.658309955+00:00 stderr F goroutine 1 [select]: 2025-03-18T09:36:22.658309955+00:00 stderr F main.ProcessOutChannel(0x21869c7?, 0xc00013e000)
Version-Release number of selected component (if applicable):
4.18.z
How reproducible:
Restart node multiple times -This is seen after couple of reboots ( gracefully - Cold reboot) of the server
- blocks
-
OCPBUGS-54964 Fatal error concurrent map write results in sidecar crash and losing past events
-
- Closed
-
- depends on
-
OCPBUGS-54160 [4.19] Fatal error concurrent map write results in sidecar crash and losing past events
-
- Closed
-
- is blocked by
-
OCPBUGS-54160 [4.19] Fatal error concurrent map write results in sidecar crash and losing past events
-
- Closed
-
- is cloned by
-
OCPBUGS-54160 [4.19] Fatal error concurrent map write results in sidecar crash and losing past events
-
- Closed
-
-
OCPBUGS-54964 Fatal error concurrent map write results in sidecar crash and losing past events
-
- Closed
-
- is duplicated by
-
OCPBUGS-54209 [4.19] Fatal error concurrent map write results in sidecar crash and losing past events
-
- Closed
-
- links to
-
RHBA-2025:3576 OpenShift Container Platform 4.18.z extras update