Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-53342

Fatal error concurrent map write results in sidecar crash and losing past events

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • 3/20: Fix is on it way to prevent sidecar crash due to concurrent write
    • None
    • None
    • None
    • Done
    • Release Note Not Required
    • N/A
    • None
    • None
    • None

      Description

      Description of problem: On frequent restarts of node .

      The cloud-event-proxy experienced a fatal error, causing a loss of awareness of PTP Clock Class data. As a result, when a client later requested the Clock Class event, it returned a zero value with an "event not found" error.

      Timeline of Events

      • Clock Class Announced

              Timestamp: 2025-03-18T09:36:22.660908894

              Event: The system announced a Clock Class change.

              Impact: This should have updated the cloud-event-proxy with the latest Clock Class data.

      •  cloud-event-proxy Fatal Error

              Timestamp: 2025-03-18T09:36:22.654826461

              Event: cloud-event-proxy crashed, encountering a fatal error.

              Duration: The failure persisted until 2025-03-18T09:36:22.664739733.

              Impact: During this period, cloud-event-proxy was non-functional, likely leading to data loss with PTP Clock Class updates.

      • Sidecar Recovery & Transport Setup

              Timestamp: 2025-03-18T09:36:25.440540718

              Event: The sidecar component initiated recovery after the fatal crash .

              Impact: This marks the re-establishment of event streaming, but prior events (such as the Clock Class announcement) were not successfully restored and there were no new updated clockclass announced by the daemon.

      •  Client Requested Clock Class Event

              Timestamp: 2025-03-18T11:04:50.030682807Z

              Event: A client requested the Clock Class event.

              Expected Behavior: The request should return the most recent Clock Class announcement. 

             Actual Behaviour:   Event Not Found Error

              Timestamp: 2025-03-18T11:04:00.386282859

              Event: The system returned an "event not found" error.

              Cause: Since the Clock Class announcement occurred before  the cloud-event-proxy crash, the proxy was unaware of PTP Clock Class updates after it was recovered.

              Impact: The system defaulted to a zero value for the requested Clock Class.

       

      Root Cause Analysis

          The cloud-event-proxy fatal error caused event tracking loss.

          The Clock Class announcement happened  before the crash, but the event proxy failed to retain this update on recovery.

          When the client requested the event later, it returned zero, leading to an "event not found" error.
      Fatal Crash error details: 

      2025-03-18T09:36:22.211309058+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="updating new subscriber in configmap"
      2025-03-18T09:36:22.213575269+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="configmap updated"
      2025-03-18T09:36:22.213610918+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="subscriber saved in configmap   EndpointUri: http://ptp-event-publisher-service.openshift-ptp.svc.cluster.local:9043\n  clientID: 46ccdb8f-80d4-3a54-942f-08c73a37b26e\n  sub :{ {  EndpointUri: \n  UriLocation: \n  SubscriptionId: 64cea808-8152-406b-83a9-384ffc1d4f07\n  Resource: /cluster/node/abc/sync/ptp-status/ptp-clock-class-change\n\n }, {  EndpointUri: \n  UriLocation: \n  SubscriptionId: 67abff17-a492-46fd-a73f-8c0b88876696\n  Resource: /cluster/node/abc/sync/sync-status/os-clock-sync-state\n\n }, {  EndpointUri: \n  UriLocation: \n  SubscriptionId: 7576c110-d275-4fe4-81e6-ad8c1efdbd5d\n  Resource: /cluster/node/abc/sync/ptp-status/lock-state\n\n }, }" 
          2025-03-18T09:36:22.211309058+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="updating new subscriber in configmap"
      2025-03-18T09:36:22.213575269+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="configmap updated"
      2025-03-18T09:36:22.213610918+00:00 stderr F time="2025-03-18T09:36:22Z" level=info msg="subscriber saved in configmap   EndpointUri: http://ptp-event-publisher-service-abc.openshift-ptp.svc.cluster.local:9043\n  clientID: 46ccdb8f-80d4-3a54-942f-08c73a37b26e\n  sub :{ {  EndpointUri: \n  UriLocation: \n  SubscriptionId: 64cea808-8152-406b-83a9-384ffc1d4f07\n  Resource: /cluster/node/abc/sync/ptp-status/ptp-clock-class-change\n\n }, {  EndpointUri: \n  UriLocation: \n  SubscriptionId: 67abff17-a492-46fd-a73f-8c0b88876696\n  Resource: /cluster/node/abc/sync/sync-status/os-clock-sync-state\n\n }, {  EndpointUri: \n  UriLocation: \n  SubscriptionId: 7576c110-d275-4fe4-81e6-ad8c1efdbd5d\n  Resource: /cluster/node/abc/sync/ptp-status/lock-state\n\n }, }"
      2025-03-18T09:36:22.654826461+00:00 stderr P fatal error:
      2025-03-18T09:36:22.654874485+00:00 stderr P concurrent map writes
      2025-03-18T09:36:22.654896711+00:00 stderr F
      2025-03-18T09:36:22.658309955+00:00 stderr F
      2025-03-18T09:36:22.658309955+00:00 stderr F goroutine 151 [running]:
      2025-03-18T09:36:22.658309955+00:00 stderr F github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/metrics.(*PTPEventManager).GetStats(...)
      2025-03-18T09:36:22.658309955+00:00 stderr F    /go/src/github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/metrics/manager.go:161
      2025-03-18T09:36:22.658309955+00:00 stderr F github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/metrics.(*PTPEventManager).ExtractMetrics(0xc000765d40, {0xc0004e20c0, 0x39})
      2025-03-18T09:36:22.658309955+00:00 stderr F    /go/src/github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/metrics/metrics.go:109 +0x326
      2025-03-18T09:36:22.658309955+00:00 stderr F plugin/unnamed-c25f887eb2f7c1227773d9a8acc8a929d0a3f192.processMessages({0x23df4e0?, 0xc0009104a8?})
      2025-03-18T09:36:22.658309955+00:00 stderr F    /go/src/github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/ptp_operator_plugin.go:621 +0xa7
      2025-03-18T09:36:22.658309955+00:00 stderr F created by plugin/unnamed-c25f887eb2f7c1227773d9a8acc8a929d0a3f192.listenToSocket in goroutine 132
      2025-03-18T09:36:22.658309955+00:00 stderr F    /go/src/github.com/redhat-cne/cloud-event-proxy/plugins/ptp_operator/ptp_operator_plugin.go:607 +0x245
      2025-03-18T09:36:22.658309955+00:00 stderr F
      2025-03-18T09:36:22.658309955+00:00 stderr F goroutine 1 [select]:
      2025-03-18T09:36:22.658309955+00:00 stderr F main.ProcessOutChannel(0x21869c7?, 0xc00013e000)  

      Version-Release number of selected component (if applicable):

      4.18.z    

      How reproducible:

          Restart node multiple times -This is seen after couple of reboots ( gracefully - Cold reboot) of the server 
       

              aputtur@redhat.com Aneesh Puttur
              aputtur@redhat.com Aneesh Puttur
              None
              None
              Bonnie Block Bonnie Block
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: