Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55121

Consumer is losing subscription to events after restarting linuxptp-daemon pod.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Major Major
    • None
    • 4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18
    • Networking / ptp
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • 4/24: PR is under review , will merge this week
    • None
    • None
    • CNF RAN Sprint 269, CNF RAN Sprint 270
    • 2
    • Done
    • Release Note Not Required
    • N/A
    • None
    • None
    • None
    • None

      Description of problem:
      Consumer loses subscription to events after restarting linuxptp-daemon pod.

      From CI automation log:

      PTP Recovery [ptp-recovery]
      /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:26
        should recover to stable state after delete PTP daemon pod [49738, test_id:49738]
        /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:404
        > Enter [BeforeEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/util/execute/ginkgo.go:10 @ 12/04/24 10:07:00.275
        < Exit [BeforeEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/util/execute/ginkgo.go:10 @ 12/04/24 10:07:00.275 (0s)
        > Enter [BeforeEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:44 @ 12/04/24 10:07:00.275
      2024/12/04 10:07:05 Reached PTP clock state 1 for all interfaces
        < Exit [BeforeEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:44 @ 12/04/24 10:07:05.445 (5.17s)
        > Enter [It] should recover to stable state after delete PTP daemon pod - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:404 @ 12/04/24 10:07:05.445
        STEP: validate event [LOCKED] after killing the publisher pod - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:415 @ 12/04/24 10:07:05.455
      2024/12/04 10:07:05 Waiting for node to be reachable via ping
      2024/12/04 10:07:05 Node helix66.lab.eng.rdu2.redhat.com is reachable
      2024/12/04 10:07:05 Waiting for cluster to be reachable
      2024/12/04 10:07:08 Cluster is reachable
      2024/12/04 10:07:08 Waiting up to 45m0s for all pods in namespaces [openshift-ptp] to be healthy for 40s
      2024/12/04 10:07:53 All pods in namespaces [openshift-ptp] are healthy for 40s
        STEP: validate all ptp clocks are in LOCKED state in ptp metrics - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:427 @ 12/04/24 10:07:53.516
      2024/12/04 10:08:08 Reached PTP clock state 1 for all interfaces for 10s seconds
        < Exit [It] should recover to stable state after delete PTP daemon pod - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:404 @ 12/04/24 10:08:08.886 (1m3.441s)
        > Enter [AfterEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:57 @ 12/04/24 10:08:08.886
      2024/12/04 10:08:08 cloud-events  exists 
      2024/12/04 10:08:08 found cloud-event-consumer container
      2024/12/04 10:08:08 Logs from last 1m0s for pod cloud-consumer-deployment-7667ff8cd4-zbw5r container cloud-event-consumer:
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/gnss-status/gnss-sync-status/CurrentState, "
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/sync-status/sync-state/CurrentState, "
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/sync-status/os-clock-sync-state/CurrentState, "
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/ptp-status/clock-class/CurrentState, "
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/ptp-status/lock-state/CurrentState, "
      time="2024-12-04T15:08:08Z" level=info msg="checking for rest service health"
      time="2024-12-04T15:08:08Z" level=info msg="health check http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/health"
      time="2024-12-04T15:08:08Z" level=info msg="rest service returned healthy status"
      2024/12/04 10:08:08 bring up all slave and master interfaces on all ptp pods
      2024/12/04 10:08:12 Restore ptpconfigs to original specs
      2024/12/04 10:08:12 Check ptp clocks are in sync
      2024/12/04 10:08:27 Reached PTP clock state 1 for all interfaces for 10s seconds    

      Version-Release number of selected component (if applicable):

          

      How reproducible:

      100%

      Steps to Reproduce:

          1. Deploy SNO and cloud events consumer
          2. Monitor consumer events
          3. Restart linuxptp-daemon pod
          4. Monitor consumer events     

      Actual results:
      Sometimes the event subscriptions are missing:

      2024/12/04 10:08:08 Logs from last 1m0s for pod cloud-consumer-deployment-7667ff8cd4-zbw5r container cloud-event-consumer:
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/gnss-status/gnss-sync-status/CurrentState, "
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/sync-status/sync-state/CurrentState, "
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/sync-status/os-clock-sync-state/CurrentState, "
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/ptp-status/clock-class/CurrentState, "
      time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/ptp-status/lock-state/CurrentState, "

      Expected results:

      Event subscriptions should persist and continue to appear in consumer log.

      Additional info:

          

              jacding@redhat.com Jack Ding
              bblock@redhat.com Bonnie Block
              None
              None
              Bonnie Block Bonnie Block
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: