-
Bug
-
Resolution: Done-Errata
-
Major
-
None
-
4.14.z, 4.15.z, 4.17.z, 4.16.z, 4.18
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
4/24: PR is under review , will merge this week
-
None
-
None
-
CNF RAN Sprint 269, CNF RAN Sprint 270
-
2
-
Done
-
Release Note Not Required
-
N/A
-
None
-
None
-
None
-
None
Description of problem:
Consumer loses subscription to events after restarting linuxptp-daemon pod.
From CI automation log:
PTP Recovery [ptp-recovery] /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:26 should recover to stable state after delete PTP daemon pod [49738, test_id:49738] /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:404 > Enter [BeforeEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/util/execute/ginkgo.go:10 @ 12/04/24 10:07:00.275 < Exit [BeforeEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/util/execute/ginkgo.go:10 @ 12/04/24 10:07:00.275 (0s) > Enter [BeforeEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:44 @ 12/04/24 10:07:00.275 2024/12/04 10:07:05 Reached PTP clock state 1 for all interfaces < Exit [BeforeEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:44 @ 12/04/24 10:07:05.445 (5.17s) > Enter [It] should recover to stable state after delete PTP daemon pod - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:404 @ 12/04/24 10:07:05.445 STEP: validate event [LOCKED] after killing the publisher pod - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:415 @ 12/04/24 10:07:05.455 2024/12/04 10:07:05 Waiting for node to be reachable via ping 2024/12/04 10:07:05 Node helix66.lab.eng.rdu2.redhat.com is reachable 2024/12/04 10:07:05 Waiting for cluster to be reachable 2024/12/04 10:07:08 Cluster is reachable 2024/12/04 10:07:08 Waiting up to 45m0s for all pods in namespaces [openshift-ptp] to be healthy for 40s 2024/12/04 10:07:53 All pods in namespaces [openshift-ptp] are healthy for 40s STEP: validate all ptp clocks are in LOCKED state in ptp metrics - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:427 @ 12/04/24 10:07:53.516 2024/12/04 10:08:08 Reached PTP clock state 1 for all interfaces for 10s seconds < Exit [It] should recover to stable state after delete PTP daemon pod - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:404 @ 12/04/24 10:08:08.886 (1m3.441s) > Enter [AfterEach] PTP Recovery - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/ptp/tests/ptp_recovery.go:57 @ 12/04/24 10:08:08.886 2024/12/04 10:08:08 cloud-events exists 2024/12/04 10:08:08 found cloud-event-consumer container 2024/12/04 10:08:08 Logs from last 1m0s for pod cloud-consumer-deployment-7667ff8cd4-zbw5r container cloud-event-consumer: time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/gnss-status/gnss-sync-status/CurrentState, " time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/sync-status/sync-state/CurrentState, " time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/sync-status/os-clock-sync-state/CurrentState, " time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/ptp-status/clock-class/CurrentState, " time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/ptp-status/lock-state/CurrentState, " time="2024-12-04T15:08:08Z" level=info msg="checking for rest service health" time="2024-12-04T15:08:08Z" level=info msg="health check http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/health" time="2024-12-04T15:08:08Z" level=info msg="rest service returned healthy status" 2024/12/04 10:08:08 bring up all slave and master interfaces on all ptp pods 2024/12/04 10:08:12 Restore ptpconfigs to original specs 2024/12/04 10:08:12 Check ptp clocks are in sync 2024/12/04 10:08:27 Reached PTP clock state 1 for all interfaces for 10s seconds
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Deploy SNO and cloud events consumer 2. Monitor consumer events 3. Restart linuxptp-daemon pod 4. Monitor consumer events
Actual results:
Sometimes the event subscriptions are missing:
2024/12/04 10:08:08 Logs from last 1m0s for pod cloud-consumer-deployment-7667ff8cd4-zbw5r container cloud-event-consumer: time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/gnss-status/gnss-sync-status/CurrentState, " time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/sync-status/sync-state/CurrentState, " time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/sync-status/os-clock-sync-state/CurrentState, " time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/ptp-status/clock-class/CurrentState, " time="2024-12-04T15:07:13Z" level=error msg="CurrentState:error 404 from url http://ptp-event-publisher-service-helix66.openshift-ptp.svc.cluster.local:9043/api/ocloudNotifications/v2/cluster/node/helix66.lab.eng.rdu2.redhat.com/sync/ptp-status/lock-state/CurrentState, "
Expected results:
Event subscriptions should persist and continue to appear in consumer log.
Additional info:
- blocks
-
OCPBUGS-55511 Consumer is losing subscription to events after restarting linuxptp-daemon pod.
-
- Closed
-
- clones
-
OCPBUGS-45680 Consumer is losing subscription to events after restarting linuxptp-daemon pod.
-
- Closed
-
- is blocked by
-
OCPBUGS-45680 Consumer is losing subscription to events after restarting linuxptp-daemon pod.
-
- Closed
-
- links to
-
RHBA-2025:4712 OpenShift Container Platform 4.18.z bug fix update