-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.14.z
-
None
-
No
-
False
-
-
-
Description of problem:
On a dual-WPC grandmaster setup where both cards are properly synced, we see them lose sync to the GNSS / 1-PPS signal quite often. This is shown in the linuxptp-daemon-containers logs with the following fingerprint: $ grep ts2phc daemon.log |grep extts ... ts2phc[7979.272]: [ts2phc.0.config] ens2f0 extts index 0 at 1711124315.000000010 corr -10 src 1711124315.881141758 diff 0 ts2phc[7980.259]: [ts2phc.0.config] ens1f0 extts index 0 at 1711124316.000000000 corr 0 src 1711124316.904125569 diff 0 ts2phc[7980.688]: [ts2phc.0.config] ens2f0 extts index 0 at 1711124316.000000010 corr -10 src 1711124316.904125569 diff 0 ts2phc[7981.259]: [ts2phc.0.config] ens1f0 extts index 0 at 1711124317.000000000 corr 0 src 1711124316.498665691 diff 1000000000 ts2phc[7981.563]: [ts2phc.0.config] ens2f0 extts index 0 at 1711124317.000000010 corr -10 src 1711124316.498665691 diff 1000000000 ts2phc[7982.198]: [ts2phc.0.config] ens1f0 extts index 0 at 1711124317.373915578 corr 0 src 1711124318.574747824 diff -626084422 ts2phc[7982.216]: [ts2phc.0.config] ens2f0 extts index 0 at 1711124317.925023509 corr -10 src 1711124318.574747824 diff -74976501 ... The first few lines ending with "diff 0" show that the NIC clock is properly synchronized. Then, at timestamp 7981.259, it goes all the way to the max value (1000000000) in both NICs. It slowly converges back to 0, and after a few seconds it fails again. This is also shown by the PTP Events framework, where the ptp-state-change event is flipping between LOCKED and FREERUN: time="2024-03-22T16:01:16Z" level=debug msg="Got CurrentState: {\"specversion\":\"0.3\",\"id\":\"1ffed146-e576-4049-b307-cc9e3d878efe\",\"source\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"type\":\"event.sync.ptp-status.ptp-state-change\",\"subject\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"datacontenttype\":\"application/json\",\"time\":\"2024-03-22T16:01:16.971249066Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"notification\",\"valueType\":\"enumeration\",\"value\":\"LOCKED\"},{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"metric\",\"valueType\":\"decimal64.3\",\"value\":\"0\"}]}} " time="2024-03-22T02:16:16Z" level=debug msg="Got CurrentState: {\"specversion\":\"0.3\",\"id\":\"c4f3c093-4f94-4c35-ad1d-0b98df3c6b9a\",\"source\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"type\":\"event.sync.ptp-status.ptp-state-change\",\"subject\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"datacontenttype\":\"application/json\",\"time\":\"2024-03-22T16:02:16.985507564Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"notification\",\"valueType\":\"enumeration\",\"value\":\"FREERUN\"},{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"metric\",\"valueType\":\"decimal64.3\",\"value\":\"0\"}]}} " time="2024-03-22T16:03:16Z" level=debug msg="Got CurrentState: {\"specversion\":\"0.3\",\"id\":\"5e90af67-0621-4a35-bfdf-2f9046623fc6\",\"source\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"type\":\"event.sync.ptp-status.ptp-state-change\",\"subject\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"datacontenttype\":\"application/json\",\"time\":\"2024-03-22T16:03:16.973188143Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"notification\",\"valueType\":\"enumeration\",\"value\":\"FREERUN\"},{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"metric\",\"valueType\":\"decimal64.3\",\"value\":\"0\"}]}} " time="2024-03-22T16:04:16Z" level=debug msg="Got CurrentState: {\"specversion\":\"0.3\",\"id\":\"335b79a7-ffec-498a-b6e7-4ed8f134df1f\",\"source\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"type\":\"event.sync.ptp-status.ptp-state-change\",\"subject\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"datacontenttype\":\"application/json\",\"time\":\"2024-03-22T16:04:16.970489851Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"notification\",\"valueType\":\"enumeration\",\"value\":\"LOCKED\"},{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"metric\",\"valueType\":\"decimal64.3\",\"value\":\"0\"}]}} "
Version-Release number of selected component (if applicable):
OCP 4.14.12 ptp-operator 4.14.0-202402081809
How reproducible:
Always
Steps to Reproduce:
1. Set up dual-WPC grandmaster following documentation, using all ports on both NICs as grandmaster 2. Check linuxptp-daemon-container logs
Actual results:
ts2phc diff goes high
Expected results:
Once locked, ts2phc diff should stay stable
Additional info:
Originally, we thought this could be related to https://issues.redhat.com/browse/RHEL-19000, because we saw a "PTP failed to get time" message in dmesg that could be correlated to some of the instances. Testing with a kernel including the fix removed the message in dmesg, but the ts2phc behavior remained. NICs used in the tests are HP-branded, with FW 4.40