Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31730

Dual-WPC Grandmaster: ts2phc loses sync and reports high offset values on both cards

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.14.z
    • Networking / ptp
    • None
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      5/3: The fix has been verified , with the test build
      4/24 : RHEL fix is in progress, to verify with test build
      Show
      5/3: The fix has been verified , with the test build 4/24 : RHEL fix is in progress, to verify with test build

      Description of problem:

      On a dual-WPC grandmaster setup where both cards are properly synced, we see them lose sync to the GNSS / 1-PPS signal quite often. This is shown in the linuxptp-daemon-containers logs with the following fingerprint:
      
      $ grep ts2phc daemon.log  |grep extts
      
      ...
      ts2phc[7979.272]: [ts2phc.0.config] ens2f0 extts index 0 at 1711124315.000000010 corr -10 src 1711124315.881141758 diff 0
      ts2phc[7980.259]: [ts2phc.0.config] ens1f0 extts index 0 at 1711124316.000000000 corr 0 src 1711124316.904125569 diff 0
      ts2phc[7980.688]: [ts2phc.0.config] ens2f0 extts index 0 at 1711124316.000000010 corr -10 src 1711124316.904125569 diff 0
      ts2phc[7981.259]: [ts2phc.0.config] ens1f0 extts index 0 at 1711124317.000000000 corr 0 src 1711124316.498665691 diff 1000000000
      ts2phc[7981.563]: [ts2phc.0.config] ens2f0 extts index 0 at 1711124317.000000010 corr -10 src 1711124316.498665691 diff 1000000000
      ts2phc[7982.198]: [ts2phc.0.config] ens1f0 extts index 0 at 1711124317.373915578 corr 0 src 1711124318.574747824 diff -626084422
      ts2phc[7982.216]: [ts2phc.0.config] ens2f0 extts index 0 at 1711124317.925023509 corr -10 src 1711124318.574747824 diff -74976501
      ...
      
      The first few lines ending with "diff 0" show that the NIC clock is properly synchronized. Then, at timestamp 7981.259, it goes all the way to the max value (1000000000) in both NICs. It slowly converges back to 0, and after a few seconds it fails again.
      
      This is also shown by the PTP Events framework, where the ptp-state-change event is flipping between LOCKED and FREERUN:
      
      time="2024-03-22T16:01:16Z" level=debug msg="Got CurrentState: {\"specversion\":\"0.3\",\"id\":\"1ffed146-e576-4049-b307-cc9e3d878efe\",\"source\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"type\":\"event.sync.ptp-status.ptp-state-change\",\"subject\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"datacontenttype\":\"application/json\",\"time\":\"2024-03-22T16:01:16.971249066Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"notification\",\"valueType\":\"enumeration\",\"value\":\"LOCKED\"},{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"metric\",\"valueType\":\"decimal64.3\",\"value\":\"0\"}]}} "
      time="2024-03-22T02:16:16Z" level=debug msg="Got CurrentState: {\"specversion\":\"0.3\",\"id\":\"c4f3c093-4f94-4c35-ad1d-0b98df3c6b9a\",\"source\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"type\":\"event.sync.ptp-status.ptp-state-change\",\"subject\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"datacontenttype\":\"application/json\",\"time\":\"2024-03-22T16:02:16.985507564Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"notification\",\"valueType\":\"enumeration\",\"value\":\"FREERUN\"},{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"metric\",\"valueType\":\"decimal64.3\",\"value\":\"0\"}]}} "
      time="2024-03-22T16:03:16Z" level=debug msg="Got CurrentState: {\"specversion\":\"0.3\",\"id\":\"5e90af67-0621-4a35-bfdf-2f9046623fc6\",\"source\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"type\":\"event.sync.ptp-status.ptp-state-change\",\"subject\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"datacontenttype\":\"application/json\",\"time\":\"2024-03-22T16:03:16.973188143Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"notification\",\"valueType\":\"enumeration\",\"value\":\"FREERUN\"},{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"metric\",\"valueType\":\"decimal64.3\",\"value\":\"0\"}]}} "
      time="2024-03-22T16:04:16Z" level=debug msg="Got CurrentState: {\"specversion\":\"0.3\",\"id\":\"335b79a7-ffec-498a-b6e7-4ed8f134df1f\",\"source\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"type\":\"event.sync.ptp-status.ptp-state-change\",\"subject\":\"/cluster/node/master-0.dl110platocp.example.com/sync/ptp-status/lock-state\",\"datacontenttype\":\"application/json\",\"time\":\"2024-03-22T16:04:16.970489851Z\",\"data\":{\"version\":\"v1\",\"values\":[{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"notification\",\"valueType\":\"enumeration\",\"value\":\"LOCKED\"},{\"resource\":\"/cluster/node/master-0.dl110platocp.example.com/ens1fx/master\",\"dataType\":\"metric\",\"valueType\":\"decimal64.3\",\"value\":\"0\"}]}} "

      Version-Release number of selected component (if applicable):

      OCP 4.14.12
      ptp-operator 4.14.0-202402081809 

      How reproducible:

          Always

      Steps to Reproduce:

          1. Set up dual-WPC grandmaster following documentation, using all ports on both NICs as grandmaster
          2. Check linuxptp-daemon-container logs
         

      Actual results:

      ts2phc diff goes high

      Expected results:

      Once locked, ts2phc diff should stay stable

      Additional info:

      Originally, we thought this could be related to https://issues.redhat.com/browse/RHEL-19000, because we saw a "PTP failed to get time" message in dmesg that could be correlated to some of the instances. Testing with a kernel including the fix removed the message in dmesg, but the ts2phc behavior remained.
      
      NICs used in the tests are HP-branded, with FW 4.40

            aputtur@redhat.com Aneesh Puttur
            jpena@redhat.com Javier Pena
            Hen Shay Hassid Hen Shay Hassid
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: