Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57082

ptp pods do not die on error

XMLWordPrintable

    • Quality / Stability / Reliability
    • True
    • Hide

       depending on  release of  OCPBUGS-55732

      Show
       depending on  release of  OCPBUGS-55732
    • None
    • Important
    • None
    • Hide
      9/18: depending PR OCPBUGS-61180 is merging in 4.16
      7/3: depending on OCPBUGS-55732
      6/30: onHold : Will wait for unicast_master_table bug: Let's please mark as on hold until backported. We are moving all clusters to 4.16 now so that would be our goal version but we do want to follow up on this regardless
      Show
      9/18: depending PR OCPBUGS-61180 is merging in 4.16 7/3: depending on OCPBUGS-55732 6/30: onHold : Will wait for unicast_master_table bug: Let's please mark as on hold until backported. We are moving all clusters to 4.16 now so that would be our goal version but we do want to follow up on this regardless
    • None
    • None
    • CNF RAN Sprint 277
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

          

      Version-Release number of selected component (if applicable):

        PTP operator version 4.15.0-202505132237.     

      How reproducible:

          Systemic, but random.

      Steps to Reproduce:

          1.install ptp operator, upgrade openshift from 4.14 to 4.15
          2.pod can get stuck in an error state and not refresh itself. Needs monitoring so that it will automatically cycle.
       
          

      Actual results:

          This morning we had an outage that blocked trading activities due to clock sync errors, the root cause of which was that the linux PTP pods scheduled by the PTP operator were not working correctly. The pods were in healthy state according to OCP, while printing the following lines few times a second:
      
      ```
      2025-06-02 00:33:25.106	
      phc2sys[150296.077]: [ptp4l.0.config] Waiting for ptp4l...
      2025-06-02 00:33:25.546	
      I0602 00:33:25.546967  393813 daemon.go:745] Starting ptp4l...
      2025-06-02 00:33:25.546	
      I0602 00:33:25.546939  393813 daemon.go:844] Recreating ptp4l...
      2025-06-02 00:33:25.547	
      I0602 00:33:25.546974  393813 daemon.go:746] ptp4l cmd: /bin/chrt -f 10 /usr/sbin/ptp4l -f /var/run/ptp4l.0.config  -s -m 
      2025-06-02 00:33:25.547	
      
      2025-06-02 00:33:25.547	
      I0602 00:33:25.547052  393813 daemon.go:674] ptp4l[1748824405]:[ptp4l.0.config] PTP_PROCESS_STATUS:1
      2025-06-02 00:33:25.548	
      failed to parse configuration file /var/run/ptp4l.0.config
      2025-06-02 00:33:25.548	
      line 72: missing table_id
      2025-06-02 00:33:25.548	
      E0602 00:33:25.548539  393813 daemon.go:829] CmdRun() error waiting for ptp4l: exit status 254
      2025-06-02 00:33:25.548	
      
      2025-06-02 00:33:25.548	
      I0602 00:33:25.548583  393813 daemon.go:674] ptp4l[1748824405]:[ptp4l.0.config] PTP_PROCESS_STATUS:0
      2025-06-02 00:33:26.106	
      phc2sys[150297.077]: [ptp4l.0.config] Waiting for ptp4l...
      2025-06-02 00:33:26.548	
      I0602 00:33:26.548933  393813 daemon.go:745] Starting ptp4l...
      2025-06-02 00:33:26.548	
      I0602 00:33:26.548904  393813 daemon.go:844] Recreating ptp4l...
      2025-06-02 00:33:26.548	
      I0602 00:33:26.548942  393813 daemon.go:746] ptp4l cmd: /bin/chrt -f 10 /usr/sbin/ptp4l -f /var/run/ptp4l.0.config  -s -m 
      2025-06-02 00:33:26.549	
      
      2025-06-02 00:33:26.549	
      I0602 00:33:26.548998  393813 daemon.go:674] ptp4l[1748824406]:[ptp4l.0.config] PTP_PROCESS_STATUS:1
      2025-06-02 00:33:26.550	
      failed to parse configuration file /var/run/ptp4l.0.config
      2025-06-02 00:33:26.550	
      line 72: missing table_id
      ```

      This was fixed by a simple pod restart - configurations were not changed, whatever the problem was, was likely a temporary condition caused by the upgrade. Another node in the same cluster with the same pod and same configurations never had this issue.

      Expected results:

          Pod should automatically know it is erred and restart.

      Additional info:

          

              josricha@redhat.com Joseph Richard
              rhn-support-tywalker Tyler Walker
              None
              None
              Kirsten Laskoski Kirsten Laskoski
              None
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated: