Description of problem:
ptp4l process is missing the process status metric in the linuxptp-daemon
Version-Release number of selected component (if applicable):
OCP 4.20.0-rc.3 + ptp_operator 4.20.0-202509080953 OCP 4.20.0-ec.4 + ptp_operator 4.20.0-202507221345 OCP 4.19.14 + ptp_operator 4.19.0-202509230113 OCP 4.19.12 + ptp_operator 4.19.0-202509111607 OCP 4.19.5 + ptp_operator 4.19.0-202507232110 OCP 4.18.21 + ptp_operator 4.18.0-202507211933
How reproducible:
Completely random
Steps to Reproduce:
1. Deploy the spoke cluster 2. Run the test_ptp.sh script --> https://gitlab.cee.redhat.com/ran/ran-integration/-/blob/master/scripts/test_ptp.sh?ref_type=heads
Actual results:
The linuxptp-daemon pod does not provide the ptp4l process status metric
[kni@registry.kni-qe-23 ~]$ oc -n openshift-ptp exec linuxptp-daemon-rzsc5 -- curl http://localhost:9091/metrics | grep ptp4lDefaulted container "cloud-event-proxy" out of: cloud-event-proxy, kube-rbac-proxy, linuxptp-daemon-container % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 4741 0 4741 0 0 4629k 0 --:--:-- --:--:-- --:--:-- 4629kopenshift_ptp_clock_class{node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 6openshift_ptp_clock_state{iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 1openshift_ptp_delay_ns{from="master",iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 360openshift_ptp_frequency_adjustment_ns{from="master",iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 11openshift_ptp_interface_role{iface="ens1f0",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 2openshift_ptp_interface_role{iface="ens1f1",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 1openshift_ptp_interface_role{iface="ens1f2",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 2openshift_ptp_interface_role{iface="ens1f3",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 2openshift_ptp_max_offset_ns{from="master",iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 2.71339e+06openshift_ptp_offset_ns{from="master",iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 1
Expected results:
The expected result is actually obtain if we restart the pod, it should look like this
[kni@registry.kni-qe-23 ~]$ oc -n openshift-ptp exec linuxptp-daemon-4vdvs -- curl http://localhost:9091/metrics | grep ptp4l Defaulted container "cloud-event-proxy" out of: cloud-event-proxy, kube-rbac-proxy, linuxptp-daemon-container % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 5391 0 5391 0 0 5264k 0 --:--:-- --:--:-- --:--:-- 5264k openshift_ptp_clock_class{node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 6 openshift_ptp_clock_state{iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 1 openshift_ptp_delay_ns{from="master",iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 366 openshift_ptp_frequency_adjustment_ns{from="master",iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 7 openshift_ptp_interface_role{iface="ens1f0",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 2 openshift_ptp_interface_role{iface="ens1f1",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 1 openshift_ptp_interface_role{iface="ens1f2",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 2 openshift_ptp_interface_role{iface="ens1f3",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 2 openshift_ptp_max_offset_ns{from="master",iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 116 openshift_ptp_offset_ns{from="master",iface="ens1fx",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 1 openshift_ptp_process_restart_count{config="ptp4l.0.config",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="phc2sys"} 1 openshift_ptp_process_restart_count{config="ptp4l.0.config",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 1 openshift_ptp_process_status{config="ptp4l.0.config",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="phc2sys"} 1 openshift_ptp_process_status{config="ptp4l.0.config",node="sno.kni-qe-67.lab.eng.rdu2.redhat.com",process="ptp4l"} 1
Additional info:
I have encounter this problem randomly affecting different environments, architectures (aarch64 and x86_64) and combinations of OCP and ptp_operator versions. This is the must-gather of the latest deployment where I have seen this happening. https://drive.google.com/file/d/1XYR5kI0k2ccExnHS7g7Hg20fs5nvc2_2/view?usp=sharing