Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14681

openshift_ptp_clock_class is not updated in Prometheus metrics when the incomming ptp packets clock class is changed

XMLWordPrintable

    • Important
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      6/23: QE is installing a 4.14 system today with the aim of validating this on Sunday. Then we can line up a 4.13 backport and regression.
      6/22: pr merged; working logistics with QE for testing and backporting.
      6/21: a short term fix is underway to address this issue. A PR has already been posted.
      Show
      6/23: QE is installing a 4.14 system today with the aim of validating this on Sunday. Then we can line up a 4.13 backport and regression. 6/22: pr merged; working logistics with QE for testing and backporting. 6/21: a short term fix is underway to address this issue. A PR has already been posted.

      Description of problem:

      openshift_ptp_clock_class is not updated in Prometheus metrics when the incomming ptp packets clock class is changed

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      Step1: OCP linuxptp is synced from interface interface ens43f1, incomming clock class is 6 at first.
      
      Step2: In OCP monitor console web (such as https://console-openshift-console.apps.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net/), add metrics query for openshift_ptp_clock_class, it's value is shown with 6. 
      
      Step3: In ptp master side, do some change to make the ptp packets with clock class 248 (restart ptp master relative service to trigger clock class to 248, and there will be several seconds without ptp packets sending out), checking in linuxptp side, such as checking with pmc tool (kubectl get pods -n openshift-ptp |grep -i linuxptp |awk '{print $1}' |xargs -i kubectl exec -it -n openshift-ptp tp-daemon-container -- pmc -f /var/run/ptp4l.0.config -u -d 24 "GET PARENT_DATA_SET"), it shows the gm.ClockClass is already changed to 248, and also in metrics query for openshift_ptp_clock_class, it's value is also changed from 6 to 248.
      
      Step4: After some seconds in ptp matster side, the clock class is changed from 248 back to 6, checing in linuxptp side, the gm.ClockClass is changed from 248 back to 6, but in 
      metrics query for openshift_ptp_clock_class, it's value is always 248, and checking inside prometheus container "kubectl exec -it -n openshift-monitoring prometheus-k8s-0 -c prometheus -- bash" with "curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://ptp-monitor-service.openshift-ptp.svc.cluster.local:8443/metrics", it shows clock class always 248 

      Actual results:

      [3. Actual Result:]
      The openshift_ptp_clock_class in prometheus metrics is always 248
      workaround: two ways can fix this issue, 1, trigger ptp packets lost a while then recovery soon; or 2, delete linuxptp pods in OCP side to trigger new linuxptp pod creating.

      Expected results:

      [2. Expected Result:]
      The openshift_ptp_clock_class in prometheus metrics should be changed back to 6

      Additional info:

      [4. Analysis of Logs:]
      #In linuxptp container logs (20230530linuxptp1.log), we can see when there is some setting in ptp master side (restart ptp master to trigger clock class to 248), linuxptp logs shows annouce messege timeout for a while and later when packets incomming the clock class is 248:
      2023-05-30T01:42:12.483007360Z ptp4l[1454728.799]: [ptp4l.0.config] port 1: SLAVE to LISTENING on ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES
      2023-05-30T01:42:12.483048396Z ptp4l[1454728.799]: [ptp4l.0.config] selected local clock b49691.fffe.d13271 as best master
      2023-05-30T01:42:13.183202792Z phc2sys[1454729.499]: [ptp4l.0.config] port b49691.fffe.d13271-1 changed state
      2023-05-30T01:42:13.183304437Z phc2sys[1454729.499]: [ptp4l.0.config] reconfiguring after port state change
      2023-05-30T01:42:13.183570266Z phc2sys[1454729.500]: [ptp4l.0.config] selecting ens43f1 for synchronization
      2023-05-30T01:42:13.183614620Z phc2sys[1454729.500]: [ptp4l.0.config] nothing to synchronize
      2023-05-30T01:42:23.779279531Z I0530 01:42:23.779242   60724 main.go:120] ticker pull
      2023-05-30T01:42:38.761658342Z ptp4l[1454755.078]: [ptp4l.0.config] selected best master clock 4ea8aa.fffe.7a3006
      2023-05-30T01:42:38.761692895Z ptp4l[1454755.078]: [ptp4l.0.config] port 1: LISTENING to UNCALIBRATED on RS_SLAVE
      2023-05-30T01:42:38.763098842Z I0530 01:42:38.763067   60724 daemon.go:443] clock change event identified
      2023-05-30T01:42:38.763129016Z ptp4l[1685410958]:[ptp4l.0.config] CLOCK_CLASS_CHANGE 248.000000
      
      #later in ptp master side the clock class is changed to 6 (please check ptp wireshark logs 20230530ptp.pcap fetched in OCP interface ens43f10), frome frame 9922 announce message to frame 9952, and in OCP linuxptp pmc tool output, we can see the gm clockclass is also changed from 248 to 6:
      [core@master0 ~]$ kubectl exec -it -n openshift-ptp linuxptp-daemon-bgmcr -c linuxptp-daemon-container -- pmc -f /var/run/ptp4l.0.co    nfig -u -d 24 "GET PARENT_DATA_SET"
      sending: GET PARENT_DATA_SET
      b49691.fffe.d13271-0 seq 0 RESPONSE MANAGEMENT PARENT_DATA_SET
      parentPortIdentity                    4ea8aa.fffe.7a3006-2
      parentStats                           0
      observedParentOffsetScaledLogVariance 0xffff
      observedParentClockPhaseChangeRate    0x7fffffff
      grandmasterPriority1                  128
      gm.ClockClass                         248
      gm.ClockAccuracy                      0x21
      gm.OffsetScaledLogVariance            0x4e5d
      grandmasterPriority2                  128
      grandmasterIdentity                   4ea8aa.fffe.7a3006
      [core@master0 ~]$ kubectl exec -it -n openshift-ptp linuxptp-daemon-bgmcr -c linuxptp-daemon-container -- pmc -f /var/run/ptp4l.0.co    nfig -u -d 24 "GET PARENT_DATA_SET"
      sending: GET PARENT_DATA_SET
      b49691.fffe.d13271-0 seq 0 RESPONSE MANAGEMENT PARENT_DATA_SET
      parentPortIdentity                    4ea8aa.fffe.7a3006-2
      parentStats                           0
      observedParentOffsetScaledLogVariance 0xffff
      observedParentClockPhaseChangeRate    0x7fffffff
      grandmasterPriority1                  128
      gm.ClockClass                         6
      gm.ClockAccuracy                      0x21
      gm.OffsetScaledLogVariance            0x4e5d
      grandmasterPriority2                  128
      grandmasterIdentity                   4ea8aa.fffe.7a3006
      
      #But in console web, the clock class is always 248 (please check the har logs console-openshift-console.apps.hztt-pz-25ae06-1-1r-10-69-38-82.har of web console):
      "text": "{\"status\":\"success\",\"data\":{\"resultType\":\"matrix\",\"result\":[{\"metric\":{\"__name__\":\"openshift_ptp_clock_class\",\"container\":\"kube-rbac-proxy\",\"endpoint\":\"metrics\",\"instance\":\"10.69.38.82:8443\",\"job\":\"ptp-monitor-service\",\"namespace\":\"openshift-ptp\",\"node\":\"master0.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net\",\"pod\":\"linuxptp-daemon-bgmcr\",\"process\":\"ptp4l\",\"prometheus\":\"openshift-monitoring/k8s\",\"service\":\"ptp-monitor-service\"},\"values\":[[1685413389.496,\"248\"],[1685413395.496,\"248\"],[1685413401.496,\"248\"],[1685413407.496,\"248\"],[1685413413.496,\"248\"],[1685413419.496,\"248\"],[1685413425.496,\"248\"],[1685413431.496,\"248\"],[1685413437.496,\"248\"],[1685413443.496,\"248\"],[1685413449.496,\"248\"],[1685413455.496,\"248\"],[1685413461.496,\"248\"],[1685413467.496,\"248\"],[1685413473.496,\"248\"],[1685413479.496,\"248\"],[1685413485.496,\"248\"]
      ,[1685415111.496,\"248\"],[1685415117.496,\"248\"],[1685415123.496,\"248\"],[1685415129.496,\"248\"],[1685415135.496,\"248\"],[1685415141.496,\"248\"],[1685415147.496,\"248\"],[1685415153.496,\"248\"],[1685415159.496,\"248\"],[1685415165.496,\"248\"],[1685415171.496,\"248\"],[1685415177.496,\"248\"],[1685415183.496,\"248\"],[1685415189.496,\"248\"]]}]}}\n"
      },
      
      #And inside prometheus container (promethus_output.txt) it's 248:
      [core@master0 ~]$ kubectl exec -it -n openshift-monitoring prometheus-k8s-0 -- bash
      bash-4.4$ curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/service-ca.crt -H "Authorization: Bearer $(cat /var/run/secrets/kubernetes.io/serviceaccount/token)" https://ptp-monitor-service.openshift-ptp.svc.cluster.local:8443/metrics
      # HELP cne_api_events_published Metric to get number of events published by the rest api
      # TYPE cne_api_events_published gauge
      cne_api_events_published{address="/cluster/node/master0.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net/sync/ptp-status/lock-state",status="success"} 45
      cne_api_events_published{address="/cluster/node/master0.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net/sync/ptp-status/ptp-clock-class-change",status="success"} 7
      cne_api_events_published{address="/cluster/node/master0.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net/sync/sync-status/os-clock-sync-state",status="success"} 25
      # HELP cne_api_publishers Metric to get number of publishers
      # TYPE cne_api_publishers gauge
      cne_api_publishers{status="active"} 3
      # HELP cne_events_ack Metric to get number of events produced
      # TYPE cne_events_ack gauge
      cne_events_ack{status="success",type="/cluster/node/master0.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net/sync/ptp-status/lock-state"} 45
      cne_events_ack{status="success",type="/cluster/node/master0.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net/sync/ptp-status/ptp-clock-class-change"} 7
      cne_events_ack{status="success",type="/cluster/node/master0.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net/sync/sync-status/os-clock-sync-state"} 25
      # HELP openshift_ptp_clock_class
      # TYPE openshift_ptp_clock_class gauge
      openshift_ptp_clock_class{node="master0.hztt-pz-25ae06-1-1r-10-69-38-82.ocp.hz.nsn-rdnet.net",process="ptp4l"} 248

              vgrinber@redhat.com Vitaly Grinberg
              rhn-support-nigsmith Nigel Smith
              Ofer Bochan Ofer Bochan
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: