Uploaded image for project: 'Network Observability'
  1. Network Observability
  2. NETOBSERV-1619

Dedicated metrics ports for netobserv as the default

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • netobserv-1.6
    • None
    • eBPF
    • None
    • False
    • None
    • False
    • Needs release note text.
    • Bug Fix
    • NetObserv - Sprint 252, NetObserv - Sprint 253

      Description of problem:

      Recently found an issue for the default port assignment to the eBPF pods while running the automated test for verifying the eBPF agent metrics, health and console metrics. Currently, this issue is reproducing just on OCP 4.13 cluster with ppc64le arch. Due to the unavailable port the eBPF pods in the "-privileged" namespace are getting into the CrashLoopBackOff state as below:
      
      [root@rdr-noo-ocp-413-bastion-0 ~]# oc -n e2e-test-netobserv-d2jg4-privileged get po -o wide
      NAME                         READY   STATUS             RESTARTS      AGE   IP              NODE       NOMINATED NODE   READINESS GATES
      netobserv-ebpf-agent-27kx8   1/1     Running            0             58s   10.20.186.124   worker-1   <none>           <none>
      netobserv-ebpf-agent-2dsbj   0/1     CrashLoopBackOff   3 (17s ago)   58s   10.20.186.120   master-2   <none>           <none>
      netobserv-ebpf-agent-6nzpc   1/1     Running            0             58s   10.20.186.236   worker-0   <none>           <none>
      netobserv-ebpf-agent-mrm7z   0/1     Error              3 (39s ago)   58s   10.20.186.183   master-0   <none>           <none>
      netobserv-ebpf-agent-n4gwd   0/1     Error              3 (38s ago)   58s   10.20.186.79    master-1   <none>           <none>

      Steps to Reproduce:

      As mentioned above the, this issue is being reproducing for on 4.13 cluster with ppc64le arch only, while running the automated test for verifying flowlogs-pipeline, eBPF agent, health and console metrics. Can be reproduced with the below steps:
      
      1. Deploy OCP 4.13 cluster for ppc64le arch
      2. Run the automated test- "54043-High-66031-High-72959-Verify flowlogs-pipeline, eBPF agent and Console metrics"
      

      Actual results:

      The eBPF pods are getting into the CrashLoopBackOff state due to port unavailability.

      Expected results:

      eBPF pods should get scheduled with the default defined port in the flowcollector deployment.

      eBPF pod logs:

      [root@rdr-noo-ocp-413-bastion-0
      ~]# oc -n e2e-test-netobserv-cxv4l-privileged logs netobserv-ebpf-agent-7xdw2
      
      time="2024-04-24T09:36:39Z"
      level=info msg="starting NetObserv eBPF Agent"
      
      time="2024-04-24T09:36:39Z"
      level=info msg="initializing Flows agent" component=agent.Flows
      
      time="2024-04-24T09:36:39Z"
      level=info msg="StartServerAsync: addr = :9102" component=prometheus
      
      time="2024-04-24T09:36:39Z"
      level=info msg="push CTRL+C or send SIGTERM to interrupt execution"
      
      time="2024-04-24T09:36:39Z"
      level=info msg="starting Flows agent" component=agent.Flows
      
      time="2024-04-24T09:36:39Z"
      level=warning msg="can't detect any network-namespaces err: open
      /var/run/netns: no such file or directory [Ignore if the agent privileged flag
      is not set]" component=ifaces.Watcher
      
      time="2024-04-24T09:36:39Z"
      level=warning msg="failed to add watcher to netns directory err: no such
      file or directory [Ignore if the agent privileged flag is not set]"
      component=ifaces.Watcher
      
      time="2024-04-24T09:36:39Z"
      level=info msg="Flows agent successfully started"
      component=agent.Flows
      
      time="2024-04-24T09:36:39Z"
      level=fatal msg="error in http.ListenAndServe: listen tcp :9102: bind:
      address already in use" component=prometheus
       

      process attached to the port:

      root     3393866 3393742  0 17:38 pts/0    00:00:00 grep --color=auto 3543
      [root@master-2 core]# netstat -plano | grep :9102
      tcp6       0      0 :::9102                 :::*                    LISTEN      3543/kube-rbac-prox  off (0.00/0/0)
      tcp6       0      0 10.20.186.120:9102      10.20.186.236:57840     ESTABLISHED 3543/kube-rbac-prox  keepalive (11.40/0/0)
      tcp6       0      0 10.20.186.120:9102      10.20.186.124:45198     ESTABLISHED 3543/kube-rbac-prox  keepalive (1.17/0/0)
      
      [root@master-2 core]# ps -ef | grep 3543
      nfsnobo+    3543    3423  0 Apr19 ?        00:02:13 /usr/bin/kube-rbac-proxy --logtostderr --secure-listen-address=:9102 --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256 --upstream=http://127.0.0.1:29102/ --tls-private-key-file=/etc/pki/tls/metrics-cert/tls.key --tls-cert-file=/etc/pki/tls/metrics-cert/tls.crt
      root     3394891 3393742  0 17:39 pts/0    00:00:00 grep --color=auto 3543
      [root@master-2 core]# 

            mmahmoud@redhat.com Mohamed Mahmoud
            rh-ee-ahonkala Aditya Honkalas
            Aditya Honkalas Aditya Honkalas
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: