Uploaded image for project: 'Distributed Tracing'
  1. Distributed Tracing
  2. TRACING-3347

OCP 4.10 - elasticsearch + jaeger pods referencing old certificates until manually deleted/restarted

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Undefined
    • rhosdt-3.0
    • None
    • Jaeger
    • None
    • Tracing Sprint # 239, Tracing Sprint # 240, Tracing Sprint # 241, Tracing Sprint # 242, Tracing Sprint # 243, Tracing Sprint # 244

    Description

      Issue: jaeger/elasticsearch pods crashloop status, with reported customer errors:

      ~~~
      Before I restarted any pods, all three elasticsearch pods showed logs like so:

      2023-06-20T22:25:02.243702305Z [2023-06-20T22:25:02,243][ERROR][c.a.o.s.s.h.n.OpenDistroSecuritySSLNettyHttpServerTransport] [elasticsearch-cdm-istiosystemjaeger-1] SSL Problem Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
      2023-06-20T22:25:02.243702305Z javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)

      I restarted the 3rd elasticsearch pod, elasticsearch-cdm-istiosystemjaeger-3-8db99fd54-wffnw, and started getting errors like so:

      2023-06-20T22:25:02.243702305Z [2023-06-20T22:25:02,243][ERROR][c.a.o.s.s.h.n.OpenDistroSecuritySSLNettyHttpServerTransport] [elasticsearch-cdm-istiosystemjaeger-1] SSL Problem Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
      2023-06-20T22:25:02.243702305Z javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)

      And:

      2023-06-21T01:00:18.168137633Z [2023-06-21T01:00:18,161][ERROR][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [elasticsearch-cdm-istiosystemjaeger-2] SSL Problem PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed
      2023-06-21T01:00:18.168137633Z javax.net.ssl.SSLHandshakeException: PKIX path validation failed: java.security.cert.CertPathValidatorException: validity check failed
      2023-06-21T01:00:18.168137633Z at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:?]

      And:

      2023-06-21T01:00:19.010710594Z [2023-06-21T01:00:19,010][WARN ][o.e.d.z.ZenDiscovery ] [elasticsearch-cdm-istiosystemjaeger-2] not enough master nodes discovered during pinging (found [[Candidate{node=

      {elasticsearch-cdm-istiosystemjaeger-2}

      {-taatlZURZ6jxxPZC7KNWw}

      {FCP0cg0SRwSmSqAkwaOsfw} {x.x.x.45} {x.x.x.45:9300}

      , clusterStateVersion=-1}]], but needed [2]), pinging again
      ~~~

      Triggered by: Expired Certificates (see below)
      ~~~
      /etc/openshift/elasticsearch/secret/logging-es.crt
      Validity
      Not Before: Jun 14 05:47:02 2023 GMT
      Not After : Jun 13 05:47:02 2025 GMT
      /etc/openshift/elasticsearch/secret/elasticsearch.crt
      Validity
      Not Before: Jun 14 05:47:01 2023 GMT
      Not After : Jun 13 05:47:01 2025 GMT
      /etc/openshift/elasticsearch/secret/..2023_06_21_01_00_03.1819449341/logging-es.crt
      Validity
      Not Before: Jun 14 05:47:02 2023 GMT
      Not After : Jun 13 05:47:02 2025 GMT
      /etc/openshift/elasticsearch/secret/..2023_06_21_01_00_03.1819449341/elasticsearch.crt
      Validity
      Not Before: Jun 14 05:47:01 2023 GMT
      Not After : Jun 13 05:47:01 2025 GMT
      /etc/elasticsearch/secret/elasticsearch.crt
      Validity
      Not Before: Jun 14 05:47:01 2023 GMT
      Not After : Jun 13 05:47:01 2025 GMT
      /etc/elasticsearch/secret/logging-es.crt
      Validity
      Not Before: Jun 14 05:47:02 2023 GMT
      Not After : Jun 13 05:47:02 2025 GMT
      /run/secrets/kubernetes.io/serviceaccount/ca.crt
      Validity
      Not Before: Apr 1 20:31:23 2021 GMT
      Not After : Mar 30 20:31:23 2031 GMT
      /run/secrets/kubernetes.io/serviceaccount/service-ca.crt
      Validity
      Not Before: May 31 21:01:25 2023 GMT
      Not After : Jul 29 21:01:26 2025 GMT
      /run/secrets/kubernetes.io/serviceaccount/..2023_06_21_01_00_03.1972118856/service-ca.crt
      Validity
      Not Before: May 31 21:01:25 2023 GMT
      Not After : Jul 29 21:01:26 2025 GMT
      /run/secrets/kubernetes.io/serviceaccount/..2023_06_21_01_00_03.1972118856/ca.crt
      Validity
      Not Before: Apr 1 20:31:23 2021 GMT
      Not After : Mar 30 20:31:23 2031 GMT
      And in a pod I hadn't restarted:

      oc exec -c elasticsearch $p1 – sh -c 'for i in $(find /etc /run -name *.crt -o -name *,.cert); do openssl x509 -noout -text -in $i| grep -i -A2 -e validity|grep -i "not after : jun 14 " && echo $i; done'
      unable to load certificate
      139852364314432:error:0909006C:PEM routines:get_name:no start line:crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
      Not After : Jun 14 05:16:32 2023 GMT
      /etc/elasticsearch/secret/elasticsearch.crt
      Not After : Jun 14 05:16:32 2023 GMT
      /etc/elasticsearch/secret/logging-es.crt
      command terminated with exit code 1

      $ oc get po $p1
      NAME READY STATUS RESTARTS AGE
      elasticsearch-cdm-istiosystemjaeger-1-85c49b687c-4nkqc 1/2 Running 0 16d
      ~~~

      Workaround:
      Delete the affected pods, allow them to restart with the new certs, pods come up OK.

      data gathers:
      must-gather taken during issue observation no actions taken:
      https://attachments.access.redhat.com/hydra/rest/cases/03543308/attachments/3b57aa03-80dc-42d9-8643-5f7a0416f35a?usePresignedUrl=true

      must-gather taken after an issue: https://attachments.access.redhat.com/hydra/rest/cases/03543308/attachments/79b6c69c-537c-4c01-99f3-9bf805e09cc9?usePresignedUrl=true

      Attachments

        Activity

          People

            rvargasp@redhat.com Ruben Vargas Palma
            rhn-support-wrussell Will Russell
            Ruben Vargas Palma
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: