Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2988

apiserver pods cannot reach etcd on single node IPv6 cluster: transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"

XMLWordPrintable

    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Hide
      11/10: flipping back to Green since it's ON_QA
      11/4: R e d for 4.12 til the fix is posted.
      11/3: now rejected as release blocker; PR posted. Y e l l o w as fix is still in flight.
      10/25: in progress (release blocker)
      10/19: Completely blocks System Test, issue has been discussed with the sdn team on Slack
      Show
      11/10: flipping back to Green since it's ON_QA 11/4: R e d for 4.12 til the fix is posted. 11/3: now rejected as release blocker; PR posted. Y e l l o w as fix is still in flight. 10/25: in progress (release blocker) 10/19: Completely blocks System Test, issue has been discussed with the sdn team on Slack

      Description of problem:

      openshift-apiserver, openshift-oauth-apiserver and kube-apiserver pods cannot validate the certificate when trying to reach etcd reporting certificate validation errors:
      
      }. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"
      W1018 11:36:43.523673      15 logging.go:59] [core] [Channel #186 SubChannel #187] grpc: addrConn.createTransport failed to connect to {
        "Addr": "[2620:52:0:198::10]:2379",
        "ServerName": "2620:52:0:198::10",
        "Attributes": null,
        "BalancerAttributes": null,
        "Type": 0,
        "Metadata": null
      }. Err: connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for ::1, 127.0.0.1, ::1, fd69::2, not 2620:52:0:198::10"
      

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-2022-10-18-041406

      How reproducible:

      100%

      Steps to Reproduce:

      1. Deploy SNO with single stack IPv6 via ZTP procedure
      

      Actual results:

      Deployment times out and some of the operators aren't deployed successfully.
      
      NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
      authentication                             4.12.0-0.nightly-2022-10-18-041406   False       False         True       124m    APIServerDeploymentAvailable: no apiserver.openshift-oauth-apiserver pods available on any node....
      baremetal                                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      112m    
      cloud-controller-manager                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      cloud-credential                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
      cluster-autoscaler                         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      config-operator                            4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
      console                                                                                                                      
      control-plane-machine-set                  4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      csi-snapshot-controller                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      dns                                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      etcd                                       4.12.0-0.nightly-2022-10-18-041406   True        False         True       121m    ClusterMemberControllerDegraded: could not get list of unhealthy members: giving up getting a cached client after 3 tries
      image-registry                             4.12.0-0.nightly-2022-10-18-041406   False       True          True       104m    Available: The registry is removed...
      ingress                                    4.12.0-0.nightly-2022-10-18-041406   True        True          True       111m    The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 0/1 of replicas are available)
      insights                                   4.12.0-0.nightly-2022-10-18-041406   True        False         False      118s    
      kube-apiserver                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      102m    
      kube-controller-manager                    4.12.0-0.nightly-2022-10-18-041406   True        False         True       107m    GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp [fd02::3c5f]:9091: connect: connection refused
      kube-scheduler                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
      kube-storage-version-migrator              4.12.0-0.nightly-2022-10-18-041406   True        False         False      117m    
      machine-api                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      machine-approver                           4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      machine-config                             4.12.0-0.nightly-2022-10-18-041406   True        False         False      115m    
      marketplace                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      116m    
      monitoring                                                                      False       True          True       98m     deleting Thanos Ruler Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, deleting UserWorkload federate Route failed: Timeout: request did not complete within requested timeout - context deadline exceeded, reconciling Alertmanager Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io alertmanager-main), reconciling Thanos Querier Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io thanos-querier), reconciling Prometheus API Route failed: retrieving Route object failed: the server was unable to return a response in the time allotted, but may still be processing the request (get routes.route.openshift.io prometheus-k8s), prometheuses.monitoring.coreos.com "k8s" not found
      network                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
      node-tuning                                4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      openshift-apiserver                        4.12.0-0.nightly-2022-10-18-041406   True        False         False      104m    
      openshift-controller-manager               4.12.0-0.nightly-2022-10-18-041406   True        False         False      107m    
      openshift-samples                                                               False       True          False      103m    The error the server was unable to return a response in the time allotted, but may still be processing the request (get imagestreams.image.openshift.io) during openshift namespace cleanup has left the samples in an unknown state
      operator-lifecycle-manager                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      operator-lifecycle-manager-catalog         4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m    
      operator-lifecycle-manager-packageserver   4.12.0-0.nightly-2022-10-18-041406   True        False         False      106m    
      service-ca                                 4.12.0-0.nightly-2022-10-18-041406   True        False         False      124m    
      storage                                    4.12.0-0.nightly-2022-10-18-041406   True        False         False      111m  
      
      

      Expected results:

      Deployment succeeds without issues.

      Additional info:

      I was unable to run must-gather so attaching the pods logs copied from the host file system.

            bnemec@redhat.com Benjamin Nemec
            mcornea@redhat.com Marius Cornea
            Marius Cornea Marius Cornea
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: