Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-1623

Bug 2054200 - Custom created services in openshift-ingress removed even though the services are not of type LoadBalancer

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • 4.10
    • Networking / router
    • None
    • Important
    • None
    • 3
    • Sprint 225
    • 1
    • False
    • Hide

      None

      Show
      None
    • Hide
      Cause: The logic in the ingress-operator didn't validate whether a kubernetes service object in the openshift-ingress namespace was actually created/owned by the ingress controller it was attempting to reconcile with.

      Consequence: The ingress-operator would modify/remove kubernetes services with the same name and namespace regardless of ownership which could cause unexpected behavior. This quite rare because the service has to have a very specific name and it also has to be in the openshift-ingress namespace.

      Fix: The ingress-operator now checks the ownership of existing kubernetes services it attempts to create/remove and if ownership doesn't match, the ingress-operator throws an error and does not take any action.

      Result: The ingress-operator won't modify/delete custom kubernetes services with the same name as one that it wants to modify/remove in the openshift-ingress namespace.
      Show
      Cause: The logic in the ingress-operator didn't validate whether a kubernetes service object in the openshift-ingress namespace was actually created/owned by the ingress controller it was attempting to reconcile with. Consequence: The ingress-operator would modify/remove kubernetes services with the same name and namespace regardless of ownership which could cause unexpected behavior. This quite rare because the service has to have a very specific name and it also has to be in the openshift-ingress namespace. Fix: The ingress-operator now checks the ownership of existing kubernetes services it attempts to create/remove and if ownership doesn't match, the ingress-operator throws an error and does not take any action. Result: The ingress-operator won't modify/delete custom kubernetes services with the same name as one that it wants to modify/remove in the openshift-ingress namespace.

      Manual backport mirror of https://bugzilla.redhat.com/show_bug.cgi?id=2094051 

      Description of problem:

      Having `IngressController` with `endpointPublishingStrategy` set to `Private` and a `kubernetes` service created with same naming convention `NodePort` in `openshift-ingress` namespace is being removed when the `ingress-operator` is restarted.

      $ oc get ingresscontroller -n openshift-ingress-operator example-service-testing -o json
      {
      "apiVersion": "operator.openshift.io/v1",
      "kind": "IngressController",
      "metadata":

      { "creationTimestamp": "2022-02-14T10:34:35Z", "finalizers": [ "ingresscontroller.operator.openshift.io/finalizer-ingresscontroller" ], "generation": 2, "name": "example-service-testing", "namespace": "openshift-ingress-operator", "resourceVersion": "19329705", "uid": "ffc9f14d-63ad-43bb-8a56-5e590cda9b38" }

      ,
      "spec": {
      "clientTLS": {
      "clientCA":

      { "name": "" }

      ,
      "clientCertificatePolicy": ""
      },
      "domain": "apps.example.com",
      "endpointPublishingStrategy":

      { "type": "Private" }

      ,
      "httpEmptyRequestsPolicy": "Respond",
      "httpErrorCodePages":

      { "name": "" }

      ,
      "tuningOptions": {},
      "unsupportedConfigOverrides": null
      },
      "status": {
      "availableReplicas": 2,
      "conditions": [

      { "lastTransitionTime": "2022-02-14T10:34:35Z", "reason": "Valid", "status": "True", "type": "Admitted" }

      ,

      { "lastTransitionTime": "2022-02-14T10:34:35Z", "status": "True", "type": "PodsScheduled" }

      ,

      { "lastTransitionTime": "2022-02-14T10:35:10Z", "message": "The deployment has Available status condition set to True", "reason": "DeploymentAvailable", "status": "True", "type": "DeploymentAvailable" }

      ,

      { "lastTransitionTime": "2022-02-14T10:35:10Z", "message": "Minimum replicas requirement is met", "reason": "DeploymentMinimumReplicasMet", "status": "True", "type": "DeploymentReplicasMinAvailable" }

      ,

      { "lastTransitionTime": "2022-02-14T10:35:10Z", "message": "All replicas are available", "reason": "DeploymentReplicasAvailable", "status": "True", "type": "DeploymentReplicasAllAvailable" }

      ,

      { "lastTransitionTime": "2022-02-14T10:34:35Z", "message": "The configured endpoint publishing strategy does not include a managed load balancer", "reason": "EndpointPublishingStrategyExcludesManagedLoadBalancer", "status": "False", "type": "LoadBalancerManaged" }

      ,

      { "lastTransitionTime": "2022-02-14T10:34:35Z", "message": "The endpoint publishing strategy doesn't support DNS management.", "reason": "UnsupportedEndpointPublishingStrategy", "status": "False", "type": "DNSManaged" }

      ,

      { "lastTransitionTime": "2022-02-14T10:35:10Z", "status": "True", "type": "Available" }

      ,

      { "lastTransitionTime": "2022-02-14T10:35:10Z", "status": "False", "type": "Degraded" }

      ],
      "domain": "apps.example.com",
      "endpointPublishingStrategy":

      { "type": "Private" }

      ,
      "observedGeneration": 2,
      "selector": "ingresscontroller.operator.openshift.io/deployment-ingresscontroller=example-service-testing",
      "tlsProfile":

      { "ciphers": [ "ECDHE-ECDSA-AES128-GCM-SHA256", "ECDHE-RSA-AES128-GCM-SHA256", "ECDHE-ECDSA-AES256-GCM-SHA384", "ECDHE-RSA-AES256-GCM-SHA384", "ECDHE-ECDSA-CHACHA20-POLY1305", "ECDHE-RSA-CHACHA20-POLY1305", "DHE-RSA-AES128-GCM-SHA256", "DHE-RSA-AES256-GCM-SHA384", "TLS_AES_128_GCM_SHA256", "TLS_AES_256_GCM_SHA384", "TLS_CHACHA20_POLY1305_SHA256" ], "minTLSVersion": "VersionTLS12" }

      }
      }

      $ oc get svc
      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
      router-default LoadBalancer 172.30.242.215 a777bc4ce4da740d99abdaa899bf8e88-1599963277.us-west-1.elb.amazonaws.com 80:30779/TCP,443:31713/TCP 13d
      router-internal-default ClusterIP 172.30.233.135 <none> 80/TCP,443/TCP,1936/TCP 13d
      router-internal-example-service-testing ClusterIP 172.30.86.100 <none> 80/TCP,443/TCP,1936/TCP 87m

      After `IngressController` creation, it all looks as expected and for the Private `IngressController` we can see `router-internal-example-service-testing` Service.

      $ oc create svc nodeport router-example-service-testing --tcp=80
      service/router-example-service-testing created

      $ oc get svc
      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
      router-default LoadBalancer 172.30.242.215 a777bc4ce4da740d99abdaa899bf8e88-1599963277.us-west-1.elb.amazonaws.com 80:30779/TCP,443:31713/TCP 13d
      router-internal-default ClusterIP 172.30.233.135 <none> 80/TCP,443/TCP,1936/TCP 13d
      router-internal-example-service-testing ClusterIP 172.30.86.100 <none> 80/TCP,443/TCP,1936/TCP 88m
      router-example-service-testing NodePort 172.30.2.39 <none> 80:31874/TCP 3s

      Now we are creating a `kubernetes` service of type NodePort with the same naming scheme like the one created by the `IngressController`. So far so good and also no impact or similar with regards to functionality.

      $ oc get pod -n openshift-ingress-operator
      NAME READY STATUS RESTARTS AGE
      ingress-operator-7d56fd784c-plwpj 2/2 Running 0 78m

      $ oc delete pod ingress-operator-7d56fd784c-plwpj -n openshift-ingress-operator
      pod "ingress-operator-7d56fd784c-plwpj" deleted

      $ oc get svc
      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
      router-default LoadBalancer 172.30.242.215 a777bc4ce4da740d99abdaa899bf8e88-1599963277.us-west-1.elb.amazonaws.com 80:30779/TCP,443:31713/TCP 13d
      router-internal-default ClusterIP 172.30.233.135 <none> 80/TCP,443/TCP,1936/TCP 13d
      router-internal-example-service-testing ClusterIP 172.30.86.100 <none> 80/TCP,443/TCP,1936/TCP 88m
      router-example-service-testing NodePort 172.30.2.39 <none> 80:31874/TCP 53s

      $ oc get svc
      NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
      router-default LoadBalancer 172.30.242.215 a777bc4ce4da740d99abdaa899bf8e88-1599963277.us-west-1.elb.amazonaws.com 80:30779/TCP,443:31713/TCP 13d
      router-internal-default ClusterIP 172.30.233.135 <none> 80/TCP,443/TCP,1936/TCP 13d
      router-internal-example-service-testing ClusterIP 172.30.86.100 <none> 80/TCP,443/TCP,1936/TCP 89m

      $ oc logs ingress-operator-7d56fd784c-7g48r -n openshift-ingress-operator -c ingress-operator
      2022-02-14T12:03:26.624Z INFO operator.main ingress-operator/start.go:63 using operator namespace {"namespace": "openshift-ingress-operator"}
      I0214 12:03:27.675884 1 request.go:668] Waited for 1.02063447s due to client-side throttling, not priority and fairness, request: GET:
      https://172.30.0.1:443/apis/apps.openshift.io/v1?timeout=32s
      2022-02-14T12:03:29.284Z INFO operator.main ingress-operator/start.go:63 registering Prometheus metrics for canary_controller
      2022-02-14T12:03:29.284Z INFO operator.main ingress-operator/start.go:63 registering Prometheus metrics for ingress_controller
      [...]
      2022-02-14T12:03:33.119Z INFO operator.dns dns/controller.go:535 using region from operator config {"region name": "us-west-1"}
      2022-02-14T12:03:33.417Z INFO operator.ingress_controller controller/controller.go:298 reconciling {"request": "openshift-ingress-operator/example-service-testing"}
      2022-02-14T12:03:33.509Z INFO operator.ingress_controller ingress/load_balancer_service.go:190 deleted load balancer service {"namespace": "openshift-ingress", "name": "router-example-service-testing"}
      [...]

      When restarting the `ingress-operator` pod we can see that shortly after, the manual created `kubernetes` service of type NodePort is being removed. Looking through the code it looks related to
      https://bugzilla.redhat.com/show_bug.cgi?id=1914127
      but that should only target/focus on `kubernetes` Service of type Loadbalancer. But we can clearly see that this is happening for all `kubernetes` Service type if they are matching the pre-defined `IngressController` naming scheme.

      As this is not expected and also the `kubernetes` Services don't have any owner reference to the `IngressController` created services, it's unexpected that does are being removed and thus this should be fixed.

      OpenShift release version:

      • OpenShift Container Platform 4.9.15

      Cluster Platform:

      • AWS but likely on other platform as well

      How reproducible:

      • Always

      Steps to Reproduce (in detail):
      1. See the steps in the problem description

      Actual results:

      `kubernetes` services of any type and without owner reference to the `IngressController` are being removed by the `IngressController` if they have a specific naming scheme.

      Expected results:

      `kubernetes` services without `IngressController` reference should never be touched/modified/removed by the same as they may be required for 3rd party integration or similar.

      Impact of the problem:

      3rd party implementation broken after updating to OpenShift Container Platform 4.8 as some helper services were removed unexpected.

      Additional info:

      Check
      https://bugzilla.redhat.com/show_bug.cgi?id=1914127
      as this seems the change that introduced that behavior. Although this seems specific for `kubernetes` type LoadBalancer and we are therefore wondering why other services are in scope as well.

              gspence@redhat.com Grant Spence
              gspence@redhat.com Grant Spence
              Hongan Li Hongan Li
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: