Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-50587

Component Readiness: [Etcd] [operator-conditions] test regressed

    • Critical
    • None
    • Rejected
    • False
    • Hide

      None

      Show
      None

      (Feel free to update this bug's summary to be more specific.)
      Component Readiness has found a potential regression in the following test:

      operator conditions etcd

      Significant regression detected.
      Fishers Exact probability of a regression: 100.00%.
      Test pass rate dropped from 99.72% to 93.55%.

      Sample (being evaluated) Release: 4.18
      Start Time: 2025-02-04T00:00:00Z
      End Time: 2025-02-11T16:00:00Z
      Success Rate: 93.55%
      Successes: 87
      Failures: 6
      Flakes: 0

      Base (historical) Release: 4.17
      Start Time: 2024-09-01T00:00:00Z
      End Time: 2024-10-01T23:59:59Z
      Success Rate: 99.72%
      Successes: 355
      Failures: 1
      Flakes: 0

      View the test details report for additional context.

      The issue is currently being discussed via https://redhat-internal.slack.com/archives/C01CQA76KMX/p1739211100312459. It seems to specifically impact periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-azure jobs which in part change instances / instance types during the job and appears to be impacting static pods.

      4.19 Test Analysis

      4.18 Test Analysis

            [OCPBUGS-50587] Component Readiness: [Etcd] [operator-conditions] test regressed

            Arda Guclu added a comment -

            According to my initial investigations (I'll do tomorrow extensively), I reached to the same conclusions similar to rh-ee-bpalmer.

            (I'd like to note that I saw `targetRevision is the generation of the deployment we're trying to apply. Can not be set on creation of a nodeStatus.` error which has been added by https://github.com/openshift/api/pull/2208/ and it prevented that installer_controller incorrectly creates new nodestatus.) 

            Arda Guclu added a comment - According to my initial investigations (I'll do tomorrow extensively), I reached to the same conclusions similar to rh-ee-bpalmer . (I'd like to note that I saw `targetRevision is the generation of the deployment we're trying to apply. Can not be set on creation of a nodeStatus.` error which has been added by https://github.com/openshift/api/pull/2208/ and it prevented that installer_controller incorrectly creates new nodestatus.) 

            Bryce Palmer added a comment - - edited

            rhn-engineering-dgoodwin From what I can tell, looking at a couple of the gcp failures, it doesn't look to be the same issue. It looks to me from the logs and must gather that things are moving along and should eventually get to the right state. I haven't been able to pinpoint exactly what might be causing the delay, but looking at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp/1896237384832389120 and the etcd operator logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp/1896237384832389120/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-etcd-operator_etcd-operator-7849b869d8-xxpmf_etcd-operator.log it had just finished progressing the second instance to revision 25 and I expect was about to trigger the last instance to start transition before the logs cut off.

            I also don't see a revision 25 installer pod yet for one of the etcd instances in the set of pod logs (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp/1896237384832389120/artifacts/e2e-gcp/gather-extra/artifacts/pods/). In the must gather, it looks like only one installer pod had just successfully completed. I'm also not seeing any issues in the installer pods to roll out the revisions.

            I'm not sure if this is a timing issue or if there is a deeper underlying issue that I haven't spotted yet. Just wanted to provide an initial update that this doesn't appear to be the same issue from an initial glance at the set of GCP failures.

            Bryce Palmer added a comment - - edited rhn-engineering-dgoodwin From what I can tell, looking at a couple of the gcp failures, it doesn't look to be the same issue. It looks to me from the logs and must gather that things are moving along and should eventually get to the right state. I haven't been able to pinpoint exactly what might be causing the delay, but looking at https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp/1896237384832389120 and the etcd operator logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp/1896237384832389120/artifacts/e2e-gcp/gather-extra/artifacts/pods/openshift-etcd-operator_etcd-operator-7849b869d8-xxpmf_etcd-operator.log it had just finished progressing the second instance to revision 25 and I expect was about to trigger the last instance to start transition before the logs cut off. I also don't see a revision 25 installer pod yet for one of the etcd instances in the set of pod logs ( https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.19-periodics-e2e-gcp/1896237384832389120/artifacts/e2e-gcp/gather-extra/artifacts/pods/ ). In the must gather, it looks like only one installer pod had just successfully completed. I'm also not seeing any issues in the installer pods to roll out the revisions. I'm not sure if this is a timing issue or if there is a deeper underlying issue that I haven't spotted yet. Just wanted to provide an initial update that this doesn't appear to be the same issue from an initial glance at the set of GCP failures.

            Same job is having issues only on gcp

            Same issue as this one? Or separate?

            Devan Goodwin added a comment - Same job is having issues only on gcp Same issue as this one? Or separate?

            Arda Guclu added a comment -

            From the latest runs, I think we can move this to VERIFIED manually, at least in order to unblock the backport PRs as they require specific Jira statuses.

            Arda Guclu added a comment - From the latest runs, I think we can move this to VERIFIED manually, at least in order to unblock the backport PRs as they require specific Jira statuses.

            Arda Guclu added a comment -

            Ideally, we should wait sometime and after ensuring that problem disappears in 4.19, we can pursue the backports in 4.18. However, we are time-constrained. Maybe the best option is to manually moving this bug to VERIFIED and merge the backports.

            Arda Guclu added a comment - Ideally, we should wait sometime and after ensuring that problem disappears in 4.19, we can pursue the backports in 4.18. However, we are time-constrained. Maybe the best option is to manually moving this bug to VERIFIED and merge the backports.

            Arda Guclu added a comment - - edited

            As me and rh-ee-bpalmer took over this work. I'll drop my findings up to now.

            Opened this PR https://github.com/openshift/origin/pull/29550 for better troubleshooting and ran this cluster-bot cluster;

            test e2e 4.19.0-0.nightly-2025-02-14-215306,openshift/api#2203,openshift/origin#29550 aws,ovn 

            This https://prow.ci.openshift.org/view/gs/test-platform-results/logs/release-openshift-origin-installer-launch-aws-modern/1891735658850947072 is the CI run that includes the logs.

            Failing tests in https://github.com/openshift/api/pull/2203 are failing in this function https://github.com/openshift/origin/blob/ff988fb159bb3b23b952861a1eee5f1e22a11ae6/test/extended/apiserver/patch.go#L109 .

            As rh-ee-bpalmer suspected, currentRevisions are set to 0 by default and that causes failures;

            kasOperatorYaml := testdata.MustAsset("test/extended/testdata/apiserver/operator-kube-apiserver-cr.yaml")
            unstructuredKasOperatorManifest := resourceread.ReadUnstructuredOrDie(kasOperatorYaml)
            manifest, _ := json.Marshal(unstructuredKasOperatorManifest)
            framework.Logf("JSONPATCH: unstructuredKasOperatorManifest %s", string(manifest))
            /*
                "apiVersion":"operator.openshift.io/v1",
                "kind":"KubeAPIServer",
                "metadata":{"generateName":"test-apiserver-"},
                "spec":{"managementState":"Managed"},
                "status":{"nodeStatuses":[{"nodeName":"master-1"},{"nodeName":"master-2"}]}}
            */
            unstructuredKasOperator, err := resourceClient.Create(ctx, unstructuredKasOperatorManifest, metav1.CreateOptions{})
            o.Expect(err).NotTo(o.HaveOccurred())
            manifest2, _ := json.Marshal(unstructuredKasOperator)
            framework.Logf("JSONPATCH: unstructuredKasOperator %s", string(manifest2))
            /*
                "apiVersion":"operator.openshift.io/v1",
                "kind":"KubeAPIServer",
                "metadata":{"creationTimestamp":"2025-02-18T08:41:18Z","generateName":"test-apiserver-","generation":1,"managedFields":[{"apiVersion":"operator.openshift.io/v1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{}},"f:spec":{".":{},"f:logLevel":{},"f:managementState":{},"f:operatorLogLevel":{}}},"manager":"openshift-tests","operation":"Update","time":"2025-02-18T08:41:18Z"}],"name":"test-apiserver-2sc47","resourceVersion":"202804","uid":"fde5fdff-a246-49d1-b8e5-469f69ad4f6e"},
                "spec":{"logLevel":"Normal","managementState":"Managed","operatorLogLevel":"Normal"}}
            */
            
            kasOperator := unstructuredToKubeAPIServerOperator(unstructuredKasOperator.Object)
            manifest3, _ := json.Marshal(kasOperator)
            framework.Logf("JSONPATCH: kasOperator %s", string(manifest3))
            /*
                "kind":"KubeAPIServer",
                "apiVersion":"operator.openshift.io/v1",
                "metadata":{"name":"test-apiserver-2sc47","generateName":"test-apiserver-","uid":"fde5fdff-a246-49d1-b8e5-469f69ad4f6e","resourceVersion":"202804","generation":1,"creationTimestamp":"2025-02-18T08:41:18Z","managedFields":[{"manager":"openshift-tests","operation":"Update","apiVersion":"operator.openshift.io/v1","time":"2025-02-18T08:41:18Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{}},"f:spec":{".":{},"f:logLevel":{},"f:managementState":{},"f:operatorLogLevel":{}}}}]},
                "spec":{"managementState":"Managed","logLevel":"Normal","operatorLogLevel":"Normal","unsupportedConfigOverrides":null,"observedConfig":null,"forceRedeploymentReason":""},
                "status":{"readyReplicas":0}}
            */
            kasOperatorFromManifest := unstructuredToKubeAPIServerOperator(unstructuredKasOperatorManifest.Object)
            manifest4, _ := json.Marshal(kasOperatorFromManifest)
            framework.Logf("JSONPATCH: kasOperatorFromManifest %s", string(manifest4))
            /*
                "kind":"KubeAPIServer",
                "apiVersion":"operator.openshift.io/v1",
                "metadata":{"generateName":"test-apiserver-","creationTimestamp":null},
                "spec":{"managementState":"Managed","unsupportedConfigOverrides":null,"observedConfig":null,"forceRedeploymentReason":""},
            ----->>    "status":{"readyReplicas":0,"nodeStatuses":[{"nodeName":"master-1","currentRevision":0},{"nodeName":"master-2","currentRevision":0}]}}
            */
            kasOperator.Status = kasOperatorFromManifest.Status
            manifest5, _ := json.Marshal(kasOperator)
            framework.Logf("JSONPATCH: kasOperator with status %s", string(manifest5))
            /*
                "kind":"KubeAPIServer",
                "apiVersion":"operator.openshift.io/v1",
                "metadata":{"name":"test-apiserver-2sc47","generateName":"test-apiserver-","uid":"fde5fdff-a246-49d1-b8e5-469f69ad4f6e","resourceVersion":"202804","generation":1,"creationTimestamp":"2025-02-18T08:41:18Z","managedFields":[{"manager":"openshift-tests","operation":"Update","apiVersion":"operator.openshift.io/v1","time":"2025-02-18T08:41:18Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:generateName":{}},"f:spec":{".":{},"f:logLevel":{},"f:managementState":{},"f:operatorLogLevel":{}}}}]},
                "spec":{"managementState":"Managed","logLevel":"Normal","operatorLogLevel":"Normal","unsupportedConfigOverrides":null,"observedConfig":null,"forceRedeploymentReason":""},
                "status":{"readyReplicas":0,"nodeStatuses":[{"nodeName":"master-1","currentRevision":0},{"nodeName":"master-2","currentRevision":0}]}}
            */
            unstructuredKasOperator, err = resourceClient.UpdateStatus(ctx, kubeAPIServerOperatorToUnstructured(kasOperator), metav1.UpdateOptions{})
            o.Expect(err).NotTo(o.HaveOccurred())
            kasOperator = unstructuredToKubeAPIServerOperator(unstructuredKasOperator.Object)
            o.Expect(kasOperator.Status.NodeStatuses).To(o.Equal([]operatorv1.NodeStatus{
                {NodeName: "master-1"},
                {NodeName: "master-2"},
            })) 

            Additionally, I opened https://github.com/openshift/api/pull/2206 to exercise if we allow currentRevision is 0.

            For now, open questions for me are;

            • Would be backwards compatible if we change the type of values from int32 to int32*. I assume that there will be disruptive side effects?.
            • We can modify the test by manually removing default currentRevision field from the patch and it should work.
            • Would it be safe to allow currentRevision=0, or would this cause overwriting the currentRevision field to this default value?
              • This seems to be safe, as we are targeting only node status creations. Allowing currentRevision=0 for newly created node status does not conflict with the goal that we are trying to fix the race.
              • According to CI results in https://github.com/openshift/api/pull/2206, allowing currentRevision=0 for new node statuses, it works. 

            Arda Guclu added a comment - - edited As me and rh-ee-bpalmer took over this work. I'll drop my findings up to now. Opened this PR https://github.com/openshift/origin/pull/29550 for better troubleshooting and ran this cluster-bot cluster; test e2e 4.19.0-0.nightly-2025-02-14-215306,openshift/api#2203,openshift/origin#29550 aws,ovn This https://prow.ci.openshift.org/view/gs/test-platform-results/logs/release-openshift-origin-installer-launch-aws-modern/1891735658850947072 is the CI run that includes the logs. Failing tests in https://github.com/openshift/api/pull/2203 are failing in this function https://github.com/openshift/origin/blob/ff988fb159bb3b23b952861a1eee5f1e22a11ae6/test/extended/apiserver/patch.go#L109 . As rh-ee-bpalmer suspected, currentRevisions are set to 0 by default and that causes failures; kasOperatorYaml := testdata.MustAsset( "test/extended/testdata/apiserver/ operator -kube-apiserver-cr.yaml" ) unstructuredKasOperatorManifest := resourceread.ReadUnstructuredOrDie(kasOperatorYaml) manifest, _ := json.Marshal(unstructuredKasOperatorManifest) framework.Logf( "JSONPATCH: unstructuredKasOperatorManifest %s" , string(manifest)) /* "apiVersion" : " operator .openshift.io/v1" , "kind" : "KubeAPIServer" , "metadata" :{ "generateName" : "test-apiserver-" }, "spec" :{ "managementState" : "Managed" }, "status" :{ "nodeStatuses" :[{ "nodeName" : "master-1" },{ "nodeName" : "master-2" }]}} */ unstructuredKasOperator, err := resourceClient.Create(ctx, unstructuredKasOperatorManifest, metav1.CreateOptions{}) o.Expect(err).NotTo(o.HaveOccurred()) manifest2, _ := json.Marshal(unstructuredKasOperator) framework.Logf( "JSONPATCH: unstructuredKasOperator %s" , string(manifest2)) /* "apiVersion" : " operator .openshift.io/v1" , "kind" : "KubeAPIServer" , "metadata" :{ "creationTimestamp" : "2025-02-18T08:41:18Z" , "generateName" : "test-apiserver-" , "generation" :1, "managedFields" :[{ "apiVersion" : " operator .openshift.io/v1" , "fieldsType" : "FieldsV1" , "fieldsV1" :{ "f:metadata" :{ "f:generateName" :{}}, "f:spec" :{ "." :{}, "f:logLevel" :{}, "f:managementState" :{}, "f:operatorLogLevel" :{}}}, "manager" : "openshift-tests" , "operation" : "Update" , "time" : "2025-02-18T08:41:18Z" }], "name" : "test-apiserver-2sc47" , "resourceVersion" : "202804" , "uid" : "fde5fdff-a246-49d1-b8e5-469f69ad4f6e" }, "spec" :{ "logLevel" : "Normal" , "managementState" : "Managed" , "operatorLogLevel" : "Normal" }} */ kasOperator := unstructuredToKubeAPIServerOperator(unstructuredKasOperator. Object ) manifest3, _ := json.Marshal(kasOperator) framework.Logf( "JSONPATCH: kasOperator %s" , string(manifest3)) /* "kind" : "KubeAPIServer" , "apiVersion" : " operator .openshift.io/v1" , "metadata" :{ "name" : "test-apiserver-2sc47" , "generateName" : "test-apiserver-" , "uid" : "fde5fdff-a246-49d1-b8e5-469f69ad4f6e" , "resourceVersion" : "202804" , "generation" :1, "creationTimestamp" : "2025-02-18T08:41:18Z" , "managedFields" :[{ "manager" : "openshift-tests" , "operation" : "Update" , "apiVersion" : " operator .openshift.io/v1" , "time" : "2025-02-18T08:41:18Z" , "fieldsType" : "FieldsV1" , "fieldsV1" :{ "f:metadata" :{ "f:generateName" :{}}, "f:spec" :{ "." :{}, "f:logLevel" :{}, "f:managementState" :{}, "f:operatorLogLevel" :{}}}}]}, "spec" :{ "managementState" : "Managed" , "logLevel" : "Normal" , "operatorLogLevel" : "Normal" , "unsupportedConfigOverrides" : null , "observedConfig" : null , "forceRedeploymentReason" :""}, "status" :{ "readyReplicas" :0}} */ kasOperatorFromManifest := unstructuredToKubeAPIServerOperator(unstructuredKasOperatorManifest. Object ) manifest4, _ := json.Marshal(kasOperatorFromManifest) framework.Logf( "JSONPATCH: kasOperatorFromManifest %s" , string(manifest4)) /* "kind" : "KubeAPIServer" , "apiVersion" : " operator .openshift.io/v1" , "metadata" :{ "generateName" : "test-apiserver-" , "creationTimestamp" : null }, "spec" :{ "managementState" : "Managed" , "unsupportedConfigOverrides" : null , "observedConfig" : null , "forceRedeploymentReason" :""}, ----->> "status" :{ "readyReplicas" :0, "nodeStatuses" :[{ "nodeName" : "master-1" , "currentRevision" :0},{ "nodeName" : "master-2" , "currentRevision" :0}]}} */ kasOperator.Status = kasOperatorFromManifest.Status manifest5, _ := json.Marshal(kasOperator) framework.Logf( "JSONPATCH: kasOperator with status %s" , string(manifest5)) /* "kind" : "KubeAPIServer" , "apiVersion" : " operator .openshift.io/v1" , "metadata" :{ "name" : "test-apiserver-2sc47" , "generateName" : "test-apiserver-" , "uid" : "fde5fdff-a246-49d1-b8e5-469f69ad4f6e" , "resourceVersion" : "202804" , "generation" :1, "creationTimestamp" : "2025-02-18T08:41:18Z" , "managedFields" :[{ "manager" : "openshift-tests" , "operation" : "Update" , "apiVersion" : " operator .openshift.io/v1" , "time" : "2025-02-18T08:41:18Z" , "fieldsType" : "FieldsV1" , "fieldsV1" :{ "f:metadata" :{ "f:generateName" :{}}, "f:spec" :{ "." :{}, "f:logLevel" :{}, "f:managementState" :{}, "f:operatorLogLevel" :{}}}}]}, "spec" :{ "managementState" : "Managed" , "logLevel" : "Normal" , "operatorLogLevel" : "Normal" , "unsupportedConfigOverrides" : null , "observedConfig" : null , "forceRedeploymentReason" :""}, "status" :{ "readyReplicas" :0, "nodeStatuses" :[{ "nodeName" : "master-1" , "currentRevision" :0},{ "nodeName" : "master-2" , "currentRevision" :0}]}} */ unstructuredKasOperator, err = resourceClient.UpdateStatus(ctx, kubeAPIServerOperatorToUnstructured(kasOperator), metav1.UpdateOptions{}) o.Expect(err).NotTo(o.HaveOccurred()) kasOperator = unstructuredToKubeAPIServerOperator(unstructuredKasOperator. Object ) o.Expect(kasOperator.Status.NodeStatuses).To(o.Equal([]operatorv1.NodeStatus{ {NodeName: "master-1" }, {NodeName: "master-2" }, })) Additionally, I opened https://github.com/openshift/api/pull/2206 to exercise if we allow currentRevision is 0. For now, open questions for me are; Would be backwards compatible if we change the type of values from int32 to int32*. I assume that there will be disruptive side effects?. We can modify the test by manually removing default currentRevision field from the patch and it should work. Would it be safe to allow currentRevision=0, or would this cause overwriting the currentRevision field to this default value? This seems to be safe, as we are targeting only node status creations. Allowing currentRevision=0 for newly created node status does not conflict with the goal that we are trying to fix the race. According to CI results in https://github.com/openshift/api/pull/2206 , allowing currentRevision=0 for new node statuses, it works. 

            Ben Luddy added a comment -

            I've opened https://github.com/openshift/api/pull/2203 as a centralized mitigation. Only the node controller is intended to add new entries to nodeStatuses, and when it does, it only populates the nodeName field. On the other hand, the installer controller is never meant to add or remove entries from nodeStatuses โ€“ only modify fields of existing entries. The validation I'm proposing should reject writes that simultaneously add a new entry to nodeStatus and populate the revision fields. This would reject the (incorrect) apply request from the installer controller observed in the resource-watch commit 0995359820c335ed146c81b112680e2b917c3f50 (see https://issues.redhat.com/browse/OCPBUGS-50587?focusedId=26585757&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26585757).

            Ben Luddy added a comment - I've opened https://github.com/openshift/api/pull/2203 as a centralized mitigation. Only the node controller is intended to add new entries to nodeStatuses, and when it does, it only populates the nodeName field. On the other hand, the installer controller is never meant to add or remove entries from nodeStatuses โ€“ only modify fields of existing entries. The validation I'm proposing should reject writes that simultaneously add a new entry to nodeStatus and populate the revision fields. This would reject the (incorrect) apply request from the installer controller observed in the resource-watch commit 0995359820c335ed146c81b112680e2b917c3f50 (see https://issues.redhat.com/browse/OCPBUGS-50587?focusedId=26585757&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26585757 ).

            David Eads added a comment -

            Not a blocker.

             

            This is caused by a race between the installer controller and node status controller. The code/API level flaw is that we have a multi-writer field that we didn't catch until now.  The race requires multiple updates and cache updates to happen in a specific sequence that is common, but not guaranteed.  In our testing we can see this race unblock itself multiple times in three hours. This makes the revision rollout extremely slow (this is bad), but not permanently stuck.

             

            Since removing masters is a rare activity and since the system unsticks itself eventually, we will ship 4.18.0 with this problem and correct it in a close following z-stream update.

            David Eads added a comment - Not a blocker.   This is caused by a race between the installer controller and node status controller. The code/API level flaw is that we have a multi-writer field that we didn't catch until now.  The race requires multiple updates and cache updates to happen in a specific sequence that is common, but not guaranteed.  In our testing we can see this race unblock itself multiple times in three hours. This makes the revision rollout extremely slow (this is bad), but not permanently stuck.   Since removing masters is a rare activity and since the system unsticks itself eventually, we will ship 4.18.0 with this problem and correct it in a close following z-stream update.

            Ben Luddy added a comment -

            Thanks to rh-ee-fbabcock for reproducing this with resource-watch enabled in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/61541/rehearse-61541-periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-aws/1889637478990614528.

            Zooming in on one feedback cycle:

            2025-02-12T14:00:34Z    I0212 14:00:34.748546       1 installer_controller.go:550] "ip-10-0-90-124.ec2.internal" is in transition to 24, but has not made progress because pod installer-24-ip-10-0-90-124.ec2.internal disappeared
            2025-02-12T14:00:34Z    I0212 14:00:34.748700       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"545f1bb8-70bf-4570-9bdc-e3467cbc8ee6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'InstallerPodDisappeared' pods "installer-24-ip-10-0-90-124.ec2.internal" not found
            2025-02-12T14:00:34Z    I0212 14:00:34.766471       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"545f1bb8-70bf-4570-9bdc-e3467cbc8ee6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node ip-10-0-90-124.ec2.internal 

            And correlating with commits in the resource-watch repo...

            db2dfd965c9aca79f57bf7f8b546ad6ad7ba26c4 (14:00:36) shows the node status entry is removed:

            @@ -304,15 +304,6 @@ metadata:
                     	f:lastFallbackCount: {}
                     	f:nodeName: {}
                     	f:targetRevision: {}
            -      	k:{"nodeName":"ip-10-0-90-124.ec2.internal"}:
            -        	.: {}
            -        	f:currentRevision: {}
            -        	f:lastFailedCount: {}
            -        	f:lastFailedReason: {}
            -        	f:lastFailedRevision: {}
            -        	f:lastFallbackCount: {}
            -        	f:nodeName: {}
            -        	f:targetRevision: {}
                   	k:{"nodeName":"ip-10-0-122-99.ec2.internal"}:
                     	.: {}
                     	f:currentRevision: {}
            @@ -476,7 +467,7 @@ metadata:
             	kind: ClusterVersion
             	name: version
             	uid: 7dd41d7c-4f0a-4399-8fb3-a406ba269c72
            -  resourceVersion: "70809"
            +  resourceVersion: "70811"
               uid: 09e24a30-92e1-4c74-b9eb-a070eb36e124
             spec:
               logLevel: Normal
            @@ -663,11 +654,4 @@ status:
             	lastFallbackCount: 0
             	nodeName: ip-10-0-80-174.ec2.internal
             	targetRevision: 0
            -  - currentRevision: 19
            -	lastFailedCount: 0
            -	lastFailedReason: ""
            -	lastFailedRevision: 0
            -	lastFallbackCount: 0
            -	nodeName: ip-10-0-90-124.ec2.internal
            -	targetRevision: 24
               readyReplicas: 0
            

            And 0995359820c335ed146c81b112680e2b917c3f50 (14:00:38) shows the node status entry coming back due to being included in an apply configuration from the installer controller:

            @@ -277,12 +277,14 @@ metadata:
                         .: {}
                         f:lastTransitionTime: {}
                         f:message: {}
            +            f:reason: {}
                         f:status: {}
                         f:type: {}
                       k:{"type":"StaticPodsAvailable"}:
                         .: {}
                         f:lastTransitionTime: {}
                         f:message: {}
            +            f:reason: {}
                         f:status: {}
                         f:type: {}
                     f:nodeStatuses:
            @@ -304,6 +306,15 @@ metadata:
                         f:lastFallbackCount: {}
                         f:nodeName: {}
                         f:targetRevision: {}
            +          k:{"nodeName":"ip-10-0-90-124.ec2.internal"}:
            +            .: {}
            +            f:currentRevision: {}
            +            f:lastFailedCount: {}
            +            f:lastFailedReason: {}
            +            f:lastFailedRevision: {}
            +            f:lastFallbackCount: {}
            +            f:nodeName: {}
            +            f:targetRevision: {}
                       k:{"nodeName":"ip-10-0-122-99.ec2.internal"}:
                         .: {}
                         f:currentRevision: {}
            @@ -316,7 +327,7 @@ metadata:
                 manager: etcd-Installer
                 operation: Apply
                 subresource: status
            -    time: "2025-02-12T13:59:15Z"
            +    time: "2025-02-12T13:59:21Z"
               - apiVersion: operator.openshift.io/v1
                 fieldsType: FieldsV1
                 fieldsV1:
            @@ -466,7 +477,7 @@ metadata:
                 kind: ClusterVersion
                 name: version
                 uid: 7dd41d7c-4f0a-4399-8fb3-a406ba269c72
            -  resourceVersion: "70813"
            +  resourceVersion: "70836"
               uid: 09e24a30-92e1-4c74-b9eb-a070eb36e124
             spec:
               logLevel: Normal
            @@ -548,11 +559,13 @@ status:
               - lastTransitionTime: "2025-02-12T12:42:07Z"
                 message: 1 node is at revision 15; 1 node is at revision 19; 2 nodes are at revision
                   21; 0 nodes have achieved new revision 24
            +    reason: ""
                 status: "True"
                 type: NodeInstallerProgressing
               - lastTransitionTime: "2025-02-12T12:13:59Z"
                 message: 4 nodes are active; 1 node is at revision 15; 1 node is at revision 19;
                   2 nodes are at revision 21; 0 nodes have achieved new revision 24
            +    reason: ""
                 status: "True"
                 type: StaticPodsAvailable
               - lastTransitionTime: "2025-02-12T12:12:19Z"
            @@ -652,4 +665,11 @@ status:
                 lastFallbackCount: 0
                 nodeName: ip-10-0-80-174.ec2.internal
                 targetRevision: 0
            +  - currentRevision: 19
            +    lastFailedCount: 0
            +    lastFailedReason: ""
            +    lastFailedRevision: 0
            +    lastFallbackCount: 0
            +    nodeName: ip-10-0-90-124.ec2.internal
            +    targetRevision: 24
               readyReplicas: 0
            

            Ben Luddy added a comment - Thanks to rh-ee-fbabcock for reproducing this with resource-watch enabled in https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_release/61541/rehearse-61541-periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-aws/1889637478990614528 . Zooming in on one feedback cycle: 2025-02-12T14:00:34Z    I0212 14:00:34.748546       1 installer_controller.go:550] "ip-10-0-90-124.ec2.internal" is in transition to 24, but has not made progress because pod installer-24-ip-10-0-90-124.ec2.internal disappeared 2025-02-12T14:00:34Z    I0212 14:00:34.748700       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"545f1bb8-70bf-4570-9bdc-e3467cbc8ee6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'InstallerPodDisappeared' pods "installer-24-ip-10-0-90-124.ec2.internal" not found 2025-02-12T14:00:34Z    I0212 14:00:34.766471       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"545f1bb8-70bf-4570-9bdc-e3467cbc8ee6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node ip-10-0-90-124.ec2.internal And correlating with commits in the resource-watch repo... db2dfd965c9aca79f57bf7f8b546ad6ad7ba26c4 (14:00:36) shows the node status entry is removed: @@ -304,15 +304,6 @@ metadata: f:lastFallbackCount: {} f:nodeName: {} f:targetRevision: {} - k:{"nodeName":"ip-10-0-90-124.ec2.internal"}: - .: {} - f:currentRevision: {} - f:lastFailedCount: {} - f:lastFailedReason: {} - f:lastFailedRevision: {} - f:lastFallbackCount: {} - f:nodeName: {} - f:targetRevision: {} k:{"nodeName":"ip-10-0-122-99.ec2.internal"}: .: {} f:currentRevision: {} @@ -476,7 +467,7 @@ metadata: kind: ClusterVersion name: version uid: 7dd41d7c-4f0a-4399-8fb3-a406ba269c72 - resourceVersion: "70809" + resourceVersion: "70811" uid: 09e24a30-92e1-4c74-b9eb-a070eb36e124 spec: logLevel: Normal @@ -663,11 +654,4 @@ status: lastFallbackCount: 0 nodeName: ip-10-0-80-174.ec2.internal targetRevision: 0 - - currentRevision: 19 - lastFailedCount: 0 - lastFailedReason: "" - lastFailedRevision: 0 - lastFallbackCount: 0 - nodeName: ip-10-0-90-124.ec2.internal - targetRevision: 24 readyReplicas: 0 And 0995359820c335ed146c81b112680e2b917c3f50 (14:00:38) shows the node status entry coming back due to being included in an apply configuration from the installer controller: @@ -277,12 +277,14 @@ metadata: .: {} f:lastTransitionTime: {} f:message: {} + f:reason: {} f:status: {} f:type: {} k:{"type":"StaticPodsAvailable"}: .: {} f:lastTransitionTime: {} f:message: {} + f:reason: {} f:status: {} f:type: {} f:nodeStatuses: @@ -304,6 +306,15 @@ metadata: f:lastFallbackCount: {} f:nodeName: {} f:targetRevision: {} + k:{"nodeName":"ip-10-0-90-124.ec2.internal"}: + .: {} + f:currentRevision: {} + f:lastFailedCount: {} + f:lastFailedReason: {} + f:lastFailedRevision: {} + f:lastFallbackCount: {} + f:nodeName: {} + f:targetRevision: {} k:{"nodeName":"ip-10-0-122-99.ec2.internal"}: .: {} f:currentRevision: {} @@ -316,7 +327,7 @@ metadata: manager: etcd-Installer operation: Apply subresource: status - time: "2025-02-12T13:59:15Z" + time: "2025-02-12T13:59:21Z" - apiVersion: operator.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: @@ -466,7 +477,7 @@ metadata: kind: ClusterVersion name: version uid: 7dd41d7c-4f0a-4399-8fb3-a406ba269c72 - resourceVersion: "70813" + resourceVersion: "70836" uid: 09e24a30-92e1-4c74-b9eb-a070eb36e124 spec: logLevel: Normal @@ -548,11 +559,13 @@ status: - lastTransitionTime: "2025-02-12T12:42:07Z" message: 1 node is at revision 15; 1 node is at revision 19; 2 nodes are at revision 21; 0 nodes have achieved new revision 24 + reason: "" status: "True" type: NodeInstallerProgressing - lastTransitionTime: "2025-02-12T12:13:59Z" message: 4 nodes are active; 1 node is at revision 15; 1 node is at revision 19; 2 nodes are at revision 21; 0 nodes have achieved new revision 24 + reason: "" status: "True" type: StaticPodsAvailable - lastTransitionTime: "2025-02-12T12:12:19Z" @@ -652,4 +665,11 @@ status: lastFallbackCount: 0 nodeName: ip-10-0-80-174.ec2.internal targetRevision: 0 + - currentRevision: 19 + lastFailedCount: 0 + lastFailedReason: "" + lastFailedRevision: 0 + lastFallbackCount: 0 + nodeName: ip-10-0-90-124.ec2.internal + targetRevision: 24 readyReplicas: 0

            Ben Luddy added a comment -

            Digging into https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-azure/1887581759378296832 and focusing on cluster-etcd-operator, here are some initial observations...

            There is clearly a cycle between the installer controller repeatedly creating installer pods on nodes that have been removed, and the orphaned pods being garbage collected. Creating installer pods on removed nodes occasionally is (I think) not really a problem. The main problem is that the installer controller continues to choose to work on a removed node, and that decision is based on the operator's status.nodeStatuses slice. This suggests a problem with the static pod node controller, which is responsible for adding and removing entries from status.nodeStatuses.

            The static pod node controller issues a JSON Patch request to remove entries from status.nodeStatuses when they have no corresponding Node, and we see that it's taking the codepath that does this repeatedly:

            Time
             
            Namespace
             
            Component
             
            RelatedObject
             
            Reason
             
            Message
             
            22:07:09 (x1067) openshift-etcd-operator openshift-cluster-etcd-operator-etcd-node etcd-operator MasterNodeRemoved Observed removal of master node ci-op-rjdpht1r-9b5b3-pwg9w-master-1
            22:43:58 (x481) openshift-etcd-operator openshift-cluster-etcd-operator-etcd-node etcd-operator MasterNodeRemoved Observed removal of master node ci-op-rjdpht1r-9b5b3-pwg9w-master-2

            The node controller also self-reports this:

            NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request 

            Which is what clients see from any error that occurs while applying a well-formed JSON Patch to an object (https://github.com/openshift/kubernetes/blob/acb9630c20d2fa0d11bad4c38c7b28e2db9d818b/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/patch.go#L394). This would be occasionally expected when the order or membership of status.nodeStatuses changed between when it was observed by the node controller and when the apiserver actually attempts to apply the patch.

            We only have audit logs from the post-rollout nodes. From the audit logs we do have, I only see 6 (of 180) patches to the etcd operator status getting HTTP 422 responses. I can't tell if all of those requests are coming from the node controller.

            The operator status controller is observing flapping in NodeControllerDegraded:

            I0206 22:30:25.448296       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:30:25.489343       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:30:31.084120       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:30:31.131189       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:30:39.046442       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:30:39.086478       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:30:43.056356       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:30:43.087734       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:31:03.055778       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:31:03.093115       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:31:41.301106       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:31:41.331253       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:32:15.270467       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found"
            I0206 22:32:15.302992       1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"

            I went looking for a resource-watch tarball in the artifacts for this job and did not find it. That would have been interesting. There may be some kind of feedback between node controller and others with the effect that the node controller's nodeStatus element deletion gets repeatedly stomped, and is never observed by the installer controller.

            This seems to be consistent across all static pod operators (scheduler, kube-controller-manager, etcd, and kube-apiserver) judging by the volume of duplicate MasterNodeRemoved events put out by their respective node controllers. It also looks like this state may eventually self-correct, but it can take a very long time (30m+). The test in question fails due to timeout.

            Ben Luddy added a comment - Digging into https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-cluster-control-plane-machine-set-operator-release-4.18-periodics-e2e-azure/1887581759378296832 and focusing on cluster-etcd-operator, here are some initial observations... There is clearly a cycle between the installer controller repeatedly creating installer pods on nodes that have been removed, and the orphaned pods being garbage collected. Creating installer pods on removed nodes occasionally is (I think) not really a problem. The main problem is that the installer controller continues to choose to work on a removed node, and that decision is based on the operator's status.nodeStatuses slice. This suggests a problem with the static pod node controller, which is responsible for adding and removing entries from status.nodeStatuses. The static pod node controller issues a JSON Patch request to remove entries from status.nodeStatuses when they have no corresponding Node, and we see that it's taking the codepath that does this repeatedly: Time   Namespace   Component   RelatedObject   Reason   Message   22:07:09 (x1067) openshift-etcd-operator openshift-cluster-etcd-operator-etcd-node etcd-operator MasterNodeRemoved Observed removal of master node ci-op-rjdpht1r-9b5b3-pwg9w-master-1 22:43:58 (x481) openshift-etcd-operator openshift-cluster-etcd-operator-etcd-node etcd-operator MasterNodeRemoved Observed removal of master node ci-op-rjdpht1r-9b5b3-pwg9w-master-2 The node controller also self-reports this: NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request Which is what clients see from any error that occurs while applying a well-formed JSON Patch to an object ( https://github.com/openshift/kubernetes/blob/acb9630c20d2fa0d11bad4c38c7b28e2db9d818b/staging/src/k8s.io/apiserver/pkg/endpoints/handlers/patch.go#L394 ). This would be occasionally expected when the order or membership of status.nodeStatuses changed between when it was observed by the node controller and when the apiserver actually attempts to apply the patch. We only have audit logs from the post-rollout nodes. From the audit logs we do have, I only see 6 (of 180) patches to the etcd operator status getting HTTP 422 responses. I can't tell if all of those requests are coming from the node controller. The operator status controller is observing flapping in NodeControllerDegraded: I0206 22:30:25.448296 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" I0206 22:30:25.489343 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" I0206 22:30:31.084120 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" I0206 22:30:31.131189 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" I0206 22:30:39.046442 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" I0206 22:30:39.086478 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" I0206 22:30:43.056356 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" I0206 22:30:43.087734 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" I0206 22:31:03.055778 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" I0206 22:31:03.093115 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" I0206 22:31:41.301106 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" I0206 22:31:41.331253 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" I0206 22:32:15.270467 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" I0206 22:32:15.302992 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"aa2b8361-af5c-4677-99d4-477409af4710", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: failed applying JSONPatch, err: the server rejected our request due to an error in our request\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found" I went looking for a resource-watch tarball in the artifacts for this job and did not find it. That would have been interesting. There may be some kind of feedback between node controller and others with the effect that the node controller's nodeStatus element deletion gets repeatedly stomped, and is never observed by the installer controller. This seems to be consistent across all static pod operators (scheduler, kube-controller-manager, etcd, and kube-apiserver) judging by the volume of duplicate MasterNodeRemoved events put out by their respective node controllers. It also looks like this state may eventually self-correct, but it can take a very long time (30m+). The test in question fails due to timeout.

              rh-ee-bpalmer Bryce Palmer
              rh-ee-fbabcock Forrest Babcock
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated: