[LOG-2919] CLO is constantly failing to create already existing logging objects (HTTP 409)

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: Logging 5.6.0
Affects Version/s: Logging 5.4.2
Component/s: Log Collection
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
VERIFIED
Release Note Text:

Hide
Before this update, the Operators general pattern for reconciling resources was to try and create before attempting to get or update which would lead to constant HTTP 409 responses after creation. With this update, Operators first attempt to retrieve an object and only create or update it if it is either missing or not as specified.

Show
Before this update, the Operators general pattern for reconciling resources was to try and create before attempting to get or update which would lead to constant HTTP 409 responses after creation. With this update, Operators first attempt to retrieve an object and only create or update it if it is either missing or not as specified.

Sprint:
Log Collection - Sprint 226, Log Collection - Sprint 227

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

While looking into an API issue, we found CLO constantly trying to create / recreate its objects causing a large number of HTTP 409 errors against the API.

From the API logs, we are seeing around 7500 failures per hour in a small lab cluster.

kubectl-dev_tool audit -f ./kube-apiserver --by resource --user=system:serviceaccount:openshift-logging:cluster-logging-operator --failed-only -otop
count: 14634, first: 2022-08-09T14:47:06-04:00, last: 2022-08-09T16:42:51-04:00, duration: 1h55m45.181771s
3191x                v1/configmaps
1504x                monitoring.coreos.com/prometheusrules
1504x                v1/services
1504x                monitoring.coreos.com/servicemonitors
935x                 v1/serviceaccounts
752x                 apps/daemonsets
752x                 rbac.authorization.k8s.io/roles
752x                 security.openshift.io/v1/securitycontextconstraints
752x                 rbac.authorization.k8s.io/v1/clusterrolebindings
752x                 scheduling.k8s.io/v1/priorityclasses

Looking at the actual failed requests, its all http 409s trying to create things that already exist:

kubectl-dev_tool audit -f ./kube-apiserver --by resource --user=system:serviceaccount:openshift-logging:cluster-logging-operator --failed-only
had 1115196 line read failures
18:47:06 [CREATE][     7.088ms] [409] /apis/scheduling.k8s.io/v1/priorityclasses                                          [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][     7.883ms] [409] /api/v1/namespaces/openshift-logging/serviceaccounts                                [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][     8.077ms] [409] /apis/security.openshift.io/v1/securitycontextconstraints                           [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][     8.343ms] [409] /apis/rbac.authorization.k8s.io/v1/namespaces/openshift-logging/roles               [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][    12.699ms] [409] /apis/rbac.authorization.k8s.io/v1/namespaces/openshift-logging/rolebindings        [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][     8.438ms] [409] /apis/rbac.authorization.k8s.io/v1/clusterroles                                     [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][     8.497ms] [409] /apis/rbac.authorization.k8s.io/v1/clusterrolebindings                              [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][    16.374ms] [409] /api/v1/namespaces/openshift-logging/services                                       [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][     7.865ms] [409] /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/servicemonitors         [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][     11.21ms] [409] /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/prometheusrules         [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][    10.356ms] [409] /api/v1/namespaces/openshift-logging/configmaps                                     [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [CREATE][     6.324ms] [409] /api/v1/namespaces/openshift-logging/configmaps                                     [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [DELETE][     1.694ms] [404] /api/v1/namespaces/openshift-logging/services/fluentd                               [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [DELETE][     2.167ms] [404] /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/servicemonitors/fluentd [system:serviceaccount:openshift-logging:cluster-logging-operator]
18:47:06 [DELETE][     2.032ms] [404] /apis/monitoring.coreos.com/v1/namespaces/openshift-logging/prometheusrules/fluentd [system:serviceaccount:openshift-logging:cluster-logging-operator]

links to

openshift/cluster-logging-operator#1620: LOG-2919: fix order of reconcilation for ServiceMonitor

openshift/cluster-logging-operator#1642: LOG-2919: fix reconciliation order of Daemonsets and trusted configmap

openshift/cluster-logging-operator#1701: LOG-2789: fix collector deployment twice because of CA trustbundle

mentioned on

Merge request - Updated 2 upstream sources

Merge request - Updated 6 upstream sources

Merge request - Updated US source to: a25ac97 Merge pull request #1701 from jcantrill/log2789

(1 mentioned on)

1.	Refactor Daemonsets	Closed	Jeffrey Cantrill
2.	Refactor trust bundle configmap	Closed	Jeffrey Cantrill
3.	Refactor reconcile of service	Closed	Jeffrey Cantrill
4.	Refactor cluster rbac	Closed	Jeffrey Cantrill
5.	Refactor serviceaccount	Closed	Jeffrey Cantrill
6.	Refactor prometheusrule	Closed	Jeffrey Cantrill

Qiaoling Tang added a comment - 2022/11/08 6:10 AM

Verified using cluster-logging.v5.6.0 .

Qiaoling Tang added a comment - 2022/11/08 6:10 AM Verified using cluster-logging.v5.6.0 .

GitLab CEE Bot added a comment - 2022/10/22 12:25 AM

CPaaS Service Account mentioned this issue in merge request !248 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.6-rhel-8_upstream_bd89702a0be8fc97812be8b73431ea96:

Updated US source to: a25ac97 Merge pull request #1701 from jcantrill/log2789

GitLab CEE Bot added a comment - 2022/10/22 12:25 AM CPaaS Service Account mentioned this issue in merge request !248 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.6-rhel-8_ upstream _bd89702a0be8fc97812be8b73431ea96 : Updated US source to: a25ac97 Merge pull request #1701 from jcantrill/log2789

Jeffrey Cantrill added a comment - 2022/10/05 7:59 PM

Some issues you may be seeing are already reported in ~~LOG-3049~~. rojacob@redhat.com just recently discovered the issue for ~~LOG-3049~~

Jeffrey Cantrill added a comment - 2022/10/05 7:59 PM Some issues you may be seeing are already reported in LOG-3049 . rojacob@redhat.com just recently discovered the issue for LOG-3049

GitLab CEE Bot added a comment - 2022/10/01 3:22 PM

CPaaS Service Account mentioned this issue in merge request !135 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.6-rhel-8_upstream_d25d5ee8d88291462ccef1912b6d2450:

Updated 2 upstream sources

GitLab CEE Bot added a comment - 2022/10/01 3:22 PM CPaaS Service Account mentioned this issue in merge request !135 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.6-rhel-8_ upstream _d25d5ee8d88291462ccef1912b6d2450 : Updated 2 upstream sources

GitLab CEE Bot added a comment - 2022/09/20 7:54 PM

CPaaS Service Account mentioned this issue in merge request !91 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.6-rhel-8_upstream_bdc48bb458b8eafbbe80de10170aff4d:

Updated 6 upstream sources

GitLab CEE Bot added a comment - 2022/09/20 7:54 PM CPaaS Service Account mentioned this issue in merge request !91 of openshift-logging / Log Collection Midstream on branch openshift-logging-5.6-rhel-8_ upstream _bdc48bb458b8eafbbe80de10170aff4d : Updated 6 upstream sources

Matt Robson added a comment - 2022/09/01 7:13 PM

I will confirm once I get new data and let you know.

Matt Robson added a comment - 2022/09/01 7:13 PM I will confirm once I get new data and let you know.

Jeffrey Cantrill added a comment - 2022/08/30 6:19 PM

If you consider it resolved with the upgrade I would proposing we close this as fixed in the next release. We are unlikely to fix in 5.4

Jeffrey Cantrill added a comment - 2022/08/30 6:19 PM If you consider it resolved with the upgrade I would proposing we close this as fixed in the next release. We are unlikely to fix in 5.4

Jeffrey Cantrill added a comment - 2022/08/30 5:20 PM

rhn-support-mrobson this is partially because of the "strategy" for object reconciliation. With the release of 5.5 we have moved to a "watch" from a 30s "periodic" poll which should alleviate part of this issue. Is there anyway you might be able to confirm if there is an improvement in 5.5?

Jeffrey Cantrill added a comment - 2022/08/30 5:20 PM rhn-support-mrobson this is partially because of the "strategy" for object reconciliation. With the release of 5.5 we have moved to a "watch" from a 30s "periodic" poll which should alleviate part of this issue. Is there anyway you might be able to confirm if there is an improvement in 5.5?

Assignee:: Jeffrey Cantrill

Reporter:: Matt Robson

QA Contact:: Qiaoling Tang

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022/08/10 3:03 PM

Updated:: 2023/01/17 4:12 PM

Resolved:: 2022/11/08 6:10 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

Collapse comment: Qiaoling Tang added a comment - 2022/11/08 6:10 AM

Expand comment: Qiaoling Tang added a comment - 2022/11/08 6:10 AM

Collapse comment: GitLab CEE Bot added a comment - 2022/10/22 12:25 AM

Expand comment: GitLab CEE Bot added a comment - 2022/10/22 12:25 AM

Collapse comment: Jeffrey Cantrill added a comment - 2022/10/05 7:59 PM

Expand comment: Jeffrey Cantrill added a comment - 2022/10/05 7:59 PM

Collapse comment: GitLab CEE Bot added a comment - 2022/10/01 3:22 PM

Expand comment: GitLab CEE Bot added a comment - 2022/10/01 3:22 PM

Collapse comment: GitLab CEE Bot added a comment - 2022/09/20 7:54 PM

Expand comment: GitLab CEE Bot added a comment - 2022/09/20 7:54 PM

Collapse comment: Matt Robson added a comment - 2022/09/01 7:13 PM

Expand comment: Matt Robson added a comment - 2022/09/01 7:13 PM

Collapse comment: Jeffrey Cantrill added a comment - 2022/08/30 6:19 PM

Expand comment: Jeffrey Cantrill added a comment - 2022/08/30 6:19 PM

Collapse comment: Jeffrey Cantrill added a comment - 2022/08/30 5:20 PM

Expand comment: Jeffrey Cantrill added a comment - 2022/08/30 5:20 PM

People

Dates