Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: rhos-18.0.0
Affects Version/s: rhos-18.0.0
Component/s: openstack-nova
Labels:
- Patch
- Triaged

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Fixed in Build:
openstack-nova-27.1.1-18.0.20230720224959.b9089ac.el9ost
Regression:
None
Intelligence Requested:
Market:

Severity:
Moderate

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

+++ This bug was initially created as a clone of Bug #2170686 +++

+++ This bug was initially created as a clone of Bug #2170685 +++

+++ This bug was initially created as a clone of Bug #2170683 +++

+++ This bug was initially created as a clone of Bug #2168206 +++

Description of problem:
Since upgrade to 16.1.9 they're seeing these errors on mysql logs from DB nodes.
They also say the deleted instances are not being purged from the DB and rapidly increasing.
The load on the master DB node is increasing more than the average from time to time.

Version-Release number of selected component (if applicable):
RHOSP 16.1.9

Additional info:
In next updates and sosreports from all 3 DB and controller nodes are attached to the case.
Also DB dumps from nova and cinder DBs.

— Additional comment from Joaquín Veira on 2023-02-08 13:05:08 UTC —

Full error:

2023-02-01 21:40:10 1190123 [ERROR] InnoDB: Clustered record for sec rec not found index `instances_project_id_deleted_idx` of table `nova_cell0`.`instances`
InnoDB: sec index record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 30; hex 666530353739313433626261343761653835303162356336633765383038; asc fe0579143bba47ae8501b5c6c7e808; (total 32 bytes);
1: len 4; hex 80000000; asc ;;
2: len 4; hex 81253331; asc %31;;

InnoDB: clust index record PHYSICAL RECORD: n_fields 55; compact format; info bits 0
0: len 4; hex 812520c7; asc % ;;
1: len 6; hex 000000000000; asc ;;
2: len 7; hex 80000000000000; asc ;;
3: len 5; hex 99ade0cb77; asc w;;
4: len 5; hex 99ade0cb77; asc w;;
5: SQL NULL;
6: SQL NULL;
7: len 30; hex 656661326266373836323135343634653934663764633563656230373732; asc efa2bf786215464e94f7dc5ceb0772; (total 32 bytes);
8: len 30; hex 666530353739313433626261343761653835303162356336633765383038; asc fe0579143bba47ae8501b5c6c7e808; (total 32 bytes);
9: len 30; hex 38623861623261312d653334392d343331332d396133382d613830306634; asc 8b8ab2a1-e349-4313-9a38-a800f4; (total 36 bytes);
10: len 0; hex ; asc ;;
11: len 0; hex ; asc ;;
12: len 4; hex 80000000; asc ;;
13: len 10; hex 6469726563746f722d64; asc director-d;;
14: len 30; hex 7373682d727361204141414142334e7a6143317963324541414141444151; asc ssh-rsa AAAAB3NzaC1yc2EAAAADAQ; (total 433 bytes);
15: len 4; hex 80000000; asc ;;
16: len 5; hex 6572726f72; asc error;;
17: len 4; hex 80002000; asc ;;
18: len 4; hex 80000002; asc ;;
19: len 11; hex 6c6a2d62756c6b74657374; asc lj-bulktest;;
20: SQL NULL;
21: SQL NULL;
22: len 10; hex 722d6338317875703276; asc r-c81xup2v;;
23: SQL NULL;
24: SQL NULL;
25: len 11; hex 6c6a2d62756c6b74657374; asc lj-bulktest;;
26: SQL NULL;
27: len 4; hex 6e6f7661; asc nova;;
28: len 1; hex 80; asc ;;
29: SQL NULL;
30: SQL NULL;
31: len 4; hex 8000062c; asc ,;;
32: SQL NULL;
33: len 30; hex 63386464376365342d616631392d343034342d623831302d353066633131; asc c8dd7ce4-af19-4044-b810-50fc11; (total 36 bytes);
34: SQL NULL;
35: SQL NULL;
36: SQL NULL;
37: SQL NULL;
38: len 0; hex ; asc ;;
39: SQL NULL;
40: SQL NULL;
41: SQL NULL;
42: len 4; hex 80000000; asc ;;
43: len 1; hex 80; asc ;;
44: len 1; hex 80; asc ;;
45: len 1; hex 80; asc ;;
46: len 4; hex 8000001e; asc ;;
47: len 4; hex 80000000; asc ;;
48: SQL NULL;
49: SQL NULL;
50: len 4; hex 80000000; asc ;;
51: SQL NULL;
52: len 4; hex 80000000; asc ;;
53: SQL NULL;
54: len 1; hex 80; asc ;;

TRANSACTION 422121737885096, ACTIVE 0 sec fetching rows
mysql tables in use 7, locked 0
0 lock struct(s), heap size 1128, 0 row lock(s)
MySQL thread id 1190123, OS thread handle 140636920887040, query id 34272351 172.16.21.72 nova Creating sort index
SELECT anon_1.instances_created_at AS anon_1_instances_created_at, anon_1.instances_updated_at AS anon_1_instances_updated_at, anon_1.instances_deleted_at AS anon_1_instances_deleted_at, anon_1.instances_deleted AS anon_1_instances_deleted, anon_1.instances_id AS anon_1_instances_id, anon_1.instances_user_id AS anon_1_instances_user_id, anon_1.instances_project_id AS anon_1_instances_project_id, anon_1.instances_image_ref AS anon_1_instances_image_ref, anon_1.instances_kernel_id AS anon_1_instances_kernel_id, anon_1.instances_ramdisk_id AS anon_1_instances_ramdisk_id, anon_1.instances_hostname

As for the instance archiving. This is our workflow:

[root@ctrl-nova-d-01 ~]# podman exec -it nova_api_cron /bin/bash
()[root@ctrl-nova-d-01 /]# cat /var/spool/cron/nova

HEADER: This file was autogenerated at 2023-01-22 17:48:07 +0000 by puppet.
HEADER: While it can still be managed manually, it is definitely not recommended.
HEADER: Note particularly that the comments starting with 'Puppet Name' should
HEADER: not be deleted, as doing so could cause duplicate cron jobs.
Puppet Name: nova-manage db archive_deleted_rows
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
1 0 * * * sleep `expr ${RANDOM} % 3600`; nova-manage db archive_deleted_rows --max_rows 1000 --before `date --date='today - 1 days' +%F` --until-complete --all-cells >>/var/log/nova/nova-rowsflush.log 2>&1
Puppet Name: nova-manage db purge
PATH=/bin:/usr/bin:/usr/sbin SHELL=/bin/sh
0 5 * * * sleep `expr ${RANDOM} % 3600`; nova-manage db purge --before `date --date='today - 14 days' +%D` --all-cells >>/var/log/nova/nova-rowspurge.log 2>&1

Even if we remove --until-complete or --before not much is archived but we still have total count of instances now over 100k which will pose a problem over time. To give you an example:

()[root@ctrl-nova-d-01 /]# nova-manage db archive_deleted_rows --max_rows 10000 --all-cells --verbose
---------------------------------------------------------+

Table

Number of Rows Archived

---------------------------------------------------------+

API_DB.instance_group_member	0
API_DB.instance_mappings	0
API_DB.request_specs	0
default.block_device_mapping	3
default.instance_extra	3
default.instance_faults	1
default.instance_id_mappings	3
default.instance_info_caches	3
default.instance_metadata	14
default.instance_system_metadata	51
default.virtual_interfaces	3

---------------------------------------------------------+

[root@ctrl-db-d-01 ~]# mysql -h 172.16.21.59 -e "select count from instances where deleted <> 0" nova;
----------

count

----------

104485

----------

So we still have 104k deleted instances but above db clean up is not cleaning them up.

As for the cinder cleanup. The command returns an error:

()[root@ctrl-api-d-02 /]# cinder-manage --debug db purge 1
Deprecated: Option "logdir" from group "DEFAULT" is deprecated. Use option "log-dir" from group "DEFAULT".
Purge command failed, check cinder-manage logs for more details.

The logs shows:

2023-02-02 15:58:06.302 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=volume_glance_metadata
2023-02-02 15:58:06.407 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Deleted 73 rows from table=volume_glance_metadata
2023-02-02 15:58:06.408 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=snapshot_metadata
2023-02-02 15:58:06.412 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=attachment_specs
2023-02-02 15:58:06.416 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=volume_type_projects
2023-02-02 15:58:06.419 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=volume_type_extra_specs
2023-02-02 15:58:06.422 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=volume_metadata
2023-02-02 15:58:06.436 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Deleted 26 rows from table=volume_metadata
2023-02-02 15:58:06.436 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=volume_attachment
2023-02-02 15:58:06.469 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Deleted 30 rows from table=volume_attachment
2023-02-02 15:58:06.470 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=volume_admin_metadata
2023-02-02 15:58:06.476 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=transfers
2023-02-02 15:58:06.480 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=snapshots
2023-02-02 15:58:06.484 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=group_volume_type_mapping
2023-02-02 15:58:06.488 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=workers
2023-02-02 15:58:06.491 770627 INFO cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] Purging deleted rows older than age=1 days from table=volumes
2023-02-02 15:58:09.336 770627 ERROR cinder.db.sqlalchemy.api [req-cc5cdb71-9bae-4821-8c2f-d808f2992624 - - - - -] DBError detected when purging from volumes: (pymysql.err.IntegrityError) (1451, 'Cannot delete or update a parent row: a foreign key constraint fails (`cinder`.`volume_attachment`, CONSTRAINT `volume_attachment_ibfk_1` FOREIGN KEY (`volume_id`) REFERENCES `volumes` (`id`))')
[SQL: DELETE FROM volumes WHERE volumes.deleted_at < %(deleted_at_1)s]
[parameters:

{'deleted_at_1': datetime.datetime(2023, 2, 1, 15, 58, 6, 491416)}

]
(Background on this error at: http://sqlalche.me/e/gkpj).: oslo_db.exception.DBReferenceError: (pymysql.err.IntegrityError) (1451, 'Cannot delete or update a parent row: a foreign key constraint fails (`cinder`.`volume_attachment`, CONSTRAINT `volume_attachment_ibfk_1` FOREIGN KEY (`volume_id`) REFERENCES `volumes` (`id`))')

So it appears we have a serious issues with purging old db records too.

No issues with glance purge.

— Additional comment from RHEL Program Management on 2023-02-16 22:30:08 UTC —

This item has been properly Triaged and planned for the release, and Target Release is now set to match the release flag. For details, see https://mojo.redhat.com/docs/DOC-1195410

— Additional comment from RHEL Program Management on 2023-02-16 22:34:21 UTC —

This item has been properly Triaged and planned for the release, and Target Release is now set to match the release flag.

external trackers

OpenStack gerrit 887978

OpenStack gerrit 887979

Red Hat Engineering Gerrit 444838