Extremely slow SQL queries after RT3 to RT4 upgrade

I’m currently trying to migrate from RT 3.8.2 to RT 4.4.1 and I’m running into some slow SQL queries (2-4 mins) which are causing the user homepage/landing page interface to time out for users who do not have the superuser right. I’m using MariaDB 5.5.52 on Centos 7.

By logging slow queries on the database, I’ve been able to investigate a bit further. One exemplar is:

SELECT COUNT(DISTINCT main.id) FROM Tickets main LEFT JOIN Groups Groups_1 ON ( Groups_1.Domain = ‘RT::Ticket-Role’ ) AND ( Groups_1.Instance = main.id ) LEFT JOIN CachedGroupMembers CachedGroupMembers_2 ON ( CachedGroupMembers_2.Disabled = ‘0’ ) AND ( CachedGroupMembers_2.MemberId = ‘3456’ ) AND ( CachedGroupMembers_2.GroupId = Groups_1.id ) WHERE ( ( main.Queue IN (‘3’, ‘4’, ‘6’, ‘7’, ‘7’, ‘8’, ‘8’, ‘9’, ‘9’, ‘10’, ‘10’, ‘10’, ‘11’, ‘11’, ‘11’, ‘12’, ‘12’, ‘13’, ‘13’, ‘13’, ‘15’, ‘15’, ‘16’, ‘17’, ‘19’, ‘19’, ‘20’, ‘21’, ‘21’, ‘21’, ‘22’, ‘25’, ‘26’, ‘26’, ‘26’, ‘28’, ‘29’, ‘29’, ‘31’, ‘31’, ‘34’, ‘37’, ‘38’, ‘38’, ‘40’, ‘40’, ‘41’, ‘42’, ‘46’, ‘47’, ‘48’, ‘48’, ‘10’, ‘11’, ‘16’, ‘17’, ‘19’, ‘25’, ‘26’, ‘31’, ‘4’, ‘40’, ‘41’) OR ( main.Owner = ‘3456’ ) OR ( CachedGroupMembers_2.MemberId IS NOT NULL AND Groups_1.Name = ‘Requestor’ ) OR ( CachedGroupMembers_2.MemberId IS NOT NULL AND Groups_1.Name = ‘AdminCc’ ) ) ) AND (main.IsMerged IS NULL) AND (main.Status != ‘deleted’) AND (main.Type = ‘ticket’) AND (main.Owner = ‘3456’ AND ( main.Status = ‘new’ OR main.Status = ‘open’ ) );

explain tells me:
±-----±------------±---------------------±-----±---------------------------------------------------±-----------±--------±----------------------------±-------±-------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
±-----±------------±---------------------±-----±---------------------------------------------------±-----------±--------±----------------------------±-------±-------------------------+
| 1 | SIMPLE | main | ref | Tickets1,Tickets2 | Tickets2 | 4 | const | 571 | Using where |
| 1 | SIMPLE | Groups_1 | ref | Groups2,Groups1 | Groups1 | 67 | const | 360409 | Using where; Using index |
| 1 | SIMPLE | CachedGroupMembers_2 | ref | DisGrouMem,CachedGroupMembers2,CachedGroupMembers3 | DisGrouMem | 12 | rt4.Groups_1.id,const,const | 1 | Using where; Using index |
±-----±------------±---------------------±-----±---------------------------------------------------±-----------±--------±----------------------------±-------±-------------------------+

My database is about 8 years old, since it’s the test database from my RT2 to RT3 migration from that time. It has about 150k tickets in it.

Things I’ve done to try to narrow down the problem:
Installed a fresh rt4 database, and compared the indexes between the two. In the end I threw away all indexes on my ported DB and create them afresh copying the ones in the fresh empty database. This seemed to help a little (halfed search times), but not enough.

Installed a separate Centos6/Mysql 5.1.73 server, uploaded the same database and executed the same query - very similar results.

On both these servers, there seems to be adequate resource - 8G memory, 4 cores (they are virtual machines) - observation suggests the entire DB is being held in memory with room to spare, and there is no disc io. the my.cnf in both cases is based on a 4G innodb setup, and specific values mention in tuning notes I’ve found are all at least as large as recommended.

Meanwhile out live RT3 database is running quite happily with over 300k tickets on a similar VM (mysql 5.0.26). If I do the same query there (minus the IsMerged since that column doesn’t exist) it takes just under 17 minutes!

This is a bit of a showstopper for the upgrade. It has the feel of a missing index, and it seems that this is a query that 4.4.1 does that 3.8.2 did not. I’m also prepared to believe we have some odd data (since this started in RT2 and I note that the validator (which I have run) cleans up some data quirks generated by bugs in some older versions.

Anyone had a similar experience, and what did you do to fix it?

Thanks.

John

My experience was similar, but with postgresql. The thing that finally cracked it was shrinking the cached group members table.

See this thread:

Thanks for this. Its sufficiently long ago that we moved from 2.2 to 3.8.6 that I’m not sure I’d run shrink_cgm_table but I have now done so, and this has reduced the size of CachedGroupMembers by about 20%. However this hasn’t reduced the query length by any more than that - it’s down to about 80s now, which is still a couple of orders of magnitude too long alas.

I’m also being a bit more systematic about shredding deleted tickets to see how much that will help.

The historic version of the DB I’m using has about 150 000 tickets in it though. Our live RT3 DB has 300 000 and CachedGroupMembers in it has 2.4million rows.

One of my colleagues has been playing with the specific query and notes that rewriting the query this way:

"Another way to do it is to do (Groups INNER JOIN CachedGroupMembers) as a
subquery, which avoids the UNION. Again, this gives the optimiser more
information, but it’s probably easier to get RT to use a subquery.

SELECT COUNT(DISTINCT Tickets.id)
FROM Tickets
LEFT JOIN
(SELECT Groups.Instance
FROM Groups INNER JOIN CachedGroupMembers
ON GroupId = Groups.id
WHERE CachedGroupMembers.Disabled = ‘0’
AND CachedGroupMembers.MemberId = ‘3456’
AND Groups.Domain = ‘RT::Ticket-Role’
AND Groups.Name IN (‘Requestor’, ‘AdminCc’)) AS y
ON y.Instance = Tickets.id
WHERE
(
Tickets.Queue IN (‘3’, ‘4’, ‘6’, ‘7’, ‘7’, ‘8’, ‘8’, ‘9’,
‘9’, ‘10’, ‘10’, ‘10’, ‘11’, ‘11’, ‘11’, ‘12’, ‘12’, ‘13’, ‘13’, ‘13’, ‘15’,
‘15’, ‘16’, ‘17’, ‘19’, ‘19’, ‘20’, ‘21’, ‘21’, ‘21’, ‘22’, ‘25’, ‘26’,
‘26’, ‘26’, ‘28’, ‘29’, ‘29’, ‘31’, ‘31’, ‘34’, ‘37’, ‘38’, ‘38’, ‘40’,
‘40’, ‘41’, ‘42’, ‘46’, ‘47’, ‘48’, ‘48’, ‘10’, ‘11’, ‘16’, ‘17’, ‘19’,
‘25’, ‘26’, ‘31’, ‘4’, ‘40’, ‘41’)
OR Tickets.Owner = ‘3456’
OR y.Instance IS NOT NULL
)
AND Tickets.IsMerged IS NULL
AND Tickets.Status != ‘deleted’
AND Tickets.Type = ‘ticket’
AND Tickets.Owner = ‘10’
AND (Tickets.Status = ‘new’ OR Tickets.Status = ‘open’);

And this query completes in 1.5s instead of 80s.

I suspect changing the page which generates the query (it’s somewhere on the user homepage) to generate this one instead may be beyond my expertise.

We have resolved this issue by changing from mariadb to postgresql whose query optimiser must handle this better.