Performance problems with postgres and rt4.2

Hi,

We’re long time users of RT and we love it. However, our existing installation (RT3.6.6 on FreeBSD) is getting rather old so we’re attempting an upgrade. We have around 300,000 tickets, 200 intermittent privileged users

The plan is to move to RT4.2.15 on a newUbuntu VM, by using the deb packages. We’ve got it installed and working OK and like the improvements from our older version. We’ve also imported and updated the old database successfully onto a dedicated postresql 9.5 host (lots of fast disk and memory).

Unfortunately the performance is not what I’d hoped for. We’re seeing home page load times of 4-6 seconds. I’ve been through the relevant tips at PerformanceTuning - Request Tracker Wiki and have seen some improvement but not much.

The worst performing page seems to be creating a new search in the query builder. I’ve tried to correlate it with logs on the database server and I think this query is the problem:

SELECT DISTINCT main.* FROM Users main CROSS JOIN ACL ACL_3 JOIN Principals Principals_1 ON ( Principals_1.id = main.id ) JOIN CachedGroupMembers CachedGroupMembers_2 ON ( CachedGroupMembers_2.MemberId = Principals_1.id ) JOIN CachedGroupMembers CachedGroupMembers_4 ON ( CachedGroupMembers_4.MemberId = Principals_1.id ) WHERE ((ACL_3.ObjectType = ‘RT::Queue’) OR (ACL_3.ObjectType = ‘RT::System’ AND ACL_3.ObjectId = 1)) AND (ACL_3.PrincipalId = CachedGroupMembers_4.GroupId) AND (ACL_3.PrincipalType = ‘Group’) AND (ACL_3.RightName = ‘OwnTicket’ OR ACL_3.RightName = ‘SuperUser’) AND (CachedGroupMembers_2.Disabled = ‘0’) AND (CachedGroupMembers_2.GroupId = ‘4’) AND (CachedGroupMembers_4.Disabled = ‘0’) AND (Principals_1.Disabled = ‘0’) AND (Principals_1.PrincipalType = ‘User’) AND (Principals_1.id != ‘1’) ORDER BY main.Name ASC

It is taking 137 seconds to complete. Rebuilding indexes on postgres doesn’t make any difference at all to this.

Is this sort of performance expected in the ‘newer’ versions of RT? Is there anything odd about the query that would be simple to fix? Is there perhaps a problem caused by the upgrade path?.

Thanks in advance for any help.

Ian Miller

A couple of quick suggestions:

  1. Run pgtune to get some sane memory settings, if you haven’t already. You can use apt to install it using Ubuntu. You might need to tweak some settings like work_mem. It might be that the ORDER BY is creating a sort that goes out to disk.

  2. Run VACUUM and ANALYZE on the DB (but not VACUUM FULL). You might have some table bloat, especially in the cachedgroupmembers table. There are a lot of updates and deletes going on in the cachedgroupmembers table. Also check out the sessions table. You can see what sort of bloat you might have by having a look at pg_stat_user_tables while connected to the RT DB.

  3. There are some MAJOR speed improvements in Pg 9.6. You might want to use it. It’s not too hard to set up the postgresql.org repos in Ubuntu.

Thanks for your reply. I’ve just tried pgtune, but the values it produced weren’t wildly different from what I’d calculated manually, so it didn’t make much difference.

However, Vacuum and Analyse have made a massive difference to the query I posted previously. It’s now running in less than a second. I’d previously been lazy and assumed “vaccuum full analyse” would clean and reindex everything in one go. Newbie DB admin Lesson Learnt.

I’m now seeing quite slow load times for ticket pages. There seem to be a number of queries involved, but this is the longest and is taking 11 seconds:

SELECT DISTINCT main.* FROM Users main CROSS JOIN ACL ACL_3 JOIN Principals Principals_1 ON ( Principals_1.id = main.id ) JOIN CachedGroupMembers CachedGroupMembers_2 ON ( CachedGroupMembers_2.MemberId = Principals_1.id ) JOIN CachedGroupMembers CachedGroupMembers_4 ON ( CachedGroupMembers_4.MemberId = Principals_1.id ) WHERE ((ACL_3.ObjectType = ‘RT::Queue’ AND ACL_3.ObjectId = 22) OR (ACL_3.ObjectType = ‘RT::System’ AND ACL_3.ObjectId = 1)) AND (ACL_3.PrincipalId = CachedGroupMembers_4.GroupId) AND (ACL_3.PrincipalType = ‘Group’) AND (ACL_3.RightName = ‘OwnTicket’) AND (CachedGroupMembers_2.Disabled = ‘0’) AND (CachedGroupMembers_2.GroupId = ‘4’) AND (CachedGroupMembers_4.Disabled = ‘0’) AND (Principals_1.Disabled = ‘0’) AND (Principals_1.PrincipalType = ‘User’) AND (Principals_1.id != ‘1’) ORDER BY main.Name ASC

Could there be problems in the CachedGroupMembers table. Does anyone know if there a way to rebuild this?

I’m not yet willing to upgrade PG version, but will bear it in mind if I can’t progress in other ways.

Thanks again,

Ian

Glad to hear things have improved!

You almost never want to run a VACUUM FULL, as it works by creating creating new tables, so you need double the storage to do it.

At this point, you should probably start looking into the autovacuum settings on your DB. These are generally set so that the autovacuum doesn’t happen very often. On my system, I have them set very aggressively. Probably overaggressive. Anyways, on my system the autovacuum/autoanalyze runs quite often, and the DB is fast. (It’s older than 9.5 as well. :))

You might also be missing an index. You can put the keyword EXPLAIN in front of the query and it will tell you how Pg is processing the query. This can be pretty useful.

Pg 9.6 does have the option of parallel sequential scan, which can improve seq scans by up to 4x.

I’ll look into autovacuum.

Prefixing the query with Explain results in this:

Unique (cost=156.11…156.21 rows=1 width=1183)
→ Sort (cost=156.11…156.12 rows=1 width=1183)
Sort Key: main.name, main.id, main.password, main.comments, main.signature, main.emailaddress, main.freeformcontactinfo, main.organization, main.realname, main.nickname, main.lang, main.emailencoding, main.webencoding, main.externalcontactinfoid, main.contactinfosystem, main.externalauthid, main.authsystem, main.gecos, main.homephone, main.workphone, main.mobilephone, main.pagerphone, main.address1, main.address2, main.city, main.state, main.zip, main.country, main.timezone, main.pgpkey, main.creator, main.created, main.lastupdatedby, main.lastupdated, main.authtoken, main.smimecertificate
→ Nested Loop (cost=1.85…156.10 rows=1 width=1183)
Join Filter: (principals_1.id = main.id)
→ Nested Loop (cost=1.56…155.78 rows=1 width=12)
→ Nested Loop (cost=1.28…151.71 rows=2 width=16)
→ Nested Loop (cost=0.85…148.59 rows=1 width=8)
→ Index Only Scan using disgroumem on cachedgroupmembers cachedgroupmembers_2 (cost=0.43…4.77 rows=17 width=4)
Index Cond: ((groupid = 4) AND (disabled = 0))
→ Index Scan using principals_pkey on principals principals_1 (cost=0.42…8.45 rows=1 width=4)
Index Cond: (id = cachedgroupmembers_2.memberid)
Filter: ((id <> 1) AND (disabled = 0) AND ((principaltype)::text = ‘User’::text))
→ Index Only Scan using cachedgroupmembers4 on cachedgroupmembers cachedgroupmembers_4 (cost=0.43…2.41 rows=71 width=8)
Index Cond: ((memberid = principals_1.id) AND (disabled = 0))
→ Index Only Scan using acl1 on acl acl_3 (cost=0.28…2.03 rows=1 width=4)
Index Cond: ((rightname = ‘OwnTicket’::text) AND (principaltype = ‘Group’::text) AND (principalid = cachedgroupmembers_4.groupid))
Filter: ((((objecttype)::text = ‘RT::Queue’::text) AND (objectid = 22)) OR (((objecttype)::text = ‘RT::System’::text) AND (objectid = 1)))
→ Index Scan using users_pkey on users main (cost=0.29…0.31 rows=1 width=1183)
Index Cond: (id = cachedgroupmembers_4.memberid)

Any pointers to resources to help me understand this?

Can you use "explain (analyze, buffers) " in front of this query? And copy/paste the result using Pre-formatted text on the forum? Here, this request on a similar DB size (though PG 9.1) take less than a second.

The explain (analyze, buffers) will tell you what it needs to do with your data. Just using explain will tell you what it should do in some generic case. As mentioned above, you should think about explain analyze, when just explain comes back with reasonable information as in your case. Here’s a good intro link to explain here: https://use-the-index-luke.com/sql/explain-plan/postgresql/getting-an-execution-plan

Here it is:

Unique  (cost=156.11..156.21 rows=1 width=1183) (actual time=11183.587..11183.653 rows=7 loops=1)
  Buffers: shared hit=1416419
  ->  Sort  (cost=156.11..156.12 rows=1 width=1183) (actual time=11183.585..11183.588 rows=48 loops=1)
        Sort Key: main.name, main.id, main.password, main.comments, main.signature, main.emailaddress, main.freeformcontactinfo, main.organization, main.realname, main.nickname, main.lang, main.emailencoding, main.webencoding, main.externalcontactinfoid, main.contactinfosystem, main.externalauthid, main.authsystem, main.gecos, main.homephone, main.workphone, main.mobilephone, main.pagerphone, main.address1, main.address2, main.city, main.state, main.zip, main.country, main.timezone, main.pgpkey, main.creator, main.created, main.lastupdatedby, main.lastupdated, main.authtoken, main.smimecertificate
        Sort Method: quicksort  Memory: 41kB
        Buffers: shared hit=1416419
        ->  Nested Loop  (cost=1.85..156.10 rows=1 width=1183) (actual time=43.652..11183.038 rows=48 loops=1)
              Join Filter: (principals_1.id = main.id)
              Buffers: shared hit=1416410
              ->  Nested Loop  (cost=1.56..155.78 rows=1 width=12) (actual time=43.606..11182.734 rows=48 loops=1)
                    Buffers: shared hit=1416266
                    ->  Nested Loop  (cost=1.28..151.71 rows=2 width=16) (actual time=0.079..180.823 rows=470716 loops=1)
                          Buffers: shared hit=4117
                          ->  Nested Loop  (cost=0.85..148.59 rows=1 width=8) (actual time=0.048..2.793 rows=327 loops=1)
                                Buffers: shared hit=1327
                                ->  Index Only Scan using disgroumem on cachedgroupmembers cachedgroupmembers_2  (cost=0.43..4.77 rows=17 width=4) (actual time=0.030..0.146 rows=330 loops=1)
                                      Index Cond: ((groupid = 4) AND (disabled = 0))
                                      Heap Fetches: 0
                                      Buffers: shared hit=5
                                ->  Index Scan using principals_pkey on principals principals_1  (cost=0.42..8.45 rows=1 width=4) (actual time=0.007..0.007 rows=1 loops=330)
                                      Index Cond: (id = cachedgroupmembers_2.memberid)
                                      Filter: ((id <> 1) AND (disabled = 0) AND ((principaltype)::text = 'User'::text))
                                      Rows Removed by Filter: 0
                                      Buffers: shared hit=1322
                          ->  Index Only Scan using cachedgroupmembers4 on cachedgroupmembers cachedgroupmembers_4  (cost=0.43..2.41 rows=71 width=8) (actual time=0.008..0.306 rows=1439 loops=327)
                                Index Cond: ((memberid = principals_1.id) AND (disabled = 0))
                                Heap Fetches: 0
                                Buffers: shared hit=2790
                    ->  Index Only Scan using acl1 on acl acl_3  (cost=0.28..2.03 rows=1 width=4) (actual time=0.023..0.023 rows=0 loops=470716)
                          Index Cond: ((rightname = 'OwnTicket'::text) AND (principaltype = 'Group'::text) AND (principalid = cachedgroupmembers_4.groupid))
                          Filter: ((((objecttype)::text = 'RT::Queue'::text) AND (objectid = 22)) OR (((objecttype)::text = 'RT::System'::text) AND (objectid = 1)))
                          Rows Removed by Filter: 0
                          Heap Fetches: 0
                          Buffers: shared hit=1412149
              ->  Index Scan using users_pkey on users main  (cost=0.29..0.31 rows=1 width=1183) (actual time=0.004..0.004 rows=1 loops=48)
                    Index Cond: (id = cachedgroupmembers_4.memberid)
                    Buffers: shared hit=144
Planning time: 4.205 ms
Execution time: 11183.926 ms

Have you customized the ticket pages?

I have a rather different query plan here. This query looks like the one used to get the list of potential owners for a ticket. Do you have a lot of users with ownticket right??

Here is the explain here:

Unique  (cost=89.73..89.81 rows=1 width=368) (actual time=3.053..3.054 rows=1 loops=1)
  Buffers: shared hit=355
  ->  Sort  (cost=89.73..89.74 rows=1 width=368) (actual time=3.050..3.050 rows=1 loops=1)
        Sort Key: main.name, main.id, main.password, main.authtoken, main.comments, main.signature, main.emailaddress, main.freeformcontactinfo, main.organization, main.realname, main.nickname, main.lang, main.gecos, main.homephone, main.workphone, main.mobilephone, main.pagerphone, main.address1, main.address2, main.city, main.state, main.zip, main.country, main.timezone, main.smimecertificate, main.creator, main.created, main.lastupdatedby, main.lastupdated
        Sort Method: quicksort  Memory: 25kB
        Buffers: shared hit=355
        ->  Nested Loop  (cost=46.26..89.72 rows=1 width=368) (actual time=0.347..2.977 rows=1 loops=1)
              Join Filter: (main.id = principals_1.id)
              Buffers: shared hit=344
              ->  Nested Loop  (cost=46.26..89.23 rows=1 width=376) (actual time=0.278..2.906 rows=1 loops=1)
                    Join Filter: (cachedgroupmembers_2.memberid = main.id)
                    Buffers: shared hit=340
                    ->  Hash Join  (cost=46.26..87.52 rows=6 width=8) (actual time=0.264..2.890 rows=1 loops=1)
                          Hash Cond: (cachedgroupmembers_2.memberid = cachedgroupmembers_4.memberid)
                          Buffers: shared hit=337
                          ->  Index Only Scan using disgroumem on cachedgroupmembers cachedgroupmembers_2  (cost=0.00..40.38 rows=21 width=4) (actual time=0.056..2.592 rows=478 loops=1)
                                Index Cond: ((groupid = 4) AND (disabled = 0))
                                Heap Fetches: 339
                                Buffers: shared hit=329
                          ->  Hash  (cost=45.07..45.07 rows=95 width=4) (actual time=0.185..0.185 rows=2 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                Buffers: shared hit=8
                                ->  Nested Loop  (cost=0.00..45.07 rows=95 width=4) (actual time=0.179..0.182 rows=2 loops=1)
                                      Buffers: shared hit=8
                                      ->  Index Only Scan using acl1 on acl acl_3  (cost=0.00..4.43 rows=1 width=4) (actual time=0.120..0.121 rows=1 loops=1)
                                            Index Cond: ((rightname = 'OwnTicket'::text) AND (principaltype = 'Group'::text))
                                            Filter: ((((objecttype)::text = 'RT::Queue'::text) AND (objectid = 22)) OR (((objecttype)::text = 'RT::System'::text) AND (objectid = 1)))
                                            Rows Removed by Filter: 1
                                            Heap Fetches: 0
                                            Buffers: shared hit=4
                                      ->  Index Only Scan using disgroumem on cachedgroupmembers cachedgroupmembers_4  (cost=0.00..40.43 rows=21 width=8) (actual time=0.048..0.050 rows=2 loops=1)
                                            Index Cond: ((groupid = acl_3.principalid) AND (disabled = 0))
                                            Heap Fetches: 0
                                            Buffers: shared hit=4
                    ->  Index Scan using users_pkey on users main  (cost=0.00..0.27 rows=1 width=368) (actual time=0.005..0.006 rows=1 loops=1)
                          Index Cond: (id = cachedgroupmembers_4.memberid)
                          Buffers: shared hit=3
              ->  Index Scan using principals_pkey on principals principals_1  (cost=0.00..0.47 rows=1 width=4) (actual time=0.062..0.064 rows=1 loops=1)
                    Index Cond: (id = cachedgroupmembers_4.memberid)
                    Filter: ((id <> 1) AND (disabled = 0) AND ((principaltype)::text = 'User'::text))
                    Buffers: shared hit=4
Total runtime: 3.362 ms

maybe you need Set($AutocompleteOwners, 0);

Set($AutocompleteOwners, 0); is present already.
No customisation of the ticket page.
7 users appear in the owner dropdown on this ticket - is that the people with the the ownticket right? There are around 200 users across the system, with different rights in different queues.

so, not so many users. Can you give us row count for every RT tables?

Rowcounts as requested are below. Please bear in mind this is imported from a system that’s been running on 3.6.6 for some years:

customfields	55
attachments		1799811
objectclasses	1
articles		3
objecttopics	0
objectcustomfields	104
scripconditions	20
templates		140
users		56444
tickets		226500
customfieldvalues	200
queues		109
scripactions	23
topics		0
scrips		34
groupmembers	610491
acl			2822
cachedgroupmembers	2131016
principals		1019548
objectscrips	34
attributes		151388
links		6417
transactions	3021402
objectcustomfieldvalues	63328
groups		963104
sessions		282
classes		1

One other thing I was thinking about. How often are you rebuilding your indexes? VACUUM and ANALYZE help quite a bit, but they don’t actually rebuild the indexes, and if you have a lot of bloat in the indexes, they won’t get used properly.

I could be wrong but I don’t think that the indexes are out-of-date. The data is hardly changing at all as this isn’t yet in production. I did rebuild the indexes after doing the import and upgrade from the old version. I’ll do a ‘reindex’ and see what happens.

As expected, no difference after a reindex.

However, I’ve had another full read of the upgrading documentation and noticed something I’d previously missed in UPGRADING-3.8 - RT 4.2.13 Documentation - Best Practical, the reference to a ‘shrink-cgm-table’ script. As my gut feeling was that was a problem with the cachegroupmembers table, I thought I’d give it a go.

It hasn’t actually finished processing yet (set off an hour ago), but already the queries above are completing in less than 10ms.

So, a note to anyone in a similar upgrade scenario, make sure you properly read (not just skim) all the provided upgrade docs!