Tracking down a new performance issue (RT + Postgres)

dejavu · March 18, 2025, 2:16am

This will no doubt be a “d’oh!” solution, but, I’m a bit stuck at knowing where to start.

Last week I upgraded from Postgres 9.6 to 15.12, and even though the upgrade itself went very smoothly, RT is noticeably slower at loading a ticket than it was prior to the upgrade.

I’ve checked my PG configuration, and it is the same, so things like “shared_buffers = 256MB” was the same on both version.

The primary area I’m seeing the slowness is that when I access a ticket via Ticket/Display.html/id=XXX it now takes about four and a bit seconds to display. It used to take at least half that. The Home page is still very quick, it’s really only the ticket Display that is slow.

I’m hoping someone can give me a suggestion on something obvious to do that I haven’t done; I’m atrocious with SQL and have no idea what the raw queries are RT is doing when displaying a ticket, so, trying to ANALYZE and other Postgres/SQL magic to find out what’s slowing things down is a non-starter for now.

I have reviewed the Performance Tuning wiki article, but, don’t think that the issues there pertain to my situation.

The system has 256GB of RAM, fast SSD drives, plenty of cores (24 @ 3GHz), and is in no way overloaded, mostly idle…

So I’m missing something obvious and my suspicion is that it’s related to my Postgres 15 upgrade.

I can live with the four seconds, but I’d like to figure out what is causing it. Thank you for any suggestions!

cgb · March 18, 2025, 4:19am

Hey there,

I would try Postgresql Query logging:

log_statement = all
log_duration = on

For busy databases, you could look to use log_min_duration_statement:

which can log queries taking longer than specified time.

Try to find the query that translates to the slow page render, and then review the query further in Postgresql with ANALYZE etc. Either the query is slow, or something in front of the DB (application server etc).

Let us know how you go.

sollericos · March 18, 2025, 6:58am

Hi. 4 core + 8GB RAM virtualized machine (load = 0,3) + PG 14.17 + RT 5.0.7

I had a similar issue at the beginning solved after reducing the rt.log level to error and the periodic execution of rt-fulltext-indexer.

dejavu · March 19, 2025, 2:51am

Thanks @cgb, this was really helpful. I disabled email input to ensure the system was a quiet as it could be and I was the only user; I enabled the logs and the duration timestamps, which I think has narrowed things down. In looking at the logs, there are a LOT of SQL queries for that one “ticket display” that I ran, which is probably normal.

Most queries seemed to take less than 1ms, which is what I’d expect.

A few, however, stood out:

LOG:  execute <unnamed>: SELECT DISTINCT main.* FROM Users main CROSS JOIN ACL ACL_3 JOIN Principals Principals_1  ON ( Principals_1.id = main.id ) JOIN CachedGroupMembers CachedGroupMembers_2  ON ( CachedGroupMembers_2.MemberId = Principals_1.id ) JOIN CachedGroupMembers CachedGroupMembers_4  ON ( CachedGroupMembers_4.MemberId = Principals_1.id )  WHERE ((ACL_3.ObjectType = $1 AND ACL_3.ObjectId   = $2) OR (ACL_3.ObjectType = $3 AND ACL_3.ObjectId   = $4) OR (ACL_3.ObjectType = $5 AND ACL_3.ObjectId   = $6)) AND (ACL_3.PrincipalId = CachedGroupMembers_4.GroupId) AND (ACL_3.PrincipalType = $7) AND (ACL_3.RightName IN ($8, $9)) AND (CachedGroupMembers_2.Disabled = $10) AND (CachedGroupMembers_2.GroupId = $11) AND (CachedGroupMembers_4.Disabled = $12) AND (Principals_1.Disabled = $13) AND (Principals_1.PrincipalType = $14) AND (Principals_1.id != $15)  ORDER BY main.Name ASC

That took 1453.457ms, and a few moments later, another query:

LOG:  execute <unnamed>: SELECT DISTINCT main.* FROM Users main CROSS JOIN ACL ACL_3 JOIN Principals Principals_1  ON ( Principals_1.id = main.id ) JOIN CachedGroupMembers CachedGroupMembers_2  ON ( CachedGroupMembers_2.MemberId = Principals_1.id ) JOIN CachedGroupMembers CachedGroupMembers_4  ON ( CachedGroupMembers_4.MemberId = Principals_1.id )  WHERE ((ACL_3.ObjectType = $1 AND ACL_3.ObjectId   = $2) OR (ACL_3.ObjectType = $3 AND ACL_3.ObjectId   = $4) OR (ACL_3.ObjectType = $5 AND ACL_3.ObjectId   = $6)) AND (ACL_3.PrincipalId = CachedGroupMembers_4.GroupId) AND (ACL_3.PrincipalType = $7) AND (ACL_3.RightName IN ($8, $9)) AND (CachedGroupMembers_2.Disabled = $10) AND (CachedGroupMembers_2.GroupId = $11) AND (CachedGroupMembers_4.Disabled = $12) AND (Principals_1.Disabled = $13) AND (Principals_1.PrincipalType = $14) AND (Principals_1.id != $15)  ORDER BY main.Name ASC

Took 1328.925ms.

And a few moments after that:

LOG:  execute <unnamed>: SELECT DISTINCT main.* FROM Users main CROSS JOIN ACL ACL_3 JOIN Principals Principals_1  ON ( Principals_1.id = main.id ) JOIN CachedGroupMembers CachedGroupMembers_2  ON ( CachedGroupMembers_2.MemberId = Principals_1.id ) JOIN CachedGroupMembers CachedGroupMembers_4  ON ( CachedGroupMembers_4.MemberId = Principals_1.id )  WHERE ((ACL_3.ObjectType = $1 AND ACL_3.ObjectId   = $2) OR (ACL_3.ObjectType = $3 AND ACL_3.ObjectId   = $4)) AND (ACL_3.PrincipalId = CachedGroupMembers_4.GroupId) AND (ACL_3.PrincipalType = $5) AND (ACL_3.RightName IN ($6, $7)) AND (CachedGroupMembers_2.Disabled = $8) AND (CachedGroupMembers_2.GroupId = $9) AND (CachedGroupMembers_4.Disabled = $10) AND (Principals_1.Disabled = $11) AND (Principals_1.PrincipalType = $12) AND (Principals_1.id != $13)  ORDER BY main.Name ASC

Took 1361.712ms.

So if you add those up, plus a few other 1 or 2 ms queries, you’re getting to the brain-time “about four seconds” that I mentioned.

I tried to use EXPLAIN on these to learn more, but, I’ve now learned about parameters and haven’t worked out an efficient way to get the 15-ish parameters put back into the query, but, I do see examples on how to do that, it’s just going to be tedious (unless there’s a shortcut?).

My sense is that it’s the database that is the issue, rather than the HTTP server that’s sitting in front, but perhaps someone else looking at this will spot something obvious…

Will try to figure out how to insert the parameters and get the EXPLAIN and ANALYZE working. Then, not sure!

dejavu · March 19, 2025, 3:11am

Okay, figured out how to set up my parameterised query, and got this back for the explain + analyze:

 Unique  (cost=219.95..220.02 rows=1 width=2648) (actual time=1877.847..1877.873 rows=3 loops=1)
   ->  Sort  (cost=219.95..219.95 rows=1 width=2648) (actual time=1877.844..1877.850 rows=4 loops=1)
         Sort Key: main.name, main.id, main.password, main.authtoken, main.comments, main.signature, main.emailaddress, main.freeformcontactinfo, main.organization, main.realname, main.nickname, main.lang, main.gecos, main.homephone, main.workphone, main.mobilephone, main.pagerphone, main.address1, main.address2, main.city, main.state, main.zip, main.country, main.timezone, main.smimecertificate, main.creator, main.created
, main.lastupdatedby, main.lastupdated
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=1.98..219.94 rows=1 width=2648) (actual time=1.001..1877.651 rows=4 loops=1)
               ->  Nested Loop  (cost=1.71..218.39 rows=2 width=2652) (actual time=0.734..319.200 rows=271454 loops=1)
                     ->  Nested Loop  (cost=1.28..216.46 rows=1 width=2656) (actual time=0.676..0.919 rows=4 loops=1)
                           ->  Nested Loop  (cost=0.86..215.36 rows=2 width=8) (actual time=0.649..0.753 rows=4 loops=1)
                                 ->  Index Only Scan using disgroumem on cachedgroupmembers cachedgroupmembers_2  (cost=0.43..63.17 rows=18 width=4) (actual time=0.611..0.627 rows=5 loops=1)
                                       Index Cond: ((groupid = 4) AND (disabled = 0))
                                       Heap Fetches: 5
                                 ->  Index Scan using principals_pkey on principals principals_1  (cost=0.43..8.46 rows=1 width=4) (actual time=0.021..0.021 rows=1 loops=5)
                                       Index Cond: (id = cachedgroupmembers_2.memberid)
                                       Filter: ((id <> 1) AND (disabled = 0) AND ((principaltype)::text = 'User'::text))
                                       Rows Removed by Filter: 0
                           ->  Index Scan using users_pkey on users main  (cost=0.42..0.55 rows=1 width=2648) (actual time=0.036..0.036 rows=1 loops=4)
                                 Index Cond: (id = principals_1.id)
                     ->  Index Scan using cachedgroupmembers3 on cachedgroupmembers cachedgroupmembers_4  (cost=0.43..1.68 rows=25 width=8) (actual time=0.042..66.614 rows=67864 loops=4)
                           Index Cond: (memberid = principals_1.id)
                           Filter: (disabled = 0)
               ->  Index Only Scan using acl1 on acl acl_3  (cost=0.27..0.77 rows=1 width=4) (actual time=0.005..0.005 rows=0 loops=271454)
                     Index Cond: ((rightname = ANY ('{SuperUser,OwnTicket}'::text[])) AND (principaltype = 'Group'::text) AND (principalid = cachedgroupmembers_4.groupid))
                     Filter: ((((objecttype)::text = 'RT::Ticket'::text) AND (objectid = 645921)) OR (((objecttype)::text = 'RT::Queue'::text) AND (objectid = 3)) OR (((objecttype)::text = 'RT::System'::text) AND (objectid = 1)))
                     Rows Removed by Filter: 0
                     Heap Fetches: 28
 Planning Time: 9.913 ms
 Execution Time: 1878.252 ms
(27 rows)

Not done it for the other two queries, but, I suspect the outcome may be similar. Any interpretation of this would be welcome, as whilst I can see some loops being performed (likely for the JOINs), nothing jumps out as the absolute cause of the delays…but I could be wrong…

cgb · March 21, 2025, 12:22pm

Can you try running a vacuumdb analyze on the RT database?

VACUUM verbose analyze;

I’m not strong on analyze analysis but Google Gemini Pro suggested some indexes and analyze DB… If vacuum analyze doesn’t help, try comparing indexes on your DB to RT’s source:

grep 'CREATE INDEX' /path/to/rt/etc/schema.Pg

pg_dump --schema-only rt5 | grep 'CREATE INDEX'

and see if anything is missing.

ktm · March 21, 2025, 5:57pm

Hi,
We found that this particular query is impacted by the statistics used by the DB. Since an ANALYZE is based on a sample, sometimes the statistics used cause a slower plan to be used. I would recommend running the following commands:

    vacuum(analyze, verbose) users;
    vacuum(analyze, verbose) acl;
    vacuum(analyze, verbose) principals;
    vacuum(analyze, verbose) cachedgroupmembers;
    vacuum(analyze, verbose) groupmembers;
    vacuum(analyze, verbose) groups;

Regards,
Ken

dejavu · March 22, 2025, 12:29am

Thanks to everyone helping me try to figure this out. It’s sincerely appreciated.

The indexes/indices are a good lead, and I’m still reading through the VACUUM output myself to see if anything pops.

In the mean time, the Schema for Postgres reads:

CREATE INDEX  ACL1 on ACL(RightName, ObjectType, ObjectId,PrincipalType,PrincipalId);
CREATE INDEX AssetsCatalog ON Assets (Catalog);
CREATE INDEX AssetsName ON Assets (LOWER(Name));
CREATE INDEX AssetsStatus ON Assets (Status);
CREATE INDEX Attachments1 ON Attachments (Parent) ;
CREATE INDEX Attachments2 ON Attachments (TransactionId) ;
CREATE INDEX Attachments3 ON Attachments (Parent, TransactionId) ;
CREATE INDEX Attachments4 ON Attachments (Filename) ;
CREATE INDEX Attributes1 on Attributes(Name);
CREATE INDEX Attributes2 on Attributes(ObjectType, ObjectId);
CREATE INDEX AuthTokensOwner ON AuthTokens (Owner);
CREATE INDEX CachedGroupMembers2 on CachedGroupMembers (MemberId, GroupId, Disabled);
CREATE INDEX CachedGroupMembers3  on CachedGroupMembers (MemberId,ImmediateParentId);
CREATE INDEX CatalogsDisabled ON Catalogs (Disabled);
CREATE INDEX CatalogsName ON Catalogs (LOWER(Name));
CREATE INDEX Configurations1 ON Configurations (LOWER(Name), Disabled);
CREATE INDEX Configurations2 ON Configurations (Disabled);
CREATE INDEX CustomFieldValues1 ON CustomFieldValues (CustomField);
CREATE INDEX DisGrouMem  on CachedGroupMembers (GroupId,MemberId,Disabled);
CREATE INDEX Groups1 ON Groups (LOWER(Domain), LOWER(Name), Instance);
CREATE INDEX Groups2 On Groups (Instance);
CREATE INDEX Links1 ON Links (Base);
CREATE INDEX Links2 ON Links (Target);
CREATE INDEX Links3 ON Links (LocalBase);
CREATE INDEX Links4 ON Links (LocalTarget);
CREATE INDEX Links5 ON Links (Type);
CREATE INDEX ObjectCustomFields1 ON ObjectCustomFields (ObjectId);
CREATE INDEX ObjectCustomFieldValues1 ON ObjectCustomFieldValues (CustomField,ObjectType,ObjectId,Content); 
CREATE INDEX ObjectCustomFieldValues2 ON ObjectCustomFieldValues (CustomField,ObjectType,ObjectId); 
CREATE INDEX ObjectCustomFieldValues3 ON ObjectCustomFieldValues (SortOrder);
CREATE INDEX Tickets1 ON Tickets (Queue, Status) ;
CREATE INDEX Tickets2 ON Tickets (Owner) ;
CREATE INDEX Tickets3 ON Tickets (EffectiveId) ;
CREATE INDEX Transactions1 ON Transactions (ObjectType, ObjectId);
CREATE INDEX Users4 ON Users (LOWER(EmailAddress));

I sorted this output alphabetically to make it easier to compare against my own pg_dump output.

The schema in my Postgres database reads:

CREATE INDEX acl1 ON public.acl USING btree (rightname, objecttype, objectid, principaltype, principalid);
CREATE INDEX assetscatalog ON public.assets USING btree (catalog);
CREATE INDEX assetsname ON public.assets USING btree (lower((name)::text));
CREATE INDEX assetsstatus ON public.assets USING btree (status);
CREATE INDEX attachments1 ON public.attachments USING btree (parent);
CREATE INDEX attachments2 ON public.attachments USING btree (transactionid);
CREATE INDEX attachments3 ON public.attachments USING btree (parent, transactionid);
CREATE INDEX attachments4 ON public.attachments USING btree (filename);
CREATE INDEX attributes1 ON public.attributes USING btree (name);
CREATE INDEX attributes2 ON public.attributes USING btree (objecttype, objectid);
CREATE INDEX authtokensowner ON public.authtokens USING btree (owner);
CREATE INDEX cachedgroupmembers2 ON public.cachedgroupmembers USING btree (memberid, groupid, disabled);
CREATE INDEX cachedgroupmembers3 ON public.cachedgroupmembers USING btree (memberid, immediateparentid);
CREATE INDEX catalogsdisabled ON public.catalogs USING btree (disabled);
CREATE INDEX catalogsname ON public.catalogs USING btree (lower((name)::text));
CREATE INDEX configurations1 ON public.configurations USING btree (lower((name)::text), disabled);
CREATE INDEX configurations2 ON public.configurations USING btree (disabled);
CREATE INDEX contentindex_idx ON public.contenttable USING gin (contentindex);
CREATE INDEX customfieldvalues1 ON public.customfieldvalues USING btree (customfield);
CREATE INDEX disgroumem ON public.cachedgroupmembers USING btree (groupid, memberid, disabled);
CREATE INDEX groups1 ON public.groups USING btree (lower((domain)::text), lower((name)::text), instance);
CREATE INDEX groups2 ON public.groups USING btree (instance);
CREATE INDEX links1 ON public.links USING btree (base);
CREATE INDEX links2 ON public.links USING btree (target);
CREATE INDEX links3 ON public.links USING btree (localbase);
CREATE INDEX links4 ON public.links USING btree (localtarget);
CREATE INDEX links5 ON public.links USING btree (type);
CREATE INDEX objectcustomfields1 ON public.objectcustomfields USING btree (objectid);
CREATE INDEX objectcustomfieldvalues1 ON public.objectcustomfieldvalues USING btree (customfield, objecttype, objectid, content);
CREATE INDEX objectcustomfieldvalues2 ON public.objectcustomfieldvalues USING btree (customfield, objecttype, objectid);
CREATE INDEX objectcustomfieldvalues3 ON public.objectcustomfieldvalues USING btree (sortorder);
CREATE INDEX tickets1 ON public.tickets USING btree (queue, status);
CREATE INDEX tickets2 ON public.tickets USING btree (owner);
CREATE INDEX tickets3 ON public.tickets USING btree (effectiveid);
CREATE INDEX transactions1 ON public.transactions USING btree (objecttype, objectid);
CREATE INDEX users4 ON public.users USING btree (lower((emailaddress)::text));

The primary differences seem to be:

BTREE as default index method in newer PG
Index name capitalisation (I hope does not matter)
Indexes are created in the public. schema (I have no idea if that matters)
contentindex_idx exists in my new DB, not the original schema, and was created using the GIN method (which is probably expected for content indexing)

I think that’s it. I don’t know enough about the different index types to know if BTREE is the right option for RT, but I do note it’s the PG default.

dejavu · March 22, 2025, 12:44am

Thanks Ken! I had a look at this output and was in particular paying attention to the hit rate (unless there’s something else I should be paying attention to?). For the most part, everything looked very good. Only the principals had a slightly higher miss (40 hits, 23 misses). Not too many entries were dirty and needing to be written out across all of those.

There are 1028 dead rows on the principals table, which don’t seem to go anywhere after the vacuum, but, that may be normal and related to other things going on with the DB.

ktm · March 22, 2025, 11:23pm

Hi,
It is not the hit counts that are important but the statistics that the database has gathered to use for query planning. If they just happen to be a little bit off in the wrong direction, the query planner will choose a very non-optimal plan for the query. One thing that can help is to increase your statistics target up from the default of 100. We found that when we hit this slow plan we needed to run the list of vacuum commands until the correct/fast plan was chosen. i.e. in one psql session run the vacuum(analyze…) commands. Then in another psql window try the slow query. Repeat until the fast plan is chosen. Since ANALYZE is a sampling process, the table statistics can very slightly from run to run.

HTH,
Ken

periway · March 29, 2025, 9:03pm

Hello,

I’m not enough of a database expert, but I encountered similar difficulties during my last postgresql version upgrade.

I worked around the problem with an exported / import of the database.

RT homepage and ticket slow for privilegied user but fast as root after migration to Debian 12 - RT Users - Request Tracker Community Forum

Kind regard