RT5 slow performance and intermittent Internal Server Error

KayJay · September 2, 2020, 10:47pm

OS: RHEL 7.6
Apache2.4.6 with FastCGI mod_fcgid/2.3.9
Oracle Database

I have RT5 running in a custom location and everything works (ish) although RT sometimes takes several minutes to return a page and sometimes I get the dreaded Error 500 Internal Server Error. I’ve read through the docs several times and can’t see what config changes I need to make to improve performance. From the email side of things, everything appears to be running well and cases are being created. It’s simply the web interface that is having the issue.

As per web deployment doc I have disabled mod_speling and mod_cache and have the prefork MPM mod configured.

My SSL virtual host includes:
ScriptAlias /rt /app/rt5/sbin/rt-server.fcgi/
<Location /rt>
Require all granted
Options +ExecCGI
AddHandler fcgid-script fcgi

Does anyone have suggestions where else I could look/investigate.

Unsure if the logs help. I saw similar messages when some files didn’t have the right permissions. As this is intermittent, I’ve ruled out permissions:

Example:
10...* - - [02/Sep/2020:11:08:59 +0100] “GET /rt/Admin/Lifecycles/ HTTP/1.1” 200 38349
10...* - - [02/Sep/2020:11:09:08 +0100] “GET /rt/Admin/Lifecycles/Modify.html?Type=ticket&Name=countermeasures HTTP/1.1” 500 547

[Wed Sep 02 11:15:08.204053 2020] [fcgid:warn] [pid 22862] [client 10.*.*.*:64133] mod_fcgid: error reading data, FastCGI server closed connection, referer: https://*******/rt/Admin/Lifecycles/
[Wed Sep 02 11:15:08.204180 2020] [core:error] [pid 22862] [client 10.*.*.*:64133] End of script output before headers: rt-server.fcgi, referer: https://*******/rt/Admin/Lifecycles/

knation · September 3, 2020, 1:48am

Is it common for doing some task or visiting some specific pages in RT? If its just random then maybe the servers memory is filling up?

KayJay · September 3, 2020, 12:24pm

Thanks for your reply. It’s completely random but I am now focussed on the server config itself, as I’ve noticed stopping apache can take a while and result in it timing out and being killed.

I’ve also reviewed atop logs and noticed yesterday that multiple rt-server.fcgi proceses were started that resulted in all swap memory being consumed before the system killed them all off.

I’ll look to tweak the mod_fcgid.conf, and see if other modules are conflicting as I do have php running, but that doesn’t experience any issues and will continue to run whilst I experience issues with RT. I’m currently using PHP for Webmail that I eventually want RT to replace!

A new development is Error 500 on CSS files:
[03/Sep/2020:10:29:10 +0100] “GET /rt/NoAuth/css/elevator-light/squished-07928e9017d9e4f24077f9c5aabcc235.css HTTP/1.1” 500 547

knation · September 3, 2020, 1:09pm

What do you have set for your MaxSpareServers for mpm_prefork.conf?

KayJay · September 3, 2020, 1:43pm

I don’t have a config set-up for MPM so it’s using defaults. According to Apache docs that would be

MaxSpareServers 10
MinSpareServers 5

But it may be the MaxRequestWorkers directive that needs lowering, as default it is set to 256?

FYI in terms of memory I have:
Total: 3.7GB & 1.8GB Free
Swap: 2GB with 1.7GB free

KayJay · October 5, 2020, 4:17pm

Just to revist this a month later, I’ve tweaked various configs and believed it helped with stability; it has to some degree as I was having apache processes running away and consuming memory & swap. The config change has helped, but I’m still getting a lot of “mod_fcgid: read data timeout in 30 seconds” even when pushing email into RT using rt-mailgate.

fcgid.conf

FcgidOutputBufferSize 536870912
FcgidMaxRequestLen 536870912

FcgidMaxRequestsPerProcess 0
FcgidMaxProcesses 100
FcgidMaxProcessesPerClass 50
FcgidIOTimeout 30
FcgidBusyTimeout 300
FcgidIdleTimeout 60
FcgidIdleScanInterval 60
FcgidProcessLifeTime 3600

I’ve added z-mpm_prefork.conf

KeepAlive On MaxKeepAliveRequests 500 KeepAliveTimeout 3
ServerLimit 23
StartServers 12

MinSpareServers 12
MaxSpareServers 23

MaxRequestWorkers 23
MaxConnectionsPerChild 10000

I have enabled SQL debug mode and can see SQL queries are quick:
[Mon Oct 5 15:56:28 2020] [debug]: SQL(0.000898s): SELECT * FROM Tickets WHERE id = ?; [ bound values: ‘261’ ] (/app/rt5/sbin/…/lib/RT/Interface/Web.pm:1356)

The last line in the RT log is
[Mon Oct 5 15:56:28 2020] [debug]: SQL(0.000867s): SELECT * FROM Queues WHERE id = ?; [ bound values: ‘23’ ] (/app/rt5/sbin/…/lib/RT/Interface/Web.pm:1356)

Then the following in Apache logs (just over 30 seconds)
[Mon Oct 05 16:57:02.700685 2020] [fcgid:warn] [pid 19755] [client 10...:12680] mod_fcgid: read data timeout in 30 seconds, referer: https://*****/rt/Ticket/Display.html?id=261
[Mon Oct 05 16:57:02.700793 2020] [core:error] [pid 19755] [client 10...:12680] End of script output before headers: rt-server.fcgi, referer: https://*****/rt/Ticket/Display.html?id=261

It’s like RT just stops processing for some reason.

knation · October 6, 2020, 12:43pm

I think FcgidIOTimeout 30 might be pretty low and would explain why at 30 seconds you see a timeout

KayJay · October 6, 2020, 3:50pm

Thanks for the reply.

Possibly, but given the fact the server is handling a low load with only a couple of users, I would have thought 30 seconds would be plenty of time and that another issue is at play here. In general, the solution works well only taking a few seconds to display information/ticket/homepage etc… it just sometimes seems to get stuck.

I need a better picture of what is happening between the last RT log at Mon Oct 5 16:56:28 2020 and the fcgid warning at Mon Oct 05 16:57:02.700685 2020. I’ll look to see if I can a more detailed view of what’s happening on the server. Perhaps atop or something else can help. But will see if increasing the timeout helps in any way.

ALone · October 7, 2020, 12:41pm

@KayJay I had a similar issue and after a week of struggle it was an rt5 bug.

To check if it is the same case you can follow two paths

Fast path, less accurate to know if it is the same case :
Remove the “SavedSearches” module from all dashboards.

On your RT home page click edit on the top right corner
Drag and drop the module to the left.

Even one dashboard with this module can slow everything down.

Second path, more time consuming but accurate :
Enable the debug log on Apache server and search for the error log below :
```
 [130972] [Tue Aug 11 01:22:02 2020] [error]: Can't locate object method "ColumnMapClassName" via package "RT::SavedSearch" at /opt/production/rt5.0/share/html/Elements/CollectionAsTable/Header line 74.
```
Stack:
[/opt/production/rt5.0/share/html/Elements/CollectionAsTable/Header:73]
[/opt/production/rt5.0/share/html/Elements/CollectionList:143]
[/opt/production/rt5.0/share/html/Elements/SavedSearches:59]
[/opt/production/rt5.0/share/html/Widgets/TitleBox:61]
[/opt/production/rt5.0/share/html/Elements/SavedSearches:67]
[/opt/production/rt5.0/share/html/Elements/MyRT:99]
[/opt/production/rt5.0/share/html/index.html:78]
[/opt/production/rt5.0/sbin/…/lib/RT/Interface/Web.pm:710]
[/opt/production/rt5.0/sbin/…/lib/RT/Interface/Web.pm:389]
[/opt/production/rt5.0/share/html/autohandler:53
(/opt/production/rt5.0/sbin/…/lib/RT/Interface/Web/Handler.pm:209)

The workaround is the same in both cases, remove the “SavedSearches” module from the home screens.

If that is the case this is a known bug as per :

KayJay · October 8, 2020, 3:04pm

Many thanks for your reply.

Good suggestion about changing the LogLevel to debug, which I’ve done this morning, as well as increased TimeOut to 45 seconds.

I still have a fairly vanilla install of RT5, upgraded from 4.4, and don’t have any SavedSearches. But thanks for your suggestion. I hope the degub mode turns something up.

blizzy · August 21, 2021, 8:25am

Hello, I hope you could help me. I’ve been experiencing slow performance of RT5. And I looked in the process, they were many processes of:

/usr/bin/perl -w /opt/rt5/sbin/rt-server.fcgi/

And they are consuming a lot of ram memory which makes the RT5 go slow for every transaction.

The mysql database is not in the same instance of the webserver. I’m using percona xtradb cluster and it is separated from the rt webserver.

I am confused if it is because they were many mysql transactions? But upon checking the mysql server, the process are okay.

I’ve configured the Apache fcgid.conf with MaxRequestLen 10000000.

FcgidConnectTimeout 20 AddHandler fcgid-script .fcgi MaxRequestLen 10000000

I badly need help.

knation · August 21, 2021, 6:10pm

You could try setting the max clients for Apache to a lower value

blizzy · August 22, 2021, 12:10pm

Sorry, I really don’t know about apache configuration. How to change that?

knation · August 22, 2021, 12:41pm

I believe this config will work:

https://httpd.apache.org/mod_fcgid/mod/mod_fcgid.html#fcgidmaxprocesses

You can add it to where you added the previous changes