RT5 slow performance and intermittent Internal Server Error

OS: RHEL 7.6
Apache2.4.6 with FastCGI mod_fcgid/2.3.9
Oracle Database

I have RT5 running in a custom location and everything works (ish) although RT sometimes takes several minutes to return a page and sometimes I get the dreaded Error 500 Internal Server Error. I’ve read through the docs several times and can’t see what config changes I need to make to improve performance. From the email side of things, everything appears to be running well and cases are being created. It’s simply the web interface that is having the issue.

As per web deployment doc I have disabled mod_speling and mod_cache and have the prefork MPM mod configured.

My SSL virtual host includes:
ScriptAlias /rt /app/rt5/sbin/rt-server.fcgi/
<Location /rt>
Require all granted
Options +ExecCGI
AddHandler fcgid-script fcgi

Does anyone have suggestions where else I could look/investigate.

Unsure if the logs help. I saw similar messages when some files didn’t have the right permissions. As this is intermittent, I’ve ruled out permissions:

Example:
10...* - - [02/Sep/2020:11:08:59 +0100] “GET /rt/Admin/Lifecycles/ HTTP/1.1” 200 38349
10...* - - [02/Sep/2020:11:09:08 +0100] “GET /rt/Admin/Lifecycles/Modify.html?Type=ticket&Name=countermeasures HTTP/1.1” 500 547

[Wed Sep 02 11:15:08.204053 2020] [fcgid:warn] [pid 22862] [client 10.*.*.*:64133] mod_fcgid: error reading data, FastCGI server closed connection, referer: https://*******/rt/Admin/Lifecycles/
[Wed Sep 02 11:15:08.204180 2020] [core:error] [pid 22862] [client 10.*.*.*:64133] End of script output before headers: rt-server.fcgi, referer: https://*******/rt/Admin/Lifecycles/

Is it common for doing some task or visiting some specific pages in RT? If its just random then maybe the servers memory is filling up?

Thanks for your reply. It’s completely random but I am now focussed on the server config itself, as I’ve noticed stopping apache can take a while and result in it timing out and being killed.

I’ve also reviewed atop logs and noticed yesterday that multiple rt-server.fcgi proceses were started that resulted in all swap memory being consumed before the system killed them all off.

I’ll look to tweak the mod_fcgid.conf, and see if other modules are conflicting as I do have php running, but that doesn’t experience any issues and will continue to run whilst I experience issues with RT. I’m currently using PHP for Webmail that I eventually want RT to replace!

A new development is Error 500 on CSS files:
[03/Sep/2020:10:29:10 +0100] “GET /rt/NoAuth/css/elevator-light/squished-07928e9017d9e4f24077f9c5aabcc235.css HTTP/1.1” 500 547

What do you have set for your MaxSpareServers for mpm_prefork.conf?

I don’t have a config set-up for MPM so it’s using defaults. According to Apache docs that would be

MaxSpareServers 10
MinSpareServers 5

But it may be the MaxRequestWorkers directive that needs lowering, as default it is set to 256?

FYI in terms of memory I have:
Total: 3.7GB & 1.8GB Free
Swap: 2GB with 1.7GB free

Just to revist this a month later, I’ve tweaked various configs and believed it helped with stability; it has to some degree as I was having apache processes running away and consuming memory & swap. The config change has helped, but I’m still getting a lot of “mod_fcgid: read data timeout in 30 seconds” even when pushing email into RT using rt-mailgate.

fcgid.conf

FcgidOutputBufferSize 536870912
FcgidMaxRequestLen 536870912

FcgidMaxRequestsPerProcess 0
FcgidMaxProcesses 100
FcgidMaxProcessesPerClass 50
FcgidIOTimeout 30
FcgidBusyTimeout 300
FcgidIdleTimeout 60
FcgidIdleScanInterval 60
FcgidProcessLifeTime 3600

I’ve added z-mpm_prefork.conf

KeepAlive On MaxKeepAliveRequests 500 KeepAliveTimeout 3
ServerLimit 23
StartServers 12

MinSpareServers 12
MaxSpareServers 23

MaxRequestWorkers 23
MaxConnectionsPerChild 10000

I have enabled SQL debug mode and can see SQL queries are quick:
[Mon Oct 5 15:56:28 2020] [debug]: SQL(0.000898s): SELECT * FROM Tickets WHERE id = ?; [ bound values: ‘261’ ] (/app/rt5/sbin/…/lib/RT/Interface/Web.pm:1356)

The last line in the RT log is
[Mon Oct 5 15:56:28 2020] [debug]: SQL(0.000867s): SELECT * FROM Queues WHERE id = ?; [ bound values: ‘23’ ] (/app/rt5/sbin/…/lib/RT/Interface/Web.pm:1356)

Then the following in Apache logs (just over 30 seconds)
[Mon Oct 05 16:57:02.700685 2020] [fcgid:warn] [pid 19755] [client 10...:12680] mod_fcgid: read data timeout in 30 seconds, referer: https://*****/rt/Ticket/Display.html?id=261
[Mon Oct 05 16:57:02.700793 2020] [core:error] [pid 19755] [client 10.
..:12680] End of script output before headers: rt-server.fcgi, referer: https://*****/rt/Ticket/Display.html?id=261

It’s like RT just stops processing for some reason.

I think FcgidIOTimeout 30 might be pretty low and would explain why at 30 seconds you see a timeout

Thanks for the reply.

Possibly, but given the fact the server is handling a low load with only a couple of users, I would have thought 30 seconds would be plenty of time and that another issue is at play here. In general, the solution works well only taking a few seconds to display information/ticket/homepage etc… it just sometimes seems to get stuck.

I need a better picture of what is happening between the last RT log at Mon Oct 5 16:56:28 2020 and the fcgid warning at Mon Oct 05 16:57:02.700685 2020. I’ll look to see if I can a more detailed view of what’s happening on the server. Perhaps atop or something else can help. But will see if increasing the timeout helps in any way.

@KayJay I had a similar issue and after a week of struggle it was an rt5 bug.

To check if it is the same case you can follow two paths

  1. Fast path, less accurate to know if it is the same case :
    Remove the “SavedSearches” module from all dashboards.
  • On your RT home page click edit on the top right corner
  • Drag and drop the module to the left.

Even one dashboard with this module can slow everything down.

  1. Second path, more time consuming but accurate :
    Enable the debug log on Apache server and search for the error log below :

     [130972] [Tue Aug 11 01:22:02 2020] [error]: Can't locate object method "ColumnMapClassName" via package "RT::SavedSearch" at /opt/production/rt5.0/share/html/Elements/CollectionAsTable/Header line 74.
    

    Stack:
    [/opt/production/rt5.0/share/html/Elements/CollectionAsTable/Header:73]
    [/opt/production/rt5.0/share/html/Elements/CollectionList:143]
    [/opt/production/rt5.0/share/html/Elements/SavedSearches:59]
    [/opt/production/rt5.0/share/html/Widgets/TitleBox:61]
    [/opt/production/rt5.0/share/html/Elements/SavedSearches:67]
    [/opt/production/rt5.0/share/html/Elements/MyRT:99]
    [/opt/production/rt5.0/share/html/index.html:78]
    [/opt/production/rt5.0/sbin/…/lib/RT/Interface/Web.pm:710]
    [/opt/production/rt5.0/sbin/…/lib/RT/Interface/Web.pm:389]
    [/opt/production/rt5.0/share/html/autohandler:53
    (/opt/production/rt5.0/sbin/…/lib/RT/Interface/Web/Handler.pm:209)

The workaround is the same in both cases, remove the “SavedSearches” module from the home screens.

If that is the case this is a known bug as per :

Many thanks for your reply.

Good suggestion about changing the LogLevel to debug, which I’ve done this morning, as well as increased TimeOut to 45 seconds.

I still have a fairly vanilla install of RT5, upgraded from 4.4, and don’t have any SavedSearches. But thanks for your suggestion. I hope the degub mode turns something up.