4.4.1 new install: Transient mod_fcgid timeouts causing 500 errors on login, possibly LDAP-related

We’ve recently upgraded to 4.4.4 from (I think) 4.4.1, via fresh install on a new server, DB restored from dump.

We’ve been getting transient mod_fcgid timeouts, possibly LDAP related - showing a 500 internal server error to the user.

The previous install used mod_fastcgi, however this was unavailable on the new server and we went with mod_fcgid instead. Everything’s been mostly fine, but we’re getting transient ~15 minute outages every few days when nobody can log in, and new tickets are sometimes bounced back to the mailserver.

Already-logged-in users are generally fine.

What it looks like is that LDAP queries are timing out, then mod_fcgid is bailing, then apache throws an error because it didn’t get any content.

Annoyingly, last time this happened we created a purely internal user to fall back on in case of LDAP outages - but it appears the (possible) LDAP failure is killing us before it can fall back to internal auth.

We don’t control the LDAP server unfortunately, so is there a way to make RT fail a bit more gracefully in case of timeouts, at least enough to fall back to internal auth?

From apache error.log

[fcgid:warn] [pid 30392:tid 140239859308288] [client REDACTED:49347] mod_fcgid: read data timeout in 40 seconds, referer: https://REDACTED/NoAuth/Login.html?next=40fc537e7bf974609d0c7c10b70b3a47
[core:error] [pid 30392:tid 140239859308288] [client REDACTED:49347] End of script output before headers: rt-server.fcgi, referer: https://REDACTED/NoAuth/Login.html?next=40fc537e7bf974609d0c7c10b70b3a47

From request-tracker log:

[debug]: Attempting to use external auth service: REDACTED_LDAP 
 /home/rt4/rt4package/sbin/../lib/RT/Authen/ExternalAuth.pm:288)
[debug]: Calling UserExists with $username (emergency_admin) and $service (REDACTED_LDAP) (/home/rt4/rt4package/sbin/../lib/RT/Authen/ExternalAuth.pm:329)
[debug]: UserExists params: username: emergency_admin , service: REDACTED_LDAP (/home/rt4/rt4package/sbin/../lib/RT/Authen/ExternalAuth/LDAP.pm:486)
[debug]: Attempting to use external auth service: REDACTED_LDAP (/home/rt4/rt4package/sbin/../lib/RT/Authen/ExternalAuth.pm:288)
[debug]: SSO Failed and no user to test with. Nexting (/home/rt4/rt4package/sbin/../lib/RT/Authen/ExternalAuth.pm:316)

From RT_Siteconfig.pm:

Set($ExternalAuth, 1);

Set($ExternalAuthPriority, ['REDACTED_LDAP']);
Set($ExternalInfoPriority, ['REDACTED_LDAP']);

Set($UserAutocreateDefaultsOnLogin, {Privileged => 0} );
Set($AutoCreateNonExternalUsers, 1);

Set($ExternalSettings, {
  REDACTED_LDAP' => {
...
}

What is your Apache mod_fcgid timeout value (FcgidBusyTimeout) and the FcgidIOTimeout timeout value? If these values are set to the default then it is possible that the web request timesout before LDAP returns a value ( this would also cause internal login to fail if RT trys and timeout waiting for LDAP first ).

You can also set the timeout value for LDAP in the “net_ldap_args” section of your RT config see this config for the LDAP module RT uses:

If you set that timeout value lower than the Apache timeout then you should at least get something in the logs telling you LDAP timed out instead of the Apache 500 errir

It looks like this nailed it - there’s been no 500 errors since setting the LDAP timeout - thanks!