RT 4.4 FastCGI processes frequently dying

We recently upgraded our RT instance from Debian Jessie to Debian Stretch, and with it upgraded from RT 4.2.13 to 4.4.1. We’re using nginx with the rt4-fcgi service as provided by the Debian packages.

Ever since the upgrade, we’ve had occasional reports of users getting a Bad gateway error from nginx when they try to post an update to a ticket, and according to syslog the rt4-fcgi service died and had to be restarted, but there are no obvious error messages from rt4-fcgi itself.

I did a quick strace of the rt4-fcgi process and that revealed that the processes were dying with SIGPIPE when they try to write to their fastcgi socket:

16562 write(5, "\1\6\0\1\0\0\0\0\1\3\0\1\0\10\0\0\0\0\0\0\0\0\0\0", 24) = -1 EPIPE (Broken pipe)
16562 --- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=16562, si_uid=33} ---
16562 +++ killed by SIGPIPE +++

Does anyone have any clues as to what might be causing this to happen, or any thoughts on additional troubleshooting steps we can follow to try and track down the issue?

I have the same problem, and so far we are unable to determine the exact cause.

Increasing the amount of RAM available on the VM seemed to help a little - it dies less often.

Other than that, we just added a cron job to restart when it dies. Not ideal, but it at least fixes it when it occurs.

# In /etc/cron.d/request-tracker4. Check if RT is running, if not restart:
*/5 * * * *  root /var/scripts/monitor rt4-fcgi


/bin/systemctl -q is-active "$service.service"
if [ "$status" == 0 ]; then
    #echo "OK"
    echo "Restarting service: $service"
    /bin/systemctl start "$service.service"

Could this be a nginx timeout? You could try increasing some of the nginx timeout variables:

Does this happen if you do a large query like id > 0?

We’re not setting a fastcgi timeout so this would be set to the default of 60 seconds. However, the users who have reported getting an error when they try to post a ticket update have reported that the error comes up within a second or two, they’re not sitting there for a minute before it times out. If it timed out I’d also expect to see 504 error responses in the nginx logs but there are none.

Looking at the logs though I did notice that all of the 502 errors from /Ticket/Update.html were preceded by a request to /Helpers/PreviewScrips that was recorded with a 499 request status. This is an internal nginx status code that indicates that the client dropped the connection before the request was completed:

[05/Mar/2020:10:50:24 +0000] "POST /Helpers/PreviewScrips HTTP/1.1" 499
[05/Mar/2020:10:50:24 +0000] "POST /Ticket/Update.html HTTP/1.1" 502
[09/Mar/2020:14:28:49 +0000] "POST /Helpers/PreviewScrips HTTP/1.1" 499
[09/Mar/2020:14:28:49 +0000] "POST /Ticket/Update.html HTTP/1.1" 502
[09/Mar/2020:16:12:28 +0000] "POST /Helpers/PreviewScrips HTTP/1.1" 499
[09/Mar/2020:16:12:28 +0000] "POST /Ticket/Update.html HTTP/1.1" 502
[10/Mar/2020:15:39:41 +0000] "POST /Helpers/PreviewScrips HTTP/1.1" 499
[10/Mar/2020:15:39:42 +0000] "POST /Ticket/Update.html HTTP/1.1" 502

This seems consistent with the advice from the FastCGI FAQ which indicates that a SIGPIPE will be sent to a FastCGI application if the client aborts the connection. It looks like RT is not installing a SIGPIPE handler so the FastCGI process is dying whenever a connection is dropped by the client.

Are there any errors in the RT logs that coincide with the timing of the 502 errors?

No, there are no errors being produced by RT. I believe the 502 errors are being generated by nginx because there are no RT FastCGI processes running at the time to process the request.

Looking at the Javascript in Update.html the aborted requests to /Helpers/PreviewScrips do make some sense. I can see that it triggers an AJAX query for this helper in response to the blur event on the UpdateContent element. Depending on the exact timing of that the user could end up blurring the content update input just before submitting the form, resulting in the AJAX call being aborted by the form submission.