We recently upgraded our RT instance from Debian Jessie to Debian Stretch, and with it upgraded from RT 4.2.13 to 4.4.1. We’re using nginx with the rt4-fcgi service as provided by the Debian packages.
Ever since the upgrade, we’ve had occasional reports of users getting a Bad gateway error from nginx when they try to post an update to a ticket, and according to syslog the rt4-fcgi service died and had to be restarted, but there are no obvious error messages from rt4-fcgi itself.
I did a quick strace of the rt4-fcgi process and that revealed that the processes were dying with SIGPIPE when they try to write to their fastcgi socket:
Does anyone have any clues as to what might be causing this to happen, or any thoughts on additional troubleshooting steps we can follow to try and track down the issue?
We’re not setting a fastcgi timeout so this would be set to the default of 60 seconds. However, the users who have reported getting an error when they try to post a ticket update have reported that the error comes up within a second or two, they’re not sitting there for a minute before it times out. If it timed out I’d also expect to see 504 error responses in the nginx logs but there are none.
Looking at the logs though I did notice that all of the 502 errors from /Ticket/Update.html were preceded by a request to /Helpers/PreviewScrips that was recorded with a 499 request status. This is an internal nginx status code that indicates that the client dropped the connection before the request was completed:
This seems consistent with the advice from the FastCGI FAQ which indicates that a SIGPIPE will be sent to a FastCGI application if the client aborts the connection. It looks like RT is not installing a SIGPIPE handler so the FastCGI process is dying whenever a connection is dropped by the client.
No, there are no errors being produced by RT. I believe the 502 errors are being generated by nginx because there are no RT FastCGI processes running at the time to process the request.
Looking at the Javascript in Update.html the aborted requests to /Helpers/PreviewScrips do make some sense. I can see that it triggers an AJAX query for this helper in response to the blur event on the UpdateContent element. Depending on the exact timing of that the user could end up blurring the content update input just before submitting the form, resulting in the AJAX call being aborted by the form submission.
We’ve just upgraded from 4.4.1 to 4.4.4 and we’re experiencing the same issue on a very regular basis. Other than “restart via cron”, has anyone worked out a better solution?
Experiencing these too an RT 5.0.1. Managed to “hide” the SIGPIPE errors by running my RT fcgi process like this: /opt/rt5/sbin/rt-server.fcgi --listen /var/tmp/rt-server.sock --nproc 4 With these config RT seems to restart pretty fast after a SIGPIPE error. (User are only experiencing slow UI when this happens, not a user visible error)
Also seeing frequent could not receive data from client: Connection reset by peer in postgres log correlating with the SIGPIPE errors.
I couldn’t reproduce this somehow. I’m wondering if it’s because of different versions of some modules.
Could you share your perl module info(i.e. the “Loaded perl modules” section on /Admin/Tools/Configuration.html) so I can investigate?