EX_TEMPFAIL / receiving multiple copies of mail

Hello,

Sorry this is a bit long. The basic issue is I believe there may be a bug
somewhere since we got multiple copies of messages today after bringing RT
back up from a FastCGI lost communications error. RT correctly issued
EX_TEMPFAIL responses, but when RT came back up, we received one copy of
the message for each EX_TEMPFAIL response that was sent and one copy for
the successful send.

Some Background:

We are running RT3.0.0 (built from source) with Apache and FastCGI
(installed from FreeBSD ports). This is our first time using FastCGI and
we are encountering some lost communications issues with it. A couple of
time over the past three days, RT has quit responding and the only
solution is to restart apache. After this happened the first time, I
added “-idle-timeout 90 -processes 5” to my FastCGIServer directive for RT
to increase the timeout (it’s possible that it may take longer than 30
seconds to send an e-mail) and increased the processes to hopefully
improve access for multiple users. Last night it stopped responding
again, so I increased the timeout to 300 as a bandaid. I know this won’t
correct the problem, it will just make our users wait 5 minutes before
figuring out there is a FastCGI problem. The FastCGI problem is something
I need to research, however if anyone has insight about this, please feel
free to reply.

Typical FastCGI error in the apache error log:

[Wed Apr 2 08:52:53 2003] [error] [client 130.94.96.248] FastCGI: comm
with server “/usr/local/rt3/bin/mason_handler.fcgi” aborted: idle timeout (90 sec)

Here are the maillog entries (stripped of identifying information to
protect the privacy of the innocent) of one of the messages. We received
this message 4 times. Note there were 3 EX_TEMPFAIL’s and one successful
send.

% grep “h32GAWdo015894” /var/log/maillog

Apr 2 08:10:35 rt3 sm-mta[15894]: h32GAWdo015894: from=<>, size=1367,
class=0, nrcpts=1, msgid=<200304021625.h32GPCO99599@>, proto=ESMTP,
daemon=MTA, relay=<>

Apr 2 08:12:06 rt3 sm-mta[15895]: h32GAWdo015894:
to="|/usr/local/rt3/bin/rt-mailgate --queue queue-name --action correspond
–url http://www.example.com/", ctladdr=<> (26/0), delay=00:01:34,
xdelay=00:01:31, mailer=prog, pri=31621, dsn=4.0.0, stat=Deferred: prog
mailer (/bin/sh) exited with EX_TEMPFAIL

Apr 2 08:13:44 rt3 sm-mta[15900]: h32GAWdo015894:
to="|/usr/local/rt3/bin/rt-mailgate --queue queue-name --action correspond
–url http://www.example.com/", ctladdr=<> (26/0), delay=00:03:12,
xdelay=00:01:31, mailer=prog, pri=121621, dsn=4.0.0, stat=Deferred: prog
mailer (/bin/sh) exited with EX_TEMPFAIL

Apr 2 08:46:47 rt3 sm-mta[16006]: h32GAWdo015894:
to="|/usr/local/rt3/bin/rt-mailgate --queue queue-name --action correspond
–url http://www.example.com/", ctladdr=<> (26/0), delay=00:36:15,
xdelay=00:01:33, mailer=prog, pri=211621, dsn=4.0.0, stat=Deferred: prog
mailer (/bin/sh) exited with EX_TEMPFAIL

Apr 2 09:13:01 rt3 sm-mta[16281]: h32GAWdo015894:
to="|/usr/local/rt3/bin/rt-mailgate --queue queue-name --action correspond
–url http://www.example.com/", ctladdr=<> (26/0), delay=01:02:29,
xdelay=00:00:03, mailer=prog, pri=301621, dsn=2.0.0, stat=Sent

“If Verisign is the ‘Value of Trust’,
then trust ain’t worth much any more.” - me

Spam Trap Mail Key: ASK and you shall receive

Hello,

Sorry this is a bit long. The basic issue is I believe there may be a bug
somewhere since we got multiple copies of messages today after bringing RT
back up from a FastCGI lost communications error. RT correctly issued
EX_TEMPFAIL responses, but when RT came back up, we received one copy of
the message for each EX_TEMPFAIL response that was sent and one copy for
the successful send.

What MTA are you running?

http://www.bestpractical.com/rt – Trouble Ticketing. Free.

Jesse,

Sendmail 8.12.6p2
(standard FreeBSD 4.7 port with latest patch from last Sunday)

Thanks for your help.On Wed, 2 Apr 2003, Jesse Vincent wrote:

On Wed, Apr 02, 2003 at 10:48:34AM -0800, bill@daze.net wrote:

Hello,

Sorry this is a bit long. The basic issue is I believe there may be a bug
somewhere since we got multiple copies of messages today after bringing RT
back up from a FastCGI lost communications error. RT correctly issued
EX_TEMPFAIL responses, but when RT came back up, we received one copy of
the message for each EX_TEMPFAIL response that was sent and one copy for
the successful send.

What MTA are you running?


http://www.bestpractical.com/rt – Trouble Ticketing. Free.

So. I’m wondering if the server-side of things is completing, but the
client is just timing out before it gets the “All Clear” signal.

  1. have you set up the SendmailArguments to use DeliveryMode=b and
    ErrorMode=m ?

I was just using the defaults.

I will add
Set($SendmailArguments,"-oi -t -ODeliveryMode=b -OErrorMode=m");
to my RT_SiteConfig.pm and wait and see what happens.

Sendmail brand sendmail doesn’t deliver in the background like
everything else. This change should make RT return much more quickly,
which eliminates the fastcgi issue you’re seeing.

  1. what happens if you turn the FastCGI timeout way up? (Like 10
    minutes)

I turned it up to 5 minutes this morning (from 90 seconds). I don’t think
our users would be happy if they had to site and wait 10 minutes before a
timeout. They aren’t going to be too happy with 5 minutes either. If
FastCGI chokes again, I will increase it at that point. However I just
think this is prolonging the inevitable. I would prefer to find out why
FastCGI is losing communications with the program.

Because the RT server is spending a long time actually trying to send
the email in its “foreground” and waiting until it’s delivered before it
returns. Other MTAs default to background delivery, so they return
quickly. The FastCGI server isn’t seeing anything come back from the RT
server so it assumes it has hung and returns an error.

http://www.bestpractical.com/rt – Trouble Ticketing. Free.

The FastCGI problem has gotten worse this afternoon. I will probably
reconfigure this system to use mod_perl tonight and see if that helps.

The timeout isn’t occurring just when sending mail. It occurs when trying
to bring up the login page or do just about anything. In fact this last
time, Stopping and starting Apache did not resolve the problem. I had to
reboot the server to get RT to respond again. I have tried researching
the FastCGI side of things, but the docs available at fastcgi.com are
quite lacking in the type of information I’m looking for.

I even tried the 10 minute timeout. The result was 10 minutes of waiting
and then an error 500 with the idle timeout error written to the apache
log.

I have turned on debugging to the rt.log file. Maybe it will turn up
something before I attempt mod_perl.

By the way, were would you prefer bug reports, and it there a public list
of known bugs in RT3?

I presume you’re running with mysql. When the system is non-responsive,
what does mysql processlist tell you?On Wed, Apr 02, 2003 at 03:25:02PM -0800, bill@daze.net wrote:

The FastCGI problem has gotten worse this afternoon. I will probably
reconfigure this system to use mod_perl tonight and see if that helps.

The timeout isn’t occurring just when sending mail. It occurs when trying
to bring up the login page or do just about anything. In fact this last
time, Stopping and starting Apache did not resolve the problem. I had to
reboot the server to get RT to respond again. I have tried researching
the FastCGI side of things, but the docs available at fastcgi.com are
quite lacking in the type of information I’m looking for.

I even tried the 10 minute timeout. The result was 10 minutes of waiting
and then an error 500 with the idle timeout error written to the apache
log.

I have turned on debugging to the rt.log file. Maybe it will turn up
something before I attempt mod_perl.

By the way, were would you prefer bug reports, and it there a public list
of known bugs in RT3?

Sendmail brand sendmail doesn’t deliver in the background like
everything else. This change should make RT return much more quickly,
which eliminates the fastcgi issue you’re seeing.


rt-users mailing list
rt-users@lists.fsck.com
http://lists.fsck.com/mailman/listinfo/rt-users

Have you read the FAQ? The RT FAQ Manager lives at http://fsck.com/rtfm

http://www.bestpractical.com/rt – Trouble Ticketing. Free.

I presume you’re running with mysql. When the system is non-responsive,
what does mysql processlist tell you?

After the reboot yesterday everything ran fine until earlier today. Here
is the processlist from when the timeouts started occurring:

Process List
Id User Host db Command Time State Info kill

13 rt_user localhost rt3 Query 3101 Copying to tmp
table SELECT DISTINCT main.* FROM Tickets main, Groups Groups_1,
Principals Principals_2, CachedGroupMembe

14 rt_user localhost rt3 Query 603 Locked SELECT
ACL.id from ACL, Groups, Principals, CachedGroupMembers WHERE
(ACL.RightName = ‘SuperUser’ O

15 rt_user localhost rt3 Query 3071 User lock
SELECT GET_LOCK(‘Apache-Session-3a88a0a85901f2ce2a01e1be0d07c63d’, 3600)

16 rt_user localhost rt3 Query 608 Locked UPDATE
Principals SET ObjectId=‘5007’ WHERE id=‘5007’

17 rt_user localhost rt3 Query 530 User lock
SELECT GET_LOCK(‘Apache-Session-209dd683b7d6e71a532db03520db854b’, 3600)

40 mysqltool localhost mysql Sleep
41 mysqltool localhost mysql Query
SHOW PROCESSLIST

We have 5 RT processes started by FastCGIServer.

I stopped and started Apache to regain response. I then did a processlist
and saw the above 5 processes still running and 5 new sleeping processes.
So I stopped and started mysql to clear out the old processes.

Our config:
FreeBSD 4.7
rt3.0.0
mysql 3.23.52
Sendmail 8.12.6p2
Apache/1.3.27 (Unix) mod_ssl/2.8.11 OpenSSL/0.9.6g mod_fastcgi/2.2.12

I presume you’re running with mysql. When the system is non-responsive,
what does mysql processlist tell you?

After the reboot yesterday everything ran fine until earlier today. Here
is the processlist from when the timeouts started occurring:

Process List
Id User Host db Command Time State Info kill

13 rt_user localhost rt3 Query 3101 Copying to tmp
table SELECT DISTINCT main.* FROM Tickets main, Groups Groups_1,
Principals Principals_2, CachedGroupMembe

That’s a ticket search.

14 rt_user localhost rt3 Query 603 Locked SELECT
ACL.id from ACL, Groups, Principals, CachedGroupMembers WHERE
(ACL.RightName = ‘SuperUser’ O

That’s an ACL check.

15 rt_user localhost rt3 Query 3071 User lock
SELECT GET_LOCK(‘Apache-Session-3a88a0a85901f2ce2a01e1be0d07c63d’, 3600)

This is someone who has an existing page view in progress that’s waiting
on the other page view to finish, so they can unlock their session

16 rt_user localhost rt3 Query 608 Locked UPDATE
Principals SET ObjectId=‘5007’ WHERE id=‘5007’

That’s a user creation in progress

17 rt_user localhost rt3 Query 530 User lock
SELECT GET_LOCK(‘Apache-Session-209dd683b7d6e71a532db03520db854b’, 3600)

This is someone who has an existing page view in progress that’s waiting
on the other page view to finish, so they can unlock their session

We have 5 RT processes started by FastCGIServer.

I stopped and started Apache to regain response. I then did a processlist
and saw the above 5 processes still running and 5 new sleeping processes.
So I stopped and started mysql to clear out the old processes.

Our config:
FreeBSD 4.7
rt3.0.0
mysql 3.23.52

Can you verify that your RT database tables are InnoDB tables and not
MyISAM tables? (Did I already ask you this?)

Sendmail 8.12.6p2
Apache/1.3.27 (Unix) mod_ssl/2.8.11 OpenSSL/0.9.6g mod_fastcgi/2.2.12


rt-users mailing list
rt-users@lists.fsck.com
http://lists.fsck.com/mailman/listinfo/rt-users

Have you read the FAQ? The RT FAQ Manager lives at http://fsck.com/rtfm

http://www.bestpractical.com/rt – Trouble Ticketing. Free.

Can you verify that your RT database tables are InnoDB tables and not
MyISAM tables? (Did I already ask you this?)

No you didn’t and that is a good question to ask. The tables were all
MyISAM. Not sure how that happened since the CREATE table statements
specify table=innodb. Anyway, I have converted all of the tables to
InnoDB and stopped and started both mysql and apache to start with a clean
slate. Now to wait and see if that was the cause of the problem.

Thanks Jesse!