Large Queue size problems

Nick_Geron · April 19, 2010, 11:06pm

Hi all,

Our company currently runs RT for customer support interactions as well
as a central email abuse reporting system for customer IP blocks.
Recently we setup a feedback system with a large hosted mail provider
and we saw the level of incoming abuse/spam reports increase to 10s of
thousands a day. I have been trying to identify the source of an issue
that essentially boils down to this: When our RT queues are ‘large’
(over 100K tickets) the UI struggles to complete operations or consumes
all system resources.

To mitigate the issues, we have been using rt-shredder to cull out the
excess, but I have a backed up DB to test with. What I have found is
that on a particular type of search, the returned DB data set is so
large the apache process handling the request consumes almost all
available memory on the RT host, leading to swapping and/or a nasty
segfault.

Our setup involves three hosts:

1 dedicated Gentoo based DB host running MySQL 5.0 with innodb based
tables. 2G ram and 1 64bit quad core xeon running under VMWare vSphere 4.
2 load balanced Gentoo based apache servers running RT3.8.2 with the
same proc/cpu specs as the DB host.

The magic search that overloads apache works as follows:

Click on our large queue from RT at a glance Quick Search. The queue
in question contains 184744 new, 7 open and 7731 stalled tickets in my
dumped database.
Click on any ticket on any of the returned pages.

Apache then consumes so much memory that we have to kill the process at
best or restart the server at worst. In a browser, the ticket page
often fails to load or may be partially completed before the host
resources are exhausted.

The MySQL query also reveals that the last operation in this state
returns a large chunk of data, and often pops up in the slow query log
with an average execution time of 15 seconds. My first thought was
that we had an issue with out database. However several days of testing
indicated that this problem was directly related to apache/mod_perl
having to drink from a firehose.

Here’s the entry that always logs to the slow query log:

Time: 100419 16:58:07

User@Host: rtadmin[rtadmin] @ rt-test[172.20.0.99]

Query_time: 15 Lock_time: 0 Rows_sent: 184751 Rows_examined: 377948

use rt3;
SELECT main.* FROM Tickets main WHERE (main.Status != ‘deleted’) AND
(main.Queue = ‘11’ AND ( main.Status = ‘new’ OR main.Status = ‘open’ )
) AND (main.EffectiveId = main.id) AND (main.Type = ‘ticket’) ORDER BY
main.id ASC;

Interestingly, searching for the specific ticket via the main page
search box brings up the typed in ticket quickly and without incident.
Another tidbit, is this appears to involve some level of caching. If I
follow the above steps, then kill the process and finally select another
ticket NOT in the large queue (one off my own top 15 tickets) then the
same behavior is observed. Also, I see queries in the MySQL query log
that include data related to the previous search. I have performed a
battery of tests stopping daemons, clearing mason cache, clearing
browser cache and the like to figure all this out.

The one detail about our setup that I suspect plays a part here is that
a previous admin wrote a series of email handling scripts that always
re-writes the sender address before handing the email off to
rt-mailgate. We suspected at one point that part of the issue was
related to the query that looks up other tickets created by the sender.
An ‘explain’ in MySQL did show that the volume of data was forcing an on
disk temp table and filesort, but I haven’t directly correlated a slow
DB operation to the consumption of memory on the apache side. That
email handling apparatus is currently being replaced, but does having
the same ‘Creator’ on 100K + tickets sound like a really bad thing, or
is this normal for large shops?

Can I get some feedback on how our system compares to others using RT?
How many tickets do you collect in a day? What rough system specs are
you running on? Is this normal for large volumes of tickets? Is the
only answer ever more RAM?

Also, I updated the test rig to 3.8.7 from 3.8.2 today including all DB
upgrade operations with no change.

Any assistance would be very much appreciated. Our current game plan is
to build an archiving system to keep the queue numbers down, but at some
point large queues may be the norm for us. Thanks.

-Nick Geron

Jesse_Vincent · April 20, 2010, 4:09pm

Hi all,

Our company currently runs RT for customer support interactions as
well as a central email abuse reporting system for customer IP
blocks. Recently we setup a feedback system with a large hosted
mail provider and we saw the level of incoming abuse/spam reports
increase to 10s of thousands a day. I have been trying to identify
the source of an issue that essentially boils down to this: When
our RT queues are ‘large’ (over 100K tickets) the UI struggles to
complete operations or consumes all system resources.

That’s actually a known issue with the “next” and “prev” button mapping
on the ticket pages. My coworker Ruslan tells me that we’ve actually
got a fix for it in 3.8.8RC2.

Can you try out 3.8.8RC2 and tell me if that fixes it for you?

Best,
Jesse

Nick_Geron · April 21, 2010, 5:48pm

Hi all,

Our company currently runs RT for customer support interactions as
well as a central email abuse reporting system for customer IP
blocks. Recently we setup a feedback system with a large hosted
mail provider and we saw the level of incoming abuse/spam reports
increase to 10s of thousands a day. I have been trying to identify
the source of an issue that essentially boils down to this: When
our RT queues are ‘large’ (over 100K tickets) the UI struggles to
complete operations or consumes all system resources.

That’s actually a known issue with the “next” and “prev” button mapping
on the ticket pages. My coworker Ruslan tells me that we’ve actually
got a fix for it in 3.8.8RC2.

Can you try out 3.8.8RC2 and tell me if that fixes it for you?

Best,
Jesse

Discover RT’s hidden secrets with RT Essentials from O’Reilly Media.
Buy a copy at http://rtbook.bestpractical.com

I’ve just gotten 3.8.8RC2 running and, yes, the issue appears to have
been corrected. Loading tickets from the large queue’s new/open search
results page now brings up the ticket page quickly. Thanks for the help.

-Nick Geron