RTx-shredder with a very large database

Hello,

I run a large RT installation. We have around 20 queues, 200-300 users,
and many more requestors. I am currently running RT 3.0.12 on the
production RT, and RT 3.4.5 on a development machine.

Unfortunately we get a lot of spam going into the RT queues. The previous
administrator of RT set up a way of dealing with this – before going into
RT all mail is checked with spamassassin, and if it reports a positive the
mail is put into an RT queue named “ZZSpam”. Unfortunately over the years
this “ZZSpam” queue has gotten extremely large, with about 250,000
tickets, and is probably taking up the bulk of the database because many
spams come with large attachments.

Because this ZZSpam queue has become so unweildy, I’d like to clean out
all tickets from that queue, as well as all users, attachments, etc. that
are associated with those tickets. I’ve tried the rtx-shredder program,
but it is extremely slow. I issued this command on the devel RT, which
is a dual 2ghz Xeon machine with 1 GB of RAM, and has no load on it:
time /opt/rt3/sbin/rtx-shredder --force --plugin
’Tickets=queue,ZZSpam;limit,20’

And it took 23 minutes and 17 seconds to complete. Obviously if it takes
that long to delete 20 tickets, I cannot use the tool to clean out 250,000
tickets.

Is there anything I can do to make rtx-shredder significantly faster? If
not, does anyone have any advice as to how I should go about cleaning out
this massive queue?

thanks for any help,
Jim Faulkner

And it took 23 minutes and 17 seconds to complete. Obviously if it
takes
that long to delete 20 tickets, I cannot use the tool to clean out
250,000
tickets.

You mean you want it to take less than roughly 6.8 months at 24x7?! :slight_smile:

duncan

-----Original Message-----
From: rt-users-bounces@lists.bestpractical.com
[mailto:rt-users-bounces@lists.bestpractical.com] On Behalf
Of Duncan Shannon
Sent: Monday, February 27, 2006 12:16 PM
To: Jim Faulkner; rt-users@lists.bestpractical.com
Subject: RE: [rt-users] RTx-shredder with a very large database

And it took 23 minutes and 17 seconds to complete. Obviously if it
takes
that long to delete 20 tickets, I cannot use the tool to clean out
250,000
tickets.

You mean you want it to take less than roughly 6.8 months at
24x7?! :slight_smile:

duncan


As the number of tickets in the database decreased, I am sure that the
execution time would speed up in proportion. It’s those first 100K or
so that you have to worry about :slight_smile:

Eric

At Monday 2/27/2006 01:37 PM, Jim Faulkner wrote:

/opt/rt3/sbin/rtx-shredder --force --plugin
’Tickets=queue,ZZSpam;limit,20’

And it took 23 minutes and 17 seconds to complete. Obviously if it takes
that long to delete 20 tickets, I cannot use the tool to clean out 250,000
tickets.

Is there anything I can do to make rtx-shredder significantly faster? If
not, does anyone have any advice as to how I should go about cleaning out
this massive queue?

thanks for any help,
Jim Faulkner

Hello Jim,

This doesn’t solve the problem, but we have been able to shred about 330
tickets per hour, about 4,000 per night. So you should expect faster
performance than you’ve been getting.

How about shredding a single ticket while monitoring all the database
statements that this invokes. It may help you determine exactly what is
slowing things down.

Good luck,
Steve

You mean you want it to take less than roughly 6.8 months at
24x7?! :slight_smile:

duncan


As the number of tickets in the database decreased, I am sure that the
execution time would speed up in proportion. It’s those first 100K or
so that you have to worry about :slight_smile:

you assume the tables actually physically shrink upon row deletes.
This is not true for MySQL InnoDB tables nor for Postgres tables
without further maintenance.

Hi.

]Jim Faulkner wrote:

Hello,

I run a large RT installation. We have around 20 queues, 200-300 users,
and many more requestors. I am currently running RT 3.0.12 on the
production RT, and RT 3.4.5 on a development machine.

Unfortunately we get a lot of spam going into the RT queues. The previous
administrator of RT set up a way of dealing with this – before going into
RT all mail is checked with spamassassin, and if it reports a positive the
mail is put into an RT queue named “ZZSpam”. Unfortunately over the years
this “ZZSpam” queue has gotten extremely large, with about 250,000
tickets, and is probably taking up the bulk of the database because many
spams come with large attachments.

Because this ZZSpam queue has become so unweildy, I’d like to clean out
all tickets from that queue, as well as all users, attachments, etc. that
are associated with those tickets. I’ve tried the rtx-shredder program,
but it is extremely slow. I issued this command on the devel RT, which
is a dual 2ghz Xeon machine with 1 GB of RAM, and has no load on it:
time /opt/rt3/sbin/rtx-shredder --force --plugin
’Tickets=queue,ZZSpam;limit,20’

And it took 23 minutes and 17 seconds to complete. Obviously if it takes
that long to delete 20 tickets, I cannot use the tool to clean out 250,000
tickets.

Is there anything I can do to make rtx-shredder significantly faster? If
not, does anyone have any advice as to how I should go about cleaning out
this massive queue?

I’m afraid I have a similar story, no spam queue (spam is filtered
before it reaches RT), but large amounts of redundant data.
I have a similar size RT installation.
I’ve installed rtx-shredder and tried that.
See my earlier post of 27Feb2006.

I thought that doing the deletes to mysql directly (that is, not using
the RT api, might be faster) but I’m afraid that isn’t necessarily so.

I have created a script that does this and it shows considerably long
processing times.

What needs to be done to delete a ticket is this:

/~~~~~~~~~~~~~~~~~~~~~~~~~~~
select * from Tickets where id =
<1>select id from Transactions where ObjectId = ;
<4>select id from ObjectCustomFieldValues where ObjectId = ;
<3>select id from Links where LocalTarget = ;
<2>select id from Attachments where TransactionId = ( for each
transactionid retrieved with <1> )

-delete from Attachments where id in ( for each Attachment retrieved
with <2> )
-delete from Links where id = ( for each link in <3> )
-delete from ObjectCustomFieldValues where id in ( for each
ObjectCustomFieldValue in <4> )
-delete from Transactions where id in ( for each Transaction in <1> )
-delete from Tickets where id = ;

Additionally Stuff needs to be deleted from
principals, cachedgroupmembers, groupmembers, groups.

For each ticket:
-delete from principals where
id=groups.id and groups.instance=tickets.id as int
-delete from cachedgroupmembers where groupid=groups.id
and groups.instance=tickets.id
-delete from groupmembers where groupid=groups.id and
groups.instance=tickets.id
-delete from groups where instance=tickets.id
____________________________________________

There may be more that could be deleted, namely Users(requestors) who
created the ticket etc.

I’m toying with the idea of a script that runs each night and cleans out
1000 tickets or so, I have 90000 tickets to delete in first instance,
after that a regular cleanup of resolved tickets.

Hope this contributes.

Kind regards.
Luke.

thanks for any help,
Jim Faulkner


http://lists.bestpractical.com/cgi-bin/mailman/listinfo/rt-users

Be sure to check out the RT Wiki at http://wiki.bestpractical.com

Download a free sample chapter of RT Essentials from O’Reilly Media at http://rtbook.bestpractical.com

WE’RE COMING TO YOUR TOWN SOON - RT Training in Amsterdam, Boston and
San Francisco - Find out more at http://bestpractical.com/services/training.html

Luke

I’m afraid I have a similar story, no spam queue (spam is filtered
before it reaches RT), but large amounts of redundant data.
I have a similar size RT installation.
I’ve installed rtx-shredder and tried that.
See my earlier post of 27Feb2006.

I thought that doing the deletes to mysql directly (that is, not using
the RT api, might be faster) but I’m afraid that isn’t necessarily so.

I have created a script that does this and it shows considerably long
processing times.

What needs to be done to delete a ticket is this:

/~~~~~~~~~~~~~~~~~~~~~~~~~~~
select * from Tickets where id =
<1>select id from Transactions where ObjectId = ;
<4>select id from ObjectCustomFieldValues where ObjectId = ;
<3>select id from Links where LocalTarget = ;
<2>select id from Attachments where TransactionId = ( for each
transactionid retrieved with <1> )

-delete from Attachments where id in ( for each Attachment retrieved
with <2> )
-delete from Links where id = ( for each link in <3> )
-delete from ObjectCustomFieldValues where id in ( for each
ObjectCustomFieldValue in <4> )
-delete from Transactions where id in ( for each Transaction in <1> )
-delete from Tickets where id = ;

Additionally Stuff needs to be deleted from
principals, cachedgroupmembers, groupmembers, groups.

For each ticket:
-delete from principals where
id=groups.id and groups.instance=tickets.id as int
-delete from cachedgroupmembers where groupid=groups.id
and groups.instance=tickets.id
-delete from groupmembers where groupid=groups.id and
groups.instance=tickets.id
-delete from groups where instance=tickets.id
____________________________________________

There may be more that could be deleted, namely Users(requestors) who
created the ticket etc.

I’m toying with the idea of a script that runs each night and cleans out
1000 tickets or so, I have 90000 tickets to delete in first instance,
after that a regular cleanup of resolved tickets.

Hope this contributes.

Kind regards.
Luke.

That helps a lot! I’m in the midst of writing my own tool to clean out
the database, and I was just trying to figure out that exact information.
Thank you very much!

Jim Faulkner

I’m afraid I have a similar story, no spam queue (spam is filtered
before it reaches RT), but large amounts of redundant data.
I have a similar size RT installation.
I’ve installed rtx-shredder and tried that.
See my earlier post of 27Feb2006.

I thought that doing the deletes to mysql directly (that is, not using
the RT api, might be faster) but I’m afraid that isn’t necessarily so.

I have created a script that does this and it shows considerably long
processing times.

What needs to be done to delete a ticket is this:

/~~~~~~~~~~~~~~~~~~~~~~~~~~~
select * from Tickets where id =
<1>select id from Transactions where ObjectId = ;
<4>select id from ObjectCustomFieldValues where ObjectId = ;
<3>select id from Links where LocalTarget = ;
<2>select id from Attachments where TransactionId = ( for each
transactionid retrieved with <1> )

-delete from Attachments where id in ( for each Attachment retrieved
with <2> )
-delete from Links where id = ( for each link in <3> )
-delete from ObjectCustomFieldValues where id in ( for each
ObjectCustomFieldValue in <4> )
-delete from Transactions where id in ( for each Transaction in <1> )
-delete from Tickets where id = ;

Additionally Stuff needs to be deleted from
principals, cachedgroupmembers, groupmembers, groups.

For each ticket:
-delete from principals where
id=groups.id and groups.instance=tickets.id as int
-delete from cachedgroupmembers where groupid=groups.id
and groups.instance=tickets.id
-delete from groupmembers where groupid=groups.id and
groups.instance=tickets.id
-delete from groups where instance=tickets.id
____________________________________________

There may be more that could be deleted, namely Users(requestors) who
created the ticket etc.

I’m toying with the idea of a script that runs each night and cleans out
1000 tickets or so, I have 90000 tickets to delete in first instance,
after that a regular cleanup of resolved tickets.

Hope this contributes.

Kind regards.
Luke.

That helps a lot! I’m in the midst of writing my own tool to clean out
the database, and I was just trying to figure out that exact information.
Thank you very much!

Why? When your done you will have RTx::Shredder, except not
as well tested and likely to break on some future version of RT.

-Todd

Todd Chapman wrote:

I’m afraid I have a similar story, no spam queue (spam is filtered
before it reaches RT), but large amounts of redundant data.
I have a similar size RT installation.
I’ve installed rtx-shredder and tried that.
See my earlier post of 27Feb2006.

I thought that doing the deletes to mysql directly (that is, not using
the RT api, might be faster) but I’m afraid that isn’t necessarily so.

I have created a script that does this and it shows considerably long
processing times.

What needs to be done to delete a ticket is this:

/~~~~~~~~~~~~~~~~~~~~~~~~~~~
select * from Tickets where id =
<1>select id from Transactions where ObjectId = ;
<4>select id from ObjectCustomFieldValues where ObjectId = ;
<3>select id from Links where LocalTarget = ;
<2>select id from Attachments where TransactionId = ( for each
transactionid retrieved with <1> )

-delete from Attachments where id in ( for each Attachment retrieved
with <2> )
-delete from Links where id = ( for each link in <3> )
-delete from ObjectCustomFieldValues where id in ( for each
ObjectCustomFieldValue in <4> )
-delete from Transactions where id in ( for each Transaction in <1> )
-delete from Tickets where id = ;

Additionally Stuff needs to be deleted from
principals, cachedgroupmembers, groupmembers, groups.

For each ticket:
-delete from principals where
id=groups.id and groups.instance=tickets.id as int
-delete from cachedgroupmembers where groupid=groups.id
and groups.instance=tickets.id
-delete from groupmembers where groupid=groups.id and
groups.instance=tickets.id
-delete from groups where instance=tickets.id
____________________________________________

There may be more that could be deleted, namely Users(requestors) who
created the ticket etc.

I’m toying with the idea of a script that runs each night and cleans out
1000 tickets or so, I have 90000 tickets to delete in first instance,
after that a regular cleanup of resolved tickets.

Hope this contributes.

Kind regards.
Luke.

That helps a lot! I’m in the midst of writing my own tool to clean out
the database, and I was just trying to figure out that exact information.
Thank you very much!

Why? When your done you will have RTx::Shredder, except not
as well tested and likely to break on some future version of RT.

Not sure whether web-interface invoked actions and RT API processed
requests will be equal in performance to straigh SQL from perl script.
Would have to test both to give objective reply.
I have tried both API and direct sql, both are slow.

BTW.
The stuff that I put in my script comes from the WIKI.
http://wiki.bestpractical.com/index.cgi?DatabaseAdmin

Kind regards.
Luke

-Todd

Luke

Not sure whether web-interface invoked actions and RT API processed
requests will be equal in performance to straigh SQL from perl script.
Would have to test both to give objective reply.
I have tried both API and direct sql, both are slow.

BTW.
The stuff that I put in my script comes from the WIKI.
http://wiki.bestpractical.com/index.cgi?DatabaseAdmin

Kind regards.
Luke

I would guess that it’s the database work that is taking
the most time. You are guessing that it is the script
overhead. It would be better to benchmark Shredder before
you do a bunch of potentially wasted work. Perhaps adding
an index or two would speed things up…

-Todd

Todd Chapman wrote:>On Wed, Mar 01, 2006 at 12:19:10PM +1030, Luke Vanderfluit wrote:

Not sure whether web-interface invoked actions and RT API processed
requests will be equal in performance to straigh SQL from perl script.
Would have to test both to give objective reply.
I have tried both API and direct sql, both are slow.

BTW.
The stuff that I put in my script comes from the WIKI.
http://wiki.bestpractical.com/index.cgi?DatabaseAdmin

Kind regards.
Luke

I would guess that it’s the database work that is taking
the most time. You are guessing that it is the script
overhead. It would be better to benchmark Shredder before
you do a bunch of potentially wasted work. Perhaps adding
an index or two would speed things up…

-Todd

Bill R. Williams brw@etsu.edu Wrote:

How about (for massive cleanup)

  • Dump the rt3 database (mysqldump -A rt3 >rt3.sql) plus whatever
    other switches might be needed to include the IF/DROP CREATE
    directives.
  • Stop MySQL or at least the RT database.
  • Process that dump file with a perl script which drops the things
    that associate with the tickets you want to delete – a filter to
    omit the stuff as mentioned in your notes (below).
    (massDelete.pl <rt3.sql >rt3clean.sql)
  • Start MySQL (if stopped)
  • mysql -u rt_user <rt3clean.sql
    which will recreate the entire DB without the stuff you deleted.

I know that this sounds like killing a fly with a shotgun, but it
might actually be the easiest (quickest?) way to go. Of course the
downside is that you would want to stop your RT database unitl you
filtered your dumped .sql, cause we’re talking DROP/CREATE every
table. BUT it might not take that long to process.
On the upside, you DB would be nice, clean, and organized.

Just a thought. I’d certainly be interested in your thoughts on the
method. 'Cause I am uncomfortable with data that I can’t clean.

My response:
/~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a good approach in theory, however the drawback here is how to
delete stuff, for example the attachements in the dump file.
I have tried editing the dump file. Given that my dump file is 14Gig, I
haven’t found an editor that can buffer that size file.
I use vi, it can’t do that, or maybe it can but I haven’t found out how.
I’ve tried several other editors which claim to be able to do it, but
none effectively can.
By far the largest proportion of data in the dump file is in the
attachments table.
It would take some experimentation before a watertight script can be run
over the dump file. Given the amount of time it takes to do that, it
would be very cumbersome to test a script. Testing a smaller dump file
may be an option but it wouldn’t be wise to apply the outcomes of that
to the larger file and assume that all is well.

_____________________________________________________________________

Kind regards.

Luke

This is a good approach in theory, however the drawback here is how to delete
stuff, for example the attachements in the dump file.

I’ve been experimenting… and I was able to delete 350,000 rows from the
Tickets table in 77 minutes last night. Perhaps the “Attachment” table
would take a lot longer for deleting so many rows, but even so it seems to
me that the limitation here is in doing the SELECTs to figure out exactly
what needs to be deleted, rather than in doing the deletions themselves.

In my bash script I’m trying to do SELECTs like RT would, from viewing the
output of mysql’s General log. For example, this statement:
SELECT DISTINCT main.Id AS id FROM Attachments main , Transactions
Transactions_1, Tickets Tickets_2 WHERE ((Tickets_2.EffectiveId =
’$TICKETNUMBER’)) AND ((Transactions_1.ObjectId = Tickets_2.id)) AND
((Transactions_1.ObjectType = ‘RT::Ticket’)) AND ((main.TransactionId =
Transactions_1.id))

seems to be a buttload faster than just:
SELECT id from Attachments where TransactionId = $TRANSACTIONID

I am by no means a SQL expert though, so maybe I’m missing some stuff by
using the former statement rather than the latter.

Jim Faulkner