Thoughts: uniq attachment and multiple references?

After my continuing saga with importing tickets (tip, turn off mysql
logging first lest you cause later embarressment), I’m struck by a blatent
waste of space in RT (and pretty much all ticketing systems that I’ve
seen).

To wit, duplicates are kept on the backend in duplicate/however many
copies. This is a Bad Thing™ when you’ve got nearly a gig of data, and
suspect that you can save ~5% by keeping only one copy of a given message.
This is a nightmare when you know there might be another gig in a seperate
queue collecting routine reports to import into RT :wink:

Looking at the ‘Attachments’ table, I think that you could redefine it as
follows:

Table Attachments:
id, TransactionID, Parent, MessageID, Subject, Filename,
ContentType, ContentEncoding, *ContentID*, Headers, Creator,
Created.

ContentID is then used as an index into another table (I know, more
indirection) to store the actual content (which the replaced field
’Content’ current does).

Table Content (no back references needed):
id, Content (blob)

The coding changes would appear to be in RT/Attachment.pm Create(),
Content(), Quote(). (quite likely missed other bits, but code mentioning
Content seems to call the above routines happily.

Would this be a good/bad thing to be doing (ie, code up and provide
patches for, or has Bruce been visiting too many (Amsterdam) Coffee Shops
in his lunch break? :wink:

                         Bruce Campbell                            RIPE
                                                                    NCC
                                                             Operations

Really, I’m not sure that 50 megs on a gig actually makes enough
difference for this to be worth the added complexity.
What I’d be curious about is how many of these attachments are bitwise
identical. Single-instance storage is, in fact, quite cool, if we can get
it just right. I’d be happy to see this in 2.2, if we can make sure it’s
genuniely happy and it doesn’t impact performance. The big place you missed
in your listing of things that would need changing is in Tickets.pm,
so we can continue to properly search for tickets by content value.

Man. I wish my coffee shops were as cool as yours. :wink: Maybe I can find a
contract gig in amsterdam or something.

-jOn Thu, Dec 27, 2001 at 08:23:09PM +0100, Bruce Campbell wrote:

After my continuing saga with importing tickets (tip, turn off mysql
logging first lest you cause later embarressment), I’m struck by a blatent
waste of space in RT (and pretty much all ticketing systems that I’ve
seen).

To wit, duplicates are kept on the backend in duplicate/however many
copies. This is a Bad Thing™ when you’ve got nearly a gig of data, and
suspect that you can save ~5% by keeping only one copy of a given message.
This is a nightmare when you know there might be another gig in a seperate
queue collecting routine reports to import into RT :wink:

Looking at the ‘Attachments’ table, I think that you could redefine it as
follows:

Table Attachments:
id, TransactionID, Parent, MessageID, Subject, Filename,
ContentType, ContentEncoding, ContentID, Headers, Creator,
Created.

ContentID is then used as an index into another table (I know, more
indirection) to store the actual content (which the replaced field
‘Content’ current does).

Table Content (no back references needed):
id, Content (blob)

The coding changes would appear to be in RT/Attachment.pm Create(),
Content(), Quote(). (quite likely missed other bits, but code mentioning
Content seems to call the above routines happily.

Would this be a good/bad thing to be doing (ie, code up and provide
patches for, or has Bruce been visiting too many (Amsterdam) Coffee Shops
in his lunch break? :wink:


Bruce Campbell RIPE
NCC
Operations


rt-devel mailing list
rt-devel@lists.fsck.com
http://lists.fsck.com/mailman/listinfo/rt-devel

http://www.bestpractical.com/products/rt – Trouble Ticketing. Free.

Really, I’m not sure that 50 megs on a gig actually makes enough
difference for this to be worth the added complexity.
What I’d be curious about is how many of these attachments are bitwise
identical. Single-instance storage is, in fact, quite cool, if we can get

50 megs over 1 gig isn’t much. Its the wastage in the first place that I
dislike, and the possibility of mail loops having a much more devastating
effect than they should (ideally, a mail loop results in a few hundred
links to the same data, not several hundred copies of the same data (even
more ideally, a mail loop doesn’t happen. Ha! ) ).

( One of these days I’ll actually implement my SQL-based mail archive
system which stores messages on a per line basis, saving oodles on
subject lines alone. One day I’ll be really bored :wink: )

it just right. I’d be happy to see this in 2.2, if we can make sure it’s
genuniely happy and it doesn’t impact performance. The big place you missed
in your listing of things that would need changing is in Tickets.pm,
so we can continue to properly search for tickets by content value.

Hrm. And the big gotcha is the upgrade hit, as this would essentially
‘lose’ all ticket content until some migration is complete. Its a NULL
field, so you could delete it without impact, and thus the _Init could set
some flags deciding which table to use at runtime.

I think I can see how Tickets.pm would be changed (special case though),
and can see how content searching could be made lots faster.

Man. I wish my coffee shops were as cool as yours. :wink: Maybe I can find a
contract gig in amsterdam or something.

heh. you’d be amazed at what the employment page of my company kicks up
from time to time. We’re only a small company though, with a small effect
on the Internet and running small, unimportant databases :wink:

                         Bruce Campbell                            RIPE
                      I saw you Flying,                             NCC
    I saw you Flying at the Coffee Shop                      Operations