RT::Extension::ExternalStorage

Hi,

I just want to give some feedback from our internal testing of this
extension.

Attachment table rows: 1593072
Attachment table datafile size: 61G.

It took 105 minutes to extract the attachments.
After extraction, the attachment directory size was 34G.

After a “optimize table rt4.Attachments;” run (which took 40 minutes),
the attachment table datafile size was 13G.

Attachment datafile + attachment directory = 13G + 34G = 47G

Compared to the previously 61G datafile size we saved 14G.

After some checks for duplicate attachments, which this extension only
extract once (which actually isn’t mentioned in the documentation), I
found out that the de-duplication feature saved 1.5G.

There seems to be a significant overhead when saving binary data in MySQL.

The only annoying thing with this extension is, that even if you have it
configured to save attachments on disk, it first saves the attachments
in the DB and you then have to extract them.
This makes an regular “optimize table rt4.Attachments;” necessary.
As this operation locks the table (up to MySQL version 5.6.17), you have
to plan a regularly downtime.

Maybe you can introduce an option to save the attachments directly on disk.

Chris

I just want to give some feedback from our internal testing of this
extension.

Thanks for the feedback!

Attachment table rows: 1593072
Attachment table datafile size: 61G.

It took 105 minutes to extract the attachments.
After extraction, the attachment directory size was 34G.

After a “optimize table rt4.Attachments;” run (which took 40 minutes),
the attachment table datafile size was 13G.

Attachment datafile + attachment directory = 13G + 34G = 47G

Compared to the previously 61G datafile size we saved 14G.

Hm – interesting. I wouldn’t have expected mysql to be that
inefficient at binary storage. Assuming the 13G is all text, that
means that it was using 61G-13G = 48G to store 34G worth of data.

After some checks for duplicate attachments, which this extension only
extract once (which actually isn’t mentioned in the documentation), I
found out that the de-duplication feature saved 1.5G.

Good catch that the de-duplication wasn’t documented; I’ll add a note
about that shortly.

The only annoying thing with this extension is, that even if you have it
configured to save attachments on disk, it first saves the attachments
in the DB and you then have to extract them.

This is intentional. The codepath which does the decision as to
whether to leave the data in the database needs information about other
objects to decide, which is not available at EncodeLOB time. Storing
the data temporarily in the database also makes it durable to hard
drives filling, or network storage being unavailable (in the case of S3
or Dropbox).

This makes an regular “optimize table rt4.Attachments;” necessary.
As this operation locks the table (up to MySQL version 5.6.17), you have
to plan a regularly downtime.

Hm – why is the “optimize table” necessary? Does MySQL not reuse the
BLOB space until it sees such?

  • Alex

This makes an regular “optimize table rt4.Attachments;” necessary.

As this operation locks the table (up to MySQL version 5.6.17), you have
to plan a regularly downtime.
Hm – why is the “optimize table” necessary? Does MySQL not reuse the
BLOB space until it sees such?

After digging into the MySQL documentation I found out, that MySQL
indeed reuse the freed space.

Thanks for the hint.

Chris

Hi,Le 11/02/2015 17:29, Christian Loos a écrit :

[…]
The only annoying thing with this extension is, that even if you have it
configured to save attachments on disk, it first saves the attachments
in the DB and you then have to extract them.
This makes an regular “optimize table rt4.Attachments;” necessary.
As this operation locks the table (up to MySQL version 5.6.17), you have
to plan a regularly downtime.

Maybe you can introduce an option to save the attachments directly on disk.

We were looking for the same option, so we developped it.
The goals are :

  • store file on disk,
  • reuse some files instead of storing them every time,
  • create a CF which could propose a list of preloaded files to “attach”,
  • be able to populate this CF directly from the filesystem (through a
    script)
  • etc.

Plus, we wanted to keep as much as possible the way RT was managing files.

The code relies in :

  • a sql file, needed to store the references to the filesystem
    (attachments was not relevant due to transactionid and to keep most of
    RT’s code)
  • Document.pm , a class inheriting from Record,
  • some minor changes to Attachment.pm and CustomField.pm,
  • some minor changes to share/html/Ticket/Attachment/dhandler
  • some minor changes to share/html/Download/CustomFieldValue/dhandler

The db used is postgresql, but I think it would be easy to port to mysql.

Let me know if you are interested.

This is a work in developpment : no responsability for production
environment :slight_smile:

Easter-eggs Spécialiste GNU/Linux
44-46 rue de l’Ouest - 75014 Paris - France - Métro Gaité
Phone: +33 (0) 1 43 35 00 37 - Fax: +33 (0) 1 43 35 00 76
emanganneau@easter-eggs.com - http://www.easter-eggs.com

We were looking for the same option, so we developped it.
The goals are :

  • store file on disk,
  • reuse some files instead of storing them every time,
  • create a CF which could propose a list of preloaded files to “attach”,
  • be able to populate this CF directly from the filesystem (through a
    script)
  • etc.

Plus, we wanted to keep as much as possible the way RT was managing files.

The code relies in :

  • a sql file, needed to store the references to the filesystem
    (attachments was not relevant due to transactionid and to keep most of
    RT’s code)
  • Document.pm , a class inheriting from Record,
  • some minor changes to Attachment.pm and CustomField.pm,
  • some minor changes to share/html/Ticket/Attachment/dhandler
  • some minor changes to share/html/Download/CustomFieldValue/dhandler

The db used is postgresql, but I think it would be easy to port to mysql.

Let me know if you are interested.

This is a work in developpment : no responsability for production
environment :slight_smile:

Hi,

thanks for the suggestion but RT::Extension::ExternalStorage works
perfect for us. Also, the extension functionality is already included in
RT 4.4, so we stay with this.

Chris

Hi,

thanks for the suggestion but RT::Extension::ExternalStorage works
perfect for us. Also, the extension functionality is already included in
RT 4.4, so we stay with this.

I looked over ExternalStorage and had the same questions than yours (in
your first post) : direct storage on disk and reuse of files.
ExternalStorage is a backup solution : the “master” data are in the DB
and that’s not what I wanted.

Thus, I needed a CF which could provide choice between uploaded files,
without copy them in attachments (think to a signature image in every
email).

The way data are stored in RT makes it very difficult to separate data
from files (attachments, customfieldvalues, objectcustomfieldvalues and
all their access and storage procedures) : that’s why I introduced a now
object (Document) and a minimum of code in order to preserve the way RT
deals with associated data.

We are actually packaging the extension and will provide soon a tgz.

Regards,

Easter-eggs Spécialiste GNU/Linux
44-46 rue de l’Ouest - 75014 Paris - France - Métro Gaité
Phone: +33 (0) 1 43 35 00 37 - Fax: +33 (0) 1 43 35 00 76
emanganneau@easter-eggs.com - http://www.easter-eggs.com

I looked over ExternalStorage and had the same questions than yours
(in your first post) : direct storage on disk and reuse of files.
ExternalStorage is a backup solution : the “master” data are in the
DB and that’s not what I wanted.

ExternalStorage move sdata out of the database as soon as it is
successfully stored externally. It is not kept in the database, and
the database does not need to be vacuumed. The initial storage in the
database is needed to ensure durability of the data, in case of file
permission failure, or network failure (in the case of the cloud
storage options). It is not a backup solution.

It also does de-duplicate files, so I’m unclear what you mean by that
point.

  • Alex

ExternalStorage move sdata out of the database as soon as it is
successfully stored externally. It is not kept in the database, and

You’re right, I re-checked the code and saw that I missed this :

my $__DecodeLOB = PACKAGE->can(‘_DecodeLOB’);
*_DecodeLOB = sub {

so I thought data access was from the db.

the database does not need to be vacuumed. The initial storage in the
database is needed to ensure durability of the data, in case of file
permission failure, or network failure (in the case of the cloud
storage options).

I see. I must admit this is quite elegant.

It also does de-duplicate files, so I’m unclear what you mean by that
point.

If data was kept inside db, files would have been stored twice and
de-duplication in the file system would not have meant db deduplication
(am I clear ?) : but I was wrong.

Maybe I will try to contribute to ExternalStorage instead of
reinveinting the wheel…

Easter-eggs Spécialiste GNU/Linux
44-46 rue de l’Ouest - 75014 Paris - France - Métro Gaité
Phone: +33 (0) 1 43 35 00 37 - Fax: +33 (0) 1 43 35 00 76
emanganneau@easter-eggs.com - http://www.easter-eggs.com

Le 01/09/2015 09:16, Alex Vandiver a �crit :

ExternalStorage move sdata out of the database as soon as it is
successfully stored externally. It is not kept in the database, and

You’re right, I re-checked the code and saw that I missed this :

my $__DecodeLOB = PACKAGE->can(‘_DecodeLOB’);
*_DecodeLOB = sub {

so I thought data access was from the db.

Ah – yes, I could see how one would draw the conclusions you did if
one missed that section.

Maybe I will try to contribute to ExternalStorage instead of
reinveinting the wheel…

It has also been merged into core for 4.4. If you have improvements
in this area, I’m sure they would be appreciated.

  • Alex

Le 01/09/2015 12:23, Alex Vandiver a �crit :

It has also been merged into core for 4.4. If you have improvements
in this area, I’m sure they would be appreciated.

Before I begin to submit code (or throw away what I did), I would like
to share my thoughts. I am sorry if those questions have already been
discussed !

  • 1st : every file transmitted is loaded in RAM before any storage; when
    sending files, you also must load the file into RAM;
  • 2nd : the files are, imho, objects. In case of external storage, this
    is more “true” (limits of my english…)

Preventing storage failure is a good thing, as preventing the system to
go out of RAM.

This, plus the 2nd question, leaded me to create a new object (called
Document) which represents files. The files are linked to other
classical objects in RT via Link objects (type “AttachedTo” or
“AttachedBy”).

This supposed to change :

  • the way files are uploaded
  • the way file content is fetched

The changes were quite simple : if a link of type “AttachedTo” exists,
send Document->content, else send (Content|LargeContent) as usual. The
number of files impacted was rather low (3).

I was feeling that I was respecting RT spirit, but I am not so sure by
now !

Easter-eggs Sp�cialiste GNU/Linux
44-46 rue de l’Ouest - 75014 Paris - France - M�tro Gait�
Phone: +33 (0) 1 43 35 00 37 - Fax: +33 (0) 1 43 35 00 76
emanganneau@easter-eggs.com - http://www.easter-eggs.com

This, plus the 2nd question, leaded me to create a new object
(called Document) which represents files. The files are linked to
other classical objects in RT via Link objects (type “AttachedTo” or
“AttachedBy”).

This sounds like a purely backend changes, yes? But it’s not clear to
me what problem you are attempting to solve by doing this. While
unifying the OCFV and Attachment data storage is a perhaps-useful
goal, I don’t see much actual gain from it – it will increase the
size of the Links table by orders of magnitude. I also don’t
understand how this would avoid having to load the file data into
memory when saving or loading (your point 2).

  • Alex

This sounds like a purely backend changes, yes? But it’s not clear to
me what problem you are attempting to solve by doing this. While

I have been asked to perform some adpatations for differents clients :

  • file storage (which ExternalStorage does perfectly)
  • inserting Articles files into ticket responses,
  • check and limit attachments sizes (client and server),
  • create a CF which proposes to pick up a file in a list (files are
    stored server side),

Each of those developpment could be made separatly, but I began to see a
picture while coding this : RT could have a way to manage files, server
side. And once you have this object, every developpment above become
quite easy to write and without edge effect.

So it’s not only a purely backend change, it’s about adding a new object
to RT, which could use the backends method you add in ExternalStorage
for example.

unifying the OCFV and Attachment data storage is a perhaps-useful
goal, I don’t see much actual gain from it – it will increase the
size of the Links table by orders of magnitude. I also don’t
Gains :

  • every developpment about files and attachment is made in the same
    place (this is relevant for me at least ! :slight_smile: )
  • there is a table of uploaded files (and files needed by the server) so
    you have a clear and instant view of this part of your IS;
  • I can create a lot of usefull (to me !) features, such as CF with a
    list of files,
  • you can link files to every object in RT, thanks to the Link table,
    without having to create specific CF or attribute.

The size of the link table is a good point, but will stay very (very)
smaller than group membership for instance.

understand how this would avoid having to load the file data into
memory when saving or loading (your point 2).

I have overloaded Interface/Web.pm to process attachment directly on
disk : this way the uploaded files (their content) is never seen by
Perl. In the same spirit, I replaced in the MimeEntities the “Content”
attribute by a “Path” attribute.

When loading, I re-wrote some handlers and some classes to “spew” (I use
Path::Tiny :slight_smile: ) the file content without loading it in a variable. There
are still some cases where I need the content, but not much.

Easter-eggs Spécialiste GNU/Linux
44-46 rue de l’Ouest - 75014 Paris - France - Métro Gaité
Phone: +33 (0) 1 43 35 00 37 - Fax: +33 (0) 1 43 35 00 76
emanganneau@easter-eggs.com - http://www.easter-eggs.com