Option to store attachments on the filesystem

Geoff_Mayes · December 21, 2011, 11:12pm

Hello RT Users and Developers,

Our RT instance at the University of Oregon is outgrowing the standard settings in some ways. One way is with attachments. The size of our database is 15.3GB and 13.7GB of that comes from the Attachments table. If our attachments were stored on a high-performance fileserver (or locally if you prefer), our database would shrink to 1.6GB. This would have numerous positive ramifications:

Database dumps/backups would finish in 1/10 the time
Database restores would finish in 1/10 the time
Planned downtimes and disaster recovery situations could be more nimbly performed (scp’ing around the db dump, restoring, etc)
Backups could be taken much more frequently
More backups could be stored
MySQL replication would be more robust with less binary data to chew on
Larger attachments could be permitted because there would be less fear of the database growing too quickly
Reduced database load querying/inserting/deleting/joining attachments

I’ve read in previous posts to this mailing list (see below) that the arguments against this are that (1) attachments on the filesystem can’t be searched and (2) the data backing the application will not be in one tidy database package but instead spread out across the db and filesystem. For our instance we don’t care about #1, and for #2, while I understand the argument, I would actually argue the opposite: when attachments are on a high-performance, redundant SAN managed by a dedicated storage team that I don’t have to worry about, my job administering RT just got a whole lot easier because I only have to worry about ensuring the fileserver is mounted and $AttachmentsPath (just an example config option) is properly set. I worked previously at a company that ran one of the largest instances of Bugzilla in the world and we served up 30TB of attachments over a fileserver without any problems. Can you imagine those attachments in a MySQL database? When ticket tracking systems are no longer small-ish, moving attachments out of the database becomes a must.

I’m not asking the RT folks to switch attachment storage to the filesystem instead of the database. My wish is that RT offers its administrators the ability to choose one or the other. I know this has been a hot topic in the past, but I was hoping we could revisit the issue. Best Practical folks – are you open to this? If so, would it help the process if I did all the work and submitted a patch? If so, should I file a bug so that we can talk about the way you would like this implemented?

Given my reading of the history of this issue, I think a lot of folks would benefit from this feature. I’ve included previous postings about this issue below. Let me know if I can help and how I can. We would love to upstream a patch so our local instance doesn’t diverge too severely from you all.

Thanks for your consideration, Geoff Mayes

One of the first, meaty discussions:

The best discussion of the issue:

Best Practical has recently worked on this issue:

Kenneth_Marshall · December 22, 2011, 2:42pm

Hello RT Users and Developers,

Our RT instance at the University of Oregon is outgrowing the standard settings in some ways. One way is with attachments. The size of our database is 15.3GB and 13.7GB of that comes from the Attachments table. If our attachments were stored on a high-performance fileserver (or locally if you prefer), our database would shrink to 1.6GB. This would have numerous positive ramifications:

Database dumps/backups would finish in 1/10 the time

Database restores would finish in 1/10 the time

Planned downtimes and disaster recovery situations could be more nimbly performed (scp’ing around the db dump, restoring, etc)

Backups could be taken much more frequently

More backups could be stored

MySQL replication would be more robust with less binary data to chew on

Larger attachments could be permitted because there would be less fear of the database growing too quickly

Reduced database load querying/inserting/deleting/joining attachments

I’ve read in previous posts to this mailing list (see below) that the arguments against this are that (1) attachments on the filesystem can’t be searched and (2) the data backing the application will not be in one tidy database package but instead spread out across the db and filesystem. For our instance we don’t care about #1, and for #2, while I understand the argument, I would actually argue the opposite: when attachments are on a high-performance, redundant SAN managed by a dedicated storage team that I don’t have to worry about, my job administering RT just got a whole lot easier because I only have to worry about ensuring the fileserver is mounted and $AttachmentsPath (just an example config option) is properly set. I worked previously at a company that ran one of the largest instances of Bugzilla in the world and we served up 30TB of attachments over a fileserver without any problems. Can you imagine those attachments in a MySQL database? When ticket tracking syste
ms are no longer small-ish, moving attachments out of the database becomes a must.

I’m not asking the RT folks to switch attachment storage to the filesystem instead of the database. My wish is that RT offers its administrators the ability to choose one or the other. I know this has been a hot topic in the past, but I was hoping we could revisit the issue. Best Practical folks – are you open to this? If so, would it help the process if I did all the work and submitted a patch? If so, should I file a bug so that we can talk about the way you would like this implemented?

Given my reading of the history of this issue, I think a lot of folks would benefit from this feature. I’ve included previous postings about this issue below. Let me know if I can help and how I can. We would love to upstream a patch so our local instance doesn’t diverge too severely from you all.

Thanks for your consideration, Geoff Mayes

One of the first, meaty discussions:
Carbon60: Managed Cloud Services
Carbon60: Managed Cloud Services
Carbon60: Managed Cloud Services
The best discussion of the issue:
Carbon60: Managed Cloud Services
Best Practical has recently worked on this issue:
Carbon60: Managed Cloud Services

Hi Geoff,

I had thought that something like this had already been implemented
by Best Practical for a customer. Hopefully, they can provide some
feedback regarding the utility and possible problems of such an
approach from personal experience. Maybe they would consider releasing
it as an extenstion.

As far as the assertion that “a lot of folks would benefit from this
feature”, I doubt that would be the case for the vast majority of RT
users. Most users can handle “one-stop-shopping” type applications
with far fewer problems. Once you divorce the metadate repository
from the actual ticket data, you add a whole slew of different failure
modes that will require much more sophisticated administration processes
to prevent, ameliorate, or recover from. Your reference to leveraging
an existing SAN+SAN management team gives a hint to the increase in
both complexity and cost of running an instance.

There are a wide range of RT users from systems that manage a handful
of tickets a week all the way to systems handling thousands of tickets
or more a week. Those on the small end can/should use whatever DB
backend that they are familiar with to simplify administration and
the “what did I do?!” errors due to a lack of familiarity. As you
move towards larger implementations, your DB backend needs to be
chosen based on it viability in an enterprise/large-scale environment.
I do not know the level of your local MySQL expertise and I am certainly
not a MySQL expert, but a 15GB database does not strike me as particularly
large, by any metric. Maybe you would benefit by changing your backend DB
to something that scales better. I know that other DBs support tablespaces
that can allow you to move certain tables to different filesystems/locations
to provide for more parallel partitioning across more I/O resources.

Sorry for the slight ramble. I am looking forward to this discussion and
if this feature is added some documentation describing when and when not
to use it will be essential.

Regards,
Ken

Joe_Harris · December 22, 2011, 5:43pm

I am looking into this type of functionality as well. We were thinking of an NSF share in a web directory to drop the attachment with a way to drop a link within the ticket. So the attachments may not even exist on the RT server, but there will be links in the ticket to a web server that houses the attachment.On Dec 22, 2011, at 9:42 AM, “ktm@rice.edu” ktm@rice.edu wrote:

On Wed, Dec 21, 2011 at 11:12:04PM +0000, Geoff Mayes wrote:

Hello RT Users and Developers,

Our RT instance at the University of Oregon is outgrowing the standard settings in some ways. One way is with attachments. The size of our database is 15.3GB and 13.7GB of that comes from the Attachments table. If our attachments were stored on a high-performance fileserver (or locally if you prefer), our database would shrink to 1.6GB. This would have numerous positive ramifications:

Database dumps/backups would finish in 1/10 the time

Database restores would finish in 1/10 the time

Planned downtimes and disaster recovery situations could be more nimbly performed (scp’ing around the db dump, restoring, etc)

Backups could be taken much more frequently

More backups could be stored

MySQL replication would be more robust with less binary data to chew on

Larger attachments could be permitted because there would be less fear of the database growing too quickly

Reduced database load querying/inserting/deleting/joining attachments

I’ve read in previous posts to this mailing list (see below) that the arguments against this are that (1) attachments on the filesystem can’t be searched and (2) the data backing the application will not be in one tidy database package but instead spread out across the db and filesystem. For our instance we don’t care about #1, and for #2, while I understand the argument, I would actually argue the opposite: when attachments are on a high-performance, redundant SAN managed by a dedicated storage team that I don’t have to worry about, my job administering RT just got a whole lot easier because I only have to worry about ensuring the fileserver is mounted and $AttachmentsPath (just an example config option) is properly set. I worked previously at a company that ran one of the largest instances of Bugzilla in the world and we served up 30TB of attachments over a fileserver without any problems. Can you imagine those attachments in a MySQL database? When ticket tracking sys
te
ms are no longer small-ish, moving attachments out of the database becomes a must.

I’m not asking the RT folks to switch attachment storage to the filesystem instead of the database. My wish is that RT offers its administrators the ability to choose one or the other. I know this has been a hot topic in the past, but I was hoping we could revisit the issue. Best Practical folks – are you open to this? If so, would it help the process if I did all the work and submitted a patch? If so, should I file a bug so that we can talk about the way you would like this implemented?

Given my reading of the history of this issue, I think a lot of folks would benefit from this feature. I’ve included previous postings about this issue below. Let me know if I can help and how I can. We would love to upstream a patch so our local instance doesn’t diverge too severely from you all.

Thanks for your consideration, Geoff Mayes

One of the first, meaty discussions:
Carbon60: Managed Cloud Services
Carbon60: Managed Cloud Services
Carbon60: Managed Cloud Services
The best discussion of the issue:
Carbon60: Managed Cloud Services
Best Practical has recently worked on this issue:
Carbon60: Managed Cloud Services

Hi Geoff,

I had thought that something like this had already been implemented
by Best Practical for a customer. Hopefully, they can provide some
feedback regarding the utility and possible problems of such an
approach from personal experience. Maybe they would consider releasing
it as an extenstion.

As far as the assertion that “a lot of folks would benefit from this
feature”, I doubt that would be the case for the vast majority of RT
users. Most users can handle “one-stop-shopping” type applications
with far fewer problems. Once you divorce the metadate repository
from the actual ticket data, you add a whole slew of different failure
modes that will require much more sophisticated administration processes
to prevent, ameliorate, or recover from. Your reference to leveraging
an existing SAN+SAN management team gives a hint to the increase in
both complexity and cost of running an instance.

There are a wide range of RT users from systems that manage a handful
of tickets a week all the way to systems handling thousands of tickets
or more a week. Those on the small end can/should use whatever DB
backend that they are familiar with to simplify administration and
the “what did I do?!” errors due to a lack of familiarity. As you
move towards larger implementations, your DB backend needs to be
chosen based on it viability in an enterprise/large-scale environment.
I do not know the level of your local MySQL expertise and I am certainly
not a MySQL expert, but a 15GB database does not strike me as particularly
large, by any metric. Maybe you would benefit by changing your backend DB
to something that scales better. I know that other DBs support tablespaces
that can allow you to move certain tables to different filesystems/locations
to provide for more parallel partitioning across more I/O resources.

Sorry for the slight ramble. I am looking forward to this discussion and
if this feature is added some documentation describing when and when not
to use it will be essential.

Regards,
Ken

RT Training Sessions (http://bestpractical.com/services/training.html)

Boston March 5 & 6, 2012

RT Training Sessions (http://bestpractical.com/services/training.html)

Boston March 5 & 6, 2012

Geoff_Mayes · December 22, 2011, 8:42pm

Hi Kevin and Joe,

Joe – that’s exactly what I did at a previous company where we served 30TB of Bugzilla attachments and it worked very well. I plan to do that with RT too.

Kevin – thanks so much for your detailed response. Yeah, definitely, I think this is an important conversation to have and I am so happy that others are weighing in.

As far as the assertion that “a lot of folks would benefit from this
feature”, I doubt that would be the case for the vast majority of RT
users. Most users can handle “one-stop-shopping” type applications
with far fewer problems.

Yes, you’re right. I was a bit overzealous in claiming that “a lot of folks would benefit from this feature.” By “a lot” I did not mean the majority but instead that a not insignificant number of RT users would be interested in this feature (per the previous list postings I referenced).

Once you divorce the metadate repository
from the actual ticket data, you add a whole slew of different failure
modes that will require much more sophisticated administration processes
to prevent, ameliorate, or recover from.

I don’t think this is the case, especially if this feature is implemented well. RT only needs to deal with filesystem attachments for the “Create Ticket” and “Display Ticket” pages where users can upload or view attachments. When displaying tickets RT could have a 5 second timeout when trying to access the local filesystem (over AJAX so the page still loads quickly entirely from the database) and if there is a timeout or failure, RT would log the issue (and optionally email) the admin and display a user-friendly message in the same attachments UI area stating that the attachments couldn’t be displayed.

I envision searching of attachments being disabled in RT if the user has chosen to store attachments on the filesystem and this would of course be well-documented for those choosing to install RT with filesystem attachments. RT could still keep the Attachments table and store metadata about files in that table when a user uploads a file, but this isn’t necessary and I’d let Best Practical decide this. Keeping attachments in the Transactions table isn’t necessary, but it fits with the design of RT and would be nice to have. Again, all of this is internal RT implementation details, and depending on how well it is done, the local administrator of RT could have an incredibly simple and easy experience if they chose to store attachments on the filesystem instead of the database.

The local filesystem or fileserver itself is, granted, an additional piece to manage, but it is an incredibly simple piece. I would much rather manage files on a filesystem than files in a database when I could have a database be 1.6GB instead of 15GB. And there are so many free (as well as expensive) ways to manage data (e.g. NFS, rsync, RAID, high-end redundant SANs, etc). Most importantly, the binary data is separated from textual data, and separate backup schedules and schemes are then permitted. Yay for modular design! As I said previously, I previously administered 30TB of attachments over NFS for a different tracking system and it worked very well.

Your reference to leveraging an existing SAN+SAN management team gives a hint to the increase in
both complexity and cost of running an instance.

This cost is up to the user depending on how secure and robust they would like their data to be. RT only provides the option for local or database attachment storage. The user can then decide whether a cheap SATA local disk serves the attachments or a super-expensive (or cheap), locally-mounted fileserver.

There are a wide range of RT users from systems that manage a handful
of tickets a week all the way to systems handling thousands of tickets
or more a week. Those on the small end can/should use whatever DB
backend that they are familiar with to simplify administration and
the “what did I do?!” errors due to a lack of familiarity.

Totally agree with this. An option to store attachments on the filesystem, however, is database-agnostic, so RT admins can select this option with MySQL, Oracle, Postgres, SQLite, etc.

As you move towards larger implementations, your DB backend needs to be
chosen based on it viability in an enterprise/large-scale environment.
I do not know the level of your local MySQL expertise and I am certainly
not a MySQL expert, but a 15GB database does not strike me as particularly
large, by any metric. Maybe you would benefit by changing your backend DB
to something that scales better. I know that other DBs support tablespaces
that can allow you to move certain tables to different filesystems/locations
to provide for more parallel partitioning across more I/O resources.

Our desire to store attachments outside of the database, at this point, has little to do with application performance and everything to do with backups, disaster recovery, upgrades, and downtimes. That being said, I do know that there are big performance gains that come from storing attachments outside of the database. Check out one of Bugzilla’s core developers discussion of this issue and their work-in-progress implementation: 577532 - Only store text attachments in the database. So moving attachments out of the database is an actual tuning option, just like the other options you mentioned. Why do something drastic like changing the database backend or performing complicated and expert-level tuning/sharding/partitioning, when I could just add a few config options to RT_SiteConfig.pm and run a script (for a pre-existing instance) that then sets up my instance to serve attachments from a filesystem instead of the db?

Here’s one recent example of how our current database size is negatively impacting us: We upgraded from 3.8.4 to 4.0.4 yesterday and it took almost an hour to dump our database and almost an hour to import the database (we were upgrading MySQL and the OSes as well). And then we had to import it again because max_packet_size was set too small (which wouldn’t have been a problem if attachments were outside the db: anecdotal and not logical argument, but nonetheless a real-world occurrence as errors happen) so add another hour instead of only another 10 minutes. If attachments were stored outside of the database, we could have reduced just the backup and import phases from 3 hours to 20 minutes. That is a huge difference, especially when your application is used by thousands of customers waiting to log back in. The positive ramifications continue: internal development of RT is much faster with a small database because we can copy them around the network faster, perform imports in 1/10th the time, and keep our development database up-to-date much easier.

If someone knew of a simpler way to cut the dump and restore times by 1/10, I would love to hear it and be totally open to a different solution.

The main point I would like to restate is that larger or quickly-growing instances of RT are very different than smaller or slowly-growing instances. One pain point of the larger instances is the size of the database and how that affects backups, restores, disaster recoveries, and development. Having the option to store attachments outside of the database allows the larger RT instances to more easily manage their data for a much longer period of time. Most importantly for the Best Practical folks, this option increase the appeal of RT to larger organizations instead of the “small- to medium-sized” market as stated at http://requesttracker.wikia.com/wiki/ManualIntroduction. The addition of this feature along with the recent SphinxSE option truly makes RT more feasible and attractive to larger organizations.

Kind regards, Geoff MayesFrom: rt-users-bounces@lists.bestpractical.com [rt-users-bounces@lists.bestpractical.com] on behalf of Joe Harris [drey111@gmail.com]
Sent: Thursday, December 22, 2011 9:43 AM
To: rt-users@lists.bestpractical.com
Subject: Re: [rt-users] Option to store attachments on the filesystem

I am looking into this type of functionality as well. We were thinking of an NSF share in a web directory to drop the attachment with a way to drop a link within the ticket. So the attachments may not even exist on the RT server, but there will be links in the ticket to a web server that houses the attachment.

Hello RT Users and Developers,

Our RT instance at the University of Oregon is outgrowing the standard settings in some ways. One way is with attachments. The size of our database is 15.3GB and 13.7GB of that comes from the Attachments table. If our attachments were stored on a high-performance fileserver (or locally if you prefer), our database would shrink to 1.6GB. This would have numerous positive ramifications:

Database dumps/backups would finish in 1/10 the time

Database restores would finish in 1/10 the time

Planned downtimes and disaster recovery situations could be more nimbly performed (scp’ing around the db dump, restoring, etc)

Backups could be taken much more frequently

More backups could be stored

MySQL replication would be more robust with less binary data to chew on

Larger attachments could be permitted because there would be less fear of the database growing too quickly

Reduced database load querying/inserting/deleting/joining attachments

I’ve read in previous posts to this mailing list (see below) that the arguments against this are that (1) attachments on the filesystem can’t be searched and (2) the data backing the application will not be in one tidy database package but instead spread out across the db and filesystem. For our instance we don’t care about #1, and for #2, while I understand the argument, I would actually argue the opposite: when attachments are on a high-performance, redundant SAN managed by a dedicated storage team that I don’t have to worry about, my job administering RT just got a whole lot easier because I only have to worry about ensuring the fileserver is mounted and $AttachmentsPath (just an example config option) is properly set. I worked previously at a company that ran one of the largest instances of Bugzilla in the world and we served up 30TB of attachments over a fileserver without any problems. Can you imagine those attachments in a MySQL database? When ticket tracking sy
s
te
ms are no longer small-ish, moving attachments out of the database becomes a must.

I’m not asking the RT folks to switch attachment storage to the filesystem instead of the database. My wish is that RT offers its administrators the ability to choose one or the other. I know this has been a hot topic in the past, but I was hoping we could revisit the issue. Best Practical folks – are you open to this? If so, would it help the process if I did all the work and submitted a patch? If so, should I file a bug so that we can talk about the way you would like this implemented?

Given my reading of the history of this issue, I think a lot of folks would benefit from this feature. I’ve included previous postings about this issue below. Let me know if I can help and how I can. We would love to upstream a patch so our local instance doesn’t diverge too severely from you all.

Thanks for your consideration, Geoff Mayes

One of the first, meaty discussions:
Carbon60: Managed Cloud Services
Carbon60: Managed Cloud Services
Carbon60: Managed Cloud Services
The best discussion of the issue:
Carbon60: Managed Cloud Services
Best Practical has recently worked on this issue:
Carbon60: Managed Cloud Services

Hi Geoff,

I had thought that something like this had already been implemented
by Best Practical for a customer. Hopefully, they can provide some
feedback regarding the utility and possible problems of such an
approach from personal experience. Maybe they would consider releasing
it as an extenstion.

As far as the assertion that “a lot of folks would benefit from this
feature”, I doubt that would be the case for the vast majority of RT
users. Most users can handle “one-stop-shopping” type applications
with far fewer problems. Once you divorce the metadate repository
from the actual ticket data, you add a whole slew of different failure
modes that will require much more sophisticated administration processes
to prevent, ameliorate, or recover from. Your reference to leveraging
an existing SAN+SAN management team gives a hint to the increase in
both complexity and cost of running an instance.

There are a wide range of RT users from systems that manage a handful
of tickets a week all the way to systems handling thousands of tickets
or more a week. Those on the small end can/should use whatever DB
backend that they are familiar with to simplify administration and
the “what did I do?!” errors due to a lack of familiarity. As you
move towards larger implementations, your DB backend needs to be
chosen based on it viability in an enterprise/large-scale environment.
I do not know the level of your local MySQL expertise and I am certainly
not a MySQL expert, but a 15GB database does not strike me as particularly
large, by any metric. Maybe you would benefit by changing your backend DB
to something that scales better. I know that other DBs support tablespaces
that can allow you to move certain tables to different filesystems/locations
to provide for more parallel partitioning across more I/O resources.

Sorry for the slight ramble. I am looking forward to this discussion and
if this feature is added some documentation describing when and when not
to use it will be essential.

Regards,
Ken

RT Training Sessions (http://bestpractical.com/services/training.html)

Boston March 5 & 6, 2012

RT Training Sessions (http://bestpractical.com/services/training.html)

Boston March 5 & 6, 2012
RT Training Sessions (http://bestpractical.com/services/training.html)

Boston March 5 & 6, 2012

Kenneth_Marshall · December 22, 2011, 9:16pm

Hi Kevin and Joe,

Joe – that’s exactly what I did at a previous company where we served 30TB of Bugzilla attachments and it worked very well. I plan to do that with RT too.

Kevin – thanks so much for your detailed response. Yeah, definitely, I think this is an important conversation to have and I am so happy that others are weighing in.

As far as the assertion that “a lot of folks would benefit from this
feature”, I doubt that would be the case for the vast majority of RT
users. Most users can handle “one-stop-shopping” type applications
with far fewer problems.

Yes, you’re right. I was a bit overzealous in claiming that “a lot of folks would benefit from this feature.” By “a lot” I did not mean the majority but instead that a not insignificant number of RT users would be interested in this feature (per the previous list postings I referenced).

Once you divorce the metadate repository
from the actual ticket data, you add a whole slew of different failure
modes that will require much more sophisticated administration processes
to prevent, ameliorate, or recover from.

I don’t think this is the case, especially if this feature is implemented well. RT only needs to deal with filesystem attachments for the “Create Ticket” and “Display Ticket” pages where users can upload or view attachments. When displaying tickets RT could have a 5 second timeout when trying to access the local filesystem (over AJAX so the page still loads quickly entirely from the database) and if there is a timeout or failure, RT would log the issue (and optionally email) the admin and display a user-friendly message in the same attachments UI area stating that the attachments couldn’t be displayed.

I may be mistaken, but I thought that all ticket content is currently stored
as an attachment in the DB and not just those available in the Create Ticket or
Display Ticket screens. Perhaps a size threshold could be used to push larger
attachments (for some value of large) to a filesystem store and store a meta-link
in the attachment table. This would allow for searching and indexing the smaller
updates and would help to keep the DB from being bogged down with the larger items.

I envision searching of attachments being disabled in RT if the user has chosen to store attachments on the filesystem and this would of course be well-documented for those choosing to install RT with filesystem attachments. RT could still keep the Attachments table and store metadata about files in that table when a user uploads a file, but this isn’t necessary and I’d let Best Practical decide this. Keeping attachments in the Transactions table isn’t necessary, but it fits with the design of RT and would be nice to have. Again, all of this is internal RT implementation details, and depending on how well it is done, the local administrator of RT could have an incredibly simple and easy experience if they chose to store attachments on the filesystem instead of the database.

The local filesystem or fileserver itself is, granted, an additional piece to manage, but it is an incredibly simple piece. I would much rather manage files on a filesystem than files in a database when I could have a database be 1.6GB instead of 15GB. And there are so many free (as well as expensive) ways to manage data (e.g. NFS, rsync, RAID, high-end redundant SANs, etc). Most importantly, the binary data is separated from textual data, and separate backup schedules and schemes are then permitted. Yay for modular design! As I said previously, I previously administered 30TB of attachments over NFS for a different tracking system and it worked very well.

Your reference to leveraging an existing SAN+SAN management team gives a hint to the increase in
both complexity and cost of running an instance.

This cost is up to the user depending on how secure and robust they would like their data to be. RT only provides the option for local or database attachment storage. The user can then decide whether a cheap SATA local disk serves the attachments or a super-expensive (or cheap), locally-mounted fileserver.

There are a wide range of RT users from systems that manage a handful
of tickets a week all the way to systems handling thousands of tickets
or more a week. Those on the small end can/should use whatever DB
backend that they are familiar with to simplify administration and
the “what did I do?!” errors due to a lack of familiarity.

Totally agree with this. An option to store attachments on the filesystem, however, is database-agnostic, so RT admins can select this option with MySQL, Oracle, Postgres, SQLite, etc.

As you move towards larger implementations, your DB backend needs to be
chosen based on it viability in an enterprise/large-scale environment.
I do not know the level of your local MySQL expertise and I am certainly
not a MySQL expert, but a 15GB database does not strike me as particularly
large, by any metric. Maybe you would benefit by changing your backend DB
to something that scales better. I know that other DBs support tablespaces
that can allow you to move certain tables to different filesystems/locations
to provide for more parallel partitioning across more I/O resources.

Our desire to store attachments outside of the database, at this point, has little to do with application performance and everything to do with backups, disaster recovery, upgrades, and downtimes. That being said, I do know that there are big performance gains that come from storing attachments outside of the database. Check out one of Bugzilla’s core developers discussion of this issue and their work-in-progress implementation: 577532 - Only store text attachments in the database. So moving attachments out of the database is an actual tuning option, just like the other options you mentioned. Why do something drastic like changing the database backend or performing complicated and expert-level tuning/sharding/partitioning, when I could just add a few config options to RT_SiteConfig.pm and run a script (for a pre-existing instance) that then sets up my instance to serve attachments from a filesystem instead of the db?

I am not against having the option at all. I think we need to have good
documentation on the trade-offs or strengths and weaknesses of each approach.
I do think that more people count on the ticket system as a resource and expect
it to be continuously available, the larger the consequences of adding additional
“moving parts” to the system. Having the ticket metadata available without the
actual data is useless in many environments and currently all ticket content is
stored as an attachment, in general. It is very easy to trivialize the issues
that need to manage a filestore, whether in a SQL DB or filesystem DB(or data
store). When reliability and availability are important, many measures need to
be taken to ensure access to all of the metadata+data and handle business
continuity and disaster recovery.

Here’s one recent example of how our current database size is negatively impacting us: We upgraded from 3.8.4 to 4.0.4 yesterday and it took almost an hour to dump our database and almost an hour to import the database (we were upgrading MySQL and the OSes as well). And then we had to import it again because max_packet_size was set too small (which wouldn’t have been a problem if attachments were outside the db: anecdotal and not logical argument, but nonetheless a real-world occurrence as errors happen) so add another hour instead of only another 10 minutes. If attachments were stored outside of the database, we could have reduced just the backup and import phases from 3 hours to 20 minutes. That is a huge difference, especially when your application is used by thousands of customers waiting to log back in. The positive ramifications continue: internal development of RT is much faster with a small database because we can copy them around the network faster, perform impo
rts in 1/10th the time, and keep our development database up-to-date much easier.

If someone knew of a simpler way to cut the dump and restore times by 1/10, I would love to hear it and be totally open to a different solution.

As an example, using a different DB product you can replicate the backend to
a new instance and keep it in sync until the upgrade. Then you have effectively
zero time to copy the DB because the work was done outside the critical path
for the upgrade.

The main point I would like to restate is that larger or quickly-growing instances of RT are very different than smaller or slowly-growing instances. One pain point of the larger instances is the size of the database and how that affects backups, restores, disaster recoveries, and development. Having the option to store attachments outside of the database allows the larger RT instances to more easily manage their data for a much longer period of time. Most importantly for the Best Practical folks, this option increase the appeal of RT to larger organizations instead of the “small- to medium-sized” market as stated at http://requesttracker.wikia.com/wiki/ManualIntroduction. The addition of this feature along with the recent SphinxSE option truly makes RT more feasible and attractive to larger organizations.

I think that the option to store attachments out-of-band would be useful
in some situations and I would like to get some feedback from BP on that.
As far as the backup situation goes, a simple way to drop your backup time
albeit with an increase in complexity is to backup the attachments table
and other large content tables using an incremental/differential process
and not the “dump-it-all” approach. In other words, dump only the new
items every day. This will dramatically reduce your backup times as well
as simplify keeping test DBs up to date. Just a couple of ideas.

Regards,
Ken

Brent_Wiese1 · December 22, 2011, 9:17pm

Totally agree with this. An option to store attachments on the filesystem, however, is database-agnostic, so >RT admins can select this option with MySQL, Oracle, Postgres, SQLite, etc.

What if it wasn’t DB agnostic?

Does any linux/open source DB provide a “FILESTREAM” option like MS SQL? It sounds like that would be an easy solution if it existed as, to the best of my knowledge, the app doesn’t need to know any different when using FILESTREAM. The DB handles the disk/DB interaction.

Brent

Kenneth_Marshall · December 22, 2011, 9:27pm

Totally agree with this. An option to store attachments on the filesystem, however, is database-agnostic, so >RT admins can select this option with MySQL, Oracle, Postgres, SQLite, etc.

What if it wasn’t DB agnostic?

Does any linux/open source DB provide a “FILESTREAM” option like MS SQL? It sounds like that would be an easy solution if it existed as, to the best of my knowledge, the app doesn’t need to know any different when using FILESTREAM. The DB handles the disk/DB interaction.

Brent

I think that PostgreSQL 9.1 includes support for Foreign Data Wrappers
that could be used to access the filesystem contents in a similar
fashion.

Regards,
Ken

Geoff_Mayes · December 22, 2011, 11:38pm

I may be mistaken, but I thought that all ticket content is currently stored
as an attachment in the DB and not just those available in the Create Ticket or
Display Ticket screens.

You’re right: all uploaded attachments (images, logs, PDFs, etc) as well as all textual ticket updates (comments, correspondence, etc) are stored in the Attachments table. I’ve only been working on RT for a couple weeks, but I’ve worked on Bugzilla for a few years, so it has been very interesting to compare the two. I am surprised that uploaded attachments and ticket comments/correspondence are stored in the same table! That feels like overloading and/or de-normalization to me. So, yes, any implementation of the “store attachments on the filesystem” feature would have to change the current schema so that all textual ticket updates remain in the database, but all file uploads are stored on the filesystem. My brief reading of the patch submitted during a previous post about this issue (Carbon60: Managed Cloud Services) shows that the current schema can be kept, but a clean, non-hackish implementation should probably change the schema.

I wonder if we’ve been misunderstanding each other in other places because of my lack of understanding on this matter until now. If so, I apologize.

I do think that more people count on the ticket system as a resource and expect
it to be continuously available, the larger the consequences of adding additional
“moving parts” to the system.
It is very easy to trivialize the issues
that need to manage a filestore, whether in a SQL DB or filesystem DB(or data
store). When reliability and availability are important, many measures need to
be taken to ensure access to all of the metadata+data and handle business
continuity and disaster recovery.

As a general statement, I would agree with everything you wrote. I will say, though, that for our RT instance I’d take an NFS mount of attachments over an additional 14GB table in the database any day of the week. Additionally, our rate of RT usage is growing every month, so that 14GB Attachments table could easily become 40GB in two years. I like future-proofing.

As an example, using a different DB product you can replicate the backend to
a new instance and keep it in sync until the upgrade. Then you have effectively
zero time to copy the DB because the work was done outside the critical path
for the upgrade.

That’s a good idea. Unfortunately, we couldn’t have done this for our last upgrade because we moved from MySQL 4.1 to 5.1 (http://dev.mysql.com/doc/refman/5.1/en/replication-compatibility.html).

I’m pretty sure we are going to make this change to our instance. (I’m pushing for it.) I was just hoping to get this feature into the main RT repo to spread the love, de-duplicate the identical work done by others, and make future upgrades easier for us.

Thomas_Sibley · December 23, 2011, 1:22am

Hi Geoff,

There exists a Best Practical written extension for putting attachments
on disk, but it is not public. (Multiple mailing list posts have
mentioned it in the last year.)

As Ken points out, there are non-trivial issues and maintenance
associated with using a database store and a filestore. This is one of
handful of reasons it isn’t public.On 12/22/2011 03:42 PM, Geoff Mayes wrote:

Most importantly for the Best Practical folks, this option increase
the appeal of RT to larger organizations instead of the “small- to
medium-sized” market as stated at
http://requesttracker.wikia.com/wiki/ManualIntroduction. The addition
of this feature along with the recent SphinxSE option truly makes RT
more feasible and attractive to larger organizations.

We deal with some huge RT instances at some huge organizations. The
assertion on the wiki page you point to is simply incorrect.

Best,
Thomas

Thomas_Sibley · December 23, 2011, 3:55pm

I am looking into this type of functionality as well. We were
thinking of an NSF share in a web directory to drop the attachment
with a way to drop a link within the ticket. So the attachments may
not even exist on the RT server, but there will be links in the
ticket to a web server that houses the attachment.

It’d be nice if this was the case, but the big problem with having the
web server directly serve your attachments is that suddenly you lose
all access control that RT normally provides around attachments.

Even if attachments are stored on disk, they must be served by RT, not
the web server directly.

Thomas

Thomas_Sibley · December 23, 2011, 4:27pm

Please keep replies on the list.

For the record, I’m not claiming that core RT shouldn’t support
attachments on disk in the future. I’m just trying to give you the
relevant info for right now.

I searched but struck out. Could you provide some links? Why isn’
it public? Any way I can take a look at it?

The extension was originally the result of customer work, and it hasn’t
been made public. Will it be made public? We don’t know yet. For the
time being, you’ll need to contact sales@bestpractical.com if you’re
interested in it.

The mailing list threads I found just now with a search:

Is it that non-trivial? The Bugzilla in-house Attachments.pm module
we used to use was 200 lines of Perl and that handled the main 8TB
attachments datastore as well as the archives 30TB datastore, sorting
out discrepancies between the two. And larger organizations will
have the resources and expertise to do these kinds of things easily,
so if RT really is for organizations of all sizes, then how does it
cater to the non-trivial users in this matter?

Larger organizations take a variety of steps to help ensure RT performs
well. Yes, occasionally that includes putting attachments on disk, but
it also includes good database tuning and many other tweaks before that.
The fact is that a 15 or 20GB database is simply not large at all;
15-20GB fits on a single USB flash drive.

I just don’t see how keeping attachments in the database is optimal
for backing up and restoring the database when the database gets
beyond 10-20GB, especially when an organization hasn’t paid Oracle
$5,000 so they can run hotcopy on their InnoDB databases.

As Ken from rice.edu said earlier, there are smarter backup solutions
than dumping/restoring the entire DB every single time. He named a couple.

As for Oracle’s $5k tool, you can get similar results with Percona’s
completely free, open-source hot backup/copy tool for InnoDB called
XtraBackup: Percona XtraBackup for MySQL - Top MySQL Backup Solution

How big are their databases and attachment tables? How do they do
backups and restores? What are their disaster recovery plans?

These are customers that we can’t provide that information on. There
was at least one thread in 2010 or 2011 on the mailing list asking
people to contribute stats about the sizes of their RT instance. Some
notably large ones came up.

Thomas