Advice on new RT cluster setup

rtaleva · March 21, 2025, 5:19am

I am setting up RT for the first time and have to meet some basic high availability and security requirements.
My current idea is as follows with Postgresql database in a HA arrangement and installing RT5 on two separate servers exposed to public Internet by via nginx reverse proxy running in the DMZ. An external load balancer may be used to distribute incoming connections between Site A and Site B reverse proxy’s public IP.

Site	Intranet	Intranet	DMZ
A	Pg Database Node 1 ↔	RT server + httpd	↔ nginx/reverse proxy
	↑↓ Inter-site link
B	Pg Database Node 2 ↔	RT server + httpd	↔ nginx/reverse proxy

Internal users will access RT directly via internal network while external users will go through the load balancer/reverse proxy

Can anybody advise if such a setup correct for my plan or are there better/safer ways to do this?

SimonW · March 21, 2025, 10:18am

Welcome to RT,

I would look closely at your RPO & RTO requirements, as in my experience it’s unusual for a ticketing system to be so critical that the recovery time should mandate the kind of hot standby, or dual active system you described.

Postgresql handles simple replication quite well these days but requires third party tools, and a third site, to safely automate a failover. It doesn’t, to my knowledge, support dual-active writes so one will always be the primary and one the secondary.

Regarding RT itself, I would opt for a failover configuration where one is active with another ready to take over if needed, although it should be able to run in a loadbalanced group, it’s rarely needed. Ansible/Puppet/Chef can take care of keeping the configurations in sync.

I hope this helps, in summary, make it as simple as possible based on your RPO & RTO requirements.

Simon.

rtaleva · March 21, 2025, 1:22pm

Thanks for the reply, you helped to throw some clarity on this.

You are right that the ticketing system is not so crucial that a hot standby is required. However, boss has some new business idea of offering VIP customers a X minutes response time along with this move to RT. So the RTO could be a bit tight for manual detection and failover. Hence the idea for dual active.

Coming from Oracle background, I had, without diving too deeply, assumed Postgresql high availability would offer similar multi active write capabilities like RAC. Automated failover however should be good enough.

For RT itself, I am thinking the setup for a failover configuration, e.g. a front end proxy that redirects requests to the primary and fail over to secondary, would be essentially the same as a proxy that load balances between the two. So there does not seem to be much difference in the effort to setup loadbalanced vs active/passive or am I badly mistaken?

Rob_Lister · March 21, 2025, 3:05pm

In theory it’s possible.

I would be more inclined to do this using failover at virtual machine or disk-level replication. We use KVM + drbd + ganeti to create server clusters in an active/standby setup. For any given VM/service, we can live migrate the VMs without interrupting service or creating a lot of complexity in the application itself. Mostly we live-migrate VMs between servers in different physical locations when doing maintenance or upgrades, for example. For us, if a server or the network breaks, worst case is about 30 minutes to just make the standby instances for the affected VMs active.

We also do postgres replication to a couple of read-only instances. I think the only time we’ve ever switched postgres over is when upgrading postgres itself.

If you use RT for incoming/outgoing email ticket updates, you may need to consider how emails arrive in RT. Which instance will receive the email? Do you have your own SMTP servers to queue incoming messages for one instance, and if it cannot reach it, try a second instance? It can make for quite a complicated multi-MX mail setup where emails get stuck in queues in various places, depending on the problem. (For example, if the DNS can’t be resolved, mail might be queued or rejected somewhere, or your spam filtering, recipient/sender verification might not work etc.)

You will want to avoid the “dual active” situation if your network becomes split in two for some reason and both instances become the “active” instance. Some clients might be served by one, with others served by the second.

My experience is that such outages happen very rarely. We have multiple sites, network links, multiple VM server clusters.

It’s a trade-off. Given the infrequency of outages, I think it’s not worth doing a lot of redundant setup with multiple instances of applications “just in case”. Just to do it reliably at the application level, we would need:

Redundant RT instances.
Multiple incoming MX servers.
Multiple DNS resolvers.
Active/Backup replication of IMAP/mail servers.
Multiple replicated postgres DBs.
Multiple directory/user verification sources/authentication.
Multiple monitoring and syslog.
…

So you end up with a setup that’s hellishly complex, with twice the maintenance and management headaches, and a lot of moving parts to debug when it goes wrong. And it usually goes wrong in a new and amusing way with a failure mode we didn’t think about. (i.e., you spend 30 minutes trying to understand why nobody can login during the outage, or why some emails don’t show up, instead of 30 minutes actually fixing the underlying problem…)

Not worth it for an outage that maybe happens once or twice a year. 30 minutes is acceptable, IMHO, as we always take the same corrective actions regardless of the application: just migrate the VMs to bring everything back online. Then investigate the network problem etc.

ktm · March 21, 2025, 6:08pm

Hi,
We used repmgr+pgbouncer to create a primary/secondary pair of RT systems. This allowed the RT frontends on both systems to point to the primary DB server with both frontends live for use. Then if the primary DB server has a problem, we promote the secondary server and repoint the RT frontends to it. Actual controlled failover testing took less than 30s. Obviously, other features can be added with the resulting increase in complexity. As Rob mentioned, it is best to keep it as simple as you can yet still meet your RTO/RPO needs.

Regards,
Ken