In theory it’s possible.
I would be more inclined to do this using failover at virtual machine or disk-level replication. We use KVM + drbd + ganeti to create server clusters in an active/standby setup. For any given VM/service, we can live migrate the VMs without interrupting service or creating a lot of complexity in the application itself. Mostly we live-migrate VMs between servers in different physical locations when doing maintenance or upgrades, for example. For us, if a server or the network breaks, worst case is about 30 minutes to just make the standby instances for the affected VMs active.
We also do postgres replication to a couple of read-only instances. I think the only time we’ve ever switched postgres over is when upgrading postgres itself.
If you use RT for incoming/outgoing email ticket updates, you may need to consider how emails arrive in RT. Which instance will receive the email? Do you have your own SMTP servers to queue incoming messages for one instance, and if it cannot reach it, try a second instance? It can make for quite a complicated multi-MX mail setup where emails get stuck in queues in various places, depending on the problem. (For example, if the DNS can’t be resolved, mail might be queued or rejected somewhere, or your spam filtering, recipient/sender verification might not work etc.)
You will want to avoid the “dual active” situation if your network becomes split in two for some reason and both instances become the “active” instance. Some clients might be served by one, with others served by the second.
My experience is that such outages happen very rarely. We have multiple sites, network links, multiple VM server clusters.
It’s a trade-off. Given the infrequency of outages, I think it’s not worth doing a lot of redundant setup with multiple instances of applications “just in case”. Just to do it reliably at the application level, we would need:
- Redundant RT instances.
- Multiple incoming MX servers.
- Multiple DNS resolvers.
- Active/Backup replication of IMAP/mail servers.
- Multiple replicated postgres DBs.
- Multiple directory/user verification sources/authentication.
- Multiple monitoring and syslog.
- …
So you end up with a setup that’s hellishly complex, with twice the maintenance and management headaches, and a lot of moving parts to debug when it goes wrong. And it usually goes wrong in a new and amusing way with a failure mode we didn’t think about. (i.e., you spend 30 minutes trying to understand why nobody can login during the outage, or why some emails don’t show up, instead of 30 minutes actually fixing the underlying problem…)
Not worth it for an outage that maybe happens once or twice a year. 30 minutes is acceptable, IMHO, as we always take the same corrective actions regardless of the application: just migrate the VMs to bring everything back online. Then investigate the network problem etc.