RT and Disaster Recovery - problem

Guadagnino_Cristiano · September 2, 2015, 3:11pm

Hi all,
are you using some kind of DR solution with RT?

Our RT servers are virtualized on VMware. As a DR solution, we keep
virtual machines on our second datacenter in sync with the first one.
VMs on the second data center are switched off.

If we have problems on the first data center, we power on the VMs on the
second and then change our DNS to point to the new VMs.
The only difference on the mirrored VMs is that thay have different IPs.

So, right after powering them up, we have to connect and change the
configuration of those VMs to use the new IPs (at least in those cases
where we cannot use DNS aliases).

This is acceptable for us, because RT is not such a critical asset that
we cannot afford a downtime of a few minutes.

However, the problem is that - after reconfiguring the VMs - RT becomes
slow as a snail (tens of seconds for each page change/refresh).

I lost a couple of days building an exact copy of our production VMs and
experimenting with varying IPs and reconfiguring, and I am not able to
overcome the problem, nor understand where it comes from.

Has anybody ever heard of such a problem?

T.I.A.
Cris

Aaron_C_de_Bruyn · September 2, 2015, 3:26pm

I have not run into the issue, but we can try and figure out where the
slowness is coming from.

What OS/distro are you running?

Are you using Apache, Nginx or something else to serve up RT?

Have you checked that DNS is resolving properly on your machine?
(type: host google.com and see how long it takes for an answer)

Are commands you run from the shell taking a long time to run, or is it just RT?

-AOn Wed, Sep 2, 2015 at 8:11 AM, Guadagnino Cristiano guadagnino.cristiano@creval.it wrote:

Hi all,
are you using some kind of DR solution with RT?

Our RT servers are virtualized on VMware. As a DR solution, we keep
virtual machines on our second datacenter in sync with the first one.
VMs on the second data center are switched off.

If we have problems on the first data center, we power on the VMs on the
second and then change our DNS to point to the new VMs.
The only difference on the mirrored VMs is that thay have different IPs.

So, right after powering them up, we have to connect and change the
configuration of those VMs to use the new IPs (at least in those cases
where we cannot use DNS aliases).

This is acceptable for us, because RT is not such a critical asset that
we cannot afford a downtime of a few minutes.

However, the problem is that - after reconfiguring the VMs - RT becomes
slow as a snail (tens of seconds for each page change/refresh).

I lost a couple of days building an exact copy of our production VMs and
experimenting with varying IPs and reconfiguring, and I am not able to
overcome the problem, nor understand where it comes from.

Has anybody ever heard of such a problem?

T.I.A.
Cris

Guadagnino_Cristiano · September 2, 2015, 3:40pm

Hi Aaron,
and thank you for the quick reply.

What OS/distro are you running?
Centos 6.7

Are you using Apache, Nginx or something else to serve up RT?
Apache/2.2.15 (CentOS) Server at localhost Port 80

Have you checked that DNS is resolving properly on your machine?
Yes, DNS is warking properly and very fast. However, on my test VMs I
have used only IPs to avoid putting the DNS into the equation.

Are commands you run from the shell taking a long time to run, or is
it just RT?
Apparently it is just RT

Thank you!
Cris

Cristiano Guadagnino

Servizio Sistemi Dipartimentali, Periferici e DB
Bankadati Servizi Informatici Soc.Cons.P.A.
Gruppo bancario Credito Valtellinese
VIA TRENTO, 22 - 23100 SONDRIO
tel +39 0342522172 - fax +39 0342522997
guadagnino.cristiano@creval.it
www.creval.it

Il presente messaggio non è di natura personale ma inviato per esigenze lavorative; l’eventuale messaggio di risposta potrà essere conosciuto anche da altri soggetti diversi dall’originatore di questo messaggio per dette esigenze o per controllo aziendale. Questo messaggio, corredato dei relativi allegati, contiene informazioni da considerarsi strettamente riservate, ed è destinato esclusivamente al destinatario sopra indicato, il quale è l’unico autorizzato ad usarlo, copiarlo e, sotto la propria responsabilità, diffonderlo. Chiunque ricevesse questo messaggio per errore o comunque lo leggesse senza esserne legittimato è avvertito che trattenerlo, copiarlo, divulgarlo, distribuirlo a persone diverse dal destinatario è severamente proibito, ed è pregato di rinviarlo immediatamente al mittente distruggendone l’originale.

-----Messaggio originale-----
Da: Aaron C. de Bruyn aaron@heyaaron.com
Inviato: Wed Sep 02 2015 17:26:02 GMT+0200 (CEST)
A: Guadagnino Cristiano guadagnino.cristiano@creval.it
Oggetto: Re: [rt-users] RT and Disaster Recovery - problem

chmrr · September 2, 2015, 5:35pm

Our RT servers are virtualized on VMware.
[snip]
However, the problem is that - after reconfiguring the VMs - RT becomes
slow as a snail (tens of seconds for each page change/refresh).

Do the VMs have any snapshots enabled? I know that historically, at
least, the mere presence of a snapshot file caused all I/O to be COW,
which caused order-of-magnitude I/O bandwidth degradation.

Alex

Guadagnino_Cristiano · September 3, 2015, 9:51am

Alex,
thank you for your reply.

No, the VMs do not have any snapshot enabled.

Cris

-----Messaggio originale-----
Da: Alex Vandiver alex@chmrr.net
Inviato: Wed Sep 02 2015 19:35:20 GMT+0200 (CEST)
A: Guadagnino Cristiano guadagnino.cristiano@creval.it
Oggetto: Re: [rt-users] RT and Disaster Recovery - problem> On Wed, Sep 02, 2015 at 03:11:52PM +0000, Guadagnino Cristiano wrote:

Our RT servers are virtualized on VMware.
[snip]
However, the problem is that - after reconfiguring the VMs - RT becomes
slow as a snail (tens of seconds for each page change/refresh).

Do the VMs have any snapshots enabled? I know that historically, at
least, the mere presence of a snapshot file caused all I/O to be COW,
which caused order-of-magnitude I/O bandwidth degradation.

Alex

Aaron_C_de_Bruyn · September 3, 2015, 5:30pm

Sorry for the late reply–work exploded.

Are you using Apache, Nginx or something else to serve up RT?
Apache/2.2.15 (CentOS) Server at localhost Port 80

If it’s bound to localhost, how are users accessing it? Or do you
have something else on the box that has a public-facing IP that
proxies traffic to port 127.0.0.1:80?

Or do you have something like spawn_fcgi running on 127.0.0.1:80 with
apache proxying?

Yes, DNS is warking properly and very fast. However, on my test VMs I
have used only IPs to avoid putting the DNS into the equation.

Apparently it is just RT

Is it the first request, or all requests?
I had an issue with spawn_fcgi if I recall correctly, that when the
process first started it took ~45 seconds to serve the first page.
After that, pages were snappy.

-A

Ram_Moskovitz1 · September 3, 2015, 5:49pm

Date: Wed, 2 Sep 2015 13:35:20 -0400
From: Alex Vandiver alex@chmrr.net
To: Guadagnino Cristiano guadagnino.cristiano@creval.it
Cc: “rt-users@lists.bestpractical.com”
rt-users@lists.bestpractical.com
Subject: Re: [rt-users] RT and Disaster Recovery - problem
Message-ID: 20150902173520.GA781@chmrr.net
Content-Type: text/plain; charset=us-ascii

Our RT servers are virtualized on VMware.
[snip]
However, the problem is that - after reconfiguring the VMs - RT becomes
slow as a snail (tens of seconds for each page change/refresh).

G, on your DR box time exactly how long the home page takes to load. Three
times in a row, each time. The length of the delay may be very telling.

Guadagnino_Cristiano · September 4, 2015, 1:04pm

Hi Aaron.

-----Messaggio originale-----
Da: Aaron C. de Bruyn aaron@heyaaron.com
Inviato: Thu Sep 03 2015 19:30:40 GMT+0200 (CEST)
A: Guadagnino Cristiano guadagnino.cristiano@creval.it
Oggetto: Re: [rt-users] RT and Disaster Recovery - problem

Are you using Apache, Nginx or something else to serve up RT?
Apache/2.2.15 (CentOS) Server at localhost Port 80

If it’s bound to localhost, how are users accessing it? Or do you
have something else on the box that has a public-facing IP that
proxies traffic to port 127.0.0.1:80?

Or do you have something like spawn_fcgi running on 127.0.0.1:80 with
apache proxying?

RT is accessible only in our intranet.
Apache is configured with a few virtual doamins, one of which is
dedicated to RT.
I don’t know why it reports localhost; however this was captured on the
working production instance, so no problem here.

Is it the first request, or all requests?
I had an issue with spawn_fcgi if I recall correctly, that when the
process first started it took ~45 seconds to serve the first page.
After that, pages were snappy.

-A

All requests, unfortuantely.

Cris