Testing 4.0.0rc5 - experiences upgrading from 3.6.4 on RHEL6

So I figured I’d give 4.0.0rc5 a try on a disposable RHEL6 virtual machine tonight with a dump of our 3.6.4 database. I have to say, the various upgrade scripts seemed to have worked wonders, and I actually have a functioning 4.0.0rc5 running. It’s reporting a lot of errors in syslog however, most of them to do with ‘lifecycle’, such as:

Feb 18 20:01:51 rt RT: DBD::mysql::st execute failed: Unknown column ‘lifecycle’ in ‘field list’ at /usr/local/share/perl5/DBIx/SearchBuilder/Handle.pm line 509. (/usr/local/share/perl5/DBIx/SearchBuilder/Handle.pm:509)

The Queues table looks like:

mysql> describe Queues;
| Field | Type | Null | Key | Default | Extra |
| id | int(11) | NO | PRI | NULL | auto_increment |
| Name | varchar(200) | NO | UNI | | |
| Description | varchar(255) | YES | | NULL | |
| CorrespondAddress | varchar(120) | YES | | NULL | |
| CommentAddress | varchar(120) | YES | | NULL | |
| InitialPriority | int(11) | NO | | 0 | |
| FinalPriority | int(11) | NO | | 0 | |
| DefaultDueIn | int(11) | NO | | 0 | |
| Creator | int(11) | NO | | 0 | |
| Created | datetime | YES | | NULL | |
| LastUpdatedBy | int(11) | NO | | 0 | |
| LastUpdated | datetime | YES | | NULL | |
| Disabled | smallint(6) | NO | MUL | 0 | |
13 rows in set (0.01 sec)

What’d I miss?

A second, more fatal problem that has me completely baffled presently is that after the initial cycle of:

  • import old database
  • upgrade to 3.7.87 using the script that allows you to specify start/end revisions
  • generate the queries.sql file and then run those

I seem to be able to login once, but as soon as I explicitly logout, I can no longer login no matter how many mason/browser cache clearing loops I do.

A third problem is that I can’t seem to get RT to run when I first start apache. I get:

mkdir /opt/rt4/var/mason_data/obj/3452818388: Permission denied at /usr/local/share/perl5/HTML/Mason/Compiler/ToObject.pm line 107

Looks like the various directories were created as root:

[root@rt rt-4.0.0rc5]# ls -al /opt/rt4/var/mason_data/
total 16
drwxrwx—. 4 apache apache 4096 Feb 18 20:21 .
drwxr-xr-x. 5 root root 4096 Feb 18 17:50 …
drwxr-xr-x. 2 root root 4096 Feb 18 19:41 cache
drwxr-xr-x. 2 root root 4096 Feb 18 20:21 obj
[root@rt rt-4.0.0rc5]#

If I delete the subdirs under mason_data while apache is still running, it recreates the obj directory with a user ID it can deal with:

[root@rt rt-4.0.0rc5]# ls -al /opt/rt4/var/mason_data/
total 12
drwxrwx—. 3 apache apache 4096 Feb 18 20:25 .
drwxr-xr-x. 5 root root 4096 Feb 18 17:50 …
drwxr-xr-x. 3 apache apache 4096 Feb 18 20:25 obj
[root@rt rt-4.0.0rc5]#

and the site loads. Still can’t login tho.

Feb 18 20:12:24 rt RT: FAILED LOGIN for root from ##.##.##.## (/opt/rt4/sbin/…/lib/RT/Interface/Web.pm:605)
Feb 18 20:12:37 rt RT: FAILED LOGIN for root from ##.##.##.##(/opt/rt4/sbin/…/lib/RT/Interface/Web.pm:605)
Feb 18 20:13:08 rt RT: FAILED LOGIN for root from ##.##.##.## (/opt/rt4/sbin/…/lib/RT/Interface/Web.pm:605)

Help?

-d

So I figured I’d give 4.0.0rc5 a try on a disposable RHEL6 virtual machine tonight with a dump of our 3.6.4 database. I have to say, the various upgrade scripts seemed to have worked wonders, and I actually have a functioning 4.0.0rc5 running. It’s reporting a lot of errors in syslog however, most of them to do with ‘lifecycle’, such as:

Feb 18 20:01:51 rt RT: DBD::mysql::st execute failed: Unknown column ‘lifecycle’ in ‘field list’ at /usr/local/share/perl5/DBIx/SearchBuilder/Handle.pm line 509. (/usr/local/share/perl5/DBIx/SearchBuilder/Handle.pm:509)

It sure looks like something went wrong during your upgrades, as a “Lifecycle” column should have been added by the 3.9.7 upgrade script. Do you still have a log?

It sure looks like something went wrong during your upgrades, as a “Lifecycle” column should have been added by the 3.9.7 upgrade script. Do you still have a log?

Jesse,

I think I figured it out.

I had been using UPGRADING.mysql and README as my principal stepwise guide.

Line 3 of UPGRADING.mysql got me to 3.7.87.

I’m not sure what lines 4,5,6 accomplish exactly but I assumed they got me completely current (did all the rest). But clearly the schema was missing a lot of stuff.

On a hunch, I re-ran:

  /opt/rt4/sbin/rt-setup-database --dba root --prompt-for-dba-password --action upgrade

with a start release of 3.7.87 and no end (apply all updates) and I have a much healthier RT now. In fact, I can’t find a single darn thing wrong with it!! Very sexy new look. :wink:

(a little while later …)

Well maybe I can find one thing … I’m not able to get mailgate working any longer:

Feb 18 21:59:03 rt postfix/local[13847]: 052E6C064D: to=eval-support_ifax@rt4.ifax.com, relay=local, delay=0.39, delays=0.07/0/0/0.32, dsn=4.3.0, status=deferred (temporary failure. Command output: An Error Occurred ================= 302 Found )

I’m not sure if you can hand-crank mailgate or not, but trying, I see:

[root@rt postfix]# /opt/rt4/bin/rt-mailgate --queue sales --action comment --url http://rt4.ifax.com/ --debug
test
.
/opt/rt4/bin/rt-mailgate: temp file is ‘/tmp/D8IMPShIFr/6difOpAIKD’
/opt/rt4/bin/rt-mailgate: connecting to http://rt4.ifax.com//REST/1.0/NoAuth/mail-gateway
An Error Occurred

302 Found

/opt/rt4/bin/rt-mailgate: undefined server error

Which seems to be the same error as postfix was getting.

Apache log says:

[Fri Feb 18 22:06:32 2011] [error] Your request path is ‘/1.0/NoAuth/mail-gateway’ and it doesn’t match your Location(Match) ‘/opt/rt4/share/html/’. This should be due to the configuration error. See perldoc Plack::Handler::Apache2 for details.

Thoughts?

-d

[root@rt postfix]# /opt/rt4/bin/rt-mailgate --queue sales --action comment --url http://rt4.ifax.com/ --debug
test
.
/opt/rt4/bin/rt-mailgate: temp file is ‘/tmp/D8IMPShIFr/6difOpAIKD’
/opt/rt4/bin/rt-mailgate: connecting to http://rt4.ifax.com//REST/1.0/NoAuth/mail-gateway
An Error Occurred

302 Found

/opt/rt4/bin/rt-mailgate: undefined server error

Changing the invocation to:

/opt/rt4/bin/rt-mailgate --queue sales --action comment --url http://rt4.ifax.com

(ie: removing the trailing slash) fixes this problem. Presumably I could have handled that in the apache config as well.

Looking good!

-d

Okay, only a few problems remaining, but they’re doozies. Any help welcomed.

  1. Apache occasionally stops responding to http requests. I haven’t found a pattern yet, but it’s frequent. Unable to login, the page just hangs waiting for a response … the only log entry I see in apache’s error log is:

[Sat Feb 19 10:09:25 2011] [error] Your request path is ‘/index.html’ and it doesn’t match your Location(Match) ‘/opt/rt4/share/html/’. This should be due to the configuration error. See perldoc Plack::Handler::Apache2 for details.

I see that every time I try to reload the page, for what it’s worth. It turns out I see a line like that for EVERY page request, so it’s probably unrelated (but also worrying).

An apache restart clears the problem. I’ve applied RHEL6 updates & rebooted the box, but if the hang occurs again I’d welcome any suggestions on how to troubleshoot.

  1. At boot time, apache seems to setup the mason_data/cache directory as root, and stuff fails to run, reporting:

mkdir /opt/rt4/var/mason_data/obj/3452818388/standard/Install: Permission denied at /usr/local/share/perl5/HTML/Mason/Compiler/ToObject.pm line 107

  1. I haven’t had a lot of luck with the SELinux suggestions at: http://requesttracker.wikia.com/wiki/SELinux . Has anyone got RT and SELinux coexisting on RHEL6?

-Darren

Okay, only a few problems remaining, but they’re doozies. Any help welcomed.

  1. Apache occasionally stops responding to http requests. I haven’t found a pattern yet, but it’s frequent. Unable to login, the page just hangs waiting for a response … the only log entry I see in apache’s error log is:

Darren,

Can you try out this new version of Plack:

http://pause.perl.org/incoming/Plack-0.9970.tar.gz

We believe it may resolve your issues.

Thanks,

Jesse

Darren,

Can you try out this new version of Plack:

http://pause.perl.org/incoming/Plack-0.9970.tar.gz

We believe it may resolve your issues.

Jesse,

Thanks for your help!

I installed that yesterday, but had to restart apache about 6 hours later because rt-mailgate was failing to insert tickets correctly. Well, actually the email was getting added to the ticket, but postfix thought it was failing:

Feb 22 21:47:34 rt4 postfix/local[20829]: E2AE4C0672: to=support_ifax@rt4.corp.ifax.com, relay=local, delay=943, delays=762/0.02/0/180, dsn=4.3.0, status=deferred (temporary failure. Command output: An Error Occurred ================= 500 read timeout )
Feb 22 22:07:35 rt4 postfix/local[20920]: E2AE4C0672: to=support_ifax@rt4.corp.ifax.com, relay=local, delay=2143, delays=1963/0.02/0/180, dsn=4.3.0, status=deferred (temporary failure. Command output: An Error Occurred ================= 500 read timeout )

That’s just one 40-minute retry interval there, it kept injecting this transaction (and others) every 40 mins until I restarted apache. This seems like the same behavior I was seeing before installing this version of Plack.

Just to give you a bit more information about what I know here, I don’t see this when RT is busy, … it’s usually in the evenings when it’s been idle. And these read timeouts aren’t limited to rt-mailgate, as I mentioned the user just sees a seemingly eternal hang when trying to access RT via the web as well. One time when this happened I kept hitting reload about 20 times on the browser, and noticed that each time I did so an apache thread went into a “W” (sending reply) status as reported by mod_status. Not getting any joy there, I tried ‘apachectl graceful’ to restart httpd, and as you can see below each thread just went into a “G” state, but stayed there until I stopped apache and restarted it.

I’m very interested in getting to the bottom of this and will help any way I can … I’m just not entirely sure what to try next, or how to get any visibility into where the lockup is occurring.

Oh, and I’m still seeing these in the apache error log:

[Wed Feb 23 07:31:14 2011] [error] Your request path is ‘/NoAuth/Logout.html’ and it doesn’t match your Location(Match) ‘/opt/rt4/share/html/’. This should be due to the configuration error. See perldoc Plack::Handler::Apache2 for details.
[Wed Feb 23 07:45:16 2011] [error] Your request path is ‘/REST/1.0/NoAuth/mail-gateway’ and it doesn’t match your Location(Match) ‘/opt/rt4/share/html/’. This should be due to the configuration error. See perldoc Plack::Handler::Apache2 for details.
[Wed Feb 23 07:46:14 2011] [error] Your request path is ‘/robots.txt’ and it doesn’t match your Location(Match) ‘/opt/rt4/share/html/’. This should be due to the configuration error. See perldoc Plack::Handler::Apache2 for details.

-Darren

[root@rt4 ~]# apachectl status
Apache Server Status for localhost

Server Version: Apache/2.2.15 (Unix) DAV/2 PHP/5.3.2 mod_ssl/2.2.15
OpenSSL/1.0.0-fips mod_wsgi/3.2 Python/2.6.5 mod_perl/2.0.4 Perl/v5.10.1

Server Built: Aug 14 2010 08:53:48

Current Time: Monday, 21-Feb-2011 11:15:41 EST

Restart Time: Sunday, 20-Feb-2011 03:09:24 EST

Parent Server Generation: 4

Server uptime: 1 day 8 hours 6 minutes 17 seconds

Total accesses: 887 - Total Traffic: 16.3 MB

CPU Usage: u128.81 s53.18 cu0 cs0 - .157% CPU load

.00767 requests/sec - 147 B/second - 18.8 kB/request

34 requests currently being processed, 7 idle workers

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG_G__W____…


Scoreboard Key:
“_” Waiting for Connection, “S” Starting up, “R” Reading Request,
“W” Sending Reply, “K” Keepalive (read), “D” DNS Lookup,
“C” Closing connection, “L” Logging, “G” Gracefully finishing,
“I” Idle cleanup of worker, “.” Open slot with no current process
[root@rt4 ~]# uptime
11:15:44 up 2 days, 22 min, 2 users, load average: 1.18, 0.37, 0.13
[root@rt4 ~]#

Darren,

I am looking at the hung response issue. for the Location error
you are getting, please update the apache config to use … instead of <Directory /opt/rt4/share/html>.

Thanks for looking into this! Just to be clear, the documentation in docs/web_deployment.pod suggests the following for mod_perl 2.xx:

           <Directory "/opt/rt4/share/html">
               Order allow,deny
               Allow from all

               SetHandler perl-script
               PerlResponseHandler Plack::Handler::Apache2
               PerlSetVar psgi_app /opt/rt4/sbin/rt-server
           </Directory>

In face my apache config is pretty much exactly as described in that section. You really want me to change Directory to Location?

If I do that, I see:

You haven’t yet configured your webserver to run RT. You appear to have installed RT’s web interface correctly, but haven’t yet configured your web server to “run” the RT server which powers the web interface. The next step is to edit your webserver’s configuration file to instruct it to use RT’s mod_perl or FastCGI handler. If you need commercial support, please contact us at sales@bestpractical.com.

And my RT is down. Please clarify your instructions here.

I’ve only been able to reproduce the hung issue when setting the
apache to have only one worker, and when the response finally
came out, i see the login page’s “time to display” being
something normal, like 0.09 secs, so that means the congestion
happened in apache or the modperl glue somehow.

Interesting data point, thanks.

In your case, does the page eventually came out rendered at all,
like after 5min?

I will try to let it wait longer … I think I have waited as long as 10 minutes in the past, but it’s worth trying and reporting back to you.

and can you see if it makes any difference if you change around line
222 of RT::Interface::Web::Handler from:

my $h = RT::Interface::Web::Handler::NewHandler(
‘HTML::Mason::PSGIHandler::Streamy’);
to:

my $h = RT::Interface::Web::Handler::NewHandler(
‘HTML::Mason::PSGIHandler’);

I will try this next time we see a lockup. Oddly, we did not have one at all yesterday (during heavy use) or last night (when RT was mostly idle). Perhaps the new version of Plack that Jesse had me install improved the situation after all?

-d

 Darren,

 I am looking at the hung response issue.  for the Location error
 you are getting, please update the apache config to use <Location
 /> ... </Location> instead of <Directory /opt/rt4/share/html>.

Thanks for looking into this! Just to be clear, the documentation in docs/web_deployment.pod
suggests the following for mod_perl 2.xx:

CL has been revamping those docs, his recommendation comes from those
updates. You can see his current branch (with tweaked docs) at
https://github.com/bestpractical/rt/blob/4.0%2Fapache2-fixes/docs/web_deployment.pod

-kevin

In your case, does the page eventually came out rendered at all,
like after 5min?

I will try to let it wait longer … I think I have waited as long as 10 minutes in the past, but it’s worth trying and reporting back to you.

So, we had two instances of the hang today, and they each followed a similar pattern.

In at least one case the problem cleared itself after some time. I do not know how long, it was a background apache thread started by rt-mailgate that eventually cleared, it was not a browser session.

Of the http worker threads, each one is blocked in a semop call:

[root@rt4 Plack-0.9970]# strace -p 30307
Process 30307 attached - interrupt to quit
semop(1802244, {{0, -1, SEM_UNDO}}, 1^C <unfinished …>
Process 30307 detached
[root@rt4 Plack-0.9970]# strace -p 30308
Process 30308 attached - interrupt to quit
semop(1802244, {{0, -1, SEM_UNDO}}, 1^C <unfinished …>
Process 30308 detached
[root@rt4 Plack-0.9970]# strace -p 30309
Process 30309 attached - interrupt to quit
semop(1802244, {{0, -1, SEM_UNDO}}, 1^C <unfinished …>
Process 30309 detached

except for one which is reading from fd 1:

[root@rt4 Plack-0.9970]# strace -p 30310
Process 30310 attached - interrupt to quit
read(1, ^C <unfinished …>

That FD is a network connection to our database server:

[root@rt4 Plack-0.9970]# ls -l /proc/30310/fd/1
lrwx------. 1 root root 64 Feb 24 17:44 /proc/30310/fd/1 → socket:[281592]
[root@rt4 Plack-0.9970]# netstat -antep | grep 281592
tcp 0 5 10.0.12.149:49410 10.0.11.100:3306 ESTABLISHED 48 281592 30310/httpd

The database server has no record of that tcp connection any longer, and mysqladmin processlist shows all threads sleeping.

and can you see if it makes any difference if you change around line
222 of RT::Interface::Web::Handler from:

my $h = RT::Interface::Web::Handler::NewHandler(
‘HTML::Mason::PSGIHandler::Streamy’);
to:

my $h = RT::Interface::Web::Handler::NewHandler(
‘HTML::Mason::PSGIHandler’);

I have not yet tried this. Given the new detailed information above, does it still make sense to do so?

-Darren

That FD is a network connection to our database server:

[root@rt4 Plack-0.9970]# ls -l /proc/30310/fd/1
lrwx------. 1 root root 64 Feb 24 17:44 /proc/30310/fd/1 → socket:[281592]
[root@rt4 Plack-0.9970]# netstat -antep | grep 281592
tcp 0 5 10.0.12.149:49410 10.0.11.100:3306 ESTABLISHED 48 281592 30310/httpd

The database server has no record of that tcp connection any longer, and mysqladmin processlist shows all threads sleeping.

bump any thoughts on this?

Should I be looking outiside of RT for a fix for this? Is this more likely to be a kernel/networking problem in RHEL6 or with mod_perl or the MySQL DBD/DBI stuff? Looks like RT is using persistent database connections, but that when the lockup occurs it’s usually because apache is waiting to get data back from a MySQL socket that no longer exists on the database server …

-d

Darren,

That FD is a network connection to our database server:

[root@rt4 Plack-0.9970]# ls -l /proc/30310/fd/1
lrwx------. 1 root root 64 Feb 24 17:44 /proc/30310/fd/1 → socket:
[281592]
[root@rt4 Plack-0.9970]# netstat -antep | grep 281592
tcp 0 5 10.0.12.149:49410
10.0.11.100:3306 ESTABLISHED 48 281592 30310/
httpd

The database server has no record of that tcp connection any
longer, and mysqladmin processlist shows all threads sleeping.

bump any thoughts on this?

Should I be looking outiside of RT for a fix for this? Is this more
likely to be a kernel/networking problem in RHEL6 or with mod_perl
or the MySQL DBD/DBI stuff? Looks like RT is using persistent
database connections, but that when the lockup occurs it’s usually
because apache is waiting to get data back from a MySQL socket that
no longer exists on the database server …

It seems the issue is with apache2 and plack. i can reproduce the
stalled connection (however not always) with the
plack-req.psgi example from the plack dist, with apache 2.2.16 on
ubuntu maverick. however it’s a bit different from yours.

strace says:

close(0) = 0
read(5, 0x7fffff1f67af, 1) = -1 EAGAIN (Resource
temporarily unavailable)
accept(4, {sa_family=AF_INET, sin_port=htons(64924),
sin_addr=inet_addr(“114.44.215.78”)}, [16]) = 0
fcntl(0, F_GETFD) = 0
fcntl(0, F_SETFD, FD_CLOEXEC) = 0
getsockname(0, {sa_family=AF_INET, sin_port=htons(6254),
sin_addr=inet_addr(“10.130.158.43”)}, [16]) = 0
fcntl(0, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(0, F_SETFL, O_RDWR|O_NONBLOCK) = 0
read(0, 0x7f8e98b0bf58, 8000) = -1 EAGAIN (Resource
temporarily unavailable)
poll([{fd=0, events=POLLIN}], 1, 300000

Cheers,
CLK

It seems the issue is with apache2 and plack. i can reproduce the stalled connection (however not always) with the
plack-req.psgi example from the plack dist, with apache 2.2.16 on ubuntu maverick. however it’s a bit different from yours.

You’re definitely looking at a different problem.

Is there any way I can make the database connections non-persistent? That might help the ones that are getting confused.

-Darren

It seems the issue is with apache2 and plack. i can reproduce the
stalled connection (however not always) with the
plack-req.psgi example from the plack dist, with apache 2.2.16 on
ubuntu maverick. however it’s a bit different from yours.

You’re definitely looking at a different problem.

Is there any way I can make the database connections non-persistent?
That might help the ones that are getting confused.

Try to remove the Plack::Handler::Apache2->preload line and make
MaxRequestsPerChild 1 ?