Overnight segfault since RT 3.8.1 upgrade

Hi all,

I’ve been having an odd issue since I upgraded from RT 3.8.0 to RT 3.8.1.
Every morning, any attempt to do anything with RT, including load the
"Login" page, results in a segfault from Apache. If I restart Apache,
everything works again until the next morning. I presume this means it’s
related to something that’s going on through cron, but since RT 3.8.0 was
working just fine, I’m not sure what that might be.

I have a fairly plain-vanilla CentOS 5.2 system (mysql-5.0.45-7.el5). I
can provide more details if they’d be helpful, but didn’t want to pollute
the mailing list unnecessarily. I did manage to use strace to capture what
was going on with Apache when a segfault happened, if that would be
helpful.

Has anyone seen this before? I don’t think I saw anything in the mailing
list archives, but I’d appreciate if anyone has any thoughts on what I can
do to fix this.

Thanks,

James

That sounds sort of like it could be related to the MySQL “morning bug”

Not sure though.On Sep 15, 2008, at 7:34 AM, James Chamberlain wrote:

Hi all,

I’ve been having an odd issue since I upgraded from RT 3.8.0 to RT
3.8.1.
Every morning, any attempt to do anything with RT, including load the
"Login" page, results in a segfault from Apache. If I restart Apache,
everything works again until the next morning. I presume this means
it’s
related to something that’s going on through cron, but since RT
3.8.0 was
working just fine, I’m not sure what that might be.

I have a fairly plain-vanilla CentOS 5.2 system
(mysql-5.0.45-7.el5). I
can provide more details if they’d be helpful, but didn’t want to
pollute
the mailing list unnecessarily. I did manage to use strace to
capture what
was going on with Apache when a segfault happened, if that would be
helpful.

Has anyone seen this before? I don’t think I saw anything in the
mailing
list archives, but I’d appreciate if anyone has any thoughts on what
I can
do to fix this.

Thanks,

James


http://lists.bestpractical.com/cgi-bin/mailman/listinfo/rt-users

Community help: http://wiki.bestpractical.com
Commercial support: sales@bestpractical.com

Discover RT’s hidden secrets with RT Essentials from O’Reilly Media.
Buy a copy at http://rtbook.bestpractical.com

Hi,

I had the same problem on a Suse 9.3 machine. Worked around it with a cron job that restarts the apache2 instance every morning.

I would also be interested if someone knew a better solution.

Best Regards,
Patrick-----Ursprüngliche Nachricht-----
Von: rt-users-bounces@lists.bestpractical.com [mailto:rt-users-bounces@lists.bestpractical.com] Im Auftrag von James Chamberlain
Gesendet: Montag, 15. September 2008 13:34
An: rt-users@lists.bestpractical.com
Betreff: [rt-users] Overnight segfault since RT 3.8.1 upgrade

Hi all,

I’ve been having an odd issue since I upgraded from RT 3.8.0 to RT 3.8.1.
Every morning, any attempt to do anything with RT, including load the “Login” page, results in a segfault from Apache. If I restart Apache, everything works again until the next morning. I presume this means it’s related to something that’s going on through cron, but since RT 3.8.0 was working just fine, I’m not sure what that might be.

I have a fairly plain-vanilla CentOS 5.2 system (mysql-5.0.45-7.el5). I can provide more details if they’d be helpful, but didn’t want to pollute the mailing list unnecessarily. I did manage to use strace to capture what was going on with Apache when a segfault happened, if that would be helpful.

Has anyone seen this before? I don’t think I saw anything in the mailing list archives, but I’d appreciate if anyone has any thoughts on what I can do to fix this.

Thanks,

James
http://lists.bestpractical.com/cgi-bin/mailman/listinfo/rt-users

Community help: http://wiki.bestpractical.com Commercial support: sales@bestpractical.com

Discover RT’s hidden secrets with RT Essentials from O’Reilly Media.
Buy a copy at http://rtbook.bestpractical.com

It does, but the thing that gets me is that this has only happened since I
upgraded to 3.8.1. If I switch back to 3.8.0 - granted, not something I
should do since there have been database changes - I don’t have the
overnight segfault problem anymore.

Thanks,

JamesOn Mon, 15 Sep 2008, Jesse Vincent wrote:

That sounds sort of like it could be related to the MySQL “morning bug”

Not sure though.

On Sep 15, 2008, at 7:34 AM, James Chamberlain wrote:

Hi all,

I’ve been having an odd issue since I upgraded from RT 3.8.0 to RT 3.8.1.
Every morning, any attempt to do anything with RT, including load the
"Login" page, results in a segfault from Apache. If I restart Apache,
everything works again until the next morning. I presume this means it’s
related to something that’s going on through cron, but since RT 3.8.0 was
working just fine, I’m not sure what that might be.

I have a fairly plain-vanilla CentOS 5.2 system (mysql-5.0.45-7.el5). I
can provide more details if they’d be helpful, but didn’t want to pollute
the mailing list unnecessarily. I did manage to use strace to capture what
was going on with Apache when a segfault happened, if that would be
helpful.

Has anyone seen this before? I don’t think I saw anything in the mailing
list archives, but I’d appreciate if anyone has any thoughts on what I can
do to fix this.

Thanks,

James


http://lists.bestpractical.com/cgi-bin/mailman/listinfo/rt-users

Community help: http://wiki.bestpractical.com
Commercial support: sales@bestpractical.com

Discover RT’s hidden secrets with RT Essentials from O’Reilly Media.
Buy a copy at http://rtbook.bestpractical.com

It does, but the thing that gets me is that this has only happened
since I upgraded to 3.8.1. If I switch back to 3.8.0 - granted, not
something I should do since there have been database changes - I
don’t have the overnight segfault problem anymore.

The database changes shouldn’t be “dangerous”, though that’s quite
interesting. By “switch back” do you mean restoring the server to an
older version or just switching the RT directory?

Can you catch a stacktrace from the segfault’s core dump? Can you make
it dump core without waiting for 10 hours of silence from you users?

-jesse

It does, but the thing that gets me is that this has only happened since I
upgraded to 3.8.1. If I switch back to 3.8.0 - granted, not something I
should do since there have been database changes - I don’t have the
overnight segfault problem anymore.

The database changes shouldn’t be “dangerous”, though that’s quite
interesting. By “switch back” do you mean restoring the server to an older
version or just switching the RT directory?

I’m doing things in a slightly different way than the installation or
upgrading instructions call for, I think. Rather than upgrade /opt/rt3, I
do a fresh install to /opt/rt3-<build#>. I then copy over local changes,
and copy and check RT_SiteConfig.pm for updates. From there, I follow the
"upgrade" instructions as they relate to the database, and update my Apache
configs and my /opt/rt3 symlink. Doing it in this way lets me keep the
previous version intact in case something goes wrong and I need to switch
back in a hurry. All I need to do is change the symlink and the Apache
configs.

Can you catch a stacktrace from the segfault’s core dump? Can you make it
dump core without waiting for 10 hours of silence from you users?

I’ll see what I can do, but I’ve only reliably seen this in the morning and
don’t know the exact trigger yet.

Thanks,

James

When I had this problem it was a bug in DBD:mysql 4.007. I’m not sure
the exact details but it only happend on low traffic sites, my guess is
the connection pool times out when it shouldn’t causing the http server
to fail on the first try and then working again once the proccesses
reconnect. Seems to be related to this, the fix is to downgrade to 4.006
or even 3.x (your distribution’s version should have the bugfix backported).

http://bugs.mysql.com/bug.php?id=36810

Curtis.

James Chamberlain wrote:> On Mon, 15 Sep 2008, Jesse Vincent wrote:

On Sep 15, 2008, at 8:55 AM, James Chamberlain wrote:

It does, but the thing that gets me is that this has only happened since I
upgraded to 3.8.1. If I switch back to 3.8.0 - granted, not something I
should do since there have been database changes - I don’t have the
overnight segfault problem anymore.

The database changes shouldn’t be “dangerous”, though that’s quite
interesting. By “switch back” do you mean restoring the server to an older
version or just switching the RT directory?

I’m doing things in a slightly different way than the installation or
upgrading instructions call for, I think. Rather than upgrade /opt/rt3, I
do a fresh install to /opt/rt3-<build#>. I then copy over local changes,
and copy and check RT_SiteConfig.pm for updates. From there, I follow the
"upgrade" instructions as they relate to the database, and update my Apache
configs and my /opt/rt3 symlink. Doing it in this way lets me keep the
previous version intact in case something goes wrong and I need to switch
back in a hurry. All I need to do is change the symlink and the Apache
configs.

Can you catch a stacktrace from the segfault’s core dump? Can you make it
dump core without waiting for 10 hours of silence from you users?

I’ll see what I can do, but I’ve only reliably seen this in the morning and
don’t know the exact trigger yet.

Thanks,

James


http://lists.bestpractical.com/cgi-bin/mailman/listinfo/rt-users

Community help: http://wiki.bestpractical.com
Commercial support: sales@bestpractical.com

Discover RT’s hidden secrets with RT Essentials from O’Reilly Media.
Buy a copy at http://rtbook.bestpractical.com

It does, but the thing that gets me is that this has only happened since I
upgraded to 3.8.1. If I switch back to 3.8.0 - granted, not something I
should do since there have been database changes - I don’t have the
overnight segfault problem anymore.

The database changes shouldn’t be “dangerous”, though that’s quite
interesting. By “switch back” do you mean restoring the server to an older
version or just switching the RT directory?

Can you catch a stacktrace from the segfault’s core dump? Can you make it
dump core without waiting for 10 hours of silence from you users?

It definitely looks like it’s MySQL-related now. I set CoreDumpDirectory
in my Apache configs and was greeted with a few cores this morning. Taking
a look at one of them (and skipping the “Reading” and “Loaded” symbols
statements), I saw:

Core was generated by `/usr/sbin/httpd’.
Program terminated with signal 11, Segmentation fault.
#0 0x00002aaab160d81e in mysql_ping ()
from /usr/lib64/mysql/libmysqlclient.so.15
(gdb) thread apply all bt full

Thread 1 (process 1029):
#0 0x00002aaab160d81e in mysql_ping ()
from /usr/lib64/mysql/libmysqlclient.so.15
No symbol table info available.
#1 0x00002afc41cb1fde in XS_DBD__mysql__db_ping (
my_perl=, cv=) at
mysql.xs:554
dbh = (SV *) 0x2afc4667fe40
RETVAL =
sp =
ax =
#2 0x00002afc39e94621 in XS_DBI_dispatch (my_perl=,
cv=0x2afc472c0be0) at DBI.xs:3287
markix = 0
xscv = (CV *) 0xe
sp = (SV **) 0x1
ax = 1
items = 1
perinterp_sv =
PERINTERP = (PERINTERP_t *) 0x2afc4858cf00
h = (SV *) 0x2afc4667fe40
st1 = (SV *) 0x2afc472c0be0
st2 = (SV *) 0x2afc45e29820
err_sv =
tmp_svp =
hook_svp = (SV **) 0x2afc461d2440
mg =
gimme = 0
trace_flags = 0
trace_level = 0
is_DESTROY = 0
is_unrelated_to_Statement = 1024
keep_error = 1
ErrCount = 0
i =
outitems = 1
call_depth = 1
is_nested_call = 0
profile_t1 = 0
meth_name = 0x2afc472c9bd0 "ping"
ima = (const dbi_ima_t *) 0x2afc472cad10
ima_flags =
imp_xxh = (imp_xxh_t *) 0x2afc339350c0
imp_msv = (SV *) 0x2afc338a40a0
qsv = (SV *) 0x2afc49eb38c0
#3 0x00002aaab2a289f6 in Perl_pp_entersub ()
from /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
No symbol table info available.
#4 0x00002aaab2a2229e in Perl_runops_standard ()
from /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
No symbol table info available.
#5 0x00002aaab29cf5f0 in Perl_call_sv ()
from /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/CORE/libperl.so
No symbol table info available.
#6 0x00002aaab2777b97 in modperl_callback ()
from /etc/httpd/modules/mod_perl.so
No symbol table info available.
#7 0x00002aaab27782af in modperl_callback_run_handlers ()
from /etc/httpd/modules/mod_perl.so
No symbol table info available.
#8 0x00002aaab27787ef in modperl_callback_per_dir ()
from /etc/httpd/modules/mod_perl.so
No symbol table info available.
#9 0x00002aaab27728f0 in modperl_response_init ()
from /etc/httpd/modules/mod_perl.so
No symbol table info available.
#10 0x00002aaab2772ab3 in modperl_response_handler_cgi ()
from /etc/httpd/modules/mod_perl.so
No symbol table info available.
#11 0x00002afc3381f7ea in ap_run_handler () from /usr/sbin/httpd
No symbol table info available.
#12 0x00002afc33822c72 in ap_invoke_handler () from /usr/sbin/httpd
No symbol table info available.
#13 0x00002afc3382d5e8 in ap_process_request () from /usr/sbin/httpd
No symbol table info available.
#14 0x00002afc3382a870 in ap_register_input_filter () from /usr/sbin/httpd
No symbol table info available.
#15 0x00002afc33826a52 in ap_run_process_connection () from /usr/sbin/httpd
No symbol table info available.
#16 0x00002afc3383120b in ap_graceful_stop_signalled () from
/usr/sbin/httpd
No symbol table info available.
#17 0x00002afc3383149a in ap_graceful_stop_signalled () from
/usr/sbin/httpd
No symbol table info available.
#18 0x00002afc33831550 in ap_graceful_stop_signalled () from
/usr/sbin/httpd
No symbol table info available.
#19 0x00002afc33832246 in ap_mpm_run () from /usr/sbin/httpd
No symbol table info available.
#20 0x00002afc3380ce04 in main () from /usr/sbin/httpd
No symbol table info available.
(gdb)

I presume this means I should try a different version of DBD::mysql. I see
that I installed DBD::mysql 4.008 as part of the upgrade to RT 3.8.1. Does
this sound right? Anyone have any other thoughts?

Thanks,

James