Charset troubles

Hello

I just upgraded Apache to 2.4 and RT to latest 3.8, and I get a charset
problem: anything that enter RT through rt-mailgate is fine, but any non
ASCII character sent through the web interface gets corrupted: I get a ?
in a quare instead, which is usually what happens when ISO-8859-1
character was mistaken as UTF-8.

Older messages from before the upgrade display correctly, hence this is
really a problem at message POST time.

Using Firefox’s developer toolbar, I can see the POST request: it
contains no information about the charset. I assume this is why e-mail
behave differently than the web interface: the former comes with
Content-Type header featuring the charset information, while the later
does not.

The page is being served as UTF-8, the form does not say anything about
accepted encoding (I tried patching to specify that, but it did not
change anything). Acting on a UTF-8 page, the client should post in
UTF-8, but I have not been able to verify that (it goes through
SSL,which does not help for checking what happens on the wire).

Anyone has an idea of what is wrong? Here are the dependencies:

rt-3.8.17
p5-Apache-Session-1.93
p5-CGI-4.13
p5-CSS-Squish-0.10
p5-Cache-Simple-TimedExpiry-0.27
p5-Calendar-Simple-1.21
p5-Class-ReturnValue-0.55
p5-DBD-mysql-4.031
p5-DBI-1.633
p5-DBIx-SearchBuilder-1.66
p5-Data-ICal-0.22
p5-Email-Address-1.905
p5-FCGI-0.77
p5-File-ShareDir-1.102
p5-GD-2.53
p5-GDGraph-1.48
p5-GDTextUtil-0.86
p5-GnuPG-Interface-0.52
p5-HTML-Format-2.11
p5-HTML-Mason-1.56
p5-HTML-Parser-3.71
p5-HTML-RewriteAttributes-0.05
p5-HTML-Scrubber-0.11
p5-HTML-Tree-5.03
p5-HTTP-Server-Simple-0.44
p5-HTTP-Server-Simple-Mason-0.14
p5-Locale-Maketext-Fuzzy-0.11
p5-Locale-Maketext-Lexicon-1.00
p5-Log-Dispatch-2.44
p5-MIME-Types-2.09
p5-MIME-tools-5.505
p5-MailTools-2.14
p5-Module-Refresh-0.17
p5-Module-Versions-Report-1.06
p5-Net-3.05
p5-Net-Server-2.008
p5-PerlIO-eol-0.14
p5-Regexp-Common-2013031301
p5-Term-ReadKey-2.32
p5-Text-Autoformat-1.669004
p5-Text-Quoted-2.08
p5-Text-Template-1.46
p5-Text-WikiFormat-0.81
p5-Text-Wrapper-1.04
p5-Time-modules-2013.0912
p5-TimeDate-2.30
p5-Tree-Simple-1.23
p5-UNIVERSAL-require-0.17
p5-XML-RSS-1.56
p5-XML-Simple-2.20
p5-libwww-6.13
perl-5.20.2
apache-2.4.12
apr-1.5.1
apr-util-1.5.4
mod_fcgid-2.3.9
mod_perl-2.0.8
p5-libapreq2-2.12
postgresql-9.4.1

Relevant httpd.conf part for RT:
AddDefaultCharset UTF-8

<Directory “/usr/pkg/share/rt3/html/NoAuth/images/”>
SetHandler none
Options -ExecCGI
allow from all

<Directory “/usr/pkg/share/rt3/html”>
AddDefaultCharset UTF-8
SetHandler fcgid-script
Options +ExecCGI

In RT_SiteConfig.pm all I have about encoding if for e-mail:
Set(@EmailInputEncodings, qw(utf-8 iso-8859-1 us-ascii
windows-1252));Set($EmailOutputEncoding , ‘iso-8859-1’);

Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@netbsd.org

I just upgraded Apache to 2.4 and RT to latest 3.8

RT 3.8 reached end-of-life over a year ago. No release of Apache 2.4
had been made before RT 3.8 was in “critical security releases only.”
I’m unsurprised that there are incompatibilities between the two.

Please upgrade to a supported version of RT. The unmaintained 3.8
series also now has disclosed security vulnerabilities against it.

  • Alex

Please upgrade to a supported version of RT.

After upgrading to RT 4.2.10, the problem vanished when updating tickets
on the web interface, but it still exists when creating a new ticket
from the web interface.

The generated HTML for creating and updating looks similar, hence I
assume it is the server-side handling that differ.

Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@netbsd.org

The generated HTML for creating and updating looks similar, hence I
assume it is the server-side handling that differ.

In database, I have in both cases
contenttype: text/plain
contentencoding: quoted-printable

An attachment that displays correctly (added by updating the ticket on
the web) indeed contains quoted-printable data:
content: rh=C3=A2=C3=A2=C3=A2

An attachment that does not display correctly (added at ticket creation
on the web) is not in quoted-printable:
content: rhâââ

Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@netbsd.org

In database, I have in both cases

Which database? In your original mail, you said:

p5-DBD-mysql-4.031
[…]
postgresql-9.4.1

Which database are you using?

  • Alex

Which database? In your original mail, you said:

p5-DBD-mysql-4.031
[…]
postgresql-9.4.1

Which database are you using?

PostgreSQL 9.4.1. And p5-DBD-postgresql-3.5.1 was missing in my list.

Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@netbsd.org

PostgreSQL 9.4.1. And p5-DBD-postgresql-3.5.1 was missing in my list.

I may be able to debug some of it, but I would need some hints: where
is the attachment supposed to be converted into quited-printable?
It happens through Ticket/Update.html but not through Ticket/Create.html

The difference should not be that hard to spot.

Emmanuel Dreyfus
manu@netbsd.org

After upgrading to RT 4.2.10, the problem vanished when updating tickets
on the web interface, but it still exists when creating a new ticket
from the web interface.

The problem is still there with RT 4.2.11.
Hints on how to fix it would be welcome.

Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@netbsd.org

I just upgraded Apache to 2.4 and RT to latest 3.8, and I get a charset
problem: anything that enter RT through rt-mailgate is fine, but any non
ASCII character sent through the web interface gets corrupted: I get a ?
in a quare instead, which is usually what happens when ISO-8859-1
character was mistaken as UTF-8.

Older messages from before the upgrade display correctly, hence this is
really a problem at message POST time.

I fixed it. Replying to myself with the whole story for someone else’s
future reference.

The problem was database encoding. RT can use PostgreSQL with encoding
“UTF-8” or the default “SQL_ASCII”. That later encoding means PostgreSQL
does not care about encoding and just gives back the bytes it was given
without any check. The former enforces UTF-8 usage and is able to
automatically transcode if the client claims to use another encoding.

My RT installation had been configured with the PostgreSQL database
using “UTF-8” encoding for a while. At some time I upgraded PostgreSQL
and I reloaded the data from a dump after reinitializing the database.
But since I did not check for it, it got “SQL_ASCII”, a setup where the
application must take care of data encoding.

RT stores data as UTF-8 but It seems there are some conversions missing
in the code, especially on ticket creation through the web. I did not
find where it happens, but this action was introducing ISO-8859-1
characters in the database. After a few weeks, I had a database randomly
mixing ISO-8859-1 and UTF-8 data.

Fixing the situation required to dump, drop and create again the
database with “UTF-8” encoding and reloading from the dump. But doing so
required to clean up the dump from any ISO-8859-1 character, otherwise
PostgreSQL could not load it.

Using iconv(1) could not help since there was also some UTF-8
characaters in the database. I had to write exernal C functions for
PostgreSQL to perfom query such as
update attachments set content=qpfix(content),
contentencoding=“qupoted-printable” where not is_utf8(content);

is_utf8() is an external function that finds character sequences invalid
for UTF-8
qpfix() is an external function that translates ISO-8859-1 in
quoted-printable UTF-8

That kind of fixes had to be done in a various columns of table
attachments, users, and transactions. I can share the C code if someone
is interested.

After the proper fix, the database dump could be reimported in the UTF-8
encoded database, and the charset trouble on ticket creation from the
web disapeared.

Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@netbsd.org

I fixed it. Replying to myself with the whole story for someone else’s
future reference.

Good to hear the full debugging story.

The problem was database encoding. RT can use PostgreSQL with encoding
“UTF-8” or the default “SQL_ASCII”. That later encoding means PostgreSQL
does not care about encoding and just gives back the bytes it was given
without any check. The former enforces UTF-8 usage and is able to
automatically transcode if the client claims to use another encoding.

My RT installation had been configured with the PostgreSQL database
using “UTF-8” encoding for a while. At some time I upgraded PostgreSQL
and I reloaded the data from a dump after reinitializing the database.
But since I did not check for it, it got “SQL_ASCII”, a setup where the
application must take care of data encoding.

So this is the first place where things went awry. How was the
database created to reload the database dump, such that it got
SQL_ASCII? By hand using ‘createdb’ from the command line? And is
your template0 database marked as ‘SQL_ASCII’ ?

For reference,
https://docs.bestpractical.com/backups#Restoring-from-backups1 is the documented technique for loading in a Pg backup.

  • Alex