On Thu, 5 Mar 2015 18:50:57 +0100 Palle Girgensohn
- Ensure that all data containing non-ASCII is quoted-printable
encoded for PostgreSQL, instead of merely all data not claiming to be
“text/plain”
Why is this? It seems it is doing more harm than good – it makes it
harder to use indexes and makes it harder to use it from the tables
directly? PostgreSQL is very capable of storing any kind of character
set? We use UTF-8 for everything, this fix breaks a lot of things for
us.
The commit in question is 3a9c38ed, which changes:
} elsif ( !$RT::Handle->BinarySafeBLOBs
&& $MIMEType !~ /text\/plain/gi
&& !Encode::is_utf8( $Body, 1 ) ) {
$ContentEncoding = 'quoted-printable';
}
…to:
} elsif ( !$RT::Handle->BinarySafeBLOBs
&& $Body =~ /\P{ASCII}/
&& !Encode::is_utf8( $Body, 1 ) ) {
$ContentEncoding = 'quoted-printable';
}
That is, any data which claimed to be “text/plain” would blindly be
attempted to be shoved into the database; this includes content which
contained invalid UTF-8 byte sequences. The commit was in RT 4.0; RT
4.2 is much better about transcoding to real UTF-8, or marking the part
as “application/octet-stream” if it cannot be decoded. In that sense,
this logic is now less necessary. However, the commit was absolutely
necessary at the time to not lose data. Erring on the side of data
preservation, at the expense of possibly-unnecessary encoding, seems
like not an unreasonable choice.
That being said, the Encode::is_utf8() check attempts to verify that we
only quoted-printable encode data whose bytes are not currently a
correctly-encoded UTF-8 byte stream. The bug now lies there, as after
the “utf8-reckoning” branch (2620658…af9fe7c), the $Body argument is
guaranteed to contain bytes, not characters. Per the Encode::is_utf8()
documentation:
[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If
CHECK is true, also checks whether STRING contains well-formed UTF-8.
Returns true if successful, false otherwise.
The UTF8 flag is almost certainly off on $Body (it may be on, and
still contain only codepoints < 128, but this is unlikely), so the
well-formed-ness of the string (which is the relevant thing we wish to
confirm) is never checked.
I believe that this code may be replaced by (untested):
} elsif ( !$RT::Handle->BinarySafeBLOBs
&& !eval { Encode::decode( "UTF-8", $Body, Encode::FB_CROAK); 1 } ) {
$ContentEncoding = 'quoted-printable';
}
I may push a branch later that makes this change, and see what breaks.
All of that aside, note that “it makes it harder to use indexes” makes
little sense – there are no indexes on Content. The full-text index
uses its own tsvector, which it constructs after decoding the Content,
so the transfer encoding of the Content column is irrelevant to that.