RT saves data in quoted-printable, why?

http://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/net/QuotedPrintableCodec.html

in the releasen note for RT-4.2.2 [https://www.bestpractical.com/release-notes/rt/4.2.2]

  • Ensure that all data containing non-ASCII is quoted-printable encoded
    for PostgreSQL, instead of merely all data not claiming to be
    "text/plain"

Why is this? It seems it is doing more harm than good – it makes it harder to use indexes and makes it harder to use it from the tables directly? PostgreSQL is very capable of storing any kind of character set? We use UTF-8 for everything, this fix breaks a lot of things for us.

I think this was a bad design decision, but maybe I can’t see the entire picture?

Best regards,
Palle

signature.asc (495 Bytes)

On Thu, 5 Mar 2015 18:50:57 +0100 Palle Girgensohn

  • Ensure that all data containing non-ASCII is quoted-printable
    encoded for PostgreSQL, instead of merely all data not claiming to be
    "text/plain"

Why is this? It seems it is doing more harm than good – it makes it
harder to use indexes and makes it harder to use it from the tables
directly? PostgreSQL is very capable of storing any kind of character
set? We use UTF-8 for everything, this fix breaks a lot of things for
us.

The commit in question is 3a9c38ed, which changes:

} elsif (    !$RT::Handle->BinarySafeBLOBs
          && $MIMEType !~ /text\/plain/gi
          && !Encode::is_utf8( $Body, 1 ) ) {
      $ContentEncoding = 'quoted-printable';
}

…to:

} elsif (    !$RT::Handle->BinarySafeBLOBs
          && $Body =~ /\P{ASCII}/
          && !Encode::is_utf8( $Body, 1 ) ) {
      $ContentEncoding = 'quoted-printable';
}

That is, any data which claimed to be “text/plain” would blindly be
attempted to be shoved into the database; this includes content which
contained invalid UTF-8 byte sequences. The commit was in RT 4.0; RT
4.2 is much better about transcoding to real UTF-8, or marking the part
as “application/octet-stream” if it cannot be decoded. In that sense,
this logic is now less necessary. However, the commit was absolutely
necessary at the time to not lose data. Erring on the side of data
preservation, at the expense of possibly-unnecessary encoding, seems
like not an unreasonable choice.

That being said, the Encode::is_utf8() check attempts to verify that we
only quoted-printable encode data whose bytes are not currently a
correctly-encoded UTF-8 byte stream. The bug now lies there, as after
the “utf8-reckoning” branch (2620658…af9fe7c), the $Body argument is
guaranteed to contain bytes, not characters. Per the Encode::is_utf8()
documentation:

[INTERNAL] Tests whether the UTF8 flag is turned on in the STRING. If
CHECK is true, also checks whether STRING contains well-formed UTF-8.
Returns true if successful, false otherwise.

The UTF8 flag is almost certainly off on $Body (it may be on, and
still contain only codepoints < 128, but this is unlikely), so the
well-formed-ness of the string (which is the relevant thing we wish to
confirm) is never checked.

I believe that this code may be replaced by (untested):

} elsif (    !$RT::Handle->BinarySafeBLOBs
          && !eval { Encode::decode( "UTF-8", $Body, Encode::FB_CROAK); 1 } ) {
      $ContentEncoding = 'quoted-printable';
}

I may push a branch later that makes this change, and see what breaks.

All of that aside, note that “it makes it harder to use indexes” makes
little sense – there are no indexes on Content. The full-text index
uses its own tsvector, which it constructs after decoding the Content,
so the transfer encoding of the Content column is irrelevant to that.

  • Alex

I believe that this code may be replaced by (untested):

} elsif (    !$RT::Handle->BinarySafeBLOBs
          && !eval { Encode::decode( "UTF-8", $Body,

Encode::FB_CROAK); 1 } ) {
$ContentEncoding = ‘quoted-printable’;
}

I may push a branch later that makes this change, and see what breaks.

  • Alex

https://issues.bestpractical.com/Ticket/Display.html?id=29735
Zito

https://issues.bestpractical.com/Ticket/Display.html?id=29735

Aha – thanks for digging that out! I thought I vaguely recalled
something in this area previously.
https://issues.bestpractical.com/Ticket/Attachment/286095/157750/utf8-encoding.patch
looks to be functionally fairly similar to the branch.

There are a few other, orthogonal fixes in there that may still be
interesting to tease out into their own commits. It looks like I see
changes to:

  • Fix the computed max size of base64’d attachments; I’d need to
    squint at it harder, but seems eminently reasonable.

  • Attempt to gracefully deal with TruncateLongAttachments truncating
    mid-byte of UTF-8 data. As above; the decode/encode is an interesting
    trick to attempt to ensure that the byte stream is consistent. I’d
    like to test it a bit, but seems not unreasonable.

  • Choose base64 vs QP based on which is shorter; I’m less convinced by
    this, since it means that for large data, it gets QP’d, base64’d, and
    then one of those again – which isn’t terribly efficient. I’m less
    convinced by the tradeoff of computation time to stored in-database
    size.

If you’re interested in reworking the patch into a 2-3 commit series,
I’m happy to apply for 4.2-trunk.

  • Alex

Hi,

https://issues.bestpractical.com/Ticket/Display.html?id=29735

Aha – thanks for digging that out! I thought I vaguely recalled
something in this area previously.
https://issues.bestpractical.com/Ticket/Attachment/286095/157750/utf8-encoding.patch
looks to be functionally fairly similar to the branch.

Thanks for attention to this…

There are a few other, orthogonal fixes in there that may still be
interesting to tease out into their own commits. It looks like I see
changes to:

  • Fix the computed max size of base64’d attachments; I’d need to
    squint at it harder, but seems eminently reasonable.

  • Attempt to gracefully deal with TruncateLongAttachments truncating
    mid-byte of UTF-8 data. As above; the decode/encode is an interesting
    trick to attempt to ensure that the byte stream is consistent. I’d
    like to test it a bit, but seems not unreasonable.

It is not too efficient maybe, but easy and safety first :slight_smile:

  • Choose base64 vs QP based on which is shorter; I’m less convinced by
    this, since it means that for large data, it gets QP’d, base64’d, and
    then one of those again – which isn’t terribly efficient. I’m less
    convinced by the tradeoff of computation time to stored in-database
    size.

You are right. My intention was to gather as much readable text as
possible. Maybe a text contains some invalid characters, but the rest
of the text is readable, so QP is more appropriate, because it leaves
the most of a text readable.
So the measuring of length of an encoded data Base64/QP gives a result of
how much ASCII chars are there.
len Base64 < len QP - many binary data - maybe some octet stream
len QP < len Base64 - many ASCII chars - maybe the text

But this is corner case probably and it is not very interesting.
The most of the text should be UTF-8 valid and the rest is not
interesting these days.

If you’re interested in reworking the patch into a 2-3 commit series,
I’m happy to apply for 4.2-trunk.

  • Alex


This is a bit newer version I’m using within production instance rt-4.2.9.
I will be happy if some part will be usable for RT mainline.

Thanks for fine software!
Cheers
Zito