Bug about subject in utf-8

Albert_Shih · August 31, 2016, 9:12pm

Hi everyone.

I find a very weird bug about the encoding/decoding problem with a subject
in utf-8 encode.

If a requestor send a email with a subject encode in utf-8 like

Bonjour =?utf-8?q?=C3=A0?= vous

RT will create a ticket with subject like (encoded)

=?UTF-8?B?WyBSVFRBRyAjTlVNQkVSIF0gQm9uam91ciDDoCB2b3Vz=?=

meanning something like

[ RTTAG #NUMBER ] Bonjour ï¿½ vous

So until known everything is correct. The problem is when the person who
answer this ticket encode the subject like this

=?utf-8?q?Re=3A?==?utf-8?q?_=5BRTTAG =?utf-8?q? #NUMBER=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

because in that case RT drop the space between the RTTAG and the #NUMBER.
So the incomming mail (to RT) got the space, the outgoing mail drop the
space so RT think it’s a new ticket and add a new set of [ RTTAG #NUMBER ]

I use RT 4.2.13 on FreeBSD 10 with all package up2date.

Anyone see a solution (beside to change the $ExtractSubjectTagMatch and
$ExtractSubjectTagNoMatch)

Regards.

JAS
Albert SHIH
DIO bï¿½timent 15
Observatoire de Paris
5 Place Jules Janssen
92195 Meudon Cedex
France
Tï¿½lï¿½phone : +33 1 45 07 76 26/+33 6 86 69 95 71
xmpp: jas@obspm.fr
Heure local/Local time:
mer 31 aoï¿½ 2016 22:49:09 CEST

chmrr · September 1, 2016, 6:41am

So until known everything is correct. The problem is when the person who
answer this ticket encode the subject like this

=?utf-8?q?Re=3A?==?utf-8?q?_=5BRTTAG =?utf-8?q? #NUMBER=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

because in that case RT drop the space between the RTTAG and the #NUMBER.

What mail client is generating that? Whatever it is, it is violating
RFC 2047 spec in multiple ways.

First, RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text
unencoded white space characters (such as SPACE and HTAB) are
FORBIDDEN within an ‘encoded-word’

As such, “=?utf-8?q? #NUMBER=5D?=” is not a valid encoded-word.

Secondly, RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text
However, an ‘encoded-word’ that appears in a header field defined as
‘*text’ MUST be separated from any adjacent ‘encoded-word’ or ‘text’
by ‘linear-white-space’.

As such, “=?utf-8?q?Re=3A?==?utf-8?” is not valid, as the two
"encoded-word"s are not separated by spaces.

Even ignoring those errors, the example you gave still isn’t parsable.
My best attempt splits it into the following tokens:

=?utf-8?q?Re=3A?= # “Re:
=?utf-8?q?_=5BRTTAG # " [RTTAG”, but no closing “?=” ?!
=?utf-8?q?#NUMBER=5D?= # “#NUMBER]”
Bonjour # “bonjour”
=?utf-8?q?=C3=A0?= # "à
vous # “vous”

Were it somehow parsed as the above, RT would still be correct in
omitting the space before the number, because space between
encoded-words is removed, RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text :

When displaying a particular header field that contains multiple
'encoded-word’s, any ‘linear-white-space’ that separates a pair of
adjacent 'encoded-word’s is ignored.

In short, fix the mail client. Failing that, set
$ExtractSubjectTagMatch, as this is not a bug in RT.

Alex

Albert_Shih · September 1, 2016, 7:42am

Le 31/08/2016 ï¿½ 23:41:07-0700, Alex Vandiver a ï¿½crit

So until known everything is correct. The problem is when the person who
answer this ticket encode the subject like this

=?utf-8?q?Re=3A?==?utf-8?q?_=5BRTTAG =?utf-8?q? #NUMBER=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

because in that case RT drop the space between the RTTAG and the #NUMBER.

What mail client is generating that? Whatever it is, it is violating

SOGo.

RFC 2047 spec in multiple ways.

And yes I didn’t find any other client do that.

First, RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text
unencoded white space characters (such as SPACE and HTAB) are
FORBIDDEN within an ‘encoded-word’

As such, “=?utf-8?q? #NUMBER=5D?=” is not a valid encoded-word.

Well I think that’s my bad, I change a little the subject to fit my first
email about the tag. The real subject is

=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

Secondly, RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text
However, an ‘encoded-word’ that appears in a header field defined as
‘*text’ MUST be separated from any adjacent ‘encoded-word’ or ‘text’
by ‘linear-white-space’.

As such, “=?utf-8?q?Re=3A?==?utf-8?” is not valid, as the two
"encoded-word"s are not separated by spaces.

So can you just confirm

=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

are still not valid (so I can make a bug report on the mail client).

I’m a not very good with perl, but when I try using ruby to decode this
line

irb(main):008:0> Mail::Encodings.unquote_and_convert_to(‘=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous’,‘utf-8’)
=> “Re: [Info Obspm #31684] Bonjour ï¿½ vous”

the result seem correct. Well if I try in the other way

irb(main):009:0> Mail::Encodings.q_value_encode(‘Re: [Info Obspm #31684] Bonjour ï¿½ vous’,‘UTF-8’)
=> “=?UTF-8?Q?Re:[Info_Obspm#31684]Bonjour=C3=A0_vous?=”

Even ignoring those errors, the example you gave still isn’t parsable.
My best attempt splits it into the following tokens:

=?utf-8?q?Re=3A?= # “Re:
=?utf-8?q?_=5BRTTAG # " [RTTAG”, but no closing “?=” ?!
=?utf-8?q?#NUMBER=5D?= # “#NUMBER]”
Bonjour # “bonjour”
=?utf-8?q?=C3=A0?= # "ï¿½
vous # “vous”

Were it somehow parsed as the above, RT would still be correct in
omitting the space before the number, because space between
encoded-words is removed, RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text :

When displaying a particular header field that contains multiple
'encoded-word’s, any ‘linear-white-space’ that separates a pair of
adjacent 'encoded-word’s is ignored.

In short, fix the mail client. Failing that, set
$ExtractSubjectTagMatch, as this is not a bug in RT.

Thanks a lot for your help

Regards.

Albert SHIH
DIO bï¿½timent 15
Observatoire de Paris
5 Place Jules Janssen
92195 Meudon Cedex
France
Tï¿½lï¿½phone : +33 1 45 07 76 26/+33 6 86 69 95 71
xmpp: jas@obspm.fr
Heure local/Local time:
jeu 1 sep 2016 09:21:31 CEST

chmrr · September 1, 2016, 8:39am

First, RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text
unencoded white space characters (such as SPACE and HTAB) are
FORBIDDEN within an ‘encoded-word’

As such, “=?utf-8?q? #NUMBER=5D?=” is not a valid encoded-word.

Well I think that’s my bad, I change a little the subject to fit my first
email about the tag. The real subject is

=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

OK, that’s a little different. Rather better. It still violates:

However, an ‘encoded-word’ that appears in a header field defined as
‘*text’ MUST be separated from any adjacent ‘encoded-word’ or ‘text’
by ‘linear-white-space’.

But:

I’m a not very good with perl, but when I try using ruby to decode this
line

irb(main):008:0> Mail::Encodings.unquote_and_convert_to(‘=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous’,‘utf-8’)
=> “Re: [Info Obspm #31684] Bonjour à vous”

the result seem correct.

For decoders that are lenient to encoded-words that aren’t
space-separated, that’s correct. The difference between this and what
you had previously is the non-encoded word between the two
encoded-words, which makes the space significant.

And indeed, this does point to an RT bug. Namely, for historical and
bad reasons, RT doesn’t use the standard MIME-words decoding library,
which would produce:

perl -MEncode -lE ‘print Encode::encode(“utf8”,
Encode::decode(“MIME-header”,
“=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous”))’

Re: [Info Obspm #31684] Bonjour à vous

Instead, it rolls its own, and gets it wrong:

perl -Ilib -MRT=-init -le ‘print RT::I18N::DecodeMIMEWordsToUTF8(
“=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous”,“Subject”)’

Re: [Info Obspm#31684] Bonjourà vous

Specifically, it removes spaces before the second and later
encoded-words, due to

github.com

bestpractical/rt/blob/stable/lib/RT/I18N.pm#L445


      
                               ([^=]*)        # trailing
                              /xgcs;
          return $str unless @list;
          
          # add everything that hasn't matched to the end of the latest
          # string in array this happen when we have 'key="=?encoded?="; key="plain"'
          $list[-1] .= substr($str, pos $str);
          
          my @parts;
          while (@list) {
              my ($prefix, $charset, $encoding, $enc_str, $trailing) =
                      splice @list, 0, 5;
              $charset  = _CanonicalizeCharset($charset);
              $encoding = lc $encoding;
          
              if ( $encoding eq 'q' ) {
                  use MIME::QuotedPrint;
                  $enc_str =~ tr/_/ /;              # RFC 2047, 4.2 (2)
                  $enc_str = decode_qp($enc_str);
              } elsif ( $encoding eq 'b' ) {
                  use MIME::Base64;

This looks to be a bug. I’ve pushed 4.2/encoded-word-spaces to
address it; if you’d like to test the fix locally, you can apply
Stop removing space before 2nd and later MIME encoded-words · bestpractical/rt@bdd6bd9 · GitHub .

Thanks for the more complete bug report.

Alex

Albert_Shih · September 2, 2016, 2:08pm

Le 01/09/2016 ï¿½ 01:39:53-0700, Alex Vandiver a ï¿½crit

And indeed, this does point to an RT bug. Namely, for historical and
bad reasons, RT doesn’t use the standard MIME-words decoding library,
which would produce:

perl -MEncode -lE ‘print Encode::encode(“utf8”,
Encode::decode(“MIME-header”,
“=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous”))’

Re: [Info Obspm #31684] Bonjour ï¿½ vous

Instead, it rolls its own, and gets it wrong:

perl -Ilib -MRT=-init -le ‘print RT::I18N::DecodeMIMEWordsToUTF8(
“=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous”,“Subject”)’

Re: [Info Obspm#31684] Bonjourï¿½ vous

Specifically, it removes spaces before the second and later
encoded-words, due to
rt/lib/RT/I18N.pm at stable · bestpractical/rt · GitHub

This looks to be a bug. I’ve pushed 4.2/encoded-word-spaces to
address it; if you’d like to test the fix locally, you can apply
Stop removing space before 2nd and later MIME encoded-words · bestpractical/rt@bdd6bd9 · GitHub .

Ok I just apply this fix, and everything seem to work nice.

Big thanks for the help.

Thanks for the more complete bug report.

No thank you…

Regards.

JAS
Albert SHIH
DIO bï¿½timent 15
Observatoire de Paris
5 Place Jules Janssen
92195 Meudon Cedex
France
Tï¿½lï¿½phone : +33 1 45 07 76 26/+33 6 86 69 95 71
xmpp: jas@obspm.fr
Heure local/Local time:
ven 2 sep 2016 16:05:55 CEST