Bug about subject in utf-8

Hi everyone.

I find a very weird bug about the encoding/decoding problem with a subject
in utf-8 encode.

If a requestor send a email with a subject encode in utf-8 like

Bonjour =?utf-8?q?=C3=A0?= vous

RT will create a ticket with subject like (encoded)

=?UTF-8?B?WyBSVFRBRyAjTlVNQkVSIF0gQm9uam91ciDDoCB2b3Vz=?=

meanning something like

[ RTTAG #NUMBER ] Bonjour � vous

So until known everything is correct. The problem is when the person who
answer this ticket encode the subject like this

=?utf-8?q?Re=3A?==?utf-8?q?_=5BRTTAG =?utf-8?q? #NUMBER=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

because in that case RT drop the space between the RTTAG and the #NUMBER.
So the incomming mail (to RT) got the space, the outgoing mail drop the
space so RT think it’s a new ticket and add a new set of [ RTTAG #NUMBER ]

I use RT 4.2.13 on FreeBSD 10 with all package up2date.

Anyone see a solution (beside to change the $ExtractSubjectTagMatch and
$ExtractSubjectTagNoMatch)

Regards.

JAS
Albert SHIH
DIO b�timent 15
Observatoire de Paris
5 Place Jules Janssen
92195 Meudon Cedex
France
T�l�phone : +33 1 45 07 76 26/+33 6 86 69 95 71
xmpp: jas@obspm.fr
Heure local/Local time:
mer 31 ao� 2016 22:49:09 CEST

So until known everything is correct. The problem is when the person who
answer this ticket encode the subject like this

=?utf-8?q?Re=3A?==?utf-8?q?_=5BRTTAG =?utf-8?q? #NUMBER=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

because in that case RT drop the space between the RTTAG and the #NUMBER.

What mail client is generating that? Whatever it is, it is violating
RFC 2047 spec in multiple ways.

First, https://tools.ietf.org/html/rfc2047#page-5
unencoded white space characters (such as SPACE and HTAB) are
FORBIDDEN within an ‘encoded-word’

As such, “=?utf-8?q? #NUMBER=5D?=” is not a valid encoded-word.

Secondly, https://tools.ietf.org/html/rfc2047#page-7
However, an ‘encoded-word’ that appears in a header field defined as
’*text’ MUST be separated from any adjacent ‘encoded-word’ or 'text’
by ‘linear-white-space’.

As such, “=?utf-8?q?Re=3A?==?utf-8?” is not valid, as the two
"encoded-word"s are not separated by spaces.

Even ignoring those errors, the example you gave still isn’t parsable.
My best attempt splits it into the following tokens:

=?utf-8?q?Re=3A?= # “Re:
=?utf-8?q?_=5BRTTAG # " [RTTAG”, but no closing “?=” ?!
=?utf-8?q?#NUMBER=5D?= # "#NUMBER]"
Bonjour # “bonjour”
=?utf-8?q?=C3=A0?= # "à
vous # “vous”

Were it somehow parsed as the above, RT would still be correct in
omitting the space before the number, because space between
encoded-words is removed, https://tools.ietf.org/html/rfc2047#page-10 :

When displaying a particular header field that contains multiple
’encoded-word’s, any ‘linear-white-space’ that separates a pair of
adjacent 'encoded-word’s is ignored.

In short, fix the mail client. Failing that, set
$ExtractSubjectTagMatch, as this is not a bug in RT.

  • Alex

Le 31/08/2016 � 23:41:07-0700, Alex Vandiver a �crit

So until known everything is correct. The problem is when the person who
answer this ticket encode the subject like this

=?utf-8?q?Re=3A?==?utf-8?q?_=5BRTTAG =?utf-8?q? #NUMBER=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

because in that case RT drop the space between the RTTAG and the #NUMBER.

What mail client is generating that? Whatever it is, it is violating

SOGo.

RFC 2047 spec in multiple ways.

And yes I didn’t find any other client do that.

First, https://tools.ietf.org/html/rfc2047#page-5
unencoded white space characters (such as SPACE and HTAB) are
FORBIDDEN within an ‘encoded-word’

As such, “=?utf-8?q? #NUMBER=5D?=” is not a valid encoded-word.

Well I think that’s my bad, I change a little the subject to fit my first
email about the tag. The real subject is

=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

Secondly, https://tools.ietf.org/html/rfc2047#page-7
However, an ‘encoded-word’ that appears in a header field defined as
’*text’ MUST be separated from any adjacent ‘encoded-word’ or 'text’
by ‘linear-white-space’.

As such, “=?utf-8?q?Re=3A?==?utf-8?” is not valid, as the two
"encoded-word"s are not separated by spaces.

So can you just confirm

=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

are still not valid (so I can make a bug report on the mail client).

I’m a not very good with perl, but when I try using ruby to decode this
line

irb(main):008:0> Mail::Encodings.unquote_and_convert_to(’=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous’,‘utf-8’)
=> “Re: [Info Obspm #31684] Bonjour � vous”

the result seem correct. Well if I try in the other way

irb(main):009:0> Mail::Encodings.q_value_encode(‘Re: [Info Obspm #31684] Bonjour � vous’,‘UTF-8’)
=> “=?UTF-8?Q?Re:[Info_Obspm#31684]Bonjour=C3=A0_vous?=”

Even ignoring those errors, the example you gave still isn’t parsable.
My best attempt splits it into the following tokens:

=?utf-8?q?Re=3A?= # “Re:
=?utf-8?q?_=5BRTTAG # " [RTTAG”, but no closing “?=” ?!
=?utf-8?q?#NUMBER=5D?= # "#NUMBER]"
Bonjour # “bonjour”
=?utf-8?q?=C3=A0?= # "�
vous # “vous”

Were it somehow parsed as the above, RT would still be correct in
omitting the space before the number, because space between
encoded-words is removed, https://tools.ietf.org/html/rfc2047#page-10 :

When displaying a particular header field that contains multiple
’encoded-word’s, any ‘linear-white-space’ that separates a pair of
adjacent 'encoded-word’s is ignored.

In short, fix the mail client. Failing that, set
$ExtractSubjectTagMatch, as this is not a bug in RT.

Thanks a lot for your help

Regards.

Albert SHIH
DIO b�timent 15
Observatoire de Paris
5 Place Jules Janssen
92195 Meudon Cedex
France
T�l�phone : +33 1 45 07 76 26/+33 6 86 69 95 71
xmpp: jas@obspm.fr
Heure local/Local time:
jeu 1 sep 2016 09:21:31 CEST

First, https://tools.ietf.org/html/rfc2047#page-5
unencoded white space characters (such as SPACE and HTAB) are
FORBIDDEN within an ‘encoded-word’

As such, “=?utf-8?q? #NUMBER=5D?=” is not a valid encoded-word.

Well I think that’s my bad, I change a little the subject to fit my first
email about the tag. The real subject is

=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous

OK, that’s a little different. Rather better. It still violates:

However, an ‘encoded-word’ that appears in a header field defined as
’*text’ MUST be separated from any adjacent ‘encoded-word’ or 'text’
by ‘linear-white-space’.

But:

I’m a not very good with perl, but when I try using ruby to decode this
line

irb(main):008:0> Mail::Encodings.unquote_and_convert_to(’=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous’,‘utf-8’)
=> “Re: [Info Obspm #31684] Bonjour à vous”

the result seem correct.

For decoders that are lenient to encoded-words that aren’t
space-separated, that’s correct. The difference between this and what
you had previously is the non-encoded word between the two
encoded-words, which makes the space significant.

And indeed, this does point to an RT bug. Namely, for historical and
bad reasons, RT doesn’t use the standard MIME-words decoding library,
which would produce:

perl -MEncode -lE ‘print Encode::encode(“utf8”,
Encode::decode(“MIME-header”,
"=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous"))’

Re: [Info Obspm #31684] Bonjour à vous

Instead, it rolls its own, and gets it wrong:

perl -Ilib -MRT=-init -le ‘print RT::I18N::DecodeMIMEWordsToUTF8(
"=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous",“Subject”)’

Re: [Info Obspm#31684] Bonjourà vous

Specifically, it removes spaces before the second and later
encoded-words, due to

This looks to be a bug. I’ve pushed 4.2/encoded-word-spaces to
address it; if you’d like to test the fix locally, you can apply
https://github.com/bestpractical/rt/commit/bdd6bd96 .

Thanks for the more complete bug report.

  • Alex

Le 01/09/2016 � 01:39:53-0700, Alex Vandiver a �crit

And indeed, this does point to an RT bug. Namely, for historical and
bad reasons, RT doesn’t use the standard MIME-words decoding library,
which would produce:

perl -MEncode -lE ‘print Encode::encode(“utf8”,
Encode::decode(“MIME-header”,
"=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous"))’

Re: [Info Obspm #31684] Bonjour � vous

Instead, it rolls its own, and gets it wrong:

perl -Ilib -MRT=-init -le ‘print RT::I18N::DecodeMIMEWordsToUTF8(
"=?utf-8?q?Re=3A?==?utf-8?q?_=5BInfo?= Obspm =?utf-8?q?#31684=5D?= Bonjour =?utf-8?q?=C3=A0?= vous",“Subject”)’

Re: [Info Obspm#31684] Bonjour� vous

Specifically, it removes spaces before the second and later
encoded-words, due to
https://github.com/bestpractical/rt/blob/stable/lib/RT/I18N.pm#L445

This looks to be a bug. I’ve pushed 4.2/encoded-word-spaces to
address it; if you’d like to test the fix locally, you can apply
https://github.com/bestpractical/rt/commit/bdd6bd96 .

Ok I just apply this fix, and everything seem to work nice.

Big thanks for the help.

Thanks for the more complete bug report.

No thank you…

Regards.

JAS
Albert SHIH
DIO b�timent 15
Observatoire de Paris
5 Place Jules Janssen
92195 Meudon Cedex
France
T�l�phone : +33 1 45 07 76 26/+33 6 86 69 95 71
xmpp: jas@obspm.fr
Heure local/Local time:
ven 2 sep 2016 16:05:55 CEST