RT 3.8 mangles html attachment

Tim_Cutts · February 22, 2009, 8:12am

Correction, the weird question mark characters are between every
character
in the original document. Like so:

<�h�e�a�d�>� � �<�m�e�t�a�

Fascinating. Does it do this with all html attachments?

That looks suspiciously like full 16-bit Unicode to me.

Tim

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Tom_Lahti · February 23, 2009, 7:14pm

Correction, the weird question mark characters are between every
character
in the original document. Like so:

<�h�e�a�d�>� � �<�m�e�t�a�
Fascinating. Does it do this with all html attachments?

That looks suspiciously like full 16-bit Unicode to me.

It was pointed out earlier in the thread that the encoding was UTF-16LE. I
was surprised when no one said “the conversion library we use doesn’t
support automatic detection of/conversion from 16-bit encodings”, but I
don’t know what is being used in RT.

With the iconv library, if you want to convert from UTF-16, you have to
specify it as the “from” code. As far as I know. But it does work if you do.

– ============================
Tom Lahti
BIT Statement LLC

(425)251-0833 x 117
http://www.bitstatement.net/
– ============================

Jesse_Vincent · February 23, 2009, 7:14pm

Correction, the weird question mark characters are between every
character
in the original document. Like so:

<�h�e�a�d�>� � �<�m�e�t�a�
Fascinating. Does it do this with all html attachments?

That looks suspiciously like full 16-bit Unicode to me.

It was pointed out earlier in the thread that the encoding was UTF-16LE. I
was surprised when no one said “the conversion library we use doesn’t
support automatic detection of/conversion from 16-bit encodings”, but I
don’t know what is being used in RT.

With the iconv library, if you want to convert from UTF-16, you have to
specify it as the “from” code. As far as I know. But it does work if you do.

Todd and I got further into it. We’re using Encode::Guess, which should
handle this. Todd had some promising places to dig for a bug.

Todd_Chapman1 · February 23, 2009, 8:04pm

Correction, the weird question mark characters are between every
character
in the original document. Like so:

<�h�e�a�d�>� � �<�m�e�t�a�
Fascinating. Does it do this with all html attachments?

That looks suspiciously like full 16-bit Unicode to me.

It was pointed out earlier in the thread that the encoding was UTF-16LE. I
was surprised when no one said “the conversion library we use doesn’t
support automatic detection of/conversion from 16-bit encodings”, but I
don’t know what is being used in RT.

With the iconv library, if you want to convert from UTF-16, you have to
specify it as the “from” code. As far as I know. But it does work if you do.

Todd and I got further into it. We’re using Encode::Guess, which should
handle this. Todd had some promising places to dig for a bug.

I’ll be sending a test as soon as I finish converting this bugzilla
instance to a new RT instance.

Tom_Lahti · February 23, 2009, 11:04pm

Todd and I got further into it. We’re using Encode::Guess, which should
handle this. Todd had some promising places to dig for a bug.

Curious: does Encode::Guess handle UTF-16(LE|BE) without a byte order mark?
That would be … fascinating.

– ============================
Tom Lahti
BIT Statement LLC

(425)251-0833 x 117
http://www.bitstatement.net/
– ============================

Todd_Chapman1 · February 23, 2009, 11:06pm

Todd and I got further into it. We’re using Encode::Guess, which should
handle this. Todd had some promising places to dig for a bug.

Curious: does Encode::Guess handle UTF-16(LE|BE) without a byte order mark?
That would be … fascinating.

No it doesn’t. I think the problem is the RT code splits the message
into lines and processes each line separately. Only the first line has
the BOM (Byte Order Mark) so the conversion fails on the rest of the
lines.