RT 3.8 mangles html attachment

Todd_Chapman1 · February 19, 2009, 3:44pm

We have an RT instance in a trusted environment. I have the following in
RT_SiteConfig.pm:

Set($TrustHTMLAttachments, 1);
Set($PreferRichText, 1);
Set($MaxAttachmentSize , 10000000);

I even turned of the HTML scrubber, yet when I attach an html file to a
ticket and then save it back to my filesystem, the md5sum is changed. This
doesn’t happen for other file types.

Anyone know what it going on?

Thanks!

Jesse_Vincent · February 19, 2009, 3:47pm

We have an RT instance in a trusted environment. I have the following in
RT_SiteConfig.pm:

Set($TrustHTMLAttachments, 1);
Set($PreferRichText, 1);
Set($MaxAttachmentSize , 10000000);

I even turned of the HTML scrubber, yet when I attach an html file to a
ticket and then save it back to my filesystem, the md5sum is changed.

md5sum changed and “mangled” are hardly the same thing. But yes, RT
canonicalizes text to UTF8 as it comes in.

Todd_Chapman1 · February 19, 2009, 3:49pm

Thanks for the reply Jessee,

The html no longer displays correctly in the browser after canonicalization.
Suggestions?On Thu, Feb 19, 2009 at 10:47 AM, Jesse Vincent jesse@bestpractical.comwrote:

On Thu, Feb 19, 2009 at 10:44:50AM -0500, Todd Chapman wrote:

We have an RT instance in a trusted environment. I have the following in
RT_SiteConfig.pm:

Set($TrustHTMLAttachments, 1);
Set($PreferRichText, 1);
Set($MaxAttachmentSize , 10000000);

I even turned of the HTML scrubber, yet when I attach an html file to a
ticket and then save it back to my filesystem, the md5sum is changed.

md5sum changed and “mangled” are hardly the same thing. But yes, RT
canonicalizes text to UTF8 as it comes in.

Jesse_Vincent · February 19, 2009, 4:05pm

Thanks for the reply Jessee,

The html no longer displays correctly in the browser after
canonicalization. Suggestions?

What do you mean “no longer displays correctly”?

Todd_Chapman1 · February 19, 2009, 4:11pm

The original file when opened up in a browser looks like a formatted web
page. After processing by RT the file look like it is rendered as what looks
like plain text in Safari. In Firefox there are a bunch of weird question
mark characters representing the spaces between characters. FF’s page info
says it’s a type = “text/html” and encoding = UTF-8.On Thu, Feb 19, 2009 at 11:05 AM, Jesse Vincent jesse@bestpractical.comwrote:

On Thu 19.Feb’09 at 10:49:11 -0500, Todd Chapman wrote:

Thanks for the reply Jessee,

The html no longer displays correctly in the browser after
canonicalization. Suggestions?

What do you mean “no longer displays correctly”?

Jesse_Vincent · February 19, 2009, 4:35pm

Correction, the weird question mark characters are between every character
in the original document. Like so:

<�h�e�a�d�>� � �<�m�e�t�a�

Fascinating. Does it do this with all html attachments?

Todd_Chapman1 · February 19, 2009, 4:40pm

Jesse,

That does not appear to be the case. Can I send the original file directly
to you?On Thu, Feb 19, 2009 at 11:35 AM, Jesse Vincent jesse@bestpractical.comwrote:

On Thu 19.Feb’09 at 11:33:25 -0500, Todd Chapman wrote:

Correction, the weird question mark characters are between every
character
in the original document. Like so:

<�h�e�a�d�>� � �<�m�e�t�a�

Fascinating. Does it do this with all html attachments?

Todd_Chapman1 · February 19, 2009, 4:33pm

Correction, the weird question mark characters are between every character
in the original document. Like so:

<�h�e�a�d�>� � �<�m�e�t�a� �h�t�t�p�-�e�q�u�i�v�=�C�o�n�t�e�n�t�-�T�y�p�e�
�c�o�n�t�e�n�t�=�"�t�e�x�t�/�h�t�m�l�;� �c�h�a�r�s�e�t�=�u�n�i�c�o�d�e�"�>�
� �<�m�e�t�a� �n�a�m�e�=�P�r�o�g�I�d�
�c�o�n�t�e�n�t�=�W�o�r�d�.�D�o�c�u�m�e�n�t�>� � �<�m�e�t�a�
�n�a�m�e�=�G�e�n�e�r�a�t�o�r� �c�o�n�t�e�n�t�=�"�M�i�c�r�o�s�o�f�t�
�W�o�r�d� �1�2�"�>� � �<�m�e�t�a� �n�a�m�e�=�O�r�i�g�i�n�a�t�o�r�
�c�o�n�t�e�n�t�=�"�M�i�c�r�o�s�o�f�t� �W�o�r�d� �1�2�"�>� � �<�l�i�n�k�
�r�e�l�=�F�i�l�e�-�L�i�s�t� �On Thu, Feb 19, 2009 at 11:11 AM, Todd Chapman todd@chaka.net wrote:

The original file when opened up in a browser looks like a formatted web
page. After processing by RT the file look like it is rendered as what looks
like plain text in Safari. In Firefox there are a bunch of weird question
mark characters representing the spaces between characters. FF’s page info
says it’s a type = “text/html” and encoding = UTF-8.

On Thu, Feb 19, 2009 at 11:05 AM, Jesse Vincent jesse@bestpractical.comwrote:

On Thu 19.Feb’09 at 10:49:11 -0500, Todd Chapman wrote:

Thanks for the reply Jessee,

The html no longer displays correctly in the browser after
canonicalization. Suggestions?

What do you mean “no longer displays correctly”?

Todd_Chapman1 · February 19, 2009, 4:42pm

According to FF the original file has an encoding of UTF-16LE. It was
generated by Word. (I know, I know)On Thu, Feb 19, 2009 at 11:35 AM, Jesse Vincent jesse@bestpractical.comwrote:

On Thu 19.Feb’09 at 11:33:25 -0500, Todd Chapman wrote:

Correction, the weird question mark characters are between every
character
in the original document. Like so:

<�h�e�a�d�>� � �<�m�e�t�a�

Fascinating. Does it do this with all html attachments?

Jesse_Vincent · February 19, 2009, 4:45pm

According to FF the original file has an encoding of UTF-16LE. It was
generated by Word. (I know, I know)

Now we’re getting somewhere. Was it attached to a mail as an
attachment? If so, what do the headers for the original attachment look
like as they hit RT. If not, how did it get into RT?

Todd_Chapman1 · February 19, 2009, 4:54pm

It was attached using the web interface.On Thu, Feb 19, 2009 at 11:45 AM, Jesse Vincent jesse@bestpractical.comwrote:

On Thu 19.Feb’09 at 11:42:42 -0500, Todd Chapman wrote:

According to FF the original file has an encoding of UTF-16LE. It was
generated by Word. (I know, I know)

Now we’re getting somewhere. Was it attached to a mail as an
attachment? If so, what do the headers for the original attachment look
like as they hit RT. If not, how did it get into RT?

Todd_Chapman1 · February 19, 2009, 4:55pm

It was attached in the web interface on the Create.html page. (Not a custom
field)On Thu, Feb 19, 2009 at 11:45 AM, Jesse Vincent jesse@bestpractical.comwrote:

On Thu 19.Feb’09 at 11:42:42 -0500, Todd Chapman wrote:

According to FF the original file has an encoding of UTF-16LE. It was
generated by Word. (I know, I know)

Now we’re getting somewhere. Was it attached to a mail as an
attachment? If so, what do the headers for the original attachment look
like as they hit RT. If not, how did it get into RT?

Jesse_Vincent · February 19, 2009, 5:29pm

It was attached in the web interface on the Create.html page. (Not a
custom field)

And what headers is RT serving it out with? Is RT announcing it as utf8?
Is that stored in the database as content-type?

If you save the raw data from RT to disk and open it in a browser, does
it render correctly? (Use wget, not your browser to save it. The browser
could corrupt it)

Todd_Chapman1 · February 19, 2009, 5:56pm

It was attached in the web interface on the Create.html page. (Not a
custom field)

And what headers is RT serving it out with? Is RT announcing it as utf8?
Is that stored in the database as content-type?

If you save the raw data from RT to disk and open it in a browser, does
it render correctly? (Use wget, not your browser to save it. The browser
could corrupt it)

It displays the same when downloaded with wget. (same md5)

Here are the headers via the wget -S option:

HTTP/1.1 200 OK
Server: Apache/2.2.3 (CentOS)
Set-Cookie: RT_SID_techrt.80=b6f63c39b4f0992034a41bb57f9bac80; path=/
Connection: close
Content-Type: text/html;charset=utf-8
Length: unspecified [text/html]

Jesse_Vincent · February 19, 2009, 7:08pm

I don’t know how FF figures out that it is UTF-16LE.

I’d recommend starting in lib/RT/I18N.pm sub SetMIMEEntityToUTF8.
instrument there.

Jesse_Vincent · February 20, 2009, 6:04pm

 Well, what does the database say for content-type? Is the content in the
 database 'right'?
Sorry. And thanks again for all the help!

mysql> select Subject, Filename, ContentType, ContentEncoding, Headers
from Attachments where id=10792\G
X-Mailer: MIME-tools 5.426 (Entity 5.426)
Content-Type: text/html;
charset=“utf-8”;
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 73017

This really does look like our content-type sniffing for HTML probably
wants to look inside the content for an encoding. But there’s a chicken
and egg problem there.

I think you probably want to see if Encode::Guess does the right thing
with your utf-16 html. If so, then it might be a problem in how RT uses
it.

I look forward to further triage.

-j

Todd_Chapman1 · February 20, 2009, 6:34pm

 Well, what does the database say for content-type? Is the content in the
 database 'right'?
Sorry. And thanks again for all the help!

mysql> select Subject, Filename, ContentType, ContentEncoding, Headers
from Attachments where id=10792\G
X-Mailer: MIME-tools 5.426 (Entity 5.426)
Content-Type: text/html;
charset=“utf-8”;
Content-Transfer-Encoding: binary
X-RT-Original-Encoding: utf-8
Content-Length: 73017
This really does look like our content-type sniffing for HTML probably
wants to look inside the content for an encoding. But there’s a chicken
and egg problem there.

I think you probably want to see if Encode::Guess does the right thing
with your utf-16 html. If so, then it might be a problem in how RT uses
it.

I look forward to further triage.

Hmmm. Just noticed this error:

[Fri Feb 20 18:32:55 2009] [debug]: Converting ‘UTF-16’ to ‘utf-8’ for
text/html - Re Eprize RPC interface failing on DC registration.htm
(/opt/rt3-devel/bin/…/lib/RT/I18N.pm:234)
[Fri Feb 20 18:32:55 2009] [error]: Encoding error:
UTF-16:Unrecognised BOM 78 at
/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 190.

And just found this in production:

Feb 19 10:39:28 c0sup-rt02 RT: Encoding error: UTF-16:Unrecognised BOM
78 at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line
190. Stack: [/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm:190]
[/opt/rt3/bin/…/lib/RT/I18N.pm:235]
[/opt/rt3/bin/…/lib/RT/I18N.pm:153]
[/opt/rt3/bin/…/lib/RT/Interface/Web.pm:853]
[/opt/rt3/share/html/Ticket/Update.html:308]
[/opt/rt3/share/html/autohandler:311] defaulting to ISO-8859-1 →
UTF-8 (/opt/rt3/bin/…/lib/RT/I18N.pm:239)

Word putting in an invalid BOM? I upgraded Encode from 2.26 to 2.31
but it had no effect.

Jesse_Vincent · February 20, 2009, 6:43pm

Hmmm. Just noticed this error:

[Fri Feb 20 18:32:55 2009] [debug]: Converting ‘UTF-16’ to ‘utf-8’ for
text/html - Re Eprize RPC interface failing on DC registration.htm
(/opt/rt3-devel/bin/…/lib/RT/I18N.pm:234)
[Fri Feb 20 18:32:55 2009] [error]: Encoding error:
UTF-16:Unrecognised BOM 78 at
/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 190.

How small a test case can you get to fail like that?

Todd_Chapman1 · February 20, 2009, 7:46pm

The attached script and input file trigger the error. I think the
problem is the loop on @lines. The BOM is only in the first line so
the rest is cornfused.On Fri, Feb 20, 2009 at 1:43 PM, Jesse Vincent jesse@bestpractical.com wrote:

Hmmm. Just noticed this error:

[Fri Feb 20 18:32:55 2009] [debug]: Converting ‘UTF-16’ to ‘utf-8’ for
text/html - Re Eprize RPC interface failing on DC registration.htm
(/opt/rt3-devel/bin/…/lib/RT/I18N.pm:234)
[Fri Feb 20 18:32:55 2009] [error]: Encoding error:
UTF-16:Unrecognised BOM 78 at
/usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 190.

How small a test case can you get to fail like that?

encode_test.pl (577 Bytes)

file (40 Bytes)

Jesse_Vincent · February 20, 2009, 9:35pm

The attached script and input file trigger the error. I think the
problem is the loop on @lines. The BOM is only in the first line so
the rest is cornfused.

If you’re up for actually rewriting that as a test file that loads the
data and checks what it creates, I’m game for trying to debug it.