UTF-8 problems

Gentlemen,

      We have a problem here with RT 3.0.7_01 with handling of

localized messages that come into the system. The problem is that some
messages are being converted to utf-8 TWICE, while being injected into
the database. As a result, broken utf-8 encoding appears.
I’ve tried to debug the code of I18N myself, and I can see
only one (correct) transformation of single-byte koi8-r messages into
utf-8, and resulting utf-8 that I can see saved in the log is correct.
Then, no further calls of I18N->SetMIMEEntityToEncoding are visible in
the log, but text in the database appears to be double-encoded into
utf-8.
What’s wrong with this and where should I look for? I only
see other calls to Encode::from_to within SendEmail and
AttachmentOverlay, but the latter is quite complex to understand what
it is being called for and when.
Any help would be appreciated.

NF

“Nick” == Nick Filimonov nick@freenet.ru writes:

We have a problem here with RT 3.0.7_01 with handling of
localized messages that come into the system. The problem is that
some messages are being converted to utf-8 TWICE, while being
injected into the database. As a result, broken utf-8 encoding
appears.

This must be similar to my problem report, and your description looks
much more accurate. I hope this will give RT developers better hints
to find the problem.

Jesse, did you receive my test case by the way?

Sam
Samuel Tardieu – sam@rfc1149.netAbout me

“Nick” == Nick Filimonov nick@freenet.ru writes:

We have a problem here with RT 3.0.7_01 with handling of
localized messages that come into the system. The problem is that
some messages are being converted to utf-8 TWICE, while being
injected into the database. As a result, broken utf-8 encoding
appears.

This must be similar to my problem report, and your description looks
much more accurate. I hope this will give RT developers better hints
to find the problem.

We have same problems here. Installing hacked IO::Stringy fixed
corrupted attachments problem, but double encoding problem still
persist.

Jesse I can setup test environment with remote access and example
corrupted messages for you, if you are interested?

O.
Ondřej Surý sury.ondrej@globe.cz
Globe Internet s.r.o.

Hello,

–Am Mittwoch, 14. Januar 2004 15:51 Uhr +0100 schrieb Ond?ej Sur?
sury.ondrej@globe.cz:

We have same problems here. Installing hacked IO::Stringy fixed
corrupted attachments problem, but double encoding problem still
persist.

same here (even after I upgraded from perl 5.8.0 to 5.8.2)

Dirk.

Hi,

I can also reproduce this, but not every time. Jan 7, there was a mail from
Jesse on the list about a problem in perl. Can this have anything to do
with it? Check it out here:–On onsdag, januari 07, 2004 13.18.00 -0500 Jesse Vincent jesse@bestpractical.com wrote:

Nicholas has tracked the intermittent bug that causes attachment
corruption for some users to a bug in perl’s “join” method. There is a
potential fix that doesn’t involve directly modifying perl’s source code,
but we don’t have that available just yet.

On Mon, Jan 05, 2004 at 10:24:27PM -0800, Nicholas Adrian Vinen wrote:

Hello,
I am a consultant for a company which uses RT for their internal
support. They asked me to fix a problem they were having where
attaching binary files to a ticket caused the file to become corrupt
sometimes. They tracked it down to the case where the mod_perl session
which serves the request to add the attachment to the ticket has
previously been used to perform some ticket-related operation. I finally
tracked down this problem to a bug in perl. Here is a detailed
description of the problem:

  When you attach a file to a ticket using RT it saves the file you
  attach into a file into /tmp. It then adds a MIME::Body::File

record to the MIME::Entity which represents the ticket. Later, it calls
make_singlepart() on the MIME::Entity, which converts the entity into a
string. During this process, it calls as_string() on the
MIME::Body::File. This causes the file to be read in and printed into a
string using the IO::Scalar object. IO::Scalar’s print() function calls
the function join() on the data as it is read in, before that data is
appended onto the destination string.

  The problem occurs inside join(). join() recycles string objects
  into which it does the joining, which it later returns. It never

touches the UTF8 flag on these strings. So, on the initial run, it has
no strings to recycle (or few), and when they are created they are set
to ASCII. So all the results of join() are ASCII, which is what MIME and
RT wants, as ASCII is also what is used for processing binary data. The
problem is, on the second and subsequent executions of RT within the
perl system, the recycled strings often have the UTF8 flag set. So, join
(‘’, $string), where $string is ASCII, will often return a UTF8 string.
When this UTF8 string is later converted into ASCII it is modified, and
so the binary data is corrupted.

  The solution is to apply the following patch to perl (tested with
  perl 5.8.2), which sets the UTF8 flag on the returned string to

something sensible.

diff -u perl-5.8.2/doop.c perl-5.8.2-patched/doop.c
— perl-5.8.2/doop.c 2003-09-30 10:09:51.000000000 -0700
+++ perl-5.8.2-patched/doop.c 2004-01-05 23:23:13.000000000 -0800
@@ -647,6 +647,9 @@
register STRLEN len;
STRLEN delimlen;
STRLEN tmplen;

  • int utf8;

  • utf8 = (SvUTF8(del)!=0);

    (void) SvPV(del, delimlen); /* stringify and get the delimlen /
    /
    SvCUR assumes it’s SvPOK() and woe betide you if it’s not. */
    @@ -674,22 +677,37 @@
    SvTAINTED_off(sv);

    if (items-- > 0) {

  •   if (*mark)
    
  •   if (*mark) {
    
  •       utf8 += (SvUTF8(*mark)!=0);
          sv_catsv(sv, *mark);
    
  •   }
      mark++;
    

    }

    if (delimlen) {
    for (; items > 0; items–,mark++) {
    sv_catsv(sv,del);

  •       utf8 += (SvUTF8(*mark)!=0);
          sv_catsv(sv,*mark);
      }
    

    }
    else {

  •   for (; items > 0; items--,mark++)
    
  •   for (; items > 0; items--,mark++) {
    
  •       utf8 += (SvUTF8(*mark)!=0);
          sv_catsv(sv,*mark);
    
  •   }
    
    }
    SvSETMAGIC(sv);
  • if( utf8 )
  • {
  •    if( utf8 != sp-oldmark+1 && ckWARN_d(WARN_UTF8) )
    
  •   {
    
  •       Perl_warner(aTHX_ packWARN(WARN_UTF8), "Joining UTF8 and
    

ASCII strings"); + }

  •    SvUTF8_on(sv);
    
  • } else {
  •    SvUTF8_off(sv);
    
  • }
    }

void

  There may be other perl functions with similar problems; this is
  beyond the scope of my job, however I hope that the maintainers of

perl will be proactive in attempting to find and fix any similar
problems, as the way they have added UTF8 support to perl doesn’t make
it obvious when such bugs exist. I’d say that any built-in function that
returns a string should be checked for (a) setting the UTF8 flag at all
and (b) whether the value it sets it to is sensible. Also I think
warnings when mixed types of strings are passed into functions are
sensible as this can be dangerous, and as we don’t know what character
set the ASCII strings are in, the routines themselves can’t really
handle this case properly if any extended characters are present.

  I hope this helps.

        Nicholas


Request Tracker... So much more than a help desk — Best Practical Solutions – Trouble Ticketing. Free.


rt-devel mailing list
rt-devel@lists.bestpractical.com
The rt-devel Archives

–On onsdag, januari 14, 2004 18.01.41 +0100 Dirk Pape pape-rt@inf.fu-berlin.de wrote:

Hello,

–Am Mittwoch, 14. Januar 2004 15:51 Uhr +0100 schrieb Ond?ej Sur?
sury.ondrej@globe.cz:

We have same problems here. Installing hacked IO::Stringy fixed
corrupted attachments problem, but double encoding problem still
persist.

same here (even after I upgraded from perl 5.8.0 to 5.8.2)

Dirk.


rt-users mailing list
rt-users@lists.bestpractical.com
The rt-users Archives

Have you read the FAQ? The RT FAQ Manager lives at http://fsck.com/rtfm

I can also reproduce this, but not every time.

Sorry, I can reproduce this every time. I was wrong. I’ll send a report
in a separate mail in a short while.

/Palle

I have installed work-around IO::Stringy from Jesse which fixed
corrupted attachments, but not double UTF-8 problem. I will be brave
and recompile perl 5.8.2 with this fix and report results.

O.On Thu, 2004-01-15 at 00:16, Palle Girgensohn wrote:

Hi,

I can also reproduce this, but not every time. Jan 7, there was a mail from
Jesse on the list about a problem in perl. Can this have anything to do
with it? Check it out here:

–On onsdag, januari 07, 2004 13.18.00 -0500 Jesse Vincent jesse@bestpractical.com wrote:

Nicholas has tracked the intermittent bug that causes attachment
corruption for some users to a bug in perl’s “join” method. There is a
potential fix that doesn’t involve directly modifying perl’s source code,
but we don’t have that available just yet.

Ondřej Surý sury.ondrej@globe.cz
Globe Internet s.r.o.

“Ondřej” == Ondřej Surý sury.ondrej@globe.cz writes:

I have installed work-around IO::Stringy from Jesse which fixed
corrupted attachments, but not double UTF-8 problem. I will be
brave and recompile perl 5.8.2 with this fix and report results.

Where can I get the IO::Stringy fix from?

Sam
Samuel Tardieu – sam@rfc1149.netAbout me

http://download.bestpractical.com/pub/rt/devel/IO-stringy-Hacked-For-UTF8-2.
109-BestPractical-Hack-20040107.tar.gz

A.From: rt-users-bounces@lists.bestpractical.com
[mailto:rt-users-bounces@lists.bestpractical.com] On Behalf Of Samuel
Tardieu
Sent: Thursday, January 15, 2004 10:48 AM
To: rt-users@lists.fsck.com
Subject: [rt-users] Re: UTF-8 problems

“Ondřej” == Ondřej Surý sury.ondrej@globe.cz writes:

I have installed work-around IO::Stringy from Jesse which fixed
corrupted attachments, but not double UTF-8 problem. I will be
brave and recompile perl 5.8.2 with this fix and report results.

Where can I get the IO::Stringy fix from?

Sam
Samuel Tardieu – sam@rfc1149.netAbout me

rt-users mailing list
rt-users@lists.bestpractical.com
http://lists.bestpractical.com/mailman/listinfo/rt-users

Have you read the FAQ? The RT FAQ Manager lives at http://fsck.com/rtfm