Hello,
I am a consultant for a company which uses RT for their internal
support. They asked me to fix a problem they were having where
attaching binary files to a ticket caused the file to become corrupt
sometimes. They tracked it down to the case where the mod_perl session
which serves the request to add the attachment to the ticket has
previously been used to perform some ticket-related operation. I finally
tracked down this problem to a bug in perl. Here is a detailed
description of the problem:
When you attach a file to a ticket using RT it saves the file you
attach into a file into /tmp. It then adds a MIME::Body::File
record to the MIME::Entity which represents the ticket. Later, it calls
make_singlepart() on the MIME::Entity, which converts the entity into a
string. During this process, it calls as_string() on the
MIME::Body::File. This causes the file to be read in and printed into a
string using the IO::Scalar object. IO::Scalar’s print() function calls
the function join() on the data as it is read in, before that data is
appended onto the destination string.
The problem occurs inside join(). join() recycles string objects
into which it does the joining, which it later returns. It never
touches the UTF8 flag on these strings. So, on the initial run, it has
no strings to recycle (or few), and when they are created they are set
to ASCII. So all the results of join() are ASCII, which is what MIME and
RT wants, as ASCII is also what is used for processing binary data. The
problem is, on the second and subsequent executions of RT within the
perl system, the recycled strings often have the UTF8 flag set. So, join
(‘’, $string), where $string is ASCII, will often return a UTF8 string.
When this UTF8 string is later converted into ASCII it is modified, and
so the binary data is corrupted.
The solution is to apply the following patch to perl (tested with
perl 5.8.2), which sets the UTF8 flag on the returned string to
something sensible.
diff -u perl-5.8.2/doop.c perl-5.8.2-patched/doop.c
— perl-5.8.2/doop.c 2003-09-30 10:09:51.000000000 -0700
+++ perl-5.8.2-patched/doop.c 2004-01-05 23:23:13.000000000 -0800
@@ -647,6 +647,9 @@
register STRLEN len;
STRLEN delimlen;
STRLEN tmplen;
-
int utf8;
-
utf8 = (SvUTF8(del)!=0);
(void) SvPV(del, delimlen); /* stringify and get the delimlen /
/ SvCUR assumes it’s SvPOK() and woe betide you if it’s not. */
@@ -674,22 +677,37 @@
SvTAINTED_off(sv);
if (items-- > 0) {
-
if (*mark) {
-
utf8 += (SvUTF8(*mark)!=0);
sv_catsv(sv, *mark);
-
}
mark++;
}
if (delimlen) {
for (; items > 0; items–,mark++) {
sv_catsv(sv,del);
-
utf8 += (SvUTF8(*mark)!=0);
sv_catsv(sv,*mark);
}
}
else {
-
for (; items > 0; items--,mark++) {
-
utf8 += (SvUTF8(*mark)!=0);
sv_catsv(sv,*mark);
-
}
}
SvSETMAGIC(sv);
- if( utf8 )
- {
-
if( utf8 != sp-oldmark+1 && ckWARN_d(WARN_UTF8) )
-
{
-
Perl_warner(aTHX_ packWARN(WARN_UTF8), "Joining UTF8 and
ASCII strings"); + }
-
SvUTF8_on(sv);
- } else {
-
SvUTF8_off(sv);
- }
}
void
There may be other perl functions with similar problems; this is
beyond the scope of my job, however I hope that the maintainers of
perl will be proactive in attempting to find and fix any similar
problems, as the way they have added UTF8 support to perl doesn’t make
it obvious when such bugs exist. I’d say that any built-in function that
returns a string should be checked for (a) setting the UTF8 flag at all
and (b) whether the value it sets it to is sensible. Also I think
warnings when mixed types of strings are passed into functions are
sensible as this can be dangerous, and as we don’t know what character
set the ASCII strings are in, the routines themselves can’t really
handle this case properly if any extended characters are present.
I hope this helps.
Nicholas