Patch for RT 3.0.3 attachment conversion problem (2)

Autrijus_Tang · June 26, 2003, 10:50am

Hello Autrijus,

–Am Donnerstag, 26. Juni 2003 12:22 Uhr +0200 schrieb Dirk Pape
pape-rt@inf.fu-berlin.de:

and this is what I get forwarded as notification for adminCC.
Web display of Bild.jpg is ok, pdf corrupted.

I tried it again with a non-empty text part with the same effect.

I have found the culprit!

Encode::Guess 1.08 is broken and Encode::Guess 1.06 (the version shipped
with Perl) works.

The reason is that Encode::Guess 1.08 introduces this bogus logic:

if ($octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE)

but in fact it may just be random binary data that happens to have
“\x00” inside it. So if you nullify this if() condition, everything
should start working.

Cc’ing Kogai-san to try finding a solution. Kogai-san, can we
somehow disable this helpful guessing of “\x00”, via a
$Encode::Guess::NoUTF32Guessing control variable or something?

Specifically, I think it is wrong for Guess.pm to return UTF32/16
without user explicitly setting it in the Suspects list, but I’m
willing to be convinced otherwise.

Thanks,
/Autrijus/

Autrijus_Tang · June 26, 2003, 4:27pm

But one thing you should be careful is that a guessed encoding is,
after all, just a guess. You should not rely too much upon it. If you
have alternate way to tell the encoding explicitly, use that instead.

Advice very well taken. Since this is MIME entities we’re talking
about, RT will use all hints possible (content-type.charset, etc)
before falling back to Guess.

Cc’ing Kogai-san to try finding a solution. Kogai-san, can we
somehow disable this helpful guessing of “\x00”, via a
$Encode::Guess::NoUTF32Guessing control variable or something?

That’s possible. Thought the name should be NoUTF1632 (horrible but
more accurate) or something because it guesses not only UTF-32 (which
is hardly ever used for the time being) but also UTF-16.

I’ll say that $NoUTFAutoGuess is correct, which should eliminate all
unrequested-for guessing of this kind.

Code and POD patch as below, against 1.08.

Thanks,
/Autrijus/

— Guess.pm.orig Fri Jun 27 00:17:48 2003
+++ Guess.pm Fri Jun 27 00:25:33 2003
@@ -18,6 +18,7 @@
sub perlio_ok { 0 }

our @EXPORT = qw(guess_encoding);
+our $NoUTFAutoGuess = 0;

sub import { # Exporter not used so we do it on our own
my $callpkg = caller;
@@ -70,22 +71,27 @@
return unless defined $octet and length $octet;

 # cheat 0: utf8 flag;

Encode::is_utf8($octet) and return find_encoding(‘utf8’);

if ( Encode::is_utf8($octet) ) {
return find_encoding(‘utf8’) if !$NoUTFAutoGuess;
Encode::_utf8_off($octet);
}
cheat 1: BOM
use Encode::Unicode;

my $BOM = unpack(‘n’, $octet);
return find_encoding(‘UTF-16’)
if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe));
$BOM = unpack(‘N’, $octet);
return find_encoding(‘UTF-32’)
if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe0000));

if (!$NoUTFAutoGuess) {
my $BOM = unpack(‘n’, $octet);
return find_encoding(‘UTF-16’)

  if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe));

$BOM = unpack(‘N’, $octet);
return find_encoding(‘UTF-32’)

  if (defined $BOM and ($BOM == 0xFeFF or $BOM == 0xFFFe0000));

}
my %try = %{$obj->{Suspects}};
for my $c (@_){
my $e = find_encoding($c) or die “Unknown encoding: $c”;
$try{$e->name} = $e;
$DEBUG and warn "Added: ", $e->name;
}

if ($octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE)

if (!$NoUTFAutoGuess and $octet =~ /\x00/o){ # if \x00 found, we assume UTF-(16|32)(BE|LE)
my $utf;
my ($be, $le) = (0, 0);
if ($octet =~ /\x00\x00/o){ # UTF-32(BE|LE) assumed
@@ -188,6 +194,10 @@

tries all major Japanese Encodings as well

use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;
+If the C<$Encode::Guess::NoUTFAutoGuess> variable is set to a true
+value, no heuristics will be applied to UTF8/16/32, and the result
+will be limited to the suspects and C.

=over 4

Patch for RT 3.0.3 attachment conversion problem (2)

cheat 1: BOM

tries all major Japanese Encodings as well