REG: HTML mails

Sanjeev_Gopal · May 2, 2001, 2:27pm

Hello,

I’ve got rt 1.0.7 running, and have also implemented ‘stripmime’, as a lot
of mails had HTML content. This works fine if the content is HTML and
text, but in case a pure HTML mail turns up, the mail has to be read
separately, and moreover, when we reply to such a mail, the requestor
doesn’t get to see any of his original content.

Is is possible to use a HTML to Text converter, like the one referred
below:

    http://userpage.fu-berlin.de/~mbayer/tools/html2text.html

Please advice.

Regards,
Sanjeev Gopal
IT Consultant
Antarix e Applications Limited
Phone: +91 44 820 3554
Fax: +91 44 827 2274

Eric_Goodman · May 4, 2001, 10:35pm

Hello,

I’ve got rt 1.0.7 running, and have also implemented ‘stripmime’, as a lot
of mails had HTML content. This works fine if the content is HTML and
text, but in case a pure HTML mail turns up, the mail has to be read
separately, and moreover, when we reply to such a mail, the requestor
doesn’t get to see any of his original content.

Is is possible to use a HTML to Text converter, like the one referred
below:
    http://userpage.fu-berlin.de/~mbayer/tools/html2text.html
Please advice.

Yes, it is possible. I’ve done this at my site, but my code is still
so ugly that I didn’t want to share it yet.

Each “part” of a MIME message has a name (like “message”, “message,
part 1”), a type (“text”, “application”), and a subtype (“text”?,
“html”, etc.)

Stripmime works by using MIME::Parser to break the incoming email
into its component parts, identifying any parts that aren’t plain
text, and making them links. A plaintext message body (named
“message”) comes with a type/subtype of “text/text” (I think). A
mixed message comes with two parts to the body, “message, part 1” and
“message, part 2” of type/subtype “text/text” and “text/html”
respectively. Stripmime handles both cases well.

The case you describe is HTML only. For this I think you typically
see a message body with name “message, part 1” (though I would expect
you might see just “message”) and type/subtype “text/html”.

All I did was add a check for this third case, and if found run the
HTML through HTML::FormatText (a module that can convert html to
plain text). I made a couple of other modifications to the script
(that I haven’t really reviewed), hence my hesitation to send this to
the list. I tried to note my mods with “EJG” comments. I expect some
are missing.

However, in case it is of use, the modified version of the script is
included below. Note that HTML::FormatText relies on
HTML::TreeBuilder, and it was a fairly long process to locate and
install all of the various PERL modules on which those two depend in
turn.

Hope this helps!

— Eric

#!/usr/bin/perl
use MIME::Parser;
use HTML::FormatText;
use HTML::TreeBuilder;
$now = time();
$basepath = “http://YOUR_SITE/stripmime/$now-$$”;
$basefilepath = “/YOUR_HTML_PATH/stripmime/$now-$$”;
$outputprog = “/RT_PATH/bin/rt-mailgate @ARGV”;

sub dump_entity {
my ($entity, $checksentry, $name) = @_;
defined($name) or $name = “message”;
my $IO;

 # EJG: Head appears to be the deliver info.
 # Output the head, if it's the root level head
 # Otherwise, it's just some crappy mime header
 if ($name eq "message") {
    print OUT  $entity->head->original_text."\n";
 }

 # Output the body:
 my @parts = $entity->parts;

 if (@parts) {                     # multipart...
     my $i;
     foreach $i (0 .. $#parts) {       # dump each part...
         dump_entity($parts[$i], 0, ("$name, part ".(1+$i)));
     }
 }
 else {                            # single part...

     # Get MIME type, and display accordingly...
     my ($type, $subtype) = split('/', $entity->head->mime_type);
     my $body = $entity->bodyhandle;

     # If it's text, display it, perhaps
     my $path = $body->path;
     my ($filename) = ($path =~ /\/([^\/]+)$/);

     if ($type =~ /^(text|message)$/ && $subtype ne "html") {
        print OUT "\n>>> Text component $filename:\n" if

($filename !~ “msgauto”);
if ($IO = $body->open(“r”)) {
print OUT $_ while (defined($_ = $IO->getline));
$IO->close;
push (@deletetemp, “$basefilepath/$filename”);
$keepgoing = false;
}
}
else {
# EJG: Added case for Apple headers
if ( ($type eq “application”) && ($subtype eq “applefile”) ) {
print OUT “\n>>> $type/$subtype component, $name:\n”;
print OUT “Not relevant, deleted\n”;
push (@deletetemp, “$basefilepath/$filename”);
}
else {
# EJG: Added 3rd condition (to avoid “.html.html” files)
if ($subtype eq “html” && $filename =~ /msgauto/ &&
$filename !~ /.html$/ ) {
$newfilename = “$filename.html”;
$renametemp{“$basefilepath/$filename”} =
“$basefilepath/$newfilename”;
$filename = $newfilename;
}
# EJG: If the message or the first part of the message is HTML,
# EJG: invoke HTML::FormatText to convert it to text.
if ($subtype eq “html” && ($name eq “message” || $name
eq “message, part 1” ) ){
my $htmltree = new HTML::TreeBuilder;
my $htmlformat = new HTML::FormatText(
leftmargin=>4, rightmargin=>60 );
$htmltree->parse_file( “$basefilepath/$filename” );
if ($name eq “message”) {
print OUT “$sentrystr”.“\n”;
}
print OUT “Incoming HTML message detected –
converted to text only.\n”;
print OUT “\n\n==========================================\n”;
print OUT $htmlformat->format( $htmltree );
print OUT “\n\n==========================================\n”;
print OUT “Original HTML version available at URL below.\n”;
}
print OUT “\n>>> $type/$subtype component, $name:\n”;
print OUT “<A HREF="$basepath/$filename">\n”;
print OUT “$basepath/$filename\n”;
print OUT “</A>\n”;
}
}
}
1;
}

main

sub main {

 # Create a new MIME parser:
 my $parser = new MIME::Parser;

 # Set the output directory:
 (-d "$basefilepath") or mkdir "$basefilepath",0755 or die "mkdir: $!";
 (-w "$basefilepath") or die "can't write to directory";
 $parser->output_dir($basefilepath);
 open (OUT, "|$outputprog");

 $parser->output_prefix("msgauto");


 # Read the MIME message:
 $entity = $parser->read(\*STDIN) or die "couldn't parse MIME stream";

 # Dump it out:
 dump_entity($entity, 1);
 close(OUT);


 # Delete unneeded temporary files
 foreach (@deletetemp) {
    unlink ($_);
 }

 # Rename our temporary files that were renamed (html, etc.)
 foreach (keys %renametemp) {
    rename ($_, $renametemp{$_});
 }

 # Delete our directory, or at least try -- won't delete if it's not empty
 rmdir($basefilepath);

}

&main();

exit(0);

Eric_Goodman · May 4, 2001, 11:57pm

Whoops!

An undefined debug string I thought I’d deleted from that last script
prior to mailing didn’t get cleaned out fully. Sorry about that.

             if ($name eq "message") {
                print OUT "$sentrystr"."\n";
             }

This is the culprit that should be deleted.

Note that I also assume later in the script that you’ve modified RT
to actually display URLs as links. If not, then at this point:

          print OUT "\n>>> $type/$subtype component, $name:\n";
          print OUT "<A HREF=\"$basepath/$filename\">\n";
          print OUT "$basepath/$filename\n";
          print OUT "<\/A>\n";

You’ll want to remove the 2nd and 4th lines. By default RT will
“regularize” them, and display the text literally (that is as LINK, instead of as a link to “LINK”). That can also
end up looking pretty funky when received in email, which will
usually interpret it as two clickable links with a bunch of
less-than, greater-than characters around them.

— Eric

Dave_Sherohman · May 7, 2001, 1:39pm

Each “part” of a MIME message has a name (like “message”, “message,
part 1”), a type (“text”, “application”), and a subtype (“text”?,
“html”, etc.)

Note that parts are not required to be named.

A plaintext message body (named
“message”) comes with a type/subtype of “text/text” (I think).

text/plain

The case you describe is HTML only. For this I think you typically
see a message body with name “message, part 1” (though I would expect
you might see just “message”) and type/subtype “text/html”.

The body’s name is MTA-dependent. A safer way to check for this would be
to look for messages which have a text/html part, but no text/plain part.
(You wouldn’t want to look for text/html as the only part because it
could have attachments, but still lack a plaintext version of the body.)

Eric_Goodman · May 7, 2001, 10:37pm

A plaintext message body (named
“message”) comes with a type/subtype of “text/text” (I think).

text/plain

That’s what happens when I get lazy and try to respond from memory!

The case you describe is HTML only. For this I think you typically
see a message body with name “message, part 1” (though I would expect
you might see just “message”) and type/subtype “text/html”.

The body’s name is MTA-dependent. A safer way to check for this would be
to look for messages which have a text/html part, but no text/plain part.
(You wouldn’t want to look for text/html as the only part because it
could have attachments, but still lack a plaintext version of the body.)

Good to know! Still, it works for most of the mailers we’re using,
and stripmime itself looks for a part named “message” in determining
what to use as the message body, so I think my additions will work in
most cases where stripmime would have worked for a corresponding
plain-text or mixed-type messages.

That’s not a great excuse for not implementing a better search, but
then again, that’s why I hadn’t sent my code to the list earlier.

— Eric

James_Dumser · May 8, 2001, 1:04pm

The case you describe is HTML only. For this I think you typically
see a message body with name “message, part 1” (though I would expect
you might see just “message”) and type/subtype “text/html”.

The body’s name is MTA-dependent. A safer way to check for this would
be to look for messages which have a text/html part, but no text/plain
part. (You wouldn’t want to look for text/html as the only part
because it could have attachments, but still lack a plaintext version
of the body.)

Netscape Mail encodes such emails as
multipart/mixed
multipart/alternative
text/plain
text/html
attachments …

In other words, it nests a multipart/alternative into a multipart/mixed.
I was in the process of developing a filter extend stripmime to handle
these other cases. In my flow, if I see a multipart/alternative entity,
I grab the text/plain sub-entity (I’m assuming it will always be there)
and ignore the rest.

James Dumser james.dumser@ericsson.com