HTML stripper

Ive seen numerous posts here from people ($self->include) wishing that RT
could take incoming HTML mail and strip them down to plain text. I wrote this
Perl script to do this, it may not be the most elegant solution but it works
and can be used for RT, MajorDomo, whatever.

Basically it takes mail from STDIN and spits out a new one to STDOUT, so you
can put in your /etc/aliases:

 rt-queue: "| htmldump | rt-mailgate etc..."

If incoming mail is a straight text/html MIME type, the script will run it
through lynx -dump (or you can use html2txt) to generate a text version. Since
this may not be the prettiest formatting, a header is attached saying “this
was generated from HTML automatically etc” and the original HTML email is
preserved. The output of the script in this case will be a multipart MIME
email which has the text part first and then the HTML as another MIME
attachment, given the name “original.html” so it’s obvious when viewed in the
RT ticket.

If the incoming email is already multipart, any text parts and attachments are
passed on unchanged. HTML parts are treated as above, with the exception that
if the MIME header already has a filename for the HTML part, it won’t get
given the “original.html” name.

Im sure it can be improved, but it seems to work well enough for what I need.

Suppose the script would help…

htmldump (4.14 KB)

Why re-invent the wheel?

http://scifi.squawk.com/demime.html

Works wonderfully for us.

-ToddOn Tue, Mar 30, 2004 at 07:45:53PM -0500, Craig Schenk wrote:

Ive seen numerous posts here from people ($self->include) wishing that RT
could take incoming HTML mail and strip them down to plain text. I wrote this
Perl script to do this, it may not be the most elegant solution but it works
and can be used for RT, MajorDomo, whatever.

Basically it takes mail from STDIN and spits out a new one to STDOUT, so you
can put in your /etc/aliases:

 rt-queue: "| htmldump | rt-mailgate etc..."

If incoming mail is a straight text/html MIME type, the script will run it
through lynx -dump (or you can use html2txt) to generate a text version. Since
this may not be the prettiest formatting, a header is attached saying “this
was generated from HTML automatically etc” and the original HTML email is
preserved. The output of the script in this case will be a multipart MIME
email which has the text part first and then the HTML as another MIME
attachment, given the name “original.html” so it’s obvious when viewed in the
RT ticket.

If the incoming email is already multipart, any text parts and attachments are
passed on unchanged. HTML parts are treated as above, with the exception that
if the MIME header already has a filename for the HTML part, it won’t get
given the “original.html” name.

Im sure it can be improved, but it seems to work well enough for what I need.


rt-users mailing list
rt-users@lists.bestpractical.com
The rt-users Archives

Have you read the FAQ? The RT FAQ Manager lives at http://fsck.com/rtfm

Demime kind of takes a sledgehammer approach, which is why I chose not to use
it. It doesn’t strip HTML, it strips ANYTHING that isn’t plain text. People
using our RT do need to send non-HTML things sometimes such as Word documents,
jpegs, or sometimes HTML documents. Demime clobbers all, and is a great tool
if you want nothing but text on your RT, mailing list, etc. You could of
course set up multiple aliases for the same RT queues and have a MIME-stripped
and MIME-allowed one but that would require more adminning and attention
paying users. The script I wrote doesnt clobber attachments, it just shows a
plaintext version of the HTML in them so someone reading a ticket can still
get at the MIME attachments if they want.On 31-Mar-2004 Todd Chapman wrote:

Why re-invent the wheel?

http://scifi.squawk.com/demime.html

Works wonderfully for us.

I tried your script. When I replied to an e-mail from
RT your script stripped the subject… and the content.

Here is the attachment info from the RT database. Only
this header survives.

X-RT-Original-Encoding: iso-8859-1
Content-Length: 0On Tue, Mar 30, 2004 at 09:13:25PM -0500, Craig Schenk wrote:

Suppose the script would help…


rt-users mailing list
rt-users@lists.bestpractical.com
The rt-users Archives

Have you read the FAQ? The RT FAQ Manager lives at http://fsck.com/rtfm

Craig Schenk wrote:

Ive seen numerous posts here from people ($self->include) wishing that RT
could take incoming HTML mail and strip them down to plain text. I wrote this
Perl script to do this, it may not be the most elegant solution but it works
and can be used for RT, MajorDomo, whatever.

Basically it takes mail from STDIN and spits out a new one to STDOUT, so you
can put in your /etc/aliases:

 rt-queue: "| htmldump | rt-mailgate etc..."

I worked with this code for several hours, trying to make it work.

It didn’t encode the other parts (like images). It didn’t set a header
to indicate that the mail was filtered.

It shouldn’t convert HTML if there is a text alternative for the HTML.

And I prefere ISO-8859-1.

I looked at a couple of other alternatives (demime and stripmime) that
didn’t do what I wanted.

Firstly, it should create a text/plain part from HTML if none exist.

Before I continue modifying dumphtml, are there any other alternatives?
Craig, do you have a later version of dumphtml?

jonas@rit.se RIT AB http://www.rit.se
Box 70, 428 21 K�llered Bes�k: G:a Riksv�gen 36
Tel: +46 (0)31 751 8600 Fax: +46 (0)31 751 8609

Ive done a little work on it but not much, been busy with other projects. I was planning to get back to it next week and try to fix the bug (it seems to be chomping mime entities too aggresively when run as a mail pipe, even though it worked fine in commandline simulations).

Before I continue modifying dumphtml, are there any other alternatives?
Craig, do you have a later version of dumphtml?

I’m going to make a recommendation that you try to avoid altering the
incoming html message, as anything we do to a message that’s not
reversable means less information for staff working with tickets down
the line. Instead, make the display logic smarter…which is just what
I did while sitting at home sick last night… (it’ll be in RT 3.2)

http://svn.bestpractical.com/index.cgi/public/log/rt/branches/rt-3.1/html/Elements/ScrubHTML
http://svn.bestpractical.com/index.cgi/public/log/rt/branches/rt-3.1/html/Ticket/Elements/ShowTransaction
http://svn.bestpractical.com/index.cgi/public/log/rt/branches/rt-3.1/html/Ticket/Elements/ShowTransactionAttachments

That’s fantastic. It’s what I’ve done with limited success. Will this
also populate reply’s and admincc’s properly also? That’s where I’ve
fallen down.

Jesse Vincent wrote:

Before I continue modifying dumphtml, are there any other alternatives?
Craig, do you have a later version of dumphtml?

I’m going to make a recommendation that you try to avoid altering the
incoming html message, as anything we do to a message that’s not
reversable means less information for staff working with tickets down
the line. Instead, make the display logic smarter…which is just what
I did while sitting at home sick last night… (it’ll be in RT 3.2)

http://svn.bestpractical.com/index.cgi/public/log/rt/branches/rt-3.1/html/Elements/ScrubHTML
http://svn.bestpractical.com/index.cgi/public/log/rt/branches/rt-3.1/html/Ticket/Elements/ShowTransaction
http://svn.bestpractical.com/index.cgi/public/log/rt/branches/rt-3.1/html/Ticket/Elements/ShowTransactionAttachments

With regards,

Say_Ten

This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it.

In the event of misdirection, illegible or incomplete transmission please telephone (023) 8024 3137
or return the E.mail to postmaster@multiplay.co.uk.