Storing messages containing invalid encodings

Dominic_Hargreaves · June 15, 2010, 5:24pm

Hello,

We have found that messages from one particular sender are declared
as being in a UTF8 encoding, but contain byte sequences which are not
valid in UTF8; in particular ‘0xb2’, ‘0xb3’, ‘0xb9’ - they appear to
relate to particularly brain-dead renderings of various quotation
marks: http://www.memoryhole.net/kyle/2007/08/superscriptone.html
(although that page doesn’t cover the extra breakage of inserting
those particular bytes into a UTF8 encoded document).

With PostgreSQL at least, the attachments are stored internally as
unicode characters, so PostgreSQL not unreasonably refuses to store such
an attachment. Of course, it’s then impossible to create a ticket.

In an ideal world, the correspondent would receive the error message,
enquire further, be told why his/her message wasn’t usable, and fix
his/her software.

In practice, this is unlikely to happen in this particular case and the
messages are considered of high value to the organisation.

So, what to do? I’ve thought of four possibilities:

One: validate all data received via RT and pass it out to a
heuristic routine which would substitute all invalid characters by some
number of U+FFFD characters before storing the message. This might be
controversial behaviour if the expectation is that RT stores what was
supplied to it.

An alternative approach would be to alter the database scheme to allow
for an attachment with unknown or invalid encoding; the binary data
would be stored unmodified, and the web interface would offer for
download the raw data for interpreting at the user’s whim.

A third approach might involve filtering the incoming message outside of
RT; this might be the most practical way to achieve the behaviour we
desire, especially since it could be easily contained to individual queues.

Yet another acceptable workaround might be a much smaller modification
to notify the queue owners that a message failed to be stored, as well
as the correspondent.

Our logs indicate we’ve had 9 such occurrences (although some may relate
to a separate UTF8 related bug fixed in 3.8.8 which we’ve only just
installed) over 37,000 tickets so it’s not a particularly common problem.

I would be interested to hear of anyone else encountering this issue,
and any work taken to improve the situation for the unfortunate
recipient of highly important garbage emails. When it comes down to both
user expectations, and the oft-quoted principal of being liberal in
what one accepts, there is clearly some room for improvement here.

Cheers,
Dominic.

Dominic Hargreaves, Systems Development and Support Team
Computing Services, University of Oxford

signature.asc (197 Bytes)

Kenneth_Marshall · June 15, 2010, 5:41pm

Hello,

We have found that messages from one particular sender are declared
as being in a UTF8 encoding, but contain byte sequences which are not
valid in UTF8; in particular ‘0xb2’, ‘0xb3’, ‘0xb9’ - they appear to
relate to particularly brain-dead renderings of various quotation
marks: http://www.memoryhole.net/kyle/2007/08/superscriptone.html
(although that page doesn’t cover the extra breakage of inserting
those particular bytes into a UTF8 encoded document).

With PostgreSQL at least, the attachments are stored internally as
unicode characters, so PostgreSQL not unreasonably refuses to store such
an attachment. Of course, it’s then impossible to create a ticket.

In an ideal world, the correspondent would receive the error message,
enquire further, be told why his/her message wasn’t usable, and fix
his/her software.

In practice, this is unlikely to happen in this particular case and the
messages are considered of high value to the organisation.

So, what to do? I’ve thought of four possibilities:

One: validate all data received via RT and pass it out to a
heuristic routine which would substitute all invalid characters by some
number of U+FFFD characters before storing the message. This might be
controversial behaviour if the expectation is that RT stores what was
supplied to it.

An alternative approach would be to alter the database scheme to allow
for an attachment with unknown or invalid encoding; the binary data
would be stored unmodified, and the web interface would offer for
download the raw data for interpreting at the user’s whim.

A third approach might involve filtering the incoming message outside of
RT; this might be the most practical way to achieve the behaviour we
desire, especially since it could be easily contained to individual queues.

Yet another acceptable workaround might be a much smaller modification
to notify the queue owners that a message failed to be stored, as well
as the correspondent.

Our logs indicate we’ve had 9 such occurrences (although some may relate
to a separate UTF8 related bug fixed in 3.8.8 which we’ve only just
installed) over 37,000 tickets so it’s not a particularly common problem.

I would be interested to hear of anyone else encountering this issue,
and any work taken to improve the situation for the unfortunate
recipient of highly important garbage emails. When it comes down to both
user expectations, and the oft-quoted principal of being liberal in
what one accepts, there is clearly some room for improvement here.

Cheers,
Dominic.

–
Dominic Hargreaves, Systems Development and Support Team
Computing Services, University of Oxford

Hi Dominic,

I would chose any approach that keeps bad data out of the database,
in this case incorrect UTF-8. Is it possible to reroute bad attachments
to a separate storage for review by responsible parties, ideally before
you it reaches RT, maybe some sort of bad-data quarantine similar to
anti-spam quarantines. Maybe RT could automatically sanitize the data
if needed, using iconv and noting that in the ticket somehow.

Regards,
Ken

Jesse_Vincent · June 15, 2010, 6:07pm

I would chose any approach that keeps bad data out of the database,
in this case incorrect UTF-8.

That’s sort of a non-starter for me. On the open Internet, bad data is
a reality. RT needs to be able to deal. I suspect that “storing the
bad data base64 encoded” may be a plausible “solution”.

-jesse

Kenneth_Marshall · June 15, 2010, 6:25pm

I would chose any approach that keeps bad data out of the database,
in this case incorrect UTF-8.

That’s sort of a non-starter for me. On the open Internet, bad data is
a reality. RT needs to be able to deal. I suspect that “storing the
bad data base64 encoded” may be a plausible “solution”.

-jesse

I agree. But this does keep the bad data out by converting it to a
proper UTF-8, base64 encoded. Hopefully, this can be done in a way to
prevent forcing all of the good Internet citizens and RT users from
taking the CPU/IO performance hit for the small group of bad apps
and worse.

Regards,
Ken