Invalid chars when creating mariadb fulltext index

I’m migrating an rt instance that we’ve been running for almost 20 years. We’re currently on rt 4.2.13 / mysql 5.1.73 with sphinx fulltext. I’m testing a migration to rt 5.0.0 / mariadb 10.3.17.

I see rt doesn’t support sphinx with mariadb, so I’m looking at native fulltext indexing. The issue I’m having is that we have content in our database (~1.6 million tickets) that contains invalid unicode characters. Trying to run rt-setup-fulltext-index fails with

[93315] [Wed Nov 18 20:49:02 2020] [warning]: DBD::mysql::st execute failed: Incorrect string value: ‘\xBD\xC3\xB3\xD7\xC6\xC4…’ for column rt3.AttachmentsIndex.Content at row 92 at /app/rt5/sbin/rt-fulltext-indexer line 238. (/app/rt5/sbin/rt-fulltext-indexer:238)

and a bunch more similar messages. The sequence in question is actually \xbd\xc3\xb3\xd7\xc6\xc4\xc5\xa9, which is complete nonsense. This is part of a legitimate ticket which contains a spam message forwarded in the body. RT accepted the message, the database was fine with it as Attachments Content is a longblob, but now we’re trying to stuff that into a longtext column which isn’t going to work.

I wondered if anyone had already encountered and fixed this issue. As far as I can see my options are:

  • Clean up the database (which presumably won’t stop this happening in the future, unless RT now sanitizes / rejects messages with invalid characters before they get into the DB).

  • Modify rt-fulltext-indexer to remove invalid characters before inserting to AttachmentsIndex.

Thanks for any help.