Utf8 and accents

Curtis_Bruneau · August 8, 2008, 8:20pm

I need some suggestions, I have come to the conclusion that all utf8
collations don’t do french properly, not like latin1 anyway. All accents
are seen as the same, while binary distinct they cannot be unique
indexed and sorting will recognize them as the same as well as queries
using any variant character.

So I’m in a bit of a bind, if I were to use RT with a case sensitive
collation like utf8_bin would the application behave as expected? I know
search would be much more strict and possibly confusing to the end user.

My other option would be to continue to use latin1, is there any way to
accomplish this using the latest code base? It’s probably not
configurable and I don’t want to have to manage diffs for the possible
changes, unless it is fairly minimal to do…

The issue in question → http://bugs.mysql.com/bug.php?id=34130

They said it’s on ‘todo’, MSSQL handles this with ci_ai, ci_as, cs_ai
and cs_as collations where the accents are either sensitive or not.
Hopefully they do come around to it…

Character difference for mysql … Collation Charts: MySQL 6.0

Curtis

Ruslan_Zakirov1 · August 8, 2008, 8:35pm

I need some suggestions, I have come to the conclusion that all utf8
collations don’t do french properly, not like latin1 anyway. All accents
are seen as the same, while binary distinct they cannot be unique
indexed and sorting will recognize them as the same as well as queries
using any variant character.

So I’m in a bit of a bind, if I were to use RT with a case sensitive
collation like utf8_bin would the application behave as expected? I know
search would be much more strict and possibly confusing to the end user.

utf8_bin is good choice. You’re free to use binary collation. May be
utf8_general_ci collation will be better for you. Any collation is ok
as long as you know how to deal with them in mysql.

My other option would be to continue to use latin1, is there any way to
accomplish this using the latest code base? It’s probably not
configurable and I don’t want to have to manage diffs for the possible
changes, unless it is fairly minimal to do…

No, we wouldn’t return to that as it’s totally wrong and have
concequences as it’s actually violation of setting purpose. RT was
storing UTF8 encoded data in a latin1 column, so collations worked
absolutly incorrect for everything even latin1 and were close to
binary.

At this point I can suggest you move either binary collation or create
a new one and send it to mysql team for inclusion.

The issue in question → http://bugs.mysql.com/bug.php?id=34130

They said it’s on ‘todo’, MSSQL handles this with ci_ai, ci_as, cs_ai
and cs_as collations where the accents are either sensitive or not.
Hopefully they do come around to it…

Character difference for mysql … http://www.collation-charts.org/mysql60/

Curtis

The rt-users Archives

Community help: http://wiki.bestpractical.com
Commercial support: sales@bestpractical.com

Discover RT’s hidden secrets with RT Essentials from O’Reilly Media.
Buy a copy at http://rtbook.bestpractical.com

Best regards, Ruslan.

Curtis_Bruneau · August 8, 2008, 8:51pm

Ruslan Zakirov wrote:

I need some suggestions, I have come to the conclusion that all utf8
collations don’t do french properly, not like latin1 anyway. All accents
are seen as the same, while binary distinct they cannot be unique
indexed and sorting will recognize them as the same as well as queries
using any variant character.

So I’m in a bit of a bind, if I were to use RT with a case sensitive
collation like utf8_bin would the application behave as expected? I know
search would be much more strict and possibly confusing to the end user.

utf8_bin is good choice. You’re free to use binary collation. May be
utf8_general_ci collation will be better for you. Any collation is ok
as long as you know how to deal with them in mysql.

Ok just wondering, I’ll give it a try… I was more curious if any string
type clauses would still work internally since binary collations are
everything/case sensitive
. I’m guessing that’s all fine because I think postgres stores it’s
stuff as binary_cs and relies on the OS do to collations (something like
that, other postgres db’s around here seem to be case sensitive).

My other option would be to continue to use latin1, is there any way to
accomplish this using the latest code base? It’s probably not
configurable and I don’t want to have to manage diffs for the possible
changes, unless it is fairly minimal to do…

No, we wouldn’t return to that as it’s totally wrong and have
concequences as it’s actually violation of setting purpose. RT was
storing UTF8 encoded data in a latin1 column, so collations worked
absolutly incorrect for everything even latin1 and were close to
binary.

At this point I can suggest you move either binary collation or create
a new one and send it to mysql team for inclusion.

Understood, I wasn’t liking that idea either. Oddly enough
latin1_swedish_ci (the latin1 default) isn’t suppose to be accent
sensitive, latin1_general_ci is but my old database (mysql 4.1) seems
to be indexing it and seeing them seperate. The collation isn’t
specified so i’m assuming swedish but it’s behaving like general,
perhaps the old version respected the differences. I’m basically trying
to get it the same as before (perhaps if swedish was enforced before I
wouldn’t be in this position), regardless this isn’t really an issue
with RT.

The issue in question → http://bugs.mysql.com/bug.php?id=34130

They said it’s on ‘todo’, MSSQL handles this with ci_ai, ci_as, cs_ai
and cs_as collations where the accents are either sensitive or not.
Hopefully they do come around to it…

Character difference for mysql … http://www.collation-charts.org/mysql60/

Curtis

Thanks again for your time, i’m really excited to launch 3.8.x, compared
to 3.4.x our users are loving it, especially the reporting and all that.
Curtis.

Curtis_Bruneau · August 11, 2008, 1:54pm

Curtis Bruneau wrote:

Ruslan Zakirov wrote:

I need some suggestions, I have come to the conclusion that all utf8
collations don’t do french properly, not like latin1 anyway. All accents
are seen as the same, while binary distinct they cannot be unique
indexed and sorting will recognize them as the same as well as queries
using any variant character.

So I’m in a bit of a bind, if I were to use RT with a case sensitive
collation like utf8_bin would the application behave as expected? I know
search would be much more strict and possibly confusing to the end user.

utf8_bin is good choice. You’re free to use binary collation. May be
utf8_general_ci collation will be better for you. Any collation is ok
as long as you know how to deal with them in mysql.

Ok just wondering, I’ll give it a try… I was more curious if any string
type clauses would still work internally since binary collations are
everything/case sensitive
. I’m guessing that’s all fine because I think postgres stores it’s
stuff as binary_cs and relies on the OS do to collations (something like
that, other postgres db’s around here seem to be case sensitive).

My other option would be to continue to use latin1, is there any way to
accomplish this using the latest code base? It’s probably not
configurable and I don’t want to have to manage diffs for the possible
changes, unless it is fairly minimal to do…

No, we wouldn’t return to that as it’s totally wrong and have
concequences as it’s actually violation of setting purpose. RT was
storing UTF8 encoded data in a latin1 column, so collations worked
absolutly incorrect for everything even latin1 and were close to
binary.

At this point I can suggest you move either binary collation or create
a new one and send it to mysql team for inclusion.

Understood, I wasn’t liking that idea either. Oddly enough
latin1_swedish_ci (the latin1 default) isn’t suppose to be accent
sensitive, latin1_general_ci is but my old database (mysql 4.1) seems
to be indexing it and seeing them seperate. The collation isn’t
specified so i’m assuming swedish but it’s behaving like general,
perhaps the old version respected the differences. I’m basically trying
to get it the same as before (perhaps if swedish was enforced before I
wouldn’t be in this position), regardless this isn’t really an issue
with RT.

The issue in question → http://bugs.mysql.com/bug.php?id=34130

They said it’s on ‘todo’, MSSQL handles this with ci_ai, ci_as, cs_ai
and cs_as collations where the accents are either sensitive or not.
Hopefully they do come around to it…

Character difference for mysql … http://www.collation-charts.org/mysql60/

Curtis

Thanks again for your time, i’m really excited to launch 3.8.x, compared
to 3.4.x our users are loving it, especially the reporting and all that.
Curtis.
I have a question that’s probably obvious… If I go ahead with utf8_bin,
any variation of case on incoming emails will be regarded as distinct
right? I can see this causing many issues, I may just get rid of my
accented emails and possibly merge the tickets or just delete the users
as they aren’t valid emails anyway. I don’t think I could pad the emails
enough to get the users to match, looking through my data emails come in
as all kinds of different cases.

Curtis