Utf8 In Mysql Select Queries

Josee · June 22, 2011

Hello,

I've already spent quite a long time trying to find the right syntax for a very simple SELECT query, but to no avail. I've also been reading chapter 9, on Globalization, from the MySQL 5.1 Reference Manual, but I still don't understand what's happening (or, rather, I don't want to believe that most MySQL functions are not multi-byte safe!).

The three SELECT queries I've tried are the following:

SELECT flexion FROM flexions WHERE flexion LIKE 'aimer%';
SELECT flexion FROM flexions WHERE flexion LIKE 'aimer%' COLLATE utf8_swedish_ci;
SELECT flexion FROM flexions WHERE MATCH (flexion) AGAINST ('aimer*' IN BOOLEAN MODE);

Whatever the query, the results always include "aimèrent"… which they shouldn't!

Does anyone have any idea of how to force MySQL to distinguish between 'e' and 'è'?

Functions like SUBSTRING() explicitly state that they are multi-byte safe. I've read no such thing in the manual about LIKE or about MATCH. Are there other functions I could use instead?

With thanks for your help,

HartleySan · June 23, 2011

I set up a database some time back where I was storing everything in Japanese using UTF-8, and it was all fine.

With that said, I don't remember exactly what I did, but so long everything along the line is UTF-8 (including the database and tables), the query *should* be able to differentiate between the two 'e's.

Sorry I couldn't be of more help.

Josee · June 23, 2011

Thanks, HartleySan.

I've at last found the explanation (at least I think so), and it's discouraging:

http://dev.mysql.com/doc/refman/5.0/en/charset-collation-effect.html

Example 2: Searching for German Umlauts

Suppose that you have three tables that differ only by the character set and collation used:

mysql> SET NAMES utf8;

mysql> CREATE TABLE german1 (

-> c CHAR(10)

-> ) CHARACTER SET latin1 COLLATE latin1_german1_ci;

mysql> CREATE TABLE german2 (

-> c CHAR(10)

-> ) CHARACTER SET latin1 COLLATE latin1_german2_ci;

mysql> CREATE TABLE germanutf8 (

-> c CHAR(10)

-> ) CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Each table contains two records:

mysql> INSERT INTO german1 VALUES ('Bar'), ('Bär');

mysql> INSERT INTO german2 VALUES ('Bar'), ('Bär');

mysql> INSERT INTO germanutf8 VALUES ('Bar'), ('Bär');

Two of the above collations have an A = Ä equality, and one has no such equality (latin1_german2_ci). For that reason, you'll get these results in comparisons:

mysql> SELECT * FROM german1 WHERE c = 'Bär';

+------+

| c |

+------+

| Bar |

| Bär |

+------+

mysql> SELECT * FROM german2 WHERE c = 'Bär';

+------+

| c |

+------+

| Bär |

+------+

mysql> SELECT * FROM germanutf8 WHERE c = 'Bär';

+------+

| c |

+------+

| Bar |

| Bär |

+------+

This is not a bug but rather a consequence of the sorting properties of latin1_german1_ci and utf8_unicode_ci (the sorting shown is done according to the German DIN 5007 standard).

If I understand rightly (and I'll test to confirm) I would be better off with some latin1_xxx character set than with Unicode, which really seems odd to me since I thought the aim of Unicode was to take into account the peculiarities of each language. I don't (much) care about sorting, but I really care about distinguishing between words such as "a" and "à", which are not at all the same in French (the first is a verb, the second is a preposition).

HartleySan · June 23, 2011

This link might also be of interest:

http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-sets.html

(Start about halfway down the page.)

Josee · June 23, 2011

Thank you, HartleySan. I had seen this page indeed… which is what sent me to the one I quoted, that only confirmed what I understood from the page you refer to.

It's really, really annoying. What's the use of being able to store the data in whatever language you choose if you can't retrieve it in a way that's meaningful in that language? That's a rhetorical question, of course, since none of us can change a thing about it.

I think I'll have to change the default character set and collation for this project, which also means having a separate configuration file with a separate connexion file, separate functions, separate templates since the html files won't be using the same encoding… Really despairing!

HartleySan · June 23, 2011

Well, I have not looked into this thoroughly, but I find it hard to believe that UTF-8 cannot be used to achieve what you want.

For example, what if you retrieve all the results, including the ones with the 'E' you don't want, and then further filter them on the PHP side? I know that's an extra step, but better working half-ass than not at all. It's getting late here, but I will search for an answer, and try to get back to you before I go to bed.

HartleySan · June 23, 2011

This is the best thing I could find:

http://stackoverflow.com/questions/5408188/mysql-search-like-with-optional-french-characters

And the main quote of interest:

Since MySQL search realization is very-very poor, there is no way to do it using just MySQL. By the way, you are not using full text searching (as mentioned in tags).

I would put this logic in your application, so it would find all the variants for those cases and append them to query.

Like I recommended before, just but the logic in PHP. Another poster says that the collation may solve the issue, but that doesn't seem to be the case.

Josee · June 23, 2011

Thanks, HartleySan. Doing part of the work in PHP could be a solution sometimes, but I think that for my project there are numerous cases where it wouldn't be very helpful. What makes me pessimistic about existing solutions is that Google, for instance, is quite bad for searches in French (or German, or any language using diacritics [accents, and so forth], I suppose). For instance, "élève" is a noun and means a student in French, and "élevé" is a past participle and means "raised up", or "high up", and many other things, but it's just not the same word as "élève". Google is unable (or unwilling) to differentiate between the two. So Google searches usually give thousands of answers where there could be only a few dozens if real spelling was taken into account.

If it's midnight for you in Japan, it was tea time for me in Europe, and I've been thinking of another solution while drinking my tea: for all tables I have to rely on for searches, I can add a column with a numeric equivalent of the words; fortunately, as far as I can see right now all these tables would have single words in the columns I'm interested in. So creating a php script that automatically creates numerical equivalents of the words before I enter both forms into the database should be rather easy. This way, I could keep the convenience of Unicode for alphabetical sorting, and be able to retrieve the spellings I want right from the database.

I'll play around with this possibility in the next few days!

Paul Swanson · June 23, 2011

You might consider running anything you insert as text in the database through htmlentities(). It should convert à to à ; Then be sure to use htmlentities() on any text you are searching for, and it should search for the entity code instead of the character. I haven't tried it, but unless the semi-colon causes a problem, it seems like it should work. And you should be able to print the converted text directly in the browser and have it display the character. If you need to you can convert it back by using html_entity_decode().

Josee · June 23, 2011

Thank you for the suggestion, Paul. Since my last post, I've been thinking that I could create a new column in the database with the same text as in the Unicode column, and just a different character set (latin1) and collation (latin1_general_ci), so that I could benefit both from utf8 and from latin1. There's a multibyte function called mb_convert_encoding() that seems promising. But if it doesn't work, I'll certainly try htmlentities() as you suggest.

Josee · June 24, 2011

I spent quite a lot of time trying to do things with mb_convert_encoding() and other functions, also considering Paul's suggestion of converting to HTML entities (but I had no idea how complex it would be starting from multibyte strings) and various other possibilities that all seemed bound to fail if I also wanted/needed to use functions like stripslashes() or trim()… and you know what works? Using exactly the same variable/text in both columns in my database, one with a UTF8 character set and the corresponding collation, the other with a latin1 character set and the corresponding collation. (So my INSERT query is something like: INSERT INTO flexions (flexion, conversion) VALUES ($string, $string); no other work needed.)

These two columns are called "flexion" and "conversion", and here are the results, depending on which of the two columns I use for the WHERE condition:

// SELECTion based on a latin1 column (conversion)
SELECT flexion FROM flexions WHERE conversion LIKE 'aima%' ORDER BY flexion

Rows: 13

flexion

aima

aimai

aimaient

aimais

aimait

aimant

aimas

aimasse

aimassent

aimasses

aimassiez

aimassions

// SELECTion based on a UTF8 column (flexion)
SELECT flexion FROM flexions WHERE flexion LIKE 'aima%' ORDER BY flexion;

Rows: 16

flexion

aima

aimai

aimaient

aimais

aimait

aimâmes

aimant

aimas

aimasse

aimassent

aimasses

aimassiez

aimassions

aimât

aimâtes

So I can have the best of both encoding worlds with no headaches!

HartleySan · June 25, 2011

Great find. I'll be sure to remember this. Thank you for taking the time to figure that out and share your findings.

Utf8 In Mysql Select Queries

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites