Jump to content
Larry Ullman's Book Forums

Recommended Posts

Thanks to this book I have got a multiple access mailing list database ready to go, subject to the ISP server being updated.

 

Before it goes live, I have to deal with accents. Being just off the coast of France, many names have é ê ë etc and the pregmatch facility listed in chapter 16 throws these out completely.

 

I am assuming that our Canadian friends must have already addressed this problem, so I hope someone will be able to supply a pattern match that will allow e and a with accents to be approved.

 

Colin

Link to comment
Share on other sites

There is a way using Unicode, among others. I will look into the details later, but if you do some Googling for regexes using "foreign" (i.e., non-English) characters or how to use Unicode in regexes, I think you'll find what you're looking for.

 

If you can't resolve this by the time I have a chance to look into it more thoroughly, I'll provide more info later.

Link to comment
Share on other sites

Thanks for reply

 

I found most of the responses to the google search difficult to understand

 

However this appears to do what I want

 

$_foreign = "ÁÈÉáèé";

if(preg_match('/[^a-zA-Z0-9_-]+/', $_foreign))

 

Would you concur?

 

Colin

Link to comment
Share on other sites

Looking at it again and comparing with what I have

 

if (preg_match ('/^[A-Za-zé \'.-]{2,40}$/i', $trimmed['last_name']))

 

So what I gleaned from another site isn't going to do what I want.

 

I want to match the "approved characters" with the last_name field, the approved characters must include ÁÈÉáèé

 

Colin

Link to comment
Share on other sites

Okay, I think I finally found the answer. I knew you'd have to search on Unicode, but it took me forever to find the Unicode code points for French.

 

If you have the time, here are some of the better explanations I found regarding foreign languages for regexes in general:

 

http://www.regular-expressions.info/unicode.html

(Basically explains that you have to use Unicode for all non-English characters. Also explains how to search for Unicode characters using the PCRE regex flavor, which is the one used by PHP.)

 

http://php.net/manual/en/function.preg-match.php

(This is the PHP.net manual explanation of the preg_match function. Do a search for "Unicode" to see some specific examples of how to search for Unicode characters.)

 

http://tlt.its.psu.edu/suggestions/international/bylanguage/french.html

(This was helpful for me, since I don't speak nor understand French well. Perhaps this is of no help to you.)

 

Anyway, after a bunch of searching, I came to realize that French doesn't have it's own predefined code group, like languages like Greek have. All French characters that aren't already part of English are together in the Unicode group called "Latin-1 Supplement".

 

To see the various Unicode code groups, see the following:

http://www.unicode.org/charts/

 

And the chart you need for French is available at the following link:

http://www.unicode.org/charts/PDF/U0080.pdf

 

Now, obviously that chart contains a lot of Latin extended characters that are not in French, so basically, I think you need to pick out the ones you want. I'm not a French expert, but the following seem to be the Unicode code points you want for your characters, as per the PDF that is the last link I just provided:

 

Á

00C1

 

È

00C8

 

É

00C9

 

á

00E1

 

è

00E8

 

é

00E9

 

So with that info, the only conclusion I can make is that the following regex and PHP is what you want:

 

if (preg_match('/^[A-Za-z\x{00C1}\x{00C8}\x{00C9}\x{00E1}\x{00E8}\x{00E9} \'.-]{2,40}$/u',$trimmed['last_name']))

 

Again, I can't guarantee it'll be want you want, but that's the best I can conclude from not knowing French.

Link to comment
Share on other sites

I'm really no expert with regular expressions, but if you are using UTF-8, you should have a look at the multibyte functions, such as mb_ereg, mb_ereg_match and so on. Also, if you are using UTF-8, you should be aware that none of the most common functions such as substr() will work on a French word using Unicode. You must use their multibyte equivalents. And you must go UTF-8 all along the line. What I mean is your html files must be in UTF-8, your database (or at least the tables where you have French) must be in UTF-8, and so on.

Link to comment
Share on other sites

  • 2 weeks later...

Many thanks to those who have responded.

 

I converted the mysql collation to utf8 general, added the meta http-equiv content to charset urf-8 and placed the code

 

if (preg_match('/^[A-Za-z\x{00C1}\x{00C8}\x{00C9}\x{00CA}\x{00E1}\x{00E8}\x{00E9}\x{00EA} \'.-]{2,40}$/u',$trimmed['last_name']))

 

This worked and accepted the match. So far so good.

 

But when I looked at the mysql db, the word Carré had been stored as CarrĀ©

 

Where did that come from?

Link to comment
Share on other sites

Please excuse my ignorance but where would I do that?

 

I have from other advise pages set the utf8 in the php config files and the mysql files and am looking into the means os confirming it when opening the database.

 

What (to my mind) in the era of international trade should be so simple, is proving to be an absolute minefield.

Link to comment
Share on other sites

But when I looked at the mysql db, the word Carré had been stored as CarrĀ©

 

Where did that come from?

 

First, your database, or at least the column where you store "Carré" must use utf8 and the appropriate collation (probably utf8_general_ci). If you change the character set after populating the database, you must convert the existing data to utf8:

ALTER TABLE tablename CONVERT TO CHARACTER SET utf8;

 

Of course, you must apply the conversion only to one column if only one column in the database should use utf8.

 

Second, you must add to your database connexion file one of these two lines:

mysqli_set_charset($dbc, 'utf8');
OR
mysqli_query($dbc, 'SET NAMES utf8');

 

I use the first, in my PHP connexion file, just after the connexion itself. This makes sure PHP uses utf8 too to send data to the db.

 

From the command-line, you must use

CHARSET UTF8;

immediately after choosing the database if you want the diacritics to display correctly.

Link to comment
Share on other sites

Thank you so much for your help

 

The data base, tables and all car columns had all been converted to utf8.

 

I have now added

 

mysqli_set_charset($dbc, 'utf8');

 

as suggested and now the results in my php web page all show correctly.

 

Thank you for your patience with me.

 

Mr Ullman - perhaps in the next edition a few more notes about this subject.

Link to comment
Share on other sites

It's all in the book, really, if you look at chapters 14 and 15. There are a few explanations from these two chapters that won't work because they apply to PHP 6 only, but it's really a minority. All the functions (mysqli_set_charset(), etc.) are explained. And Larry provides us with a very good introduction to Unicode and all that entails.

 

All the same I also spent ages trying to get my head around all these globalisation problems. I'm afraid that can't be avoided at this stage of PHP and MySQL development. UTF8 is still more of an added layer than an integral part of the languages.

Link to comment
Share on other sites

 Share

×
×
  • Create New...