Jump to content
Larry Ullman's Book Forums

Search Engine For A Message Board Doesn't Index Cyrillic Characters


Recommended Posts

Hi, I was learning the code on creating a search engine, that accompanies this book from here:

 

http://www.peachpit.com/articles/article.aspx?p=1802754&seqNum=4

 

and encountered an issue in the se_index.php file.

 

 

Everything worked fine until I tryed cyrillic characters, the

 

preg_match_all('/\b\w+\b/', $content, $output)

 

function didn't allow them to get through:

 

 

 

After researching the net I fount the solution adding the u caracter for unicode like this:

 

preg_match_all('/\b\w+\b/u', $content, $output);

 

and it worked on the localhost with LAMPP.

 

Recently I tried to index a website on a shared hosting that runs PHP-5.2.17, and the same method that worked on the localhost, didn't with cyrillic there. I also tryed 

 

preg_match_all('/\b\[a-zA-Z\p{Cyrillic}0-9]+\b/u', $content, $output);

and other combinations of regular expressions with the \p{Cyrillic} but nothing worked so far.

If anybody has knowlege how to solve this issue, please give me a note, thanks.

 

Link to comment
Share on other sites

  • 1 month later...

Sorry for the long delay, I haven't had an opportunity to try it on a live web hosting with the search engine yet, but I just managed to make a related task: Leaving the Cyrillic and Latin characters, numbers, dots, dashes, and spaces intact in a string, while replacing any other characters with an underscore.

 

The 1st link:
http://php.net//manual/en/regexp.reference.unicode.php

 

was of help!

 

Here is the code:

$modified_filename = preg_replace("/[^\\s\\p{L}0-9 .-]/u", "_", $filename);

 

 

Thanks again.

  • Upvote 1
Link to comment
Share on other sites

 Share

×
×
  • Create New...