Jump to content
Larry Ullman's Book Forums

Recommended Posts

Hello Larry and all,

 

I have finally got round to trying to integrate Zend's lucene into a website (Well not really, just my local host, but if all goes well, I'll roll it on a live project, in place of Sphider). I should mention that this local project will not be using Zend as a framework on a whole but it will use Zend for the Lucene module.

 

I've had a little bash at it although not a lot as I wanted to check my logic first, i've read up on it at Zend, but obviously that is all geared through their framework, but it was still good to see their code, I've also read a good series of posts at http://www.phpriot.com/articles/zend-search-lucene (Hope that link is ok here, I feel like you said anything that can aid people is ok?If not, feel free to remove my link).

 

I've also read your articles on it with Yii, just using small snippets of stuff from phpriot I have managed to generate a folder which I guess holds the documents. These appear as .gen files such as segments.gen, segment_1, write.lock.file, read.lock.file.

 

I guess those are the documents for the index is that correct?

 

I'm far from a) fully understanding lucence yet

and

b: getting a working search engine

 

but I wanted to run a couple of questions by you. On your creating an index series you say that

 

Lucene is not a spider and cannot crawl your site (the same goes for Zend_Search_Lucene)

 

So to me that would suggest loading every url through the

$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($url);

 

Which I also guess wouldn't be that hard by using something as my Sitemap.xml and extrapolating the raw urls out. I imagine that this too, is the safer/easier bet as the full HTML page would already have everything that I could want to be indexed, although more computationally expensive i'm sure. The main reason that i'm leaning to this approach is that I can't guarantee that all my content will be stored in a DB and for my first attempt, I always think trying to keep it simple is a good starting block.

 

So trying to piece together a start would you say that this overview is ok?bad?wrong or ugly Or ALL of THEM! :lol:

 



<?php
require_once('Zend/Search/Lucene.php');

// where to save our index
$indexPath = 'C:\xampp\htdocs\ZendSearch\\';

// create index
$index = Zend_Search_Lucene::create($indexPath);

/**
* list of urls
* this should be a function to return each url
* rather than a hardcode
**/
$urlArray = array(
'http://www.link1.com/',
'http://www.link2.com/',
'http://www.link3.com/',
'http://www.link4.com/'
);

// iterate urls and load the html and add to index
foreach ($urlArray as $url) {
	$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($url);
	$index->addDocument($doc);
}

 

After that I'm unsure, that comits the documents (urls) to the index I take it? I also note in both your example and phpriots that you can add your own fields, by doing what I suggest do I severely limit the ability of the search as I don't pass it any fields.

 

Thanks

 

Jonathon

Link to comment
Share on other sites

I've spent a lot of time over the years implementing different search engines for a site. I have used PHPDig and Sphider, and toyed with many other solutions. When I started using Zend_Lucene, I spent a lot of time trying to use a spider agent that would create Lucene-compatible indexes, then use Zend_Lucene to search those indexes, and I was never able to get something working along those lines. Writing my own indexer in Zend_Lucene worked best for me.

Link to comment
Share on other sites

I've used Sphider before and I was fairly happy with it, but I feel (possibly wrongly that Lucene would be a more powerful tool, maybe that's just because it's from Zend). A nice little spider would be nice, but I suppose using a function to pull up all the URLs isn't so bad really, I imagine it's a lot easier too. I'm intrigued by these .gen files now, I'd like to see what exactly they have put into the index and how it differs loading the urls as HTML documents rather than using something like this (Which is an extended class of the Zend_Search_Lucene_Document:


public function __construct($document)
	{
		$this->addField(Zend_Search_Lucene_Field::Keyword('document_id', $document->id));
		$this->addField(Zend_Search_Lucene_Field::UnIndexed('url',	   $document->url));
		$this->addField(Zend_Search_Lucene_Field::UnIndexed('created',   $document->created));
		$this->addField(Zend_Search_Lucene_Field::UnIndexed('teaser',	$document->teaser));
		$this->addField(Zend_Search_Lucene_Field::Text('title',		  $document->title));
		$this->addField(Zend_Search_Lucene_Field::Text('author',		 $document->author));
		$this->addField(Zend_Search_Lucene_Field::UnStored('content',	$document->body));
	}

 

Which is how the article not about Yii indexed documents. But I'm sure all will become clear (I hope). I'll look into searching/querying the indexes now, I guess that's the next step.

 

Thanks for the help

Link to comment
Share on other sites

 Share

×
×
  • Create New...