Whatever....: fulltext index

Showing posts with label fulltext index. Show all posts

Tuesday, April 24, 2007

realtime fulltext index

For past some days, i have been struggling with getting something which can allow me to create, insert, update and search fulltext indexes - big fulltext indexes. Not very large but somewhere around 3-4 GB of data.

You would suggest mysql, but with mysql the inserts on table with fulltext indexes is very slow. For every insert, data is put in the index which leads to slow inserts.

What else? Well, i had come across some other fulltext engines like sphinx and senna. Compiled mysql with senna but there was no benefit. The index size was almost double and the searches were also slow - as slow as mysql fulltext index.

What about sphinx. Well, had integrated sphinx with mysql. But the sphinx engine has lots of limitations - like the first 2 columns must be integer and the 3rd column should be a text. Rest all columns should be integers. So if i use sphinx, i would need to write a stored procedure and trigger it to port data from my current table to a parallel sphinx table whenever a new row is inserted. And what would i do if i have to run a query - i would be joining both the tables. Would that be fast. Dont know. Let me figure out how to get this thing running...

You would say - what about lucene - my favorite search engine. Well dear, lucene is in java and there is no way i could integrate it with mysql if i have to do it in a jiffy. I would need to design and build a library to get this thing running. Forget it. In fact dbsight provides a similar type of library. Maybe i will get that thing working and check it out. At a small price, I might get what the organization requires and that too without wasting any resources on building it.

Hope i get some solution to the situation i am in right now. I have solutions for it, but not a solution which requires least amount of resources to get it running. Till i get one, the search will continue...

Sunday, May 21, 2006

mysql Fulltext search versus lucene

Here is the comparison between mysql fulltext and lucene search engines. On the forefront the only thing that distinguishes one from another is

==> speed of fulltext search in lucene is much faster as compared to mysql
==> lucene is much more complex to use as compared to mysql.

In mysql, you can simply mark an index on a text/varchar column as fulltext, and your work is done. All you need to do next is to fire MATCH AGAINST queries. Adding and modification of indexes is handled by mysql internally as and when new data is added. On the other hand, in lucene, the addition/modification of documents is to be handled programatically. Moreover Mysql is pluggable to your application using available apis. Mysql is running on a port and all you need to do is connect to the port and fire sql queries to that port using your application. Whereas in case of lucene, you will have to plug your application to the index using lucene api.

Another difference is that lucene is very efficient in searching large no of documents. Where as in case of mysql as the no of documents increases, the speed of search goes down. Mysql uses RAM to cache the index and use it during serving a query. So, if the size of your fulltext index exceeds the RAM, you will experience a major fall in the search performance. Where as in case of lucene, the size of index does not affect the search performance even when the size exceeds the RAM on the system.

With mysql, when a fulltext index is created on a table, inserts on the table become very slow. Lets analyze this. Well... for each record, mysql does some major processing to fix the record in current index. After the record is indexed, the cache/RAM containing the index needs to be rebuilt, since the index which was previously there is not correct - does not have the new record. So, with each record Mysql fetches the complete index in the cache/RAM. So if you are performing search and inserts/updates on a single table with fulltext index, the performance of both search & indexing goes very very down. On the other hand, with lucene, addition of new documents is not a major drawback. Documents can be added on the fly. Which makes indexing very fast. And this process does not affect the search. Two things to be noted here.

==> lucene does not allow you to modify a document. Modifying a document is equivalent to deleting the current document and adding the modified document to the index.

==> lucene requires an object of the index to perform the search. You will know about it when you use the api. Whenever you add a new document to the index, a new object of the index has to be created to include the new document in the index. But creation of a new object is not a major overhead. Though it does slow down the searching process to some extent.

With lucene, you do not have the flexibility to join two indexes and form a single query. Like in mysql you can do something like this

SELECT TABLE1.COLA, TABLE2.COLB FROM TABLE1, TABLE2 WHERE MATCH(TABLE1.COL1) AGAINST 'TEXT1' AND MATCH(TABLE2.COL2) AGAINST 'TEXT2' AND TABLE1.ID=TABLE2.FOREIGNKEY

(Pls dont see the syntax, look for the meaning/logic behind. I am not good at syntaxes. :-D ) This cannot be done with lucene. You will have to play with the data in such a way that your index contains both the data of say TABLE1 & TABLE2 and then you will have to play with the search to get the data that you need. Too complex right??

Also mysql comes with inbuilt list of stopwords and a default word tokenizer, which separates the words based on " ", ",", "." etc. Whereas in case of lucene, both - the list of stop words and the word tokenizer has to be defined by us. This is advantageous, because then you can define your own stopwords and tokenize the text as per your requirements.

In case of mysql the searches are by default case insensitive. Whereas in case of lucene, you can make the searches case-sensitive or case-insensitive, the way you want it to be.

With mysql you have the minimum length of word to be indexed which is by default 4. So all words which have less than 4 characters will not be indexed. What will you do if you want to index words like "php", "asp", "c"? You will have to decrease the minimum length from 4 to 1. And this will increase the index size drastically and slow down all your searches as a consequence. There are no such issues in lucene.

In mysql, every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight and if a word is rare, it has higher weight. So if a word is present in 50% of the rows in a table, a query searching for that word will result in 0 result. This, mysql terms as relevance. But for me, it resulted in incorrect results for a query.

This link http://dev.mysql.com/doc/connector/j/en/fulltext-search.html will give a better idea of mysql fulltext search.

In lucene, there are some advanced options like

proximity search - find documents where there is one word between searchword1 and searchword2

wildcard search - find documents which have word like searchword* or maybe search?word etc etc...

fuzzy/similarity searches - find documents with words sounding similar to roam~ (will look for roam, foam etc...)

Term boosting - you can boost a term to move relevant documents to the top. So for example, you can say that you want documents with word "lucene" to be more relevant than those with word "mysql". Then you can do something like -> lucene^4 mysql .

Sorting of results is extremely fast if you are using lucene. In case of mysql, if you expect your results to come out fast, you will have to forget sorting. Sorting takes huge amount of time. Even retrieving the total no of documents in mysql is a chore. Where as for lucene the total no of documents come out as a default.

From this, if you are looking for a fulltext index on a small table without much hassles, go for mysql fulltext index. Whereas if you are looking at searching a large collection of data then go for lucene.

Whew... i wrote a lot... And there is more to write...Maybe next time... Do let me know if i have missed anything.