Sunday, May 21, 2006

mysql Fulltext search versus lucene

Here is the comparison between mysql fulltext and lucene search engines. On the forefront the only thing that distinguishes one from another is

==> speed of fulltext search in lucene is much faster as compared to mysql
==> lucene is much more complex to use as compared to mysql.

In mysql, you can simply mark an index on a text/varchar column as fulltext, and your work is done. All you need to do next is to fire MATCH AGAINST queries. Adding and modification of indexes is handled by mysql internally as and when new data is added. On the other hand, in lucene, the addition/modification of documents is to be handled programatically. Moreover Mysql is pluggable to your application using available apis. Mysql is running on a port and all you need to do is connect to the port and fire sql queries to that port using your application. Whereas in case of lucene, you will have to plug your application to the index using lucene api.

Another difference is that lucene is very efficient in searching large no of documents. Where as in case of mysql as the no of documents increases, the speed of search goes down. Mysql uses RAM to cache the index and use it during serving a query. So, if the size of your fulltext index exceeds the RAM, you will experience a major fall in the search performance. Where as in case of lucene, the size of index does not affect the search performance even when the size exceeds the RAM on the system.

With mysql, when a fulltext index is created on a table, inserts on the table become very slow. Lets analyze this. Well... for each record, mysql does some major processing to fix the record in current index. After the record is indexed, the cache/RAM containing the index needs to be rebuilt, since the index which was previously there is not correct - does not have the new record. So, with each record Mysql fetches the complete index in the cache/RAM. So if you are performing search and inserts/updates on a single table with fulltext index, the performance of both search & indexing goes very very down. On the other hand, with lucene, addition of new documents is not a major drawback. Documents can be added on the fly. Which makes indexing very fast. And this process does not affect the search. Two things to be noted here.

==> lucene does not allow you to modify a document. Modifying a document is equivalent to deleting the current document and adding the modified document to the index.

==> lucene requires an object of the index to perform the search. You will know about it when you use the api. Whenever you add a new document to the index, a new object of the index has to be created to include the new document in the index. But creation of a new object is not a major overhead. Though it does slow down the searching process to some extent.

With lucene, you do not have the flexibility to join two indexes and form a single query. Like in mysql you can do something like this

SELECT TABLE1.COLA, TABLE2.COLB FROM TABLE1, TABLE2 WHERE MATCH(TABLE1.COL1) AGAINST 'TEXT1' AND MATCH(TABLE2.COL2) AGAINST 'TEXT2' AND TABLE1.ID=TABLE2.FOREIGNKEY

(Pls dont see the syntax, look for the meaning/logic behind. I am not good at syntaxes. :-D ) This cannot be done with lucene. You will have to play with the data in such a way that your index contains both the data of say TABLE1 & TABLE2 and then you will have to play with the search to get the data that you need. Too complex right??

Also mysql comes with inbuilt list of stopwords and a default word tokenizer, which separates the words based on " ", ",", "." etc. Whereas in case of lucene, both - the list of stop words and the word tokenizer has to be defined by us. This is advantageous, because then you can define your own stopwords and tokenize the text as per your requirements.

In case of mysql the searches are by default case insensitive. Whereas in case of lucene, you can make the searches case-sensitive or case-insensitive, the way you want it to be.

With mysql you have the minimum length of word to be indexed which is by default 4. So all words which have less than 4 characters will not be indexed. What will you do if you want to index words like "php", "asp", "c"? You will have to decrease the minimum length from 4 to 1. And this will increase the index size drastically and slow down all your searches as a consequence. There are no such issues in lucene.

In mysql, every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight and if a word is rare, it has higher weight. So if a word is present in 50% of the rows in a table, a query searching for that word will result in 0 result. This, mysql terms as relevance. But for me, it resulted in incorrect results for a query.

This link http://dev.mysql.com/doc/connector/j/en/fulltext-search.html will give a better idea of mysql fulltext search.

In lucene, there are some advanced options like

  • proximity search - find documents where there is one word between searchword1 and searchword2

  • wildcard search - find documents which have word like searchword* or maybe search?word etc etc...

  • fuzzy/similarity searches - find documents with words sounding similar to roam~ (will look for roam, foam etc...)

  • Term boosting - you can boost a term to move relevant documents to the top. So for example, you can say that you want documents with word "lucene" to be more relevant than those with word "mysql". Then you can do something like -> lucene^4 mysql .


Sorting of results is extremely fast if you are using lucene. In case of mysql, if you expect your results to come out fast, you will have to forget sorting. Sorting takes huge amount of time. Even retrieving the total no of documents in mysql is a chore. Where as for lucene the total no of documents come out as a default.

From this, if you are looking for a fulltext index on a small table without much hassles, go for mysql fulltext index. Whereas if you are looking at searching a large collection of data then go for lucene.

Whew... i wrote a lot... And there is more to write...Maybe next time... Do let me know if i have missed anything.

19 comments:

Roland Bouman said...

Thanks for this very insightful comparison!

It was a pleasant read, and I think it gave me a graps of the differences between these two methods of fulltext indexing.

I've got a queston though, in a couple of places, you're telling something about performance issues: inserting new documents, the actual search and the sorting. Can you back this by a few figures? I'd be very grateful if you did, maybe in a sequal article?

Anyway, thanks again sofar.

noodl said...

Thanks Jayant. This will make a great explaination for that guy who asked on freenode#apache today about the differences. Let's hope he comes back again!

Jayant Kumar said...

I will be putting down some figures to the performance benchmarks soon...

pabloj said...

Has anyone considered placing CLucene (http://sourceforge.net/projects/clucene/) in an UDF, it's index in a blob field and using it to perform full text searches in MySQL? Or am I going the wrong path?

Jayant Kumar said...

As far as i know CLucene is just a replication of lucene-java version in c. The way indexing and search works for CLucene and for java-lucene is the same.

However what i have read is that CLucene
is 3 times faster than java-lucene due to the fact that the C code is closer to machine language without any intermediate virtual machine as in java.

Anonymous said...

Good overall comparison of Lucene vs Mysql fulltext search!

CLucene is not in C. It's an implementation of Lucene in C++. Apache foundation started an index-level compatible project in C for Lucene, codename lucene4c, but it hasn't got very far as CLucene.

Another great indexing toolkit which is very much in the same ballpark as Lucene but less fanfare exposure is Xapian http://www.xapian.org. Its searching time is very fast and its features are comparable to Lucene. It's also very mature as it used to be the core of a web search engine before being released as an open-source project.

Anonymous said...

Can Lucene index the text content which resides inside Mysql MyIsam tables, or "just" files?

Jayant Kumar said...

lucene can index any type of textual data whether it resides in mysql or in files or in any other database.

You will have to write down programs to fetch data from any data source be it mysql or db2 or text files and feed it into lucene to make the index.

Razvan said...

But why you want to compare a information retrieval library (lucene) with a relational database management system (MySQL) in the first place, when each of them is made with a different scope?
One is a specialised information retrieval tool that allows you to optimize full-text searching to your custom demands (and thus fast) while the other is a relational database (full-text searching is not the main scope).
Use each of them for what they are.

Jayant Kumar said...

Yup razvan, you are right. Mysql is meant for data storage and retrieval but it also has fulltext indexing capabilities. What i have done is a simple comparison of mysql fulltext capabilities with the fulltext indexing capabilities of lucene.

But mysql was not meant for fulltext searches - as data storage and retrieval is its prime objective. Whereas in case of lucene, fulltext search is its only objective.

I wanted to given an alternative to people who want to look into large fulltext indexes and fast searches.

srinivas said...

That was a very good comparison and was very usefull

-Srinivas Varma

Zixin said...

This is a very helpful and informative comparison.
But specifically, what's the fundamental reason that makes Lucene is faster than database index, since database index may also be Hash table/map? Is it because of the segment strategy Lucene uses?

H@r@ld said...

Speed of sorting in Mysql depends on indexing! If you sort according to a column with a B-tree index, sorting is lightning fast, since mysql won't need to copy the data, only the index.

Martin Wildam said...

I would really prefer to just use MySQL because I am planning a project that should have a lot of relational information that should also be searchable accordingly. But I have large chunks of text (document content) that also needs to be indexed.

So if I need to use Lucene I would miss the option to combine relational with fulltext search getting a single result.

And currently I do not really see many options. :-(

Jayant Kumar said...

@Martin: If you want to combine a fulltext search into a single query with relational tables, then mysql fulltext is the best option. But there is sphinx engine - also available, which could be plugged into mysql. It is much faster than default mysql-fulltext engine.

Again, additional effort is only required when performance becomes a concern.

Martin Wildam said...

Thank you for the hint with Sphinx - I will have a look at this also.

If performance will become an issue - I don't know. In general I prefer to do things well right from the beginning. Doing major changes later sucks.

Chetan said...
This comment has been removed by a blog administrator.
Anonymous said...

It's what I really matter, do you have some example code? can you explain how to connect lucene and Mysql? I've just begun to read about lucene and I think it's perferct for my paper. I cannot imagine how lucene treats mysql tables. Thanks in advance

Jayant Kumar said...

To connect lucene to mysql, you will have to write your own jdbc connector - connect from java to mysql and pull data from mysql to lucene.