Friday, June 02, 2006

Benchmarking results of mysql, lucene and sphinx...

Finally i was able to do a benchmark of the 3 search engines
- mysql fulltext search
- lucene search engine
- sphinx www.sphinxsearch.com

I came across sphinx while doing a search on full text search engines. It is a very good engine. Few points regarding sphinx
-> Very simple to configure, create index and search
-> Very easy to integrate with php and mysql. APIs for the same are also available. I was able to build index and search using sphinx in a few hours.
-> The index which has been created is a combination of all fields of mysql. There is no distinction between different fields being searched. So you can perform search on an index and not on different fields of the index.
-> Of course since its source code is available, the searching process can be customized according to your needs. Moreover 0.9.6 version which is under development will be providing field wise search.
-> Since this is in C, it is supposed to be faster as compared to lucene.

I did the benchmarking on my own laptop. It is a dell Inspiron 700m running linux (fedora core 4) kernel 2.6.11. Configuration of the m/c ==>>
Processor : Intel(R) Pentium(R) M processor 1.60GHz
Cache size : 2048 KB
Memory : 512 MB

I got down a table containing 1 Lakh (100,000) records. The data size was 456 MB. And created index on some fields from the table.

INDEXING TIME

Mysql Version - 5.1.9 (Beta)
Stop words : Built in
Indexing words of length >=2 & <=84 ( There is a feature in mysql only which allows you to specify the min & max length of words you want to index. By default min length is 4. I changed it to 2 so that i can index words like net, php, hr etc. If you want to index all words, change this to 1.
Indexing time : 1440 seconds
Index size : 263 MB (456 MB data - remains same).

Lucene Version - 1.9.1
Stop words : 660 general words (like is, if, this etc...)
Indexing time : 2761 seconds (default configuration was used during indexing. There are certain parameters like mergefactor and maxmergedocs using which indexing can be tuned to work much faster. Though it may result in Too Many Open Files error in linux.
Index Size : 109 MB (No data is stored. Had stored only the unique id of each document using which i can retrieve the document later.)

Sphinx Version - 0.9.5
Stop words : NONE
Indexing time : 246 seconds (using default configuration. Dont have much idea whether indexing can be tuned.)
Index Size : 211.1 MB (3.1 MB Index - .spi file + 208 MB Data - .spd file).
3.1 MB of index looks extremely good. Also in case of sphinx, there is no need to maintain separate id for retrieval of data, since the unique id of your data is maintained in the index. As compared to lucene where you have to maintain a separate id and enforce uniqueness on it with your program. The indexing time and data compression are both good.

SEARCH TIME

The searches done were using scripts. I did a number of searches on randomly selected words and then came out with an average time. In case of lucene and mysql, the search was done on 2 fields with an OR between them. In case of sphinx the search was done on the complete index.

Searches/ThreadConcurrency - no of simultaneous threadsTotal searchesTotal time (milli seconds)Average time (milli seconds)
MySQL
51534715369430.6
521072735972735.9
10330228839276279.73
Found that search for an exact phrase which can be done using "in boolean mode" queries is more resource hungry. The query time in mysql is extremely high. Mysql runs purely on RAM, so with more RAM and accordingly configured mysql the queries would be much faster. Concurrency does not affect query execution speed to a major extent.
LUCENE
515673134.6
521061561.5
1033089729.9
503150276218.41
Initially searches are slow. But as we keep on searching the index is cached in RAM and the speed of searches increases. The searches are very fast as compared to MySQL. Here, from the results, it does not seem that concurrency is an issue. But i have been using lucene for some time now and have found that there are some concurrency issues in lucene. So if you are searching a huge index of say 100,00,000 records and the index size is say 8-10 GB, then with a concurrency of 100 searches at the same time, issues pop up, as searches seem to get locked. But still performance wise it is much better than mysql
SPHINX
515512102.4
521073373.3
10330227275.73
503150443929.59
Single searches are faster than that in lucene. But here we will have to consider the fact that there is no OR clause in the search. So the search engine does not have to get 2 different result sets and do a union on them. But as the concurrency of searches in increased the average time per search does not drop majorly as in lucene. Clearly pointing out that there may be concurrency issues here. Since i have not explored this to a great extent, i cannot comment on the problems related to concurrency here.


To sum up, lucene seems to be the right choice for the time being if you are looking forward to searching large amounts of data and performance is your primary goal. The list of features available is also impressive. Sphinx will come in next where indexing time is very small and indexing/searching hassle free. With evolution, sphinx may overtake lucene some time down the line providing both a list of good features and performance. MySQL fulltext search comes as a last option, which, it seems should be used only if the data set is small and quick development time is required.

==================
Date : 12 Aug 2009
Some latest benchmarks on Lucene Versus Sphinx
==================

14 comments:

Roland Bouman said...

Hey,

thank you!

One of the most important conclusions seems to be that MySQL fulltext search is not the right choice if you really need to rely on fast fulltext searching, but I keep wondering about the memory situation. At the very least, your article learns me that I should really care about memory when I'm planning on MySQL fulltext search, but I'm really curious what would happen if the machine would have loads and loads of memory. Did you have any indication that memory usage was in fact the limiting factor for MySQL fulltext search on your system?

kind regards,

Roland

Peter Zaitsev said...

Hi,

I would be careful with benchmarks. First you really should be using same stop word list and configuration as it can affect results dramatically.

I also would consider different matching strategies search engines use. MySQL and Lucene use probability matching by default, while Sphinx uses algorith which takes in account word position. This is much slower (could be 10 times) and requires much larger indexes as you have to store word positions rather than simply how frequent is word in the text. It sounds bad but it is what is required for good search quality - for example major search engines such as Google/Yahoo. account for phrases.

Yet another important thing to consider is size of data. In your case data was close to memory size which normally makes load CPU bound. For large sizes it is important how efficiently disk IO is done.

Finally There are different modes for SPHINX you can use "MATCH_ANY" if you want "OR" matching mode. And as I mentioned typically there is much more job performed finding full phrases and close keywords.

Peter Zaitsev said...

Hi,

I would be careful with benchmarks. First you really should be using same stop word list and configuration as it can affect results dramatically.

I also would consider different matching strategies search engines use. MySQL and Lucene use probability matching by default, while Sphinx uses algorith which takes in account word position. This is much slower (could be 10 times) and requires much larger indexes as you have to store word positions rather than simply how frequent is word in the text. It sounds bad but it is what is required for good search quality - for example major search engines such as Google/Yahoo. account for phrases.

Yet another important thing to consider is size of data. In your case data was close to memory size which normally makes load CPU bound. For large sizes it is important how efficiently disk IO is done.

Finally There are different modes for SPHINX you can use "MATCH_ANY" if you want "OR" matching mode. And as I mentioned typically there is much more job performed finding full phrases and close keywords.

Jayant Kumar said...

Before shifting our product to lucene, i used to run the product on mysql full text search engine. The servers that we use have loads of RAM and things used to run very well on them. But as soon as the data+index size (.MYI+.MYD) of mysql table exceeded the limit of physical RAM available, I started running into performance issues.

So it would be safe to say that if you have RAM > (.MYI+.MYD) and have configured mysql accordingly, mysql full text search would work well.

Jayant Kumar said...

Hello Andrew,

I know that i may not have done a fair comparison for sphinx. Since i
have been using lucene for about 2 years now, I know the ins and outs
of it. Whereas for sphinx, I have just started exploring it.

Pls find more comments in the mail itself.

--Jayant

On 6/3/06, Andrew Aksyonoff shodan@shodan.ru wrote:
> Hello Jayant,
>
> Thursday, June 1, 2006, 4:13:40 PM, you wrote:
> JK> I would like to try out the 0.9.6. Pls send me the cvs snapshot.
>
> I accidentally came across your blog entry comparing Sphinx
> and Lucene, and would like to add a few comments, as well as ask
> a few questions.
>
> First, comparing Lucene index which is using stopwords against
> Sphinx index which is NOT using stopwords is, well, a little unfair ;)

I agree to it. Including stop words will make a huge difference. I
will do that when i run sphinx on a large dataset.

>
> Second, there is an OR clause in Sphinx already. See SetMatchMode()
> API call and pass SPH_MATCH_ANY there. Also, not having to merge the
> results in SPH_MATCH_ALL mode is in fact of minor importance; the
> slower part is not merging the resuls sets or intersecting them.

Had passed SPH_MATCH_PHRASE. By OR clause i meant an "OR" between two
fields. Since you mention it, i could have done an OR between two
words in the index. The results would then have been comparable in a
better fashion.

>
> Third, the concurrency issues might very well be mistaken for
> caching issues. There's a known deficiency in Sphinx: it doesn't cache
> anything for now; it just depends on OS for file caching, and always
> does all the CPU work for each query. So if your test queries are the
> same, or use the same words, then it's highly probable that Lucene
> caches some intermediate results, while Sphinx does not. This is
> obviously a problem in Sphinx, and I'm of course going to fix that,
> so I'm really interested if the slowdown you experienced is a real
> concurrency issue or caching issue. Should probably finally do my
> own tests, though!

As far as i know lucene also does not do any query/result caching.
There is a .tis file, which, as far as i remember, is loaded into
memory. Which can result in increased performance. Pls correct me if i
am wrong.

>
> Now for the questions... :)
>
> I wonder if you are going to test the engine on some *really* large
> dataset, ie. 10-20 GB of text or something. With 500 M dataset, and
> 100-200 M index, it could just fit in memory, which is not the real
> world scenario with huge datasets.

Yes i would certainly do that. But it will take some time.

>
> I also wonder what was the ranking mode used with Lucene and how
> would you judge the relevance. If I'm not mistaken about Lucene
> defaults, Sphinx typically does more processing per query, because it
> always processes word positions, trying to rank matching phrases
> higher - relevance actually was my primary concern when initially
> developing Sphinx.

Lucene also does some relevance ranking. Though I have not gone into
depth of how it does that.

>
> Finally, I'd like to thank you for your benchmarking work. Your
> concurrency test was really interesting and really gives some food for
> thought. And... please do send me your results next time you
> benchmark! :)

Thanks :) . Will certainly do that...

>
> --
- Hide quoted text -
> Best regards,
> Andrew mailto:shodan@shodan.ru
>
>

Gaurav said...

Hi Jayant,

Good Effort! But I would also like to have a look at your scripts, which you used for benchmarking. Can you please provide your scripts also?

Gaurav

gnolkha AT gmail DOT com

ganesh said...

Jayant,

How good do you think Zend Lucene is for checking misspellings/the auto suggest function? What option would you recommend?

Wouter said...

I'm currently doing some benchmarking myself, and on our server (4Gb memory) MySQL fulltext index seems much faster than Lucene (I tested with Zend port), over 100 times faster even after running lucene optimize(), also search time seems to increase significantly with more records (documents) in Lucene.

Jayant Kumar said...

@wouter : i would suggest you go through http://jayant7k.blogspot.com/2007/06/lucene-in-php-lucene-in-java.html where it could be clearly seen that lucene(php port) is way too slower than the actual lucene in java.

In fact looking towards this, i think lucene ported to c/c++ should be much much faster than the current lucene in java

Tom said...

Wouter, I think the Zend port of Lucene may be substantially slower than the original/java version, by as much as 10-100x.

Adrian said...

I've been searching for an indexing solution for a website I developed www.job.ro

The good thing about Sphinx is that it can connect to a mysql server and index the results of an SQL. 200.000 resumes were indexed in 2 minutes. It is very fast, it can be compiled with Mysql so you can create queries that return everything you want and sort by relevance.
The Sphinx engine is used to handle a forum indexing service that handles billions of entries so I think is pretty good :)

Vijay Kumar said...

Jayanth ..
our company used Sphinx .. i need informatoin from you regarding he performance .. Lucene is taking more time for indexing .. can you give any performance tips .. Thanks in advance ...my gmail id is vijaykumar.ravva@gmail.com

vijay

Vijay Kumar said...

Jayant,
We need to index 20-30 million records daily. As of now Lucene is indexing million records in 15 mins . but it doesnt match to the speed we want .Could you please let me know any optimization techniques to speed up my query . Its straigh forward selection from Data Base.I think indexing is taking very less time . The Accessing part is taking more time . Any inputs are higly appreciated.
Thanks Vijay

Jayant Kumar said...

@vijay : i will have to look at the complete architecture before i could suggest any changes/optimizations. There are different ways to do the same thing, and i may find efficient ways to handle your scenario.

For starters, you can go ahead and change some parameters in the indexwriter to speed up indexing

1. use more ram - setRAMBufferSizeMB().
2. turn off compound file - setUseCompoundFile() to false
3. pass autocommit = false in the writer constructor.
4. increase mergeFactor - be careful - you might run out of file descriptors - you can increase them as well.
5. run optimizer at the end or at certain distinct intervals.
6. Use threads to create multiple indexes simultaneously and merge them in the end. Be careful - do not overload or underload the cpu.

Let me know your contact email. And we could discuss this offline.