Showing posts with label benchmark. Show all posts
Showing posts with label benchmark. Show all posts

Friday, October 01, 2010

Database speed tests (mysql and postgresql) - part 3 - code

Here is the code structure

dbfuncs.php : is the file which contains classes and functions for firing queries on mysql and pgsql
mysqlinsert.php : creates and fires inserts on mysql
mysqlselect.php : creates and fires selects on mysql
pgsqlinsert.php : creates and fires inserts on pgsql
pgsqlselect.php : creates and fires selects on pgsql
benchmark.php : script used to control concurrency and number of queries per script

Please refer to http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and_29.html and http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html for benchmarks of selects and inserts respectively.

And the code....

dbfuncs.php
abstract class dbfuncs
{
  abstract function insertqry($qry);
  abstract function selectqry($qry);
  abstract function connectdb();

  public function log($str)
  {
    $file = "error.log";
    $fp = fopen($file, "a");
    fwrite($fp, "$str\n");
    fclose($fp);
  }
}

class mysqldb extends dbfuncs
{
  private $user = "root";
  private $pass = "jayant";
  private $host = "localhost";
  //private $port = 3307;
  private $socket = "/tmp/mysql.sock";
  private $database = "benchmark";

  public $db;

  function __construct()
  {
    $this->connectdb();
  }

  public function connectdb()
  {
    $this->db = mysql_connect($this->host.':'.$this->socket, $this->user, $this->pass) or die(mysql_error())
    mysql_select_db($this->database, $this->db);
  }

  public function insertqry($qry)
  {
    if(!mysql_ping($this->db))
      $this->connectdb();

    mysql_query($qry, $this->db) or $this->log(mysql_error());
  }

  public function selectqry($qry)
  {
    if(!mysql_ping($this->db))
      $this->connectdb();

    $rs = mysql_query($qry, $this->db) or $this->log(mysql_error());
    return $rs;
  }
}

class pgsqldb extends dbfuncs
{
  private $dns = "host=localhost port=5432 user=jayant password=12qwaszx dbname=benchmark";

  public $db;

  function __construct()
  {
    $this->connectdb();
  }

  public function connectdb()
  {
    $this->db = pg_connect($this->dns) or die(pg_last_error());
  }

  public function insertqry($qry)
  {
    if(!pg_ping($this->db))
      connectdb();

    pg_query($this->db, $qry) or $this->log(pg_last_error($this->db));
  }

  public function selectqry($qry)
  {
    if(!pg_ping($this->db))
      $this->connectdb();

    $rs = pg_query($this->db, $qry) or $this->log(pg_last_error($this->db));
    return $rs;
  }
}

function logtime($str)
{
    $file = "benchmark.log";
    $fp = fopen($file, "a");
    fputs($fp, $str);
    fclose($fp);
}

mysqlinsert.php
include "dbfuncs.php";

$scriptno = $argv[1]+1;
$count = $argv[2];

$mysql = new mysqldb();
$start = microtime(true);
for($x=0; $x<$count; $x++)
{
  $xx = $x*$scriptno;
  $qry = "insert into data (val, txt) values ('$xx','$x in $scriptno')";
  $mysql->insertqry($qry);
}
$end = microtime(true);
$log = "\nMysql innodb Time to insert $count in run $scriptno = ".($end-$start);
logtime($log);

pgsqlinsert.php
include "dbfuncs.php";

$scriptno = $argv[1]+1;
$count = $argv[2];

$mysql = new pgsqldb();
$start = microtime(true);
for($x=0; $x<$count; $x++)
{
  $xx = $x*$scriptno;
  $qry = "insert into data (val, txt) values ('$xx','$x in $scriptno')";
  $mysql->insertqry($qry);
}
$end = microtime(true);
$log = "\nAvg Pgsql Time to insert $count in run $scriptno = ".($end-$start);
logtime($log);

mysqlselect.php
include "dbfuncs.php";

$scriptno = $argv[1]+1;
$count = $argv[2];

$mysql = new mysqldb();
$start = microtime(true);
for($x=0; $x<$count; $x++)
{
  $xx = $x*$scriptno;
  $qry = "select * from `data` where val ='$xx'";
  $mysql->selectqry($qry);
}
$end = microtime(true);
$log = "\nMysql innodb Time to select $count in run $scriptno = ".($end-$start);
logtime($log);

pgsqlselect.php
include "dbfuncs.php";

$scriptno = $argv[1]+1;
$count = $argv[2];

$mysql = new pgsqldb();
$start = microtime(true);
for($x=0; $x<$count; $x++)
{
  $xx = $x*$scriptno;
  $qry = "select * from data where val ='$xx'";
  $mysql->selectqry($qry);
}
$end = microtime(true);
$log = "\nPgsql Time to select $count in run $scriptno = ".($end-$start);
logtime($log);

benchmark.php
$count = 100000;
$concurrency = 40;

for($i=0; $i<$concurrency; $i++)
{
   exec("php -q mysqlselect.php $i $count > /dev/null &");
//   exec("php -q pgsqlselect.php $i $count > /dev/null &");
//   exec("php -q mysqlinsert.php $i $count > /dev/null &");
//   exec("php -q pgsqlinsert.php $i $count > /dev/null &");
}

All test runs were individual - meaning that only one script was run at a time with different count and concurrency. For inserts $count was set to 10,000 and for selects $count was set to 100,000 while $concurrency kept on varying.

I am using ubuntu 10.04 with 32 bit kernel 2.6.32-24 and ext4 file system. And my system has around 3GB of RAM and Intel Core 2 DUO T5870 @ 2.0 GHz.

Wednesday, September 29, 2010

Database speed tests (mysql and postgresql) - part 2

Here is the comparison between mysql and postgresql for selects (only). I had used the same table that i had created earlier http://jayant7k.blogspot.com/2010/09/database-speed-tests-mysql-and.html while comparing insertion speed. I have created approximately 1,000,000 records in the table and ran selects on them. I also modified the configuration of both mysql and postgresql to enable faster selects.

Mysql

In mysql I specially disabled query_cache - the reason being that I would use innodb for tables with large number of inserts - due to its support for row level locking. And with every insert t
he query cache is marked as dirty. So enabling query cache would lead to caching of queries which will not be the scenario in a live setup.

Mysql Configuration :
innodb_buffer_pool_size = 256MB
key_buffer_size = 256MB
read_buffer_size = 512KB
sort_buffer_size = 512KB
query_cache_size = 0
thread_cache_size = 32
table_open_cache = 64

table information :
No of records : 9755366
data (idb) size : 692 MB

Mysql : (time for 100000 selects)

Avg time for 500000 selects with concurrency = 5 : 58.67
Avg time for 1000000 selects with concurrency = 10 : 122.8
Avg time for 2000000 selects with concurrency = 20 : 225.67
Avg time for 3000000 selects with concurrency = 30 : 351.66
Avg time for 4000000 selects with concurrency = 40 : 452.3


PostgreSQL :

Mysql has better table compression as compared to postgres. Same data in innodb is of around 700 MB while that in Postgres is of around 900 MB.

Postgres configuration :
shared_buffers = 128MB
work_mem = 1MB
random_page_cost = 4.0
effective_cache_size = 256MB


table information :
No of records : 9755366
data size : 912 MB

Pgsql : (time for 100000 selects)

Avg time for 500000 selects with concurrency = 5 : 86.8
Avg time for 1000000 selects with concurrency = 10 : 144.74
Avg time for 2000000 selects with concurrency = 20 : 274.37
Avg time for 3000000 selects with concurrency = 30 : 402.92
Avg time for 4000000 selects with concurrency = 40 : 528.17



Mysql seems to perform better with selects. The graph also shows that with increase in concurrency, selects in innodb take lesser time than that in postgresql.

So, why would you switch from mysql to postgresql - only if you have a very high ratio of inserts as compared to selects. The benefit in inserts outweigh the loss in selects to some extent.

Monday, September 27, 2010

Database speed tests (mysql and postgresql) - part 1

There has been major changes in mysql and postgres over a couple of years. Mysql has been focusing on improving and optimizing innodb. Postgres on the other hand has been focusing on database replication and hot standby.

Recently postgres came out with version 9.0 which has built-in replication and hot standby - the two most requested feature in postgresql. Earlier people used to shy away from postgres because there was no proper "easily deployable" solution available for replication. Now with this release, postgres had taken a major step forward. Here http://www.postgresql.org/docs/9.0/static/release-9-0 is a list of features that has been introduced in postgres 9.0

Mysql has released the rc version of Mysql 5.5 which has a bunch of improvements over the previous version of mysql. Support for multi-core cpus, Changes in Innodb for effective use of available I/O capacity, semisynchronous replication - are some of the features that mysql 5.5 promices. Here is a list of all the new features in MySQL 5.5 http://dev.mysql.com/doc/refman/5.5/en/mysql-nutshell.html

It has been a long time, since posted my last benchmark http://jayant7k.blogspot.com/2008/06/mysql-versus-postgresql.html. And i believe it is time i do some rough benchmarks and post it out. The scope is to check out innodb tables in mysql 5.5.6 versus the tables in postgresql 9.0. I am focusing only on inserts and selects. And i will be benchmarking pure inserts and selects only. Thie blog focuses only on inserts. I will be focusing on selects in my next blog. I am running these tests on my laptop which has a Intel Core 2 Duo T5870 @ 2.00 GHz and 3 GB of RAM

I have created a simple php script to perform the benchmark. Which spawns out multiple php scripts that work on background. Let me know if you need the scripts and i will share it with you.

Mysql:

Innodb Settings in my.cnf :

innodb_buffer_pool_size = 256M
innodb_additional_mem_pool_size = 8M
innodb_log_file_size = 64M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50
innodb_file_per_table

Table structure :
CREATE TABLE `data` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`val` int(11) NOT NULL,
`txt` varchar(20) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_val` (`val`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1


Avg Time for 20000 inserts with concurrency of 2 : 1367 seconds
Load on machine : 3.4
size on disk : 9 MB

Avg Time for 50000 inserts with concurrency of 5 : 1537.2 seconds
Load on machine : 4.6
Size on disk : 13 MB

Avg Time for 100000 inserts with concurrency of 10 : 1255
Load on machine : 3.5
Size on disk : 17 MB

Avg Time for 200000 inserts with concurrency of 20 : 1403
Load on machine : 4.2
Size on disk : 26 MB

Time for 400000 inserts with concurrency of 40 : 1730
Load in machine : 6.6
Size on disk : 52 MB


Postgresql :

Settings in postgresql.conf:
shared_buffers = 64MB
work_mem = 1MB
synchronous_commit = on


Table structure :
Column | Type | Modifiers | Storage | Description
--------+-----------------------+---------------------------------------------------+----------+-------------
id | integer | not null default nextval('data_id_seq'::regclass) | plain |
val | integer | not null | plain |
txt | character varying(20) | | extended |
Indexes:
"data_pkey" PRIMARY KEY, btree (id)
"idx_val" btree (val)
Has OIDs: no

Avg Time for 20000 inserts with concurrency of 2 : 221.3 seconds
Load on machine : 2.0
size on disk : 2 MB

Avg Time for 50000 inserts with concurrency of 5 : 232 seconds
Load on machine : 2.0
Size on disk : 4.8 MB

Avg Time for 100000 inserts with concurrency of 10 : 248.75 seconds
Load on machine : 4.0
Size on disk : 9.8 MB

Avg Time for 200000 inserts with concurrency of 20 : 276.34
Load on machine : 3.6
Size on disk : 19 MB

Time for 400000 inserts with concurrency of 40 : 350.11
Load in machine : 5.8
size on disk : 38 MB



The graph shows that mysql is heavy as compared to pgsql. The base timings are almost 5 times more in mysql as compared to pgsql. Also as the concurrency goes up the time required for inserts in mysql spikes up more steeply as compared to that required for postgres.

I did a sample run on mysql by turning innodb_flush_logs_at_trx_commit=2 and the benefit I got was a lot

Avg Time for 20000 inserts with concurrency of 2 (innodb_flush_logs_at_trx_commit=2) : 5.2 seconds
Avg Time for 100000 inserts with concurrency of 10 (innodb_flush_logs_at_trx_commit=2) : 18.69 seconds

Similarly i disabled synchronous_commit on postgres and did a sample run

Avg Time for 20000 inserts with concurrency of 2 (synchronous_commit = off) : 2.95 seconds
Avg Time for 100000 inserts with concurrency of 10 (synchronous_commit = off) : 15.06 seconds

PS : The average time is the time for 10000 inserts (inserts per instance)

Lets see what do the selects tell - in the next blog.

Wednesday, June 04, 2008

MySQL versus PostgreSQL - part II

My earlier post mysql versus postgresql brought me lots of negative comments - that i did not compare the transactional database of pgsql with the transactional engine (innodb) of mysql. The main reason why i did not do that was because i had found InnoDB to be very slow as compared to MyISAM.

But after all those comments i ran the benchmarks again using the same scripts and the same technology on the same machine (my laptop) and here are the results. I created a new table in both Mysql (using InnoDB engine) and pgsql. And i disabled the binary logging in mysql to speed up insert/update/delete queries. Please refer to the earlier post for the setup information.

Following notification would be used :

<operation(select/insert/update/delete)> : <no_of_threads> X <operations_per_thread>


  • Firstly i ran single thread with inserts both before and after disabling binary logging in mysql
    Mysql Insert : 1 X 100000
    Time : 65.22 Sec (binary logging enabled)
    Time : 32.62 Sec (binary logging disabled)
    So disabling binary logging in mysql would make your insert/update/delete queries take half the time.
    Pgsql Insert : 1 X 100000
    Time : 53.07 Sec
    Inserts in mysql are very fast.

  • Selects : 2 X 100000
    Mysql time : 30.1 Sec
    Pgsql time : 29.92 Sec
    Both are same

  • Updates : 2 X 50000
    Mysql time : 29.38 Sec
    Pgsql time : 36.98 Sec
    Mysql updates are faster

  • Ran 4 threads with different no_of_operations/thread
    Run 1 [Select : 1 X 100000, Insert : 1 X 50000, Update : 1 X 50000, Delete : 1 X 20000]
    Mysql time : 40.86 Sec
    Pgsql time : 45.03 Sec
    Run 2 [Select : 1 X 100000, Insert : 1 X 100000, Update : 1 X 50000, Delete : 1 X 10000]
    Mysql time : 49.91 Sec
    Pgsql time : 63.38 Sec
    Run 3 [Select : 1 X 100000, Insert : 1 X 20000, Update : 1 X 20000, Delete : 1 X 1000]
    Mysql time : 29.83 Sec
    Pgsql time : 29.3 Sec
    It could be seen that increasing the amount of insert/update/delete queries affects the performance of pgsql. Pgsql would perform better if number of selects are very high. Whereas mysql-innodb performs better in all cases

  • Had 4 runs with different no of threads.
    Run 1: 12 threads [Select : 2X30000 + 3X20000, Insert : 1X20000 + 2X10000, Update : 2X10000, Delete : 2X1000]
    Mysql time : 31.16 Sec
    Pgsql time : 30.46 Sec
    Run 2: 12 threads [Select : 2X50000 + 2X40000 + 1X30000, Insert : 1X20000 + 2X15000, Update : 2X15000, Delete : 2X2000]
    Mysql time : 52.25 Sec
    Pgsql time : 53.03 Sec
    Run 3: 20 Threads [Select : 4X50000 + 4X40000 + 2X30000, Insert : 2X20000 + 3X15000, Update : 2X20000 + 1X15000, Delete : 2X5000]
    Mysql time : 169.81 Sec
    Pgsql time : 136.04 Sec
    Run 4: 30 Threads [Select : 2X50000 + 3X40000 + 3X30000 + 3X20000 + 4X10000, Insert : 1X30000 + 2X20000 + 3X10000, Update : 3X20000 + 3X10000, Delete : 1X10000 + 2X5000]
    Mysql time : 200.25 Sec
    Pgsql time : 156.9 Sec
    So, it can be said that for a small system with less concurrency, mysql would perform better. But as concurrency increases, pgsql would perform better. I also saw that while running the pgsql benchmark, the system load was twice than while running mysql benchmark.


Enabling mysql binary logging for replication would ofcourse add an over head. Similarly enabling trigger based replication in pgsql would be another overhead. The fact that replication in mysql is very closely linked with the database server helps in making a high availability system easier. Whereas creating slaves using replication in pgsql is not that easy. All available products for replication in pgsql are external - 3rd party softwares. Still, for a high concurrency system pgsql would be a better choice.

Tuesday, June 03, 2008

MySQL versus PostgreSQL

I created and ran some simple tests on mysql and postgresql to figure out which one is faster. It is already known that postgresql is more stable and reliable than mysql. pgsql has a rich set of features. It is a complete RDBMS and also supports fulltext search.

All benchmarks were done on my laptop - Intel core 2 duo (2.0 GHz) with 4MB L2 cache & 2 GB ram. I have 64 Bit ubuntu system loaded with MySQL 5.1.24-rc (64 bit binary) and PostgreSQL 8.3.1 (compiled from source).

I used python as a scripting language for writing down my benchmark scripts. I used psycopg2 as a connector from python to postgres and mysql-python as a connector from python to mysql.

The benchmarking was done in phases. Firstly simple Insert, update and select queries were run to check the raw speed of these queries. Then threads were created to run simultaneous insert, update, select and delete queries. I checked the benchmark times for different number of concurrent threads.

I created a simple table on both mysql and pgsql. I used the MyISAM database engine to create table in mysql. :

ABC(id int not null auto_increment primary key, value varchar(250));

Queries that were run are:

Insert(I) : Insert ignore into ABC (id, value) ...(For pgsql, a rule has to be created to ignore duplicate inserts)
Update(U) : Update ABC set value=<something> where id=<random_id>
Select(S) : Select * from ABC where id=<random_id>
Delete(D) : Delete from ABC where id=<random_id>



  • Insert - 100000 rows in 1 thread
    Time taken for Mysql : 20.8 seconds
    Time taken for Pgsql : 58.1 seconds
    So, raw insert speed of mysql is much better as compared to pgsql

  • 100000 selects in 1 thread
    Time taken for Mysql : 21.76 seconds
    Time taken for Pgsql : 20.15 seconds
    Raw selects are better in pgsql as compared to mysql

  • Selects - 2 threads of 100000 selects
    Time taken for Mysql : 40.46 seconds
    Time taken for Pgsql : 27.38 seconds
    So, if i increase the concurrency of selects, pgsql perfors much than mysql

  • Update - 2 threads of 50000
    Time taken for Mysql : 23.97 seconds
    Time taken for Pgsql : 34.03 seconds
    Mysql looks better in handling updates here.

  • 4 Threads
    Run 1 : [100000 Selects, 50000 Inserts, 50000 Updates, 20000 Deletes]
    Time taken for Mysql : 45.25 seconds
    Time taken for Pgsql : 54.58 seconds
    Run 2 : [100000 Selects, 100000 Inserts, 50000 Updates, 10000 Deletes]
    Time taken for Mysql : 59.05 seconds
    Time taken for Pgsql : 69.38 seconds
    Run 3 : [100000 Selects, 20000 Inserts, 20000 Updates, 1000 Deletes]
    Time taken for Mysql : 35.54 seconds
    Time taken for Pgsql : 31.23 seconds
    These runs show that Mysql is good when you have very large no of inserts/updates/deletes as compared to selects. But pgsql's performance surpasses that of mysql when the number of selects are much higher.

  • Finally, lets approach the real life scenario where generally the number of selects are much more than the number of inserts and there are multiple threads performing selects and inserts.
    I will use the following notification here - <no_of_threads> X <no_of_operations(select/insert/update/delete)_per_thread>
    So, for example 3 X 20 Selects = 3 threads of 20 Selects in each thread

    Run 1 : [2 X 30000 selects, 3 X 20000 selects, 1 X 20000 inserts, 2 X 10000 inserts, 2 X 100000 updates, 2 X 1000 deletes] Total - 12 threads
    Time taken for Mysql : 42.28 seconds
    Time taken for Pgsql : 44.28 seconds
    Both Mysql and Pgsql are almost at par.

    Run 2 : [2 X 50000 selects, 2 X 40000 selects, 1 X 30000 selects, 1 X 20000 inserts, 2 X 15000 inserts, 2 X 15000 updates, 2 X 2000 deletes] Total - 12 threads but number of selects are quite high
    Time taken for Mysql : 61.02 seconds
    Time taken for Pgsql : 48.60 seconds
    So, as we increase the number of operations (specially selects) mysql's performance degrades, whereas pgsql's performance remains almost the same

    Run 3 : [4 X 50000 selects, 4 X 40000 selects, 2 X 30000 selects, 2 X 20000 inserts, 3 X 15000 inserts, 3 X 15000 updates, 2 X 3000 deletes] Total - 20 threads (10 threads for select, 5 for insert, 3 for update and 2 for delete) Which is the normal trend in database servers.
    Time taken for Mysql : 169.31 seconds
    Time taken for Pgsql : 128.7 seconds
    Bingo, so as concurrency increases pgsql becomes faster than mysql.



My earlier benchmarks with pgsql 7.x was not as good as this one. With postgresql 8.3.1, the speed of serving concurrent requests has increased a lot. So, in a high concurrency environment, i would generally recommend to go ahead with using postgresql rather than mysql.

Please check the comments section. We have some really interesting comments there...

Thursday, June 07, 2007

lucene in php & lucene in java

Something i found out while solving some issue from Mr. Nguyen from vietnam.

He used lucene-php in zend framework for building a lucene index and searching on the index, and was facing issues with search times. It turned out that mysql full text index was performing better than lucene index.

So i did a quick benchmark and found the following stuff

1. Indexing using php-lucene takes a huge amount of time as compared to java-lucene. I indexed 30000 records and the time it took was 1673 seconds. Optimization time was 210 seconds. Total time for index creation was 1883 seconds. Which is hell lot of time.

2. Index created using php-lucene is compatible to java-lucene. So index created by php-lucene can be read by java-lucene and vice versa.

3. Search in php-lucene is very slow as compared to java-lucene. The time for 100 searches are -

jayant@jayantbox:~/myprogs/java$ java searcher
Total : 30000 docs
t2-t1 : 231 milliseconds

jayant@jayantbox:~/myprogs/php$ php -q searcher.php
Total 30000 docs
total time : 15 seconds


So i thought that maybe php would be retrieving the documents upfront. And changed the code to extract all documents in php and java. Still the time for 100 searches were -

jayant@jayantbox:~/myprogs/java$ java searcher
Total : 30000 docs
t2-t1 : 2128 milliseconds

jayant@jayantbox:~/myprogs/php$ php -q searcher.php
Total 30000 docs
total time : 63 seconds


The code for php search for lucene index is:

/*
* searcher.php
* On 2007-06-06
* By jayant
*
*/

include("Zend/Search/Lucene.php");

$index = new Zend_Search_Lucene("/tmp/myindex");
echo "Total ".$index->numDocs()." docs\n";
$query = "java";
$s = time();
for($i=0; $i<100; $i++)
{
$hits = $index->find($query);
// retrieve all documents. Comment this code if you dont want to retrieve documents
foreach($hits as $hit)
$doc = $hit->getDocument();

}
$total = time()-$s;
echo "total time : $total s";
?>



And the code for java search of lucene index is

/*
* searcher.java
* On 2007-06-06
* By jayant
*
*/

import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;

public class searcher {

public static void main (String args[]) throws Exception
{
IndexSearcher s = new IndexSearcher("/tmp/myindex");
System.out.println("Total : "+s.maxDoc()+" docs");
QueryParser q = new QueryParser("content",new StandardAnalyzer());
Query qry = q.parse("java");

long t1 = System.currentTimeMillis();
for(int x=0; x< 100; x++)
{
Hits h = s.search(qry);
// retrieve all documents. Comment this code if you dont want to retrieve documents
for(int y=0; y< h.length(); y++)
{
Document d = h.doc(y);
}

}
long t2 = System.currentTimeMillis();
System.out.println("t2-t1 : "+(t2-t1)+" ms");
}
}


Hope i havent missed anything here.

Friday, June 02, 2006

Benchmarking results of mysql, lucene and sphinx...

Finally i was able to do a benchmark of the 3 search engines
- mysql fulltext search
- lucene search engine
- sphinx www.sphinxsearch.com

I came across sphinx while doing a search on full text search engines. It is a very good engine. Few points regarding sphinx
-> Very simple to configure, create index and search
-> Very easy to integrate with php and mysql. APIs for the same are also available. I was able to build index and search using sphinx in a few hours.
-> The index which has been created is a combination of all fields of mysql. There is no distinction between different fields being searched. So you can perform search on an index and not on different fields of the index.
-> Of course since its source code is available, the searching process can be customized according to your needs. Moreover 0.9.6 version which is under development will be providing field wise search.
-> Since this is in C, it is supposed to be faster as compared to lucene.

I did the benchmarking on my own laptop. It is a dell Inspiron 700m running linux (fedora core 4) kernel 2.6.11. Configuration of the m/c ==>>
Processor : Intel(R) Pentium(R) M processor 1.60GHz
Cache size : 2048 KB
Memory : 512 MB

I got down a table containing 1 Lakh (100,000) records. The data size was 456 MB. And created index on some fields from the table.

INDEXING TIME

Mysql Version - 5.1.9 (Beta)
Stop words : Built in
Indexing words of length >=2 & <=84 ( There is a feature in mysql only which allows you to specify the min & max length of words you want to index. By default min length is 4. I changed it to 2 so that i can index words like net, php, hr etc. If you want to index all words, change this to 1.
Indexing time : 1440 seconds
Index size : 263 MB (456 MB data - remains same).

Lucene Version - 1.9.1
Stop words : 660 general words (like is, if, this etc...)
Indexing time : 2761 seconds (default configuration was used during indexing. There are certain parameters like mergefactor and maxmergedocs using which indexing can be tuned to work much faster. Though it may result in Too Many Open Files error in linux.
Index Size : 109 MB (No data is stored. Had stored only the unique id of each document using which i can retrieve the document later.)

Sphinx Version - 0.9.5
Stop words : NONE
Indexing time : 246 seconds (using default configuration. Dont have much idea whether indexing can be tuned.)
Index Size : 211.1 MB (3.1 MB Index - .spi file + 208 MB Data - .spd file).
3.1 MB of index looks extremely good. Also in case of sphinx, there is no need to maintain separate id for retrieval of data, since the unique id of your data is maintained in the index. As compared to lucene where you have to maintain a separate id and enforce uniqueness on it with your program. The indexing time and data compression are both good.

SEARCH TIME

The searches done were using scripts. I did a number of searches on randomly selected words and then came out with an average time. In case of lucene and mysql, the search was done on 2 fields with an OR between them. In case of sphinx the search was done on the complete index.

Searches/ThreadConcurrency - no of simultaneous threadsTotal searchesTotal time (milli seconds)Average time (milli seconds)
MySQL
51534715369430.6
521072735972735.9
10330228839276279.73
Found that search for an exact phrase which can be done using "in boolean mode" queries is more resource hungry. The query time in mysql is extremely high. Mysql runs purely on RAM, so with more RAM and accordingly configured mysql the queries would be much faster. Concurrency does not affect query execution speed to a major extent.
LUCENE
515673134.6
521061561.5
1033089729.9
503150276218.41
Initially searches are slow. But as we keep on searching the index is cached in RAM and the speed of searches increases. The searches are very fast as compared to MySQL. Here, from the results, it does not seem that concurrency is an issue. But i have been using lucene for some time now and have found that there are some concurrency issues in lucene. So if you are searching a huge index of say 100,00,000 records and the index size is say 8-10 GB, then with a concurrency of 100 searches at the same time, issues pop up, as searches seem to get locked. But still performance wise it is much better than mysql
SPHINX
515512102.4
521073373.3
10330227275.73
503150443929.59
Single searches are faster than that in lucene. But here we will have to consider the fact that there is no OR clause in the search. So the search engine does not have to get 2 different result sets and do a union on them. But as the concurrency of searches in increased the average time per search does not drop majorly as in lucene. Clearly pointing out that there may be concurrency issues here. Since i have not explored this to a great extent, i cannot comment on the problems related to concurrency here.


To sum up, lucene seems to be the right choice for the time being if you are looking forward to searching large amounts of data and performance is your primary goal. The list of features available is also impressive. Sphinx will come in next where indexing time is very small and indexing/searching hassle free. With evolution, sphinx may overtake lucene some time down the line providing both a list of good features and performance. MySQL fulltext search comes as a last option, which, it seems should be used only if the data set is small and quick development time is required.

==================
Date : 12 Aug 2009
Some latest benchmarks on Lucene Versus Sphinx
==================