Saturday, May 27, 2006

Missing posts...

There are a few posts i have missed. Lots have happened in the mean time...

Have been to home and to my collegues marriage on 14th of may to bhopal. Had a wonderful day on 15th of may at the famous lake of Bhopal. Had lunch at a four-star, did boating, hung around in odd clothes... In all had great fun. Below are some pics from the same...

The clothes were bought specially for the occassion. We had people enquiring "Where have we come from?". Hope we would have created some waves...

Secondly last sunday, me and one of my collegues - ankit went down for a ride on the delhi-jaipur highway. Something around 150 kms from the starting point. Started at around 5:30 in the morning and reached the place - dharuwera -> jungle babbler (a restro on the highway) around 9. The ride was very pleasant. Will post pics of the same some time later. Saw some great valleys and greenery on the way. We were back by 11:30 am and by that time the sun was up and hot...

Lastly, yesterday we celebrated color day in the office and we were supposed to wear red shirts and blue pants/jeans. Saw "Da-Vinci-Code" at Waves-Noida. Nice movie for the people who have not read the book. But since i had read the book, i found lots of stuff missing. Well... i agree to the fact that it is difficult to show almost all sequences of a book in a movie. But still, some thrills were lacking.

Also, me and my other room-mate shifted to another residence in noida - sector 19. Very close to sector 18 and our office...

Anyways, thats 2 great weeks behind... And in the mean time, i have been working on benchmarking lucene, mysql-fulltext search and sphinx. Will try to provide the benchmark nos as soon as they are out...

Sunday, May 21, 2006

mysql Fulltext search versus lucene

Here is the comparison between mysql fulltext and lucene search engines. On the forefront the only thing that distinguishes one from another is

==> speed of fulltext search in lucene is much faster as compared to mysql
==> lucene is much more complex to use as compared to mysql.

In mysql, you can simply mark an index on a text/varchar column as fulltext, and your work is done. All you need to do next is to fire MATCH AGAINST queries. Adding and modification of indexes is handled by mysql internally as and when new data is added. On the other hand, in lucene, the addition/modification of documents is to be handled programatically. Moreover Mysql is pluggable to your application using available apis. Mysql is running on a port and all you need to do is connect to the port and fire sql queries to that port using your application. Whereas in case of lucene, you will have to plug your application to the index using lucene api.

Another difference is that lucene is very efficient in searching large no of documents. Where as in case of mysql as the no of documents increases, the speed of search goes down. Mysql uses RAM to cache the index and use it during serving a query. So, if the size of your fulltext index exceeds the RAM, you will experience a major fall in the search performance. Where as in case of lucene, the size of index does not affect the search performance even when the size exceeds the RAM on the system.

With mysql, when a fulltext index is created on a table, inserts on the table become very slow. Lets analyze this. Well... for each record, mysql does some major processing to fix the record in current index. After the record is indexed, the cache/RAM containing the index needs to be rebuilt, since the index which was previously there is not correct - does not have the new record. So, with each record Mysql fetches the complete index in the cache/RAM. So if you are performing search and inserts/updates on a single table with fulltext index, the performance of both search & indexing goes very very down. On the other hand, with lucene, addition of new documents is not a major drawback. Documents can be added on the fly. Which makes indexing very fast. And this process does not affect the search. Two things to be noted here.

==> lucene does not allow you to modify a document. Modifying a document is equivalent to deleting the current document and adding the modified document to the index.

==> lucene requires an object of the index to perform the search. You will know about it when you use the api. Whenever you add a new document to the index, a new object of the index has to be created to include the new document in the index. But creation of a new object is not a major overhead. Though it does slow down the searching process to some extent.

With lucene, you do not have the flexibility to join two indexes and form a single query. Like in mysql you can do something like this


(Pls dont see the syntax, look for the meaning/logic behind. I am not good at syntaxes. :-D ) This cannot be done with lucene. You will have to play with the data in such a way that your index contains both the data of say TABLE1 & TABLE2 and then you will have to play with the search to get the data that you need. Too complex right??

Also mysql comes with inbuilt list of stopwords and a default word tokenizer, which separates the words based on " ", ",", "." etc. Whereas in case of lucene, both - the list of stop words and the word tokenizer has to be defined by us. This is advantageous, because then you can define your own stopwords and tokenize the text as per your requirements.

In case of mysql the searches are by default case insensitive. Whereas in case of lucene, you can make the searches case-sensitive or case-insensitive, the way you want it to be.

With mysql you have the minimum length of word to be indexed which is by default 4. So all words which have less than 4 characters will not be indexed. What will you do if you want to index words like "php", "asp", "c"? You will have to decrease the minimum length from 4 to 1. And this will increase the index size drastically and slow down all your searches as a consequence. There are no such issues in lucene.

In mysql, every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight and if a word is rare, it has higher weight. So if a word is present in 50% of the rows in a table, a query searching for that word will result in 0 result. This, mysql terms as relevance. But for me, it resulted in incorrect results for a query.

This link will give a better idea of mysql fulltext search.

In lucene, there are some advanced options like

  • proximity search - find documents where there is one word between searchword1 and searchword2

  • wildcard search - find documents which have word like searchword* or maybe search?word etc etc...

  • fuzzy/similarity searches - find documents with words sounding similar to roam~ (will look for roam, foam etc...)

  • Term boosting - you can boost a term to move relevant documents to the top. So for example, you can say that you want documents with word "lucene" to be more relevant than those with word "mysql". Then you can do something like -> lucene^4 mysql .

Sorting of results is extremely fast if you are using lucene. In case of mysql, if you expect your results to come out fast, you will have to forget sorting. Sorting takes huge amount of time. Even retrieving the total no of documents in mysql is a chore. Where as for lucene the total no of documents come out as a default.

From this, if you are looking for a fulltext index on a small table without much hassles, go for mysql fulltext index. Whereas if you are looking at searching a large collection of data then go for lucene.

Whew... i wrote a lot... And there is more to write...Maybe next time... Do let me know if i have missed anything.

Thursday, May 18, 2006

matrimonial ads worth a look

These are Girls ads taken from
These are actual ads on a matrimony site. Grammar
and spell errors have no
place in a profile description as everything is
straight from the heart!
Disclaimer : I am not responsible if you forget your
basic grammar after reading this mail...


- Hello To Viewvers My Name is Sowmya , I am single
i dont have male, If any
one whant to marrie to me u can visite to my home. I
am not a good education
but i working all field in bangalroe.. if u like me
u welcome to my heart...
when ever u whant to meet pls viset my resident or
send u letter.. Thanks
yours Regards Sowmya ~*~

i want very simple boy. from brahmin educated family
from orissa state she
is also know about RAMAYAN, GEETA BHAGABATA, and
other homework


Wants a man who knows me better and can adjust with
me forever. he may never
create any difficulties in my life or his life by
which the entire life can run smoothly. thank you

(The principle of running life smoothly was never so


he should be good looking and should have a service.
he Shoulsd have one
brother and one sister. he should be educated.

(ain't it unique !! 1 brother 1 sister criteria !)


I am a happy-go-lucky kind of person. Enjoys every
moments of life. I love
to make friendship. Becauese friendship is a first
step of love. I am
looking for my dreamboy who will love me more than
i. Because i love myself
a lot. If u think that is u then why to late come on
........ hold my hand forever !!!

(The dilwale dulhaniya effect)


i am simple girl.I have lot ofproblemin mylife
because ofmylucknow i
amlooking oneboyhe caremeandloveme lot lot lot

(I don't know why but this is one of my favorites)


My husband should be as 'Shiva' as in Kahani Ghar
Ghar Ki and as Tanwerr as in KSBKBT......

(Ok I haven't seen these soaps but I am sure she
must be demanding too much, ain't he?)


i want a boy with no drinks if he wants he can wear
jeans in house but while
steping out of house he should give recpect to our cast

(by not wearing his jeans? Wat the hell...)



(all of us are loughing{laughing})


whatever he may be but he should feel that he is
going to be someone groom
and he must think of the future life if he is
toolike this he would bde called the man of the lamp

(I am clueless, I feel so lost. Can anyone tell me
what this girl wants)


i love my patner i marriage the patner ok i search
my patner and i love the
patner ok thik hai the patner has a graduate ok

(I am again clueless but I liked the use of "ok".
The person is suffering from "Ok-syndrome")



(the "ok syndrome" again)


iam pranati my family histoy my two brother two
sister and fater&mother sister complity marred

(somebody please explain in comments section how to
get married 'completely'?)


iam very simpel and hanest. i have three sister one
brother and parent. i am doing postal sarvice and tailor master my
original resdence at kalahandi diste naw iam staing at rayagada dist.

(actually what is this girl doing? Postal service or tailor.??)


my name is farhanbegum and i am unmarried. pleaes
you marrige me pleaes
pleaes pleaes pleaes pleaes pleaes pleaes

(height of desperation! J )


Iwant one boy who love me or my mother. he love me
heartly or he havea frank
he's skin colour 'normal'not a black or not a
whitey. IThink the main think
is heart if your heart is beautiful then you are beautiful.
but iam not a handsome girl or not a good looking. but
my Mom say that Iam a good girl. My father already expired . iam
bye bye.

(uttama purishinin)


iam kanandevi. i do owo sistar.he was marred.

(No comments)



(maybe the poor guy meant BAD habits)


hello i am a good charactarised woman. i want to run
my life happily.i divorced my first husband.his charactor is not
good'. i expect the good minded and clean habits boy who may be in
the same caste or other caste accepted ...

(but credit cards not accepted..???)


my colour is black,but my heart is white.i like
social service



i'm looking out for who lives in bombay, boy simple
who trust me lot should be roman catholic, LOVE ME ONLY.

(Now that criterion is a must, isn't it?)


to be married on jan-2005. working man perferable

(this girl has fixed the marriage date too! But she
is yet to find a bridegroom.
I wish her best of luck on behalf of all of us. I am
sure she will get one soon.)


i would like a beautyfull boy. and i do not want his
any treasure. because boy is the maharaja.

(Now he is going to be a lucky boy! Any takers?)


ssc failed three times and worked with privated ltd
company which not paying
salary at present.

(Any takers again?)

Friday, May 12, 2006

discovering lucene

How i discovered lucene and how did i tune it to use it with our organization... Oops, i may have to leave out specific details about its usage in our organization but i will try to give out a brief general idea.

Well.... i found out lucene thru the mysql website. Was just going thru the forums of mysql full text search engine. At that time mysql 4.0 had just rolled out and we were using the full text search engine of mysql. But it was terribly slow. Since the data size was not "that" large and number of searches could also be numbered down easily, we were able to survive on it. But we were aware that we would be hitting a bottle-neck some time later and well, we had to do something about it. Soooo... i was just browsing the forums and somewhere i found someone mention that mysql full text search is slow and a better alternative to it would be "aspseek" or "lucene". I first tried out aspseek, but it did not allow me to do a boolean query using different fields. Later i tried lucene. It is completely in java, but recently some versions of lucene for c and other languages are coming out... Another thing which is better than lucene is "egothor". But there is not much support/awareness about it. And i have tried but have been unable to use it to perform field based searches.

What i did was build up a huge index on one of the small P-III machines and try out the search. Made a program to fire random queries concurrently to the index and checked the load on the system. It turned out that the searches were extremely fast and the total no of results found was obtained in a flash, but the retrieval of large number of documents after search was a tedious process. Another thing that was found is that with compound file structure (lucene has 2 forms of index structure - compound and simple - will explain in detail later) and increased concurrency in searches, the speed of search would go down and load on the system would shoot up. So we decided on using the simple index structure.

The most important thing we did was that we found out a bug in that version of lucene (at that time 1.3 was the version being used) related to thread concurrency. A class which was synchronized and was not supposed to be. Pointed it out and sent thread stack dumps to Doug Cutting - the creator of lucene and he helped us solve it out. So we recompiled lucene with the patches, modified the data to suit the search we wanted , created our own analyzer (lucene has analyzers - will explain in detail later) and then used lucene. It still does serve our purpose. Though from that time till now, numerous changes have been done to the data and the search to optimize the search.

To begin, lucene is just an API. A set of tools that allows you to create an index and then search it. The index created is an inverted index (you will have to do some googling to find out what inverted index is - if u dont know it). And the index is basically a directory containing files related to the index. When compound index structure is used, the index directory contains a single file. But when simple index structure is used, the index directory contains lots of files (generally 1 file/field being indexed and 7 (i think) more files).

An analyzer is the most important part of the index. It defines how data to be searched is being broken and formatted. You can break data into phrases or words, convert all of them to either lower-case or upper-case (search can also be made case sensitive). For example there is the whitespace-analyzer which breaks the text to be indexed into tokens (or words) separated by white space. There is a StandardAnalyzer which retains only alphanumeric stuff from your text and discards special characters. There is a stop-word analyzer which breaks text on the basis of stop word list provided to it. This seems to be too heavy. In fact this part was the most difficult one when i started out with lucene. It may be difficult to get what you exactly want from your analyzer and like me, you may end up making your own analyzer and define your own stop words.

What are stop words? Oh... i forgot. Stop words or noise words are words which are neither indexed nor searched. Something like "the" can be made a stop word, since it is a common word and is not relevant during search.

I would just put down 2 small and basic programs for indexing and search. Dont copy and paste them, it wont work. I dont believe in spoon feeding.

INDEXING: A program which would index text files in a directory.

/** Declare the main class. Index all text files under a directory. */
public class IndexFiles
// Name of the index directory. where index would be stored.
static final File INDEX_DIR = new File("index");

// The main method
public static void main(String[] args)
// idxsrc is the directory where files to be indexed is stored
final File docDir = new File("idxsrc");

// Start the indexing process
//create an object of the indexwriter, use standard analyzer and create new index
IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
System.out.println("Indexing to directory '" +INDEX_DIR+ "'...");
//call a function which does the actual indexing
indexDocs(writer, docDir);
//optimize the index - very important - increases speed of search

} catch (IOException e) //exception handling

static void indexDocs(IndexWriter writer, File file)
throws IOException
System.out.println("adding " + file);
Document doc = new Document();
doc.add(new Field("path", file.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.add(new Field("contents", new FileReader(file), Field.Store.NO, Field.Index.TOKENIZED));
catch (FileNotFoundException fnfe)

Now this is a very crude program to index a directory containing text files. The program creates 2 fields in the index - "path" and "contents". The important thing to note over here is how the index is made up. An index is a collection of documents and each document has n number of fields. For each field, you can decide whether you want to store the data or not while indexing. Also you can decide whether you want to break up the data for that field into tokens for search using the analyzer that you have specified.

SEARCH: Now lets search the index created

// Declare the class
public class SearchFiles
/** main method */
public static void main(String[] args) throws Exception
String index = "index"; //index to search
String field = "contents"; //default field to search
String queries = null;

//open up the index
Searcher searcher = new IndexSearcher(;
//create an analyzer - to convert both indexed data and search query to same format
Analyzer analyzer = new StandardAnalyzer();

BufferedReader in = null;

while (true)
if (queries == null) // prompt the user
System.out.print("Query: ");

String line = in.readLine();

if (line == null || line.length() == -1)

//create a query from the string put by user
Query query = QueryParser.parse(line, field, analyzer);
System.out.println("Searching for: " + query.toString(field));

//hits defines the pointer to the result set obtained after search
Hits hits =;

System.out.println(hits.length() + " total matching documents");

// get and display first 10 results/documents
final int DISPLAY = 10;
for (int start = 0; start < hits.length(); start++)
int end = 0;
Document doc = hits.doc(i);
String path = doc.get("path"); //show the path - which was stored
//close the index

Another crude program for search. For search the field where search is to be performed has to be named. Suppose we had 2 fields - say "contents" and "contents1", then we would be giving a query like this :

contents: phrase1 AND/OR contents1: phrase 2

There is lots that could be done using lucene. This is just the starting point. Wellll, i cud have skipped it and just given the link, but blogs dont work that way. So..

Maybe some time later, i will write down something advanced about lucene...

How sinful am i ?

Your Deadly Sins
Greed: 100%
Pride: 80%
Envy: 60%
Sloth: 60%
Lust: 40%
Wrath: 40%
Gluttony: 20%
Chance You'll Go to Hell: 57%
You'll die in a shuttle crash, on your way to your resort on the moon.

Thursday, May 11, 2006

Yahoo! Mail Beta is here......

Yes yes yes,,,
Just got found out that Yahoo! Mail Beta is here...
So whats new with yahoo mail beta. Well, the gud point is that there is nothing new, it is exactly similar to your outlook express with a xtra "ad" panel on the right hand side...

Have not explored all the features yet. But at the first glance, it looks coooool and fast...

Here is a snapshot.

AND... you can also switch to Yahoo! Mail Beta. Just follow the steps below

  • log in to Yahoo Mail
  • click Options
  • select Account information from the left panel
  • go to Member Information, General Preferences, Preferred Content
  • select, for example, Yahoo UK
  • click Finished
  • go to Yahoo Mail
  • you'll see a page that says "It's the New Yahoo! Mail Beta... and you're invited."
  • click on "Try Beta Now".

It should give gmail some tough competition... Lets c...

Monday, May 08, 2006

The best book i have read...

I have read many books till now. Lots and lots. Started off in my 5th standard with lots of "famous 5's" and then in my high school days went to "hardy boys" and "nancy drew" and "the three investigators - alfred hitchcock". I started off with "hardy boys" in my 8th and by 9th the library was out of "hardy boys" i have not read. I used to read 2 books in a weeks time. Loads of stuff. People used to call me a book worm.

By 10th, i left reading books. Too much of studies and tutions. And no new books coming up. Soo, i did good in both my 10th an 12th. Went to college and had a great college life. Then in my 3rd year, i was again introduced to the pleaseures of reading. Got "10 commandments - john grasim" by one of my friends. And finished it fast. From then onwards there has been no pause in my reading.

Read lots of books which i got from Almost 95% of them were classics. Some of the titles i remember are "Journey to the core of the earth", "sherlock holmes" - the complete series, "Tarzan" original complete series - a very good book, "The Time machine" and other books by the same author. The one i liked the most is a book named "A woman in white". Great book about revenge.

I like fiction. I have tried reading some non-fiction stuff. But was never successful in staying awake while reading. The only non-fiction stuff i was able to complete was "a train from pakistan". In fiction also i like reading science-fiction, adventure and horror-thriller types. When i am reading a book, i prefer to get detached from the world. Be in an imaginary world of my own, where impossible things happen. Thats what i prefer.

Then i laned in a job in delhi. And used to spend gathering books by "sidney sheldon", "robin cook", "stephen king", "robert ludlum", "patricia cronwell", "john grisham" and some other random authors. I got tired of "sidney sheldon" very soon. Medical thrillers by "robin cook" still seem good. But the most exciting are stories by "stephen king". He seems to write horror in a very normal way. The stories by him are normal in a way and abnormal in a way. It strikes.

I have been lucky to go thru the "harry potter" series by "J K Rowling". The first 6 books are great. By the way, there are 2 versions of the 5th and 6th book. I hope i have gone thru the right one. And am eagerly waiting for the 7th book, which is supposed to be the final one.

To return back to the topic, the best book i have read is "The Dark Tower" by Stephen King. There is no other book which can beat this one according to me. He starts off with the first book "The gunslinger" where he introduces "the hero" of the story, the last gunslinger and how he is seeking The Dark Tower. The first book is the most boring one. There is nothing there except travelling. It may dishearten you from reading the other books. It did dishearten me. But i was free and there was nothing else to read, so i continued with book 2 "The Drawing Of The Three". This is where the actual story, the quest and the will to "save the dark tower" and hence the world starts. With each book, the author has introduced new ideas and imaginations and made the previous one general. He introduces time travelling, high end magic, hi-tech machinery with artificial intelligence - so much advanced that it seems to be more like machines controlling the future of human beings, lost civilization and vampires, warewolves and draculas. Each book takes the reader to a new level of imagination and thrill. As the author moves thru "book 3: The Waste Lands", "book 4: Wizard And Glass". "book 5: The Wolves Of Calla", "book 6: Song of Susannah" and finally "book 7: The Dark Tower", you feel like you cannot stop reading. You are totally engrossed, and there is continuous thought about what will happen next.

The book to me seems to be the god of "Harry potter", "Matrix", "Lord of the rings". Lots of stephen king books have been converted into movies. I just hope this one also gets converted to one - no a series of movies somewhat similar to the harry potter series.

For more information you can visit I have the complete ebook with me. Drop me a comment if you need me to mail it to you.

Friday, May 05, 2006


Went down for swimming yesterday at kendriya vidyalaya near adobe in noida. Had a great time for an hour. After a long long time i went for swimming and swam in water worth swimming. Had a chance to swim sometime back when we were on a picnic to some resort (dont remember the name). But the water there was so shallow that it was difficult to remain submerged in it...

I love to swim. Learned swimming when i was in 4th or 5th standard. Dad taught me. Had a great pool in baroda. And really cheap. We used to join for a quarter in summer and it costed us something around 350/- for the membership. The pool was very clean, very deep and had a diving board. Used to practice all types of diving there... I swam for some years and then in 9th left it due to studies.

But later me and my college friends joined again in our 2nd or 3rd year of college. And we used to have a great time. My highest record was around 22 lengths. 1 length was 50 meters. Used to get very tired and hungry.

Yesterday where i swam, the pool is not that long, but i managed something around 4-5 lengths (starting from 3/4 th of the pool).

Just hope that i will join the pool later and maybe after a month will be able to break my previous record. Lets see....