Whatever....: lucene

Showing posts with label lucene. Show all posts

Sunday, December 12, 2010

how to go about apache-solr

I had explored lucene for providing full text search engines. I had even gone into the depth of modifying the core classes of lucene to change and also add new functionality into lucene. Being good at lucene, i never looked at solr. I had the notion that Solr was a simple web interface on top of lucene so it will not be very customizable. But recently my belief was broken. I have been going though solr 1.4.1. Here is a list of features that solr provides by default : http://lucene.apache.org/solr/features.html.

Solr 1.4.1 combines

Lucene 2.9.3 - an older version of lucene. The recent version of lucene 3.0.2 is based on java 5 and has some really good performance improvements over the 2.x version of lucene. I wish we had lucene 3.x on solr. Lucene powers the core full-text search capability of solr.
Tika 0.4 - Again the latest version here is 0.8. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Carrot2 3.1.0 - Here also the latest version is 3.4.2. Carrot is an open source search result clustering engine. It readily integrates with Lucene as well.

To install solr simply download it from http://lucene.apache.org/solr/index.html and untar it. You can launch the solr by simply navigating to <solr_directory>/example directory and running java -jar start.jar. This will start the sample solr server without any data. You can go to http://localhost:8983/solr/admin to see the admin page for solr. To post some sample files to solr simply do <solr_dir>/example/exampledocs$ java -jar post.jar *.xml. This will load example documents in solr server and create an index on them.

Now the real fun begins when you want to index your own site on solr. First of all, you need to define the schema and identify how data will be ported into the index.

For starters
-- copy example directory : cp example myproject
-- go to solr/conf directory : cd myproject/solr/conf
-- ls will show you the directory contents : $ ls
admin-extra.html dataimport.properties mapping-ISOLatin1Accent.txt schema.xml solrconfig.xml stopwords.txt xslt
data-config.xml elevate.xml protwords.txt scripts.conf spellings.txt synonyms.txt

These are the only files in solr which need to be tweaked for getting solr working...
We will see all the files one by one

schema.xml
-- you need to define the fieldType such as String, boolean, binary, int, float, double etc...
-- Each field type has a class and certain properties associated with it.
-- Also you can specify how a fieldtype is analyzed/tokenized/stored in the schema
-- Any filters related to a fieldType can also be specified here.
-- Lets take an example of a text field

# the text field is of type solr.TextField
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    # the analyzer is to be applied during indexing.
    <analyzer type="index">
        # pass the text through the following tokenizer
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
        # and then apply these filters
        # use the stop words specified in stopwords.txt for the stopwordfilter
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        # avoid stemming words which are in protwords.txt file
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
    # for query use the following analyzers and filters
    # generally the analyers/filters for indexing and querying are same
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
</fieldType>

-- Once we are done defining the field types, we can go ahead and define the fields that will be indexed.
-- Each field can have additional attributes like type=fieldType, stored=true/false, indexed=true/false, omitNorms=true/false
-- if you want to ignore indexing of some fields, you can create an ignored field type and specify the type of the field as ignored

<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
<dynamicField name="*" type="ignored" multiValued="true" />

-- in addition to these, you also have to specify the unique-key to enforce uniqueness among multiple documents

<uniqueKey>table_id</uniqueKey>

-- Default Search field and default search operator are among other things that can be specified.

solrconfig.xml
- used to define the configuration for solr
- parameters like datadir (where index is to be stored), and indexing parameters can be specified here.
- You can also configure caches like queryResultCache, documentCache or fieldValueCache and their caching parameters.
- It also handles warming of cache
- There are request handlers for performing various tasks.
- Replication and partitioning are sections in request handling.
- Various search components are also available to handle advanced features like faceted search, moreLikeThis, highlighting
- All you have to do is put the appropriate settings in the xml file and solr will handle the rest.
- Spellchecking is available which can be used to generate a list of alternate spelling suggestions.
- Clustering is also a search component, it integrates search with Carrot2 for clustering. You can select which algorithm you want for clustering - out of those provided by carrot2.
- porting of data can be made using multiple options like xml, csv. There are request handlers available for all these formats
- A more interesting way of porting data is by using the dataImportHandler - it ports data directly from mysql to lucene. Will go in detail on this.
- There is an inbuilt dedup results handler as well. So all you have to do is set it up by telling it which fields to monitor and it will automatically deduplicates the results.

DataImportHandler

For people who use sphinx for search, a major benefit is that they do not have to write any code for porting data. You can provide a query in sphinx and it automatically pulls data out of mysql and pushes it to sphinx engine. DataImportHandler is a similar tool available for linux. You can register a dataimporthandler as a requestHandler

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
    # specify the file which has the config for database connection and query to be fired for getting data
    # you can also specify parameters to handle incremental porting
    <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

data-config.xml

<dataConfig>
  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/databaseName?zeroDateTimeBehavior=convertToNull"
                 user="root"  password="jayant" />
    <document>
        <entity name="outer" pk="table_id"
        query="SELECT table_id, data1, field1, tags, rowname, date FROM mytable"
        deltaImportQuery="SELECT table_id, data1, field1, tags, rowname, date FROM mytable where table_id='${dataimporter.delta.table_id}'"
        deltaQuery="SELECT table_id from mytable where last_modified_time > '${dataimporter.last_index_time}'">
        
        # this is the map which says which column will go into which fieldname in the index
        <field column="table_id" name="table_id" />
        <field column="data1" name="data1" />
        <field column="field1" name="field1" />
        <field column="tags" name="tags" />
        <field column="rowname" name="rowname" />
        <field column="date" name="date" />

            # getting content from another table for this tableid
            <entity name="inner"
            query="SELECT content FROM childtable where table_id = '${outer.table_id}'" >
            <field column="content" name="content" />
            </entity>
        </entity>
    </document>
</dataConfig>

Importing data

Once you start the solr server using java -jar start.jar, you can see the server working on
http://localhost:8983/solr/

It will show you a "welcome message"

To import data using dataimporthandler use
http://localhost:8983/solr/dataimport?command=full-import (for full import)
http://localhost:8983/solr/dataimport?command=delta-import (for delta import)

To check the status of dataimporthandler use
http://localhost:8983/solr/dataimport?command=status

Searching

The ultimate aim of solr is searching. So lets see how can we search in solr and how to get results from solr.
http://localhost:8983/solr/select/?q=solr&start=0&rows=10&fl=rowname,table_id,score&sort=date desc&hl=true&wt=json

Now this says that
- search for the string "solr"
- start from 0 and get 10 results
- return only fields : rowname, table_id and score
- sort by date descending
- highlight the results
- return output as json

All you need to do is process the output and the results.

There is a major apprenhension in using solr due to the reason that it provides an http interface for communication with the engine. But i dont think that is a flaw. Ofcourse you can go ahead and create your own layer on top of lucene for search, but then solr uses some standards for search and it would be difficult to replicate all these standards. Another option is to create a wrapper around the http interface with limited functionality that you need. Http is an easier way of communication as compred to defining your own server and your own protocols.

Definitely solr provides an easy to use - ready made solution for search on lucene - which is also scalable (remember replication, caching and partitioning). And in case the solr guys missed something, you can pick up their classes and modify/create your own to cater to your needs.

Tuesday, October 06, 2009

intro to lucene 2.9

What crap!!!. Why do they have to come out with a new version every now and then. And make people rewrite their code to upgrade to a new version. How much do they still have to improve their code. Just because of their frequent upgrades, i have to change my code every now and then. Why
should i upgrade to lucene 2.9?

To answer this question - it could be said that you build something and then you figure out that - oh no if this could have been done in this way then it would have been better. So for example you make an omlette and you figure out that putting a bit of cheese and pepper would have improved its taste. So next time you try that, but then you figure out that making it in butter would bring out more taste. Or you buy a Pentium 3 pc with 1 GB ram and after 2 years you see that it is outdated - the softwares have grown and so have the processing powers. To run the currently available softwares, you would need to upgrade your pc to a Pentium 4 - core 2 duo and maybe upgrade your graphics card to ATI Radeon 4870 X2 from the previous nvidia 9800 GT to play the recent games more effectively. And maybe upgrade your 20 inch CRT television to a 42 inch HD LCD for better graphics display.

It is the same reason that lucene keeps on optimizing its code and improving the features - they realize that better code leads to faster indexing and searching on the same machine.

The reason why you shud upgrade your lucene version is defined by the list of features that lucene 2.9 provides:

Per segment searching and caching (can lead to much faster reopen among other things). FieldCache - takes advantage of the fact that most segments of the index are static, only processes the parts that change, save on time and memory. Also faster searching among multiple segments.

Near real-time search capabilities added to IndexWriter - new way to search the current in-memory segment before index is written to disk.

New Query types

Smarter, more scalable multi-term queries (wildcard, range, etc)

A freshly optimized Collector/Scorer API

Improved Unicode support and the addition of Collation contrib

A new Attribute based TokenStream API

A new QueryParser framework in contrib with a core QueryParser replacement impl included.

Scoring is now optional when sorting by Field, or using a custom Collector, gaining sizable performance when scores are not required.

New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)

New fast-vector-highlighter for large documents

Lucene now includes high-performance handling of numeric fields (NumericField & NumericRangeQuery). Such fields are indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values. This improves the Lucene number indexing, and is faster for searching numbers, geo-locations, and dates, faster for sorting, and hugely faster for range searching.

For the newbies, all this techno-rant is just there to make you feel good about upgrading. In brief - faster search and more features.

Lets take a look at how you would go ahead with indexing and searching using lucene 2.9

Here is a very rough example. What i have done is use the twitter api to search for keywords in twitter and fetch the micro-blogs and create an index using lucene 2.9. And then use the same program to open the index and run a search - displaying only the n results. You can fetch the twitter api from http://yusuke.homeip.net/twitter4j/en/index.html


import twitter4j.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.store.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.*;
import org.apache.lucene.document.*;
import java.io.*;
import java.util.Date;
import java.util.ArrayList;
import java.util.List;

public class lucene
{
  public static void main(String[] args) throws Exception
  {
    if(args.length != 3)
    {
      System.out.println("Usage : java lucene <index/search> <dirname> <string>");
      System.exit(1);
    }

    if(!args[0].equalsIgnoreCase("index") && !args[0].equalsIgnoreCase("search"))
    {
      System.out.println("Usage : java lucene <index/search> <dirname> <string>");
      System.exit(1);
    }
    System.out.println(args[0]+","+args[1]+","+args[2]);

    lucene lu = new lucene(args[0], args[1]);
    if(args[0].equalsIgnoreCase("index"))
      lu.indexFiles(args[2]);
    else if(args[0].equalsIgnoreCase("search"))
      lu.searchFiles(args[2]);


  }

  File index_dir;
  String action;

  public lucene(String action, String dirname) throws Exception
  {
    this.index_dir = new File(dirname);
    this.action = action;

    if(index_dir.exists() && action.equalsIgnoreCase("index"))
    {
      System.out.println("Index already exisits... enter another another directory for indexing...");
      System.exit(1);
    }
  }

  public void indexFiles(String searchstr) throws Exception
  {
    Twitter tw = new Twitter();
    System.out.println("Getting tweets for "+searchstr);
    twitter4j.Query qry = new twitter4j.Query("source:twitter4j "+searchstr);
    qry.setRpp(50);

    QueryResult res = tw.search(qry);
    List<Tweet> tweets = res.getTweets();
    System.out.println("Got "+tweets.size()+" tweets in "+res.getCompletedIn()+" : "+res.getMaxId());

    // constructor changed from lucene 2.4.1 
    IndexWriter iw = new IndexWriter(NIOFSDirectory.open(this.index_dir), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

    int docs = 0;
    for(int z=0; z<tweets.size(); z++)
    {
      Tweet twt = (Tweet)(tweets.get(z));
      String user = twt.getFromUser();
      String usrTwt = twt.getText();
      System.out.println("Got : "+user+" => "+usrTwt);

      Document d = new Document();
      // constructor for Field changed - introduced new constants ANALYZED & NOT_ANALYZED. Not storing NORMS improve performance.
      d.add(new Field("user", user, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS, Field.TermVector.YES));
      d.add(new Field("tweet", usrTwt, Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS));

      iw.addDocument(d);
      docs++;
    }

    System.out.println("optimizing..."+docs+" docs");
    iw.optimize();
    iw.close();
  }

  public void searchFiles(String searchstr) throws Exception
  {
    BufferedReader br = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));  
    QueryParser parser = new QueryParser("tweet",new WhitespaceAnalyzer());
    // New constructor in 2.9 - pass true to open in readonly mode.
    IndexReader ir = IndexReader.open(NIOFSDirectory.open(this.index_dir), true);
    Searcher searcher = new IndexSearcher(ir);
    int ResultsPerPage = 5;
    do
    {
      org.apache.lucene.search.Query qry = parser.parse(searchstr);
      System.out.println("Searching for : "+searchstr);

      //use TopScoreDocCollector to get results and do paging. Get 2 page in a go. Do not sort on score.
      TopScoreDocCollector collector = TopScoreDocCollector.create(2*ResultsPerPage, false);
      searcher.search(qry, collector); 
      //get total no of hits found;
      int totalResults = collector.getTotalHits();
      int start = 0;
      int end = Math.min(totalResults, ResultsPerPage);
      ScoreDoc[] hits = collector.topDocs().scoreDocs;

      System.out.println("Total hits : "+totalResults+", end : "+end);

      for(int i=start; i<end; i++)
      {
        Document doc = searcher.doc(hits[i].doc);
        System.out.println(i+"] "+doc.get("user")+" => "+doc.get("tweet"));
      }


      System.out.print("\nQuery (enter \"quit\" to exit): ");
      searchstr = br.readLine();
      if(searchstr == null || searchstr.length() == -1)
      {
        break;
      }
      searchstr.trim();
      if(searchstr.length()==0)
      {
        break;
      }

    }while(!searchstr.equalsIgnoreCase("quit"));
    
  }
}

Monday, September 07, 2009

moniter and improve lucene search

How ?
Simple - use lucidgaze available at http://www.lucidimagination.com/Downloads/LucidGaze-for-Lucene

With lucidgaze you could analyze

Read/write stats on index utilization, distribution and throughput

Query efficiency stats - show how effectively user input is analyzed and decomposed for processing by the index

Mapping of tokenizers, token-streams and analyzers - makes transparent how text is processed and indexed

So, we went ahead and downloaded the api but were unable to find any example program or source code. After some amount of tinkering around we were finally able to figure out how to use it.

You could get the code here : http://ai-cafe.blogspot.com/2009/09/lucid-gaze-tough-nut.html

Monday, February 18, 2008

lucene search class diagram

Check out the UML diagram for lucene search & scoring i found on the net

Application flow:
1. create query
2. Pass to searcher
3. Searcher creates weight
4. Weight creates scorer
5. Searcher create collector and pass to scorer
6. scorer find matches.
7. scorer calculates scores.
8. Matches are written to collector

Friday, July 06, 2007

Document scoring in lucene - Part 2

This is an addition to the previous post document scoring/calculating relevance in lucene. If you find the link inadequate you can refer scoring@lucene.org and the formula at Default similarity formula.

What i did was i created some txt file and some code to index the file and have tried to find out in practice how lucene calculates relevance using the DefaultSimilarity class. Here is the file and the source code.

file: file_a.txt
jayant project manager project leader team leader java linux c c++ lucene apache solaris aix unix minix gnome kde ubuntu redhat fedora rpm deb media player vlc evolution exchange microsoft java vb vc vc++ php mysql java

source code: FIndexer.java
import java.io.*;
import java.util.Date;
import org.apache.lucene.index.*;
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;

public class FIndexer
{

public static void main (String[] args) throws Exception
{
String src_path = args[0];
String des_path = args[1];

File f_src = new File(src_path);
File f_des = new File(des_path);

if(!f_src.isDirectory() || !f_des.isDirectory())
{
System.out.println("Error : "+f_src+" || "+f_des+" is not directory");
System.exit(1);
}

IndexWriter writer = new IndexWriter(f_des,new WhitespaceAnalyzer(), true);

File[] files = f_src.listFiles();
for(int x=0; x < files.length; x++)
{
Document doc = new Document();
if(files[x].isFile())
{
BufferedReader br = new BufferedReader(new FileReader(files[x]));
StringBuffer content = new StringBuffer();
String line = null;
while( (line = br.readLine()) != null)
{
content.append(line).append("\n");
}

Field f1 = new Field("name",files[x].getName(), Field.Store.YES, Field.Index.NO);
doc.add(f1);
Field f2 = new Field("content", content.toString(), Field.Store.NO, Field.Index.TOKENIZED);
doc.add(f2);
Field f3 = new Field("content2", content.toString(), Field.Store.NO, Field.Index.TOKENIZED);
doc.add(f3);

}
writer.addDocument(doc);
}
writer.optimize();
writer.close();
}
}

I created copies of the file_a.txt as file_b.txt and file_c.txt and edited file_c.txt. Such that file_a.txt and file_b.txt have the same content (that is word java occurs 3 times in both the files). And file_c.txt has word java occuring only 2 times.

I created an index using FIndexer.java and used luke to fire queries on the index.

Firstly did a search content:java. And as expected i got 3 results with the following score.

Rank	score	filename
0	0.1928	file_b.txt
1	0.1928	file_a.txt
2	0.1574	file_c.txt

Lets take the score for file_b.txt and see how it is calculated. The score is a product of
tf = 1.7321 (java occurs 3 times)
idf = 0.7123 (document freq = 3)

And the score of file_c.txt is a product of
tf = 1.4142 (java occurs 2 times)
idf = 0.7123 (document freq = 3)

Here score is equivalent to the fieldWeight since there is just one field. If more than one fields are used in the query, then the score would be a product of the queryWeight and fieldWeight

So if i change the query to Content:java^5 Content2:java^2. Here i am boosting java in content by 5 and java in content2 by 2. That is java in content is 2.5 times more important than java in content2. Lets check the scores.

Rank	score	filename
0	0.2506	file_b.txt
1	0.2506	file_a.txt
2	0.2046	file_c.txt

Lets look at how the score was calculated

Again for file_b.txt
0.2506 = 0.1709 (weight of content:java^5) * 0.0716 (weight of content2:java^2)

Out of which Weight of content:java^5 is
= 0.9285(Query weight) [ 5 (boost) * 0.7123 (idf docFreq=3) * 0.2607 (queryNorm) ]
* 0.1928(field weight) [ 1.7321 (tf=3) * 0.7123 (idf docFreq=3) * 0.1562 (fieldNorm) ]

And weight of content2:java^2 is
= 0.3714(Query weight) [ 2 (boost) * 0.7123 (idf docFreq=3) * 0.2607 (queryNorm) ]
* 0.1928(Field weight) [ 1.7321 (tf=3) * 0.7123 (idf docFreq=3) * 0.1562 (fieldNorm) ]

The same formula is used for calculating of score of file_c.txt except for the fact that termfrequency = 1.4142 (tf of content:java or content2:java is 2)

This explains a little about scoring in lucene. The scoring can be altered by either changing the DefaultSimilarity class or extending the DefaultSimilarity and changing certain factors/calculations in it. More on how to change the scoring formula later.

Thursday, June 07, 2007

lucene in php & lucene in java

Something i found out while solving some issue from Mr. Nguyen from vietnam.

He used lucene-php in zend framework for building a lucene index and searching on the index, and was facing issues with search times. It turned out that mysql full text index was performing better than lucene index.

So i did a quick benchmark and found the following stuff

1. Indexing using php-lucene takes a huge amount of time as compared to java-lucene. I indexed 30000 records and the time it took was 1673 seconds. Optimization time was 210 seconds. Total time for index creation was 1883 seconds. Which is hell lot of time.

2. Index created using php-lucene is compatible to java-lucene. So index created by php-lucene can be read by java-lucene and vice versa.

3. Search in php-lucene is very slow as compared to java-lucene. The time for 100 searches are -

jayant@jayantbox:~/myprogs/java$ java searcher
Total : 30000 docs
t2-t1 : 231 milliseconds

jayant@jayantbox:~/myprogs/php$ php -q searcher.php
Total 30000 docs
total time : 15 seconds

So i thought that maybe php would be retrieving the documents upfront. And changed the code to extract all documents in php and java. Still the time for 100 searches were -

jayant@jayantbox:~/myprogs/java$ java searcher
Total : 30000 docs
t2-t1 : 2128 milliseconds

jayant@jayantbox:~/myprogs/php$ php -q searcher.php
Total 30000 docs
total time : 63 seconds

The code for php search for lucene index is:

/*
 *      searcher.php
 *      On 2007-06-06
 * By jayant 
 *
 */

include("Zend/Search/Lucene.php");

$index = new Zend_Search_Lucene("/tmp/myindex");
echo "Total ".$index->numDocs()." docs\n";
$query = "java";
$s = time();
for($i=0; $i<100; $i++)
{
 $hits = $index->find($query);
// retrieve all documents. Comment this code if you dont want to retrieve documents
 foreach($hits as $hit)
  $doc = $hit->getDocument();

}
$total = time()-$s;
echo "total time : $total s";
?>

And the code for java search of lucene index is


/*
 *      searcher.java
 *      On 2007-06-06
 * By jayant 
 *
 */

import org.apache.lucene.search.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;

public class searcher {

 public static void main (String args[]) throws Exception
 {
  IndexSearcher s = new IndexSearcher("/tmp/myindex");
  System.out.println("Total : "+s.maxDoc()+" docs");
  QueryParser q = new QueryParser("content",new StandardAnalyzer());
  Query qry = q.parse("java");

  long t1 = System.currentTimeMillis();
  for(int x=0; x< 100; x++)
  {
   Hits h = s.search(qry);
// retrieve all documents. Comment this code if you dont want to retrieve documents
   for(int y=0; y< h.length(); y++)
   {
    Document d = h.doc(y);
   }

  }
  long t2 = System.currentTimeMillis();
  System.out.println("t2-t1 : "+(t2-t1)+" ms");
 }
}

Hope i havent missed anything here.

Tuesday, April 24, 2007

realtime fulltext index

For past some days, i have been struggling with getting something which can allow me to create, insert, update and search fulltext indexes - big fulltext indexes. Not very large but somewhere around 3-4 GB of data.

You would suggest mysql, but with mysql the inserts on table with fulltext indexes is very slow. For every insert, data is put in the index which leads to slow inserts.

What else? Well, i had come across some other fulltext engines like sphinx and senna. Compiled mysql with senna but there was no benefit. The index size was almost double and the searches were also slow - as slow as mysql fulltext index.

What about sphinx. Well, had integrated sphinx with mysql. But the sphinx engine has lots of limitations - like the first 2 columns must be integer and the 3rd column should be a text. Rest all columns should be integers. So if i use sphinx, i would need to write a stored procedure and trigger it to port data from my current table to a parallel sphinx table whenever a new row is inserted. And what would i do if i have to run a query - i would be joining both the tables. Would that be fast. Dont know. Let me figure out how to get this thing running...

You would say - what about lucene - my favorite search engine. Well dear, lucene is in java and there is no way i could integrate it with mysql if i have to do it in a jiffy. I would need to design and build a library to get this thing running. Forget it. In fact dbsight provides a similar type of library. Maybe i will get that thing working and check it out. At a small price, I might get what the organization requires and that too without wasting any resources on building it.

Hope i get some solution to the situation i am in right now. I have solutions for it, but not a solution which requires least amount of resources to get it running. Till i get one, the search will continue...

Saturday, July 08, 2006

document scoring/calculating relevance in lucene

How is scoring of a document calculated in lucene? Well, first of all what is soring and what does it do. Scoring is a factor which decides the order by which documents are returned after a search is done using lucene. Something like "order by relevance". If we provide an order by/sort by clause in the search query, then ordering is done on this. Else, ordering is done on the score of the document. All documents returned from the search have a score no associated with it. The document with the highest score comes on top and that with the lease score comes at the bottom.

Scoring is implemented in the similarity class of lucene (org.apache.lucene.search.Similarity).

Lets see how the score of a document is calculated.

The score of a query q for document d is defines as follows (t refers to a term):

score(q,d)=	Σ	(tf(t in d) * idf(t)^2 * getBoost(t in q) * getBoost(t.field in d) * lengthNorm(t.field in d) ) * coord(q,d) * queryNorm(sumOfSqaredWeights)
	t in q

where

sumOfSqaredWeights =	Σ	( idf(t) * getBoost(t in q) )^2
	t in q

Now, lets decode it...

tf = term/phrase(t) frequency in a document(d). Terms and phrases repeated in a document indicate the topic of the document, so implementations of this method usually return larger values when freq is large, and smaller values when freq is small.

idf = term document frequency(no of documents which contain the term). Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

getBoost (t in q) = Gets the boost for this clause in the query. Documents matching this clause will (in addition to the normal weightings) have their score multiplied by b. The boost b=1.0 by default.

getBoost (t.field in d) = Gets the boost for the field in the document if the term is in the field. Again boost for the field b is 1.0 by default.

lengthNorm(t.field in d) = normalization value for a field given the total number of terms contained in a field. These values, together with field boosts, are stored in an index and multipled into scores for hits on each field by the search code. Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small. These values are computed under IndexWriter.addDocument(Document d) and stored. Thus they have limited precision, and documents must be re-indexed if this method is altered.

coord = Score factor based on the fraction of all query terms that a document contains. This value is multiplied into scores. The presence of a large portion of the query terms indicates a better match with the query, so implementations of this method usually return larger values when the ratio between these parameters is large and smaller values when the ratio between them is small.

queryNorm = Computes the normalization value for a query given the sum of the squared weights of each of the query terms. This value is then multipled into the weight of each query term. This does not affect ranking, but rather just attempts to make scores from different queries comparable.

So affectively the score of a document returned, in search done, using a query is calculated keeping in mind the frequency of the term/phrase in the document, frequency of documents containing the term, boosting factor of a clause in the query, boosting factor of a field, total number of terms in a field, fraction of all query terms contained in a document and some normalization factor based in the idf and boost of term in query.

If you are searching a single field using a query, the results should be very relevant, provided the text entered is meaningful and gramatically correct. With the logic here, even if you are searching multiple fields, the order of relevance of results should be good. But relevance can be increased by assigning boost factor to fields either during the query formation process or during the indexing process. Fields could be boosted according to their order of importance resulting in very relevant results.

Boosting during the query formation process provides the flexibility to change the boost factor dynamically but slows down the search and scoring due to dynamic computation of score. Whereas if fields are boosted during the indexing process, then though the search will be much faster, you would loose the flexibility of altering the boost factor dynamically.

Anything i have missed, please let me know and i will add it...

Friday, June 02, 2006

Benchmarking results of mysql, lucene and sphinx...

Finally i was able to do a benchmark of the 3 search engines
- mysql fulltext search
- lucene search engine
- sphinx www.sphinxsearch.com

I came across sphinx while doing a search on full text search engines. It is a very good engine. Few points regarding sphinx
-> Very simple to configure, create index and search
-> Very easy to integrate with php and mysql. APIs for the same are also available. I was able to build index and search using sphinx in a few hours.
-> The index which has been created is a combination of all fields of mysql. There is no distinction between different fields being searched. So you can perform search on an index and not on different fields of the index.
-> Of course since its source code is available, the searching process can be customized according to your needs. Moreover 0.9.6 version which is under development will be providing field wise search.
-> Since this is in C, it is supposed to be faster as compared to lucene.

I did the benchmarking on my own laptop. It is a dell Inspiron 700m running linux (fedora core 4) kernel 2.6.11. Configuration of the m/c ==>>
Processor : Intel(R) Pentium(R) M processor 1.60GHz
Cache size : 2048 KB
Memory : 512 MB

I got down a table containing 1 Lakh (100,000) records. The data size was 456 MB. And created index on some fields from the table.

INDEXING TIME

Mysql Version - 5.1.9 (Beta)
Stop words : Built in
Indexing words of length >=2 & <=84 ( There is a feature in mysql only which allows you to specify the min & max length of words you want to index. By default min length is 4. I changed it to 2 so that i can index words like net, php, hr etc. If you want to index all words, change this to 1.
Indexing time : 1440 seconds
Index size : 263 MB (456 MB data - remains same).

Lucene Version - 1.9.1
Stop words : 660 general words (like is, if, this etc...)
Indexing time : 2761 seconds (default configuration was used during indexing. There are certain parameters like mergefactor and maxmergedocs using which indexing can be tuned to work much faster. Though it may result in Too Many Open Files error in linux.
Index Size : 109 MB (No data is stored. Had stored only the unique id of each document using which i can retrieve the document later.)

Sphinx Version - 0.9.5
Stop words : NONE
Indexing time : 246 seconds (using default configuration. Dont have much idea whether indexing can be tuned.)
Index Size : 211.1 MB (3.1 MB Index - .spi file + 208 MB Data - .spd file).
3.1 MB of index looks extremely good. Also in case of sphinx, there is no need to maintain separate id for retrieval of data, since the unique id of your data is maintained in the index. As compared to lucene where you have to maintain a separate id and enforce uniqueness on it with your program. The indexing time and data compression are both good.

SEARCH TIME

The searches done were using scripts. I did a number of searches on randomly selected words and then came out with an average time. In case of lucene and mysql, the search was done on 2 fields with an OR between them. In case of sphinx the search was done on the complete index.

Searches/Thread	Concurrency - no of simultaneous threads	Total searches	Total time (milli seconds)	Average time (milli seconds)
MySQL
5	1	5	347153	69430.6
5	2	10	727359	72735.9
10	3	30	2288392	76279.73
Found that search for an exact phrase which can be done using "in boolean mode" queries is more resource hungry. The query time in mysql is extremely high. Mysql runs purely on RAM, so with more RAM and accordingly configured mysql the queries would be much faster. Concurrency does not affect query execution speed to a major extent.
LUCENE
5	1	5	673	134.6
5	2	10	615	61.5
10	3	30	897	29.9
50	3	150	2762	18.41
Initially searches are slow. But as we keep on searching the index is cached in RAM and the speed of searches increases. The searches are very fast as compared to MySQL. Here, from the results, it does not seem that concurrency is an issue. But i have been using lucene for some time now and have found that there are some concurrency issues in lucene. So if you are searching a huge index of say 100,00,000 records and the index size is say 8-10 GB, then with a concurrency of 100 searches at the same time, issues pop up, as searches seem to get locked. But still performance wise it is much better than mysql
SPHINX
5	1	5	512	102.4
5	2	10	733	73.3
10	3	30	2272	75.73
50	3	150	4439	29.59
Single searches are faster than that in lucene. But here we will have to consider the fact that there is no OR clause in the search. So the search engine does not have to get 2 different result sets and do a union on them. But as the concurrency of searches in increased the average time per search does not drop majorly as in lucene. Clearly pointing out that there may be concurrency issues here. Since i have not explored this to a great extent, i cannot comment on the problems related to concurrency here.

To sum up, lucene seems to be the right choice for the time being if you are looking forward to searching large amounts of data and performance is your primary goal. The list of features available is also impressive. Sphinx will come in next where indexing time is very small and indexing/searching hassle free. With evolution, sphinx may overtake lucene some time down the line providing both a list of good features and performance. MySQL fulltext search comes as a last option, which, it seems should be used only if the data set is small and quick development time is required.

==================
Date : 12 Aug 2009
Some latest benchmarks on Lucene Versus Sphinx
==================

Sunday, May 21, 2006

mysql Fulltext search versus lucene

Here is the comparison between mysql fulltext and lucene search engines. On the forefront the only thing that distinguishes one from another is

==> speed of fulltext search in lucene is much faster as compared to mysql
==> lucene is much more complex to use as compared to mysql.

In mysql, you can simply mark an index on a text/varchar column as fulltext, and your work is done. All you need to do next is to fire MATCH AGAINST queries. Adding and modification of indexes is handled by mysql internally as and when new data is added. On the other hand, in lucene, the addition/modification of documents is to be handled programatically. Moreover Mysql is pluggable to your application using available apis. Mysql is running on a port and all you need to do is connect to the port and fire sql queries to that port using your application. Whereas in case of lucene, you will have to plug your application to the index using lucene api.

Another difference is that lucene is very efficient in searching large no of documents. Where as in case of mysql as the no of documents increases, the speed of search goes down. Mysql uses RAM to cache the index and use it during serving a query. So, if the size of your fulltext index exceeds the RAM, you will experience a major fall in the search performance. Where as in case of lucene, the size of index does not affect the search performance even when the size exceeds the RAM on the system.

With mysql, when a fulltext index is created on a table, inserts on the table become very slow. Lets analyze this. Well... for each record, mysql does some major processing to fix the record in current index. After the record is indexed, the cache/RAM containing the index needs to be rebuilt, since the index which was previously there is not correct - does not have the new record. So, with each record Mysql fetches the complete index in the cache/RAM. So if you are performing search and inserts/updates on a single table with fulltext index, the performance of both search & indexing goes very very down. On the other hand, with lucene, addition of new documents is not a major drawback. Documents can be added on the fly. Which makes indexing very fast. And this process does not affect the search. Two things to be noted here.

==> lucene does not allow you to modify a document. Modifying a document is equivalent to deleting the current document and adding the modified document to the index.

==> lucene requires an object of the index to perform the search. You will know about it when you use the api. Whenever you add a new document to the index, a new object of the index has to be created to include the new document in the index. But creation of a new object is not a major overhead. Though it does slow down the searching process to some extent.

With lucene, you do not have the flexibility to join two indexes and form a single query. Like in mysql you can do something like this

SELECT TABLE1.COLA, TABLE2.COLB FROM TABLE1, TABLE2 WHERE MATCH(TABLE1.COL1) AGAINST 'TEXT1' AND MATCH(TABLE2.COL2) AGAINST 'TEXT2' AND TABLE1.ID=TABLE2.FOREIGNKEY

(Pls dont see the syntax, look for the meaning/logic behind. I am not good at syntaxes. :-D ) This cannot be done with lucene. You will have to play with the data in such a way that your index contains both the data of say TABLE1 & TABLE2 and then you will have to play with the search to get the data that you need. Too complex right??

Also mysql comes with inbuilt list of stopwords and a default word tokenizer, which separates the words based on " ", ",", "." etc. Whereas in case of lucene, both - the list of stop words and the word tokenizer has to be defined by us. This is advantageous, because then you can define your own stopwords and tokenize the text as per your requirements.

In case of mysql the searches are by default case insensitive. Whereas in case of lucene, you can make the searches case-sensitive or case-insensitive, the way you want it to be.

With mysql you have the minimum length of word to be indexed which is by default 4. So all words which have less than 4 characters will not be indexed. What will you do if you want to index words like "php", "asp", "c"? You will have to decrease the minimum length from 4 to 1. And this will increase the index size drastically and slow down all your searches as a consequence. There are no such issues in lucene.

In mysql, every correct word in the collection and in the query is weighted according to its significance in the collection or query. Consequently, a word that is present in many documents has a lower weight and if a word is rare, it has higher weight. So if a word is present in 50% of the rows in a table, a query searching for that word will result in 0 result. This, mysql terms as relevance. But for me, it resulted in incorrect results for a query.

This link http://dev.mysql.com/doc/connector/j/en/fulltext-search.html will give a better idea of mysql fulltext search.

In lucene, there are some advanced options like

proximity search - find documents where there is one word between searchword1 and searchword2

wildcard search - find documents which have word like searchword* or maybe search?word etc etc...

fuzzy/similarity searches - find documents with words sounding similar to roam~ (will look for roam, foam etc...)

Term boosting - you can boost a term to move relevant documents to the top. So for example, you can say that you want documents with word "lucene" to be more relevant than those with word "mysql". Then you can do something like -> lucene^4 mysql .

Sorting of results is extremely fast if you are using lucene. In case of mysql, if you expect your results to come out fast, you will have to forget sorting. Sorting takes huge amount of time. Even retrieving the total no of documents in mysql is a chore. Where as for lucene the total no of documents come out as a default.

From this, if you are looking for a fulltext index on a small table without much hassles, go for mysql fulltext index. Whereas if you are looking at searching a large collection of data then go for lucene.

Whew... i wrote a lot... And there is more to write...Maybe next time... Do let me know if i have missed anything.

Friday, May 12, 2006

discovering lucene

How i discovered lucene and how did i tune it to use it with our organization... Oops, i may have to leave out specific details about its usage in our organization but i will try to give out a brief general idea.

Well.... i found out lucene thru the mysql website. Was just going thru the forums of mysql full text search engine. At that time mysql 4.0 had just rolled out and we were using the full text search engine of mysql. But it was terribly slow. Since the data size was not "that" large and number of searches could also be numbered down easily, we were able to survive on it. But we were aware that we would be hitting a bottle-neck some time later and well, we had to do something about it. Soooo... i was just browsing the forums and somewhere i found someone mention that mysql full text search is slow and a better alternative to it would be "aspseek" or "lucene". I first tried out aspseek, but it did not allow me to do a boolean query using different fields. Later i tried lucene. It is completely in java, but recently some versions of lucene for c and other languages are coming out... Another thing which is better than lucene is "egothor". But there is not much support/awareness about it. And i have tried but have been unable to use it to perform field based searches.

What i did was build up a huge index on one of the small P-III machines and try out the search. Made a program to fire random queries concurrently to the index and checked the load on the system. It turned out that the searches were extremely fast and the total no of results found was obtained in a flash, but the retrieval of large number of documents after search was a tedious process. Another thing that was found is that with compound file structure (lucene has 2 forms of index structure - compound and simple - will explain in detail later) and increased concurrency in searches, the speed of search would go down and load on the system would shoot up. So we decided on using the simple index structure.

The most important thing we did was that we found out a bug in that version of lucene (at that time 1.3 was the version being used) related to thread concurrency. A class which was synchronized and was not supposed to be. Pointed it out and sent thread stack dumps to Doug Cutting - the creator of lucene and he helped us solve it out. So we recompiled lucene with the patches, modified the data to suit the search we wanted , created our own analyzer (lucene has analyzers - will explain in detail later) and then used lucene. It still does serve our purpose. Though from that time till now, numerous changes have been done to the data and the search to optimize the search.

To begin, lucene is just an API. A set of tools that allows you to create an index and then search it. The index created is an inverted index (you will have to do some googling to find out what inverted index is - if u dont know it). And the index is basically a directory containing files related to the index. When compound index structure is used, the index directory contains a single file. But when simple index structure is used, the index directory contains lots of files (generally 1 file/field being indexed and 7 (i think) more files).

An analyzer is the most important part of the index. It defines how data to be searched is being broken and formatted. You can break data into phrases or words, convert all of them to either lower-case or upper-case (search can also be made case sensitive). For example there is the whitespace-analyzer which breaks the text to be indexed into tokens (or words) separated by white space. There is a StandardAnalyzer which retains only alphanumeric stuff from your text and discards special characters. There is a stop-word analyzer which breaks text on the basis of stop word list provided to it. This seems to be too heavy. In fact this part was the most difficult one when i started out with lucene. It may be difficult to get what you exactly want from your analyzer and like me, you may end up making your own analyzer and define your own stop words.

What are stop words? Oh... i forgot. Stop words or noise words are words which are neither indexed nor searched. Something like "the" can be made a stop word, since it is a common word and is not relevant during search.

I would just put down 2 small and basic programs for indexing and search. Dont copy and paste them, it wont work. I dont believe in spoon feeding.

INDEXING: A program which would index text files in a directory.

/** Declare the main class. Index all text files under a directory. */
public class IndexFiles
{
// Name of the index directory. where index would be stored.
static final File INDEX_DIR = new File("index");

// The main method
public static void main(String[] args)
{
// idxsrc is the directory where files to be indexed is stored
final File docDir = new File("idxsrc");

// Start the indexing process
try
{
//create an object of the indexwriter, use standard analyzer and create new index
IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
System.out.println("Indexing to directory '" +INDEX_DIR+ "'...");
//call a function which does the actual indexing
indexDocs(writer, docDir);
System.out.println("Optimizing...");
//optimize the index - very important - increases speed of search
writer.optimize();
writer.close();

} catch (IOException e) //exception handling
{
e.printStackTrace();
}
}

static void indexDocs(IndexWriter writer, File file)
throws IOException
{
System.out.println("adding " + file);
try
{
Document doc = new Document();
doc.add(new Field("path", file.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
doc.add(new Field("contents", new FileReader(file), Field.Store.NO, Field.Index.TOKENIZED));
writer.addDocument(doc);
}
catch (FileNotFoundException fnfe)
{
fnfe.printStackTrace();
}
}
}
}
}

Now this is a very crude program to index a directory containing text files. The program creates 2 fields in the index - "path" and "contents". The important thing to note over here is how the index is made up. An index is a collection of documents and each document has n number of fields. For each field, you can decide whether you want to store the data or not while indexing. Also you can decide whether you want to break up the data for that field into tokens for search using the analyzer that you have specified.

SEARCH: Now lets search the index created

// Declare the class
public class SearchFiles
{
/** main method */
public static void main(String[] args) throws Exception
{
String index = "index"; //index to search
String field = "contents"; //default field to search
String queries = null;

//open up the index
Searcher searcher = new IndexSearcher(IndexReader.open(index));
//create an analyzer - to convert both indexed data and search query to same format
Analyzer analyzer = new StandardAnalyzer();

BufferedReader in = null;

while (true)
{
if (queries == null) // prompt the user
System.out.print("Query: ");

String line = in.readLine();

if (line == null || line.length() == -1)
break;

//create a query from the string put by user
Query query = QueryParser.parse(line, field, analyzer);
System.out.println("Searching for: " + query.toString(field));

//hits defines the pointer to the result set obtained after search
Hits hits = searcher.search(query);

System.out.println(hits.length() + " total matching documents");

// get and display first 10 results/documents
final int DISPLAY = 10;
for (int start = 0; start < hits.length(); start++)
{
int end = 0;
Document doc = hits.doc(i);
String path = doc.get("path"); //show the path - which was stored
end++;
if(end>DISPLAY)
break;
}
}
//close the index
reader.close();
}
}

Another crude program for search. For search the field where search is to be performed has to be named. Suppose we had 2 fields - say "contents" and "contents1", then we would be giving a query like this :

contents: phrase1 AND/OR contents1: phrase 2

There is lots that could be done using lucene. This is just the starting point. Wellll, i cud have skipped it and just given the link lucene.apache.org, but blogs dont work that way. So..

Maybe some time later, i will write down something advanced about lucene...

Sunday, April 02, 2006

Work Work Work ...

Have been working straight for 4 days... getting only 2-4 hours of sleep in the night.

Well, i am a hard worker, but not that hard. Now the question comes, what is the urgency for this hard work. And i would say, buddy, it is a difficult to maintain the product on which i am working. People call it RESDEX. Short form for "Resume Database Access". We have more than 55 lakh profiles as of today. And it is my job to ensure that everything works without any problem. Providing a search on such a huge database, and getting results in matter of seconds is a difficult task.

What the user sees is a search form (well actually 4 search forms). Which is the tip of the iceberg. What i have to see is the complete iceberg. And it is a big iceberg. Like my collegue says - "Har click ki alag kahani hai" (Every click has its own story).

This is the first time i am talking about my work over here. Though all i do is work and work. Actually i have done things in my life which i am proud of. Building system architecture block by block is something which is really very exciting.

Earlier i used to work for an organization named 3di systems. I worked on lots of projects over there, but my major contribution was in handling the website named as www.newgateway4u.com. And it was an MLM (Multi Level Marketing) website. Logic for the website was - in theoretical terms - a binary tree. To explain in layman terms, each user registered with the website had to pay a certain amount to get himself activated. And once activated, he had to work on getting active people below him. Each user can have at most one person on left and one on right. The tree used to look something like this


            Root
             |
            /            /             L    R
           \   /\  
            \ /             LR RL RR
             /
            RLL

And the site used to work as - if R is active and both RL & RR are active then some percentage of the activation fees used to be passed to R. So Root would get percentage of activation money from L, R, LR, RL, RR, RLL, if all of them are active. There used to be weekly calculations to extract the amount payble for all active users.

The owner of the site used to make money by taking what was remaining from giving out the percentages to eligible active people.

When i joined the project, there were 5000 users and the amount calculation time was 11 hours. The site was pathetically slow. And then I Joined. Moved the complete binary tree to a doubly linked list in memory. Everything became cool. Calculations that used to take 11 hours were done in 30 minutes. WOW!!

And then after 1 year of working with the site, i left the organization. When i left, the no of users were 50,000 and the weekly calculation time was 4 hours. Boy!! the code i wrote and how i handled the application architecture. The ideas that used to flow from me. I was good. And i am proud of myself.

And after that i joined Naukri. And worked over here. Handled servers, built up lots of applications. And finally got into resdex. The Baap of all applications. RESDEX started off with i dont remember how much, but we hit a bottleneck at 10 Lakh cvs. When we shifted the architecture to something better. I still remember the date. It was 11th june 2004. Things stabilized. And we kept on adding more features and cvs to the product. The product is going stable. But there is always a fear of what would happen once the number reaches something like 100 lakh. How will we be providing a search on such a large database.

Challenges are unlimited. And step by step evolution is the only way to adapt to them.

Cool yaar...

And currently i am just adding one more block to the architecture...
And it is a bit heavy block. Well all blocks are heavy... Right!!.

Wish that everything goes right and i get to sleep on Monday...