Whatever....: intro to lucene 2.9

Tuesday, October 06, 2009

intro to lucene 2.9

What crap!!!. Why do they have to come out with a new version every now and then. And make people rewrite their code to upgrade to a new version. How much do they still have to improve their code. Just because of their frequent upgrades, i have to change my code every now and then. Why
should i upgrade to lucene 2.9?

To answer this question - it could be said that you build something and then you figure out that - oh no if this could have been done in this way then it would have been better. So for example you make an omlette and you figure out that putting a bit of cheese and pepper would have improved its taste. So next time you try that, but then you figure out that making it in butter would bring out more taste. Or you buy a Pentium 3 pc with 1 GB ram and after 2 years you see that it is outdated - the softwares have grown and so have the processing powers. To run the currently available softwares, you would need to upgrade your pc to a Pentium 4 - core 2 duo and maybe upgrade your graphics card to ATI Radeon 4870 X2 from the previous nvidia 9800 GT to play the recent games more effectively. And maybe upgrade your 20 inch CRT television to a 42 inch HD LCD for better graphics display.

It is the same reason that lucene keeps on optimizing its code and improving the features - they realize that better code leads to faster indexing and searching on the same machine.

The reason why you shud upgrade your lucene version is defined by the list of features that lucene 2.9 provides:

Per segment searching and caching (can lead to much faster reopen among other things). FieldCache - takes advantage of the fact that most segments of the index are static, only processes the parts that change, save on time and memory. Also faster searching among multiple segments.

Near real-time search capabilities added to IndexWriter - new way to search the current in-memory segment before index is written to disk.

New Query types

Smarter, more scalable multi-term queries (wildcard, range, etc)

A freshly optimized Collector/Scorer API

Improved Unicode support and the addition of Collation contrib

A new Attribute based TokenStream API

A new QueryParser framework in contrib with a core QueryParser replacement impl included.

Scoring is now optional when sorting by Field, or using a custom Collector, gaining sizable performance when scores are not required.

New analyzers (PersianAnalyzer, ArabicAnalyzer, SmartChineseAnalyzer)

New fast-vector-highlighter for large documents

Lucene now includes high-performance handling of numeric fields (NumericField & NumericRangeQuery). Such fields are indexed with a trie structure, enabling simple to use and much faster numeric range searching without having to externally pre-process numeric values into textual values. This improves the Lucene number indexing, and is faster for searching numbers, geo-locations, and dates, faster for sorting, and hugely faster for range searching.

For the newbies, all this techno-rant is just there to make you feel good about upgrading. In brief - faster search and more features.

Lets take a look at how you would go ahead with indexing and searching using lucene 2.9

Here is a very rough example. What i have done is use the twitter api to search for keywords in twitter and fetch the micro-blogs and create an index using lucene 2.9. And then use the same program to open the index and run a search - displaying only the n results. You can fetch the twitter api from http://yusuke.homeip.net/twitter4j/en/index.html


import twitter4j.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.store.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.*;
import org.apache.lucene.document.*;
import java.io.*;
import java.util.Date;
import java.util.ArrayList;
import java.util.List;

public class lucene
{
  public static void main(String[] args) throws Exception
  {
    if(args.length != 3)
    {
      System.out.println("Usage : java lucene <index/search> <dirname> <string>");
      System.exit(1);
    }

    if(!args[0].equalsIgnoreCase("index") && !args[0].equalsIgnoreCase("search"))
    {
      System.out.println("Usage : java lucene <index/search> <dirname> <string>");
      System.exit(1);
    }
    System.out.println(args[0]+","+args[1]+","+args[2]);

    lucene lu = new lucene(args[0], args[1]);
    if(args[0].equalsIgnoreCase("index"))
      lu.indexFiles(args[2]);
    else if(args[0].equalsIgnoreCase("search"))
      lu.searchFiles(args[2]);


  }

  File index_dir;
  String action;

  public lucene(String action, String dirname) throws Exception
  {
    this.index_dir = new File(dirname);
    this.action = action;

    if(index_dir.exists() && action.equalsIgnoreCase("index"))
    {
      System.out.println("Index already exisits... enter another another directory for indexing...");
      System.exit(1);
    }
  }

  public void indexFiles(String searchstr) throws Exception
  {
    Twitter tw = new Twitter();
    System.out.println("Getting tweets for "+searchstr);
    twitter4j.Query qry = new twitter4j.Query("source:twitter4j "+searchstr);
    qry.setRpp(50);

    QueryResult res = tw.search(qry);
    List<Tweet> tweets = res.getTweets();
    System.out.println("Got "+tweets.size()+" tweets in "+res.getCompletedIn()+" : "+res.getMaxId());

    // constructor changed from lucene 2.4.1 
    IndexWriter iw = new IndexWriter(NIOFSDirectory.open(this.index_dir), new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);

    int docs = 0;
    for(int z=0; z<tweets.size(); z++)
    {
      Tweet twt = (Tweet)(tweets.get(z));
      String user = twt.getFromUser();
      String usrTwt = twt.getText();
      System.out.println("Got : "+user+" => "+usrTwt);

      Document d = new Document();
      // constructor for Field changed - introduced new constants ANALYZED & NOT_ANALYZED. Not storing NORMS improve performance.
      d.add(new Field("user", user, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS, Field.TermVector.YES));
      d.add(new Field("tweet", usrTwt, Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS));

      iw.addDocument(d);
      docs++;
    }

    System.out.println("optimizing..."+docs+" docs");
    iw.optimize();
    iw.close();
  }

  public void searchFiles(String searchstr) throws Exception
  {
    BufferedReader br = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));  
    QueryParser parser = new QueryParser("tweet",new WhitespaceAnalyzer());
    // New constructor in 2.9 - pass true to open in readonly mode.
    IndexReader ir = IndexReader.open(NIOFSDirectory.open(this.index_dir), true);
    Searcher searcher = new IndexSearcher(ir);
    int ResultsPerPage = 5;
    do
    {
      org.apache.lucene.search.Query qry = parser.parse(searchstr);
      System.out.println("Searching for : "+searchstr);

      //use TopScoreDocCollector to get results and do paging. Get 2 page in a go. Do not sort on score.
      TopScoreDocCollector collector = TopScoreDocCollector.create(2*ResultsPerPage, false);
      searcher.search(qry, collector); 
      //get total no of hits found;
      int totalResults = collector.getTotalHits();
      int start = 0;
      int end = Math.min(totalResults, ResultsPerPage);
      ScoreDoc[] hits = collector.topDocs().scoreDocs;

      System.out.println("Total hits : "+totalResults+", end : "+end);

      for(int i=start; i<end; i++)
      {
        Document doc = searcher.doc(hits[i].doc);
        System.out.println(i+"] "+doc.get("user")+" => "+doc.get("tweet"));
      }


      System.out.print("\nQuery (enter \"quit\" to exit): ");
      searchstr = br.readLine();
      if(searchstr == null || searchstr.length() == -1)
      {
        break;
      }
      searchstr.trim();
      if(searchstr.length()==0)
      {
        break;
      }

    }while(!searchstr.equalsIgnoreCase("quit"));
    
  }
}

2 comments:

Vina said...: A big fan of your technical writing. It is very concise and well explained.; 1/24/2010 11:21 PM
gamegeek said...: thanks vina; 1/24/2010 11:43 PM