Sunday, December 12, 2010

how to go about apache-solr

I had explored lucene for providing full text search engines. I had even gone into the depth of modifying the core classes of lucene to change and also add new functionality into lucene. Being good at lucene, i never looked at solr. I had the notion that Solr was a simple web interface on top of lucene so it will not be very customizable. But recently my belief was broken. I have been going though solr 1.4.1. Here is a list of features that solr provides by default : http://lucene.apache.org/solr/features.html.

Solr 1.4.1 combines
  • Lucene 2.9.3 - an older version of lucene. The recent version of lucene 3.0.2 is based on java 5 and has some really good performance improvements over the 2.x version of lucene. I wish we had lucene 3.x on solr. Lucene powers the core full-text search capability of solr.
  • Tika 0.4 - Again the latest version here is 0.8. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
  • Carrot2 3.1.0 - Here also the latest version is 3.4.2. Carrot is an open source search result clustering engine. It readily integrates with Lucene as well.

To install solr simply download it from http://lucene.apache.org/solr/index.html and untar it. You can launch the solr by simply navigating to <solr_directory>/example directory and running java -jar start.jar. This will start the sample solr server without any data. You can go to http://localhost:8983/solr/admin to see the admin page for solr. To post some sample files to solr simply do <solr_dir>/example/exampledocs$ java -jar post.jar *.xml. This will load example documents in solr server and create an index on them.

Now the real fun begins when you want to index your own site on solr. First of all, you need to define the schema and identify how data will be ported into the index.

For starters
-- copy example directory : cp example myproject
-- go to solr/conf directory : cd myproject/solr/conf
-- ls will show you the directory contents : $ ls
admin-extra.html dataimport.properties mapping-ISOLatin1Accent.txt schema.xml solrconfig.xml stopwords.txt xslt
data-config.xml elevate.xml protwords.txt scripts.conf spellings.txt synonyms.txt

These are the only files in solr which need to be tweaked for getting solr working...
We will see all the files one by one

schema.xml
-- you need to define the fieldType such as String, boolean, binary, int, float, double etc...
-- Each field type has a class and certain properties associated with it.
-- Also you can specify how a fieldtype is analyzed/tokenized/stored in the schema
-- Any filters related to a fieldType can also be specified here.
-- Lets take an example of a text field
# the text field is of type solr.TextField
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    # the analyzer is to be applied during indexing.
    <analyzer type="index">
        # pass the text through the following tokenizer
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
        # and then apply these filters
        # use the stop words specified in stopwords.txt for the stopwordfilter
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        # avoid stemming words which are in protwords.txt file
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
    # for query use the following analyzers and filters
    # generally the analyers/filters for indexing and querying are same
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
</fieldType>
-- Once we are done defining the field types, we can go ahead and define the fields that will be indexed.
-- Each field can have additional attributes like type=fieldType, stored=true/false, indexed=true/false, omitNorms=true/false
-- if you want to ignore indexing of some fields, you can create an ignored field type and specify the type of the field as ignored
<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
<dynamicField name="*" type="ignored" multiValued="true" />
-- in addition to these, you also have to specify the unique-key to enforce uniqueness among multiple documents
<uniqueKey>table_id</uniqueKey>
-- Default Search field and default search operator are among other things that can be specified.

solrconfig.xml
- used to define the configuration for solr
- parameters like datadir (where index is to be stored), and indexing parameters can be specified here.
- You can also configure caches like queryResultCache, documentCache or fieldValueCache and their caching parameters.
- It also handles warming of cache
- There are request handlers for performing various tasks.
- Replication and partitioning are sections in request handling.
- Various search components are also available to handle advanced features like faceted search, moreLikeThis, highlighting
- All you have to do is put the appropriate settings in the xml file and solr will handle the rest.
- Spellchecking is available which can be used to generate a list of alternate spelling suggestions.
- Clustering is also a search component, it integrates search with Carrot2 for clustering. You can select which algorithm you want for clustering - out of those provided by carrot2.
- porting of data can be made using multiple options like xml, csv. There are request handlers available for all these formats
- A more interesting way of porting data is by using the dataImportHandler - it ports data directly from mysql to lucene. Will go in detail on this.
- There is an inbuilt dedup results handler as well. So all you have to do is set it up by telling it which fields to monitor and it will automatically deduplicates the results.

DataImportHandler

For people who use sphinx for search, a major benefit is that they do not have to write any code for porting data. You can provide a query in sphinx and it automatically pulls data out of mysql and pushes it to sphinx engine. DataImportHandler is a similar tool available for linux. You can register a dataimporthandler as a requestHandler
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
    # specify the file which has the config for database connection and query to be fired for getting data
    # you can also specify parameters to handle incremental porting
    <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

data-config.xml
<dataConfig>
  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/databaseName?zeroDateTimeBehavior=convertToNull"
                 user="root"  password="jayant" />
    <document>
        <entity name="outer" pk="table_id"
        query="SELECT table_id, data1, field1, tags, rowname, date FROM mytable"
        deltaImportQuery="SELECT table_id, data1, field1, tags, rowname, date FROM mytable where table_id='${dataimporter.delta.table_id}'"
        deltaQuery="SELECT table_id from mytable where last_modified_time > '${dataimporter.last_index_time}'">
        
        # this is the map which says which column will go into which fieldname in the index
        <field column="table_id" name="table_id" />
        <field column="data1" name="data1" />
        <field column="field1" name="field1" />
        <field column="tags" name="tags" />
        <field column="rowname" name="rowname" />
        <field column="date" name="date" />

            # getting content from another table for this tableid
            <entity name="inner"
            query="SELECT content FROM childtable where table_id = '${outer.table_id}'" >
            <field column="content" name="content" />
            </entity>
        </entity>
    </document>
</dataConfig>
Importing data

Once you start the solr server using java -jar start.jar, you can see the server working on
http://localhost:8983/solr/

It will show you a "welcome message"

To import data using dataimporthandler use
http://localhost:8983/solr/dataimport?command=full-import (for full import)
http://localhost:8983/solr/dataimport?command=delta-import (for delta import)

To check the status of dataimporthandler use
http://localhost:8983/solr/dataimport?command=status

Searching

The ultimate aim of solr is searching. So lets see how can we search in solr and how to get results from solr.
http://localhost:8983/solr/select/?q=solr&start=0&rows=10&fl=rowname,table_id,score&sort=date desc&hl=true&wt=json

Now this says that
- search for the string "solr"
- start from 0 and get 10 results
- return only fields : rowname, table_id and score
- sort by date descending
- highlight the results
- return output as json

All you need to do is process the output and the results.

There is a major apprenhension in using solr due to the reason that it provides an http interface for communication with the engine. But i dont think that is a flaw. Ofcourse you can go ahead and create your own layer on top of lucene for search, but then solr uses some standards for search and it would be difficult to replicate all these standards. Another option is to create a wrapper around the http interface with limited functionality that you need. Http is an easier way of communication as compred to defining your own server and your own protocols.

Definitely solr provides an easy to use - ready made solution for search on lucene - which is also scalable (remember replication, caching and partitioning). And in case the solr guys missed something, you can pick up their classes and modify/create your own to cater to your needs.

No comments: