Wednesday, March 19, 2014

An interesting suggester in Solr

Auto suggestion has evolved over time. Lucene has a number of implementations for type ahead suggestions. Existing suggesters generally find suggestions whose whole prefix matches the current user input. Recently a new AnalyzingInfixSuggester has been developed which finds matches of tokens anywhere in the user input and in the suggestion. Here is how the implementation looks.




Let us see how to implement this in Solr. Firstly, we will need solr 4.7 which has the capability to utilize the Lucene suggester module. For more details check out  SOLR-5378 and  SOLR-5528. To implement this, look at the searchComponent named "suggest" in solrconfig.xml. Make the following changes.

<searchComponent name="suggest" class="solr.SuggestComponent">
      <lst name="suggester">
      <str name="name">mySuggester</str>
      <str name="lookupImpl">AnalyzingInfixLookupFactory</str>
      <str name="suggestAnalyzerFieldType">text_ws</str>
      <str name="dictionaryImpl">DocumentDictionaryFactory</str>     <!-- org.apache.solr.spelling.suggest.HighFrequencyDictionaryFactory -->
      <str name="field">cat</str>
      <str name="weightField">price</str>
      <str name="buildOnCommit">true</str>
    </lst>
</searchComponent>

Here we have changed the type of lookup to use - lookupImpl to AnalyzingInfixLookupFactory. And defined the Analyzer to use for building the dictionary as text_ws - which is a simple WhiteSpaceTokenizer factory implementation. The field to be used for providing suggestions is "cat" and we use the "price" field as weight for sorting the suggestions.

Also change the default dictionary for suggester to "mySuggester". Add the following line to "requestHandler" by the name of "/suggest".

<str name="suggest.dictionary">mySuggester</str>

Once these configurations are in place, simply restart the solr server. In order to check the suggester, index all the documents in the exampleDocs folder. The suggester index is created when the documents are committed to the index. In order to check the implementation simply use the following URL.

http://localhost:8983/solr/suggest/?suggest=true&suggest.q=and&suggest.count=2

The output will be somewhat similer to this...


<response>
  <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">3</int>
  </lst>
  <lst name="suggest">
    <lst name="mySuggester">
      <lst name="and">
        <int name="numFound">2</int>
        <arr name="suggestions">
        <lst>
          <str name="term">electronics <b>and</b> computer1</str>
          <long name="weight">2199</long>
          <str name="payload"/>
        </lst>
        <lst>
          <str name="term">electronics <b>and</b> stuff2</str>
          <long name="weight">279</long>
          <str name="payload"/>
        </lst>
        </arr>
      </lst>
    </lst>
  </lst>
</response>

We are already getting the suggestions as highlighted. Try getting suggestions for some other partial words like "elec", "comp" and see the output. Let us note down some limitations that I came across while implementing this.

Checkout the type of the field "cat" which is being used for providing suggestions in schema.xml. It is of the type "string" and is both indexed and stored. We can change the field name in our solrconfig.xml to provide suggestions based on some other field, but the field has to be both indexed and stored. We would not want to tokenize the field as it may mess up with the suggestions - so it is recommended to use "string" fields for providing suggestions.

Another flaw, that I came across is that the suggestions do not work on multiValued fields. "cat" for example is multivalued. Do a search on all "electronics" and get the "cat" field.

http://localhost:8983/solr/collection1/select/?q=electronics&fl=cat

We can see that in addition to "electronics", the cat field also contains "connector", "hard drive" and "memory". But a search on those strings does not give any suggestions.

http://localhost:8983/solr/suggest/?suggest=true&suggest.q=hard&suggest.count=2

So, it is recommended that the field be of type "string" and not multivalued. If there are multiple fields on which suggestions are to be provided, it is recommended to merge them into a single "string" field in our index.

3 comments:

Ajay Sharma said...

Hi Jayen,

Thanks for the nice article. I was searching for implementing Infix based suggester in Solr and your article helped me to implement it.

After implementing it, it is working as expected but found that it is giving duplicate results.

For instance, if I am searching it on keywords and keywords are spread across multiple documents, instead of giving multiple keywords, it is giving list of duplicate keywords.

Do you know how to solve it?

Here is my solrconfig implementation for the same.



mySuggester
AnalyzingInfixLookupFactory
text_general
DocumentDictionaryFactory
tags
name_for
true






true
30
mySuggester


suggest


Thanks
Ajay

Ajay Sharma said...

Hi Jayen,

Thanks for the nice article.

I implemented Infix suggester using AnalyzingInfixLookupFactory but it is giving me duplicate results.

Do you know how to solve it?

Here is my solrconfig file.



mySuggester
AnalyzingInfixLookupFactory
text_general
DocumentDictionaryFactory
tags
name_for
true





true
30
mySuggester


suggest




Jayant Kumar said...

Hi Ajay,

I will have to look at the solr config and schema to figure it out.

Regards
Jayant