Whatever....: relevance scoring

Showing posts with label relevance scoring. Show all posts

Friday, July 06, 2007

Document scoring in lucene - Part 2

This is an addition to the previous post document scoring/calculating relevance in lucene. If you find the link inadequate you can refer scoring@lucene.org and the formula at Default similarity formula.

What i did was i created some txt file and some code to index the file and have tried to find out in practice how lucene calculates relevance using the DefaultSimilarity class. Here is the file and the source code.

file: file_a.txt
jayant project manager project leader team leader java linux c c++ lucene apache solaris aix unix minix gnome kde ubuntu redhat fedora rpm deb media player vlc evolution exchange microsoft java vb vc vc++ php mysql java

source code: FIndexer.java
import java.io.*;
import java.util.Date;
import org.apache.lucene.index.*;
import org.apache.lucene.document.*;
import org.apache.lucene.analysis.*;

public class FIndexer
{

public static void main (String[] args) throws Exception
{
String src_path = args[0];
String des_path = args[1];

File f_src = new File(src_path);
File f_des = new File(des_path);

if(!f_src.isDirectory() || !f_des.isDirectory())
{
System.out.println("Error : "+f_src+" || "+f_des+" is not directory");
System.exit(1);
}

IndexWriter writer = new IndexWriter(f_des,new WhitespaceAnalyzer(), true);

File[] files = f_src.listFiles();
for(int x=0; x < files.length; x++)
{
Document doc = new Document();
if(files[x].isFile())
{
BufferedReader br = new BufferedReader(new FileReader(files[x]));
StringBuffer content = new StringBuffer();
String line = null;
while( (line = br.readLine()) != null)
{
content.append(line).append("\n");
}

Field f1 = new Field("name",files[x].getName(), Field.Store.YES, Field.Index.NO);
doc.add(f1);
Field f2 = new Field("content", content.toString(), Field.Store.NO, Field.Index.TOKENIZED);
doc.add(f2);
Field f3 = new Field("content2", content.toString(), Field.Store.NO, Field.Index.TOKENIZED);
doc.add(f3);

}
writer.addDocument(doc);
}
writer.optimize();
writer.close();
}
}

I created copies of the file_a.txt as file_b.txt and file_c.txt and edited file_c.txt. Such that file_a.txt and file_b.txt have the same content (that is word java occurs 3 times in both the files). And file_c.txt has word java occuring only 2 times.

I created an index using FIndexer.java and used luke to fire queries on the index.

Firstly did a search content:java. And as expected i got 3 results with the following score.

Rank	score	filename
0	0.1928	file_b.txt
1	0.1928	file_a.txt
2	0.1574	file_c.txt

Lets take the score for file_b.txt and see how it is calculated. The score is a product of
tf = 1.7321 (java occurs 3 times)
idf = 0.7123 (document freq = 3)

And the score of file_c.txt is a product of
tf = 1.4142 (java occurs 2 times)
idf = 0.7123 (document freq = 3)

Here score is equivalent to the fieldWeight since there is just one field. If more than one fields are used in the query, then the score would be a product of the queryWeight and fieldWeight

So if i change the query to Content:java^5 Content2:java^2. Here i am boosting java in content by 5 and java in content2 by 2. That is java in content is 2.5 times more important than java in content2. Lets check the scores.

Rank	score	filename
0	0.2506	file_b.txt
1	0.2506	file_a.txt
2	0.2046	file_c.txt

Lets look at how the score was calculated

Again for file_b.txt
0.2506 = 0.1709 (weight of content:java^5) * 0.0716 (weight of content2:java^2)

Out of which Weight of content:java^5 is
= 0.9285(Query weight) [ 5 (boost) * 0.7123 (idf docFreq=3) * 0.2607 (queryNorm) ]
* 0.1928(field weight) [ 1.7321 (tf=3) * 0.7123 (idf docFreq=3) * 0.1562 (fieldNorm) ]

And weight of content2:java^2 is
= 0.3714(Query weight) [ 2 (boost) * 0.7123 (idf docFreq=3) * 0.2607 (queryNorm) ]
* 0.1928(Field weight) [ 1.7321 (tf=3) * 0.7123 (idf docFreq=3) * 0.1562 (fieldNorm) ]

The same formula is used for calculating of score of file_c.txt except for the fact that termfrequency = 1.4142 (tf of content:java or content2:java is 2)

This explains a little about scoring in lucene. The scoring can be altered by either changing the DefaultSimilarity class or extending the DefaultSimilarity and changing certain factors/calculations in it. More on how to change the scoring formula later.

Saturday, July 08, 2006

document scoring/calculating relevance in lucene

How is scoring of a document calculated in lucene? Well, first of all what is soring and what does it do. Scoring is a factor which decides the order by which documents are returned after a search is done using lucene. Something like "order by relevance". If we provide an order by/sort by clause in the search query, then ordering is done on this. Else, ordering is done on the score of the document. All documents returned from the search have a score no associated with it. The document with the highest score comes on top and that with the lease score comes at the bottom.

Scoring is implemented in the similarity class of lucene (org.apache.lucene.search.Similarity).

Lets see how the score of a document is calculated.

The score of a query q for document d is defines as follows (t refers to a term):

score(q,d)=	Σ	(tf(t in d) * idf(t)^2 * getBoost(t in q) * getBoost(t.field in d) * lengthNorm(t.field in d) ) * coord(q,d) * queryNorm(sumOfSqaredWeights)
	t in q

where

sumOfSqaredWeights =	Σ	( idf(t) * getBoost(t in q) )^2
	t in q

Now, lets decode it...

tf = term/phrase(t) frequency in a document(d). Terms and phrases repeated in a document indicate the topic of the document, so implementations of this method usually return larger values when freq is large, and smaller values when freq is small.

idf = term document frequency(no of documents which contain the term). Terms that occur in fewer documents are better indicators of topic, so implementations of this method usually return larger values for rare terms, and smaller values for common terms.

getBoost (t in q) = Gets the boost for this clause in the query. Documents matching this clause will (in addition to the normal weightings) have their score multiplied by b. The boost b=1.0 by default.

getBoost (t.field in d) = Gets the boost for the field in the document if the term is in the field. Again boost for the field b is 1.0 by default.

lengthNorm(t.field in d) = normalization value for a field given the total number of terms contained in a field. These values, together with field boosts, are stored in an index and multipled into scores for hits on each field by the search code. Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large, and larger values when numTokens is small. These values are computed under IndexWriter.addDocument(Document d) and stored. Thus they have limited precision, and documents must be re-indexed if this method is altered.

coord = Score factor based on the fraction of all query terms that a document contains. This value is multiplied into scores. The presence of a large portion of the query terms indicates a better match with the query, so implementations of this method usually return larger values when the ratio between these parameters is large and smaller values when the ratio between them is small.

queryNorm = Computes the normalization value for a query given the sum of the squared weights of each of the query terms. This value is then multipled into the weight of each query term. This does not affect ranking, but rather just attempts to make scores from different queries comparable.

So affectively the score of a document returned, in search done, using a query is calculated keeping in mind the frequency of the term/phrase in the document, frequency of documents containing the term, boosting factor of a clause in the query, boosting factor of a field, total number of terms in a field, fraction of all query terms contained in a document and some normalization factor based in the idf and boost of term in query.

If you are searching a single field using a query, the results should be very relevant, provided the text entered is meaningful and gramatically correct. With the logic here, even if you are searching multiple fields, the order of relevance of results should be good. But relevance can be increased by assigning boost factor to fields either during the query formation process or during the indexing process. Fields could be boosted according to their order of importance resulting in very relevant results.

Boosting during the query formation process provides the flexibility to change the boost factor dynamically but slows down the search and scoring due to dynamic computation of score. Whereas if fields are boosted during the indexing process, then though the search will be much faster, you would loose the flexibility of altering the boost factor dynamically.

Anything i have missed, please let me know and i will add it...