Whatever....: December 2010

Sunday, December 12, 2010

how to go about apache-solr

I had explored lucene for providing full text search engines. I had even gone into the depth of modifying the core classes of lucene to change and also add new functionality into lucene. Being good at lucene, i never looked at solr. I had the notion that Solr was a simple web interface on top of lucene so it will not be very customizable. But recently my belief was broken. I have been going though solr 1.4.1. Here is a list of features that solr provides by default : http://lucene.apache.org/solr/features.html.

Solr 1.4.1 combines

Lucene 2.9.3 - an older version of lucene. The recent version of lucene 3.0.2 is based on java 5 and has some really good performance improvements over the 2.x version of lucene. I wish we had lucene 3.x on solr. Lucene powers the core full-text search capability of solr.
Tika 0.4 - Again the latest version here is 0.8. Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.
Carrot2 3.1.0 - Here also the latest version is 3.4.2. Carrot is an open source search result clustering engine. It readily integrates with Lucene as well.

To install solr simply download it from http://lucene.apache.org/solr/index.html and untar it. You can launch the solr by simply navigating to <solr_directory>/example directory and running java -jar start.jar. This will start the sample solr server without any data. You can go to http://localhost:8983/solr/admin to see the admin page for solr. To post some sample files to solr simply do <solr_dir>/example/exampledocs$ java -jar post.jar *.xml. This will load example documents in solr server and create an index on them.

Now the real fun begins when you want to index your own site on solr. First of all, you need to define the schema and identify how data will be ported into the index.

For starters
-- copy example directory : cp example myproject
-- go to solr/conf directory : cd myproject/solr/conf
-- ls will show you the directory contents : $ ls
admin-extra.html dataimport.properties mapping-ISOLatin1Accent.txt schema.xml solrconfig.xml stopwords.txt xslt
data-config.xml elevate.xml protwords.txt scripts.conf spellings.txt synonyms.txt

These are the only files in solr which need to be tweaked for getting solr working...
We will see all the files one by one

schema.xml
-- you need to define the fieldType such as String, boolean, binary, int, float, double etc...
-- Each field type has a class and certain properties associated with it.
-- Also you can specify how a fieldtype is analyzed/tokenized/stored in the schema
-- Any filters related to a fieldType can also be specified here.
-- Lets take an example of a text field

# the text field is of type solr.TextField
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
    # the analyzer is to be applied during indexing.
    <analyzer type="index">
        # pass the text through the following tokenizer
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
        # and then apply these filters
        # use the stop words specified in stopwords.txt for the stopwordfilter
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        # avoid stemming words which are in protwords.txt file
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
    # for query use the following analyzers and filters
    # generally the analyers/filters for indexing and querying are same
    <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
    </analyzer>
</fieldType>

-- Once we are done defining the field types, we can go ahead and define the fields that will be indexed.
-- Each field can have additional attributes like type=fieldType, stored=true/false, indexed=true/false, omitNorms=true/false
-- if you want to ignore indexing of some fields, you can create an ignored field type and specify the type of the field as ignored

<fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
<dynamicField name="*" type="ignored" multiValued="true" />

-- in addition to these, you also have to specify the unique-key to enforce uniqueness among multiple documents

<uniqueKey>table_id</uniqueKey>

-- Default Search field and default search operator are among other things that can be specified.

solrconfig.xml
- used to define the configuration for solr
- parameters like datadir (where index is to be stored), and indexing parameters can be specified here.
- You can also configure caches like queryResultCache, documentCache or fieldValueCache and their caching parameters.
- It also handles warming of cache
- There are request handlers for performing various tasks.
- Replication and partitioning are sections in request handling.
- Various search components are also available to handle advanced features like faceted search, moreLikeThis, highlighting
- All you have to do is put the appropriate settings in the xml file and solr will handle the rest.
- Spellchecking is available which can be used to generate a list of alternate spelling suggestions.
- Clustering is also a search component, it integrates search with Carrot2 for clustering. You can select which algorithm you want for clustering - out of those provided by carrot2.
- porting of data can be made using multiple options like xml, csv. There are request handlers available for all these formats
- A more interesting way of porting data is by using the dataImportHandler - it ports data directly from mysql to lucene. Will go in detail on this.
- There is an inbuilt dedup results handler as well. So all you have to do is set it up by telling it which fields to monitor and it will automatically deduplicates the results.

DataImportHandler

For people who use sphinx for search, a major benefit is that they do not have to write any code for porting data. You can provide a query in sphinx and it automatically pulls data out of mysql and pushes it to sphinx engine. DataImportHandler is a similar tool available for linux. You can register a dataimporthandler as a requestHandler

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
    # specify the file which has the config for database connection and query to be fired for getting data
    # you can also specify parameters to handle incremental porting
    <str name="config">data-config.xml</str>
    </lst>
</requestHandler>

data-config.xml

<dataConfig>
  <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/databaseName?zeroDateTimeBehavior=convertToNull"
                 user="root"  password="jayant" />
    <document>
        <entity name="outer" pk="table_id"
        query="SELECT table_id, data1, field1, tags, rowname, date FROM mytable"
        deltaImportQuery="SELECT table_id, data1, field1, tags, rowname, date FROM mytable where table_id='${dataimporter.delta.table_id}'"
        deltaQuery="SELECT table_id from mytable where last_modified_time > '${dataimporter.last_index_time}'">
        
        # this is the map which says which column will go into which fieldname in the index
        <field column="table_id" name="table_id" />
        <field column="data1" name="data1" />
        <field column="field1" name="field1" />
        <field column="tags" name="tags" />
        <field column="rowname" name="rowname" />
        <field column="date" name="date" />

            # getting content from another table for this tableid
            <entity name="inner"
            query="SELECT content FROM childtable where table_id = '${outer.table_id}'" >
            <field column="content" name="content" />
            </entity>
        </entity>
    </document>
</dataConfig>

Importing data

Once you start the solr server using java -jar start.jar, you can see the server working on
http://localhost:8983/solr/

It will show you a "welcome message"

To import data using dataimporthandler use
http://localhost:8983/solr/dataimport?command=full-import (for full import)
http://localhost:8983/solr/dataimport?command=delta-import (for delta import)

To check the status of dataimporthandler use
http://localhost:8983/solr/dataimport?command=status

Searching

The ultimate aim of solr is searching. So lets see how can we search in solr and how to get results from solr.
http://localhost:8983/solr/select/?q=solr&start=0&rows=10&fl=rowname,table_id,score&sort=date desc&hl=true&wt=json

Now this says that
- search for the string "solr"
- start from 0 and get 10 results
- return only fields : rowname, table_id and score
- sort by date descending
- highlight the results
- return output as json

All you need to do is process the output and the results.

There is a major apprenhension in using solr due to the reason that it provides an http interface for communication with the engine. But i dont think that is a flaw. Ofcourse you can go ahead and create your own layer on top of lucene for search, but then solr uses some standards for search and it would be difficult to replicate all these standards. Another option is to create a wrapper around the http interface with limited functionality that you need. Http is an easier way of communication as compred to defining your own server and your own protocols.

Definitely solr provides an easy to use - ready made solution for search on lucene - which is also scalable (remember replication, caching and partitioning). And in case the solr guys missed something, you can pick up their classes and modify/create your own to cater to your needs.

Carnation - is it really worth getting your car serviced there ??

Before I go ahead with a walk through of my experience, I would like you to answer some questions
1. Was the cost of your service at carnation higher than the cost of your normal service at your earlier service center?
2. Were some parts found to be faulty? Are you sure they were faulty?
3. Did you ever get a call asking for a feedback?
4. If you were offered tea/coffee and you said yes - did you actually get your tea/coffee?

Now for my experience.

It was one fine day that I decided that instead of going to my gm service center, I decided to go to carnation. I could say that it was a bad day because I did not know that I would be duped. So on 20th of November, i took my aveo for service at the Noida sector 63 service station of carnation.

Upon entry i saw that there are not so many cars there. Generally on a saturday or sunday, the service centers are heavily loaded. I thought that it might be better, because I would get better attention. I did not realize that this was due to the fact that most of the prople there are either too rich to care or first-timers like me. My car was taken to the workshop and examined.

They opened the bonet and showed me that the car has hit something due to which the fan is rubbing against some "lamda". I would have to get a body work done. I was like hell. Cant you move the fan or twist the lamda. They told me that that neither can be done without hurthing the car. I was like... ok... You are the one who know more about the car. Upon delivery, i saw that in addition to the body work (pulling the front grill out), they have also moved the fan to the left --- what? but they told me that this was not possible. So a simple "moving the fan" cost me 3000/-. I could have definitely avoided the body work.

Another mechanic came hopping around with his "light in mobile" and told me that the drive belt is faulty. It was totally dark and i was unable to see. I thought that he might be having "night eyes" like to owl to be able to see a fault in such a dark place. But then they showed it to me from below and it was really cracked up. I think that was what which gave me the confidence to trust(wrongly) in them.

Another mechanic hopped and told me that the suspention is leaking. The supervisor who was advicing me first told me that the suspention was fine. But then he looked at the mechanic and then back to the suspention and then according to him the suspention was faulty. It was leaking and should be changed. I should have said no. Now the new suspention which they have put in creaks like anything. I am not an expert and i do not know whether they have put in genuine parts or duplicate parts. They charged me around 2000/- for the suspension. I could have saved that.

And eventually the most absurd thing was when the supervisor told me that the disk brake pads were totally rubbed out. I was like - what? How can he see it without touching or opening them? Maybe they have a way which i am not aware of. I was like ok - replace them. They charged me 2000/- something for the brake pads and 500/- something for the labour. Now, by luck i saw them replace the brake pads. First of all they were not even 50% rubbed. And secondly it was a 5 minute job - just removing some screws, changing the brake pads and putting them back. I paid 500/- for this 5 minute labour job. And i am not even sure now whether they are genuine or not.

One another thing that they replaced was a bearing. The supervisor called me and told me that it is faulty and will not last for even 300 kms. I was like - ok change it. When i went to the shop and took the spare - he told me that it is faulty and i could feel the vibrations if i rotate the bearing. I felt nothing. Except for monetary gain, i am not sure why he changed it.

So, i would have gone to my normal gm service center, this service would have costed me 3000 (body work) + 2500 (brake pads) + 2000 (suspention) + 1500 (bearing) = 9000/- less. The service which was 14000/- would then be only 5000/-.

Now the question is why did I not feel that something was wrong - all the while they were duping me. The reason being that i am not a mechanic. I do not know what part will run and what will not run. Hell, i am not even aware of the location of all the parts. So, it is easy to dupe me. The second reason was that I wanted to get my car serviced by carnation - at least once experience what they have to say? In fact I should have drove out when they told me that the drive belt was faulty without even looking at it.

So, when did i realize that i have been duped? 1 week after the service, the car was as rough as it was before the service. That was when my doubt was confirmed that the service was not worth it. I had realized that i have been duped of a few parts. But 1 week after when the car was as rough as before, I took a look at all the parts. And then I realized that I have been duped.

What could i do? I sent a mail to j.khattar@carnation.in regarding the fact that i was duped. There was one call where the lady called at around 11 am - and started explaining that the parts that they had changed were supposed to be changed. I told her that I was in a meeting and asked her to call back in 1 hour. I did not get back any calls.

When I get my car serviced at GM, I get a call after a week regarding how the car is performing - or if i am facing any issues. But I never got any such calls from carnation. I believe the business model that Mr. Khatter is following is of duping first-timers. They do not expect the people to return. If by mistake anyone does, he is worth duping again.

Here is the mail that I sent to j.khattar@carnation.in. I did not get any response though.

Hi,

This is the first time i visited Carnation for servicing my car - Aveo 1.6 LT. For the past 4 years, the car was being serviced by the GM authorized service center in sector 8 noida. The average cost of my service for the past 4 years was around 3000/- per service. At carnation the cost of 1 service was 13800(something) It has been 10 days after the service and the car feels very rough. My previous experience is that the car becomes rough after almost 2 months - when i used to get my car serviced at GM.

5 parts were replaced and a body repairing work was done.

1. brake pads
2. shocker
3. spark plugs
4. drive belt
5. some bearing

Out of the 5 parts,
- the brake pads were fine - i have the old ones and they could still run for another 10,000 kms.
- the new shocker creaks a lot - i am not sure if they put orignial or duplicate.
- the bearing - looked fine to me - i am not sure why it was changed.

In addition to this the body repair for 3000/- was done, which was not required.
I was told that body repair was the only solution - i should have referred a local mechanic as well.

The personal experience was good - but i get almost similar personal touch at GM as well.

Would i get my car serviced at carnation again ?
No

Reason :
I am not a car mechanic. I cannot tell whether a part is defective or not, genuine or duplicate. The service is not good.
Parts are unnecessarily replaced - even when the current parts are in working condition and doing fine. The service is very expensive - and not worth it.

Service location:
Sector 63, Noida.

2 of my friends who have got their car serviced at carnation have similar feedback.

Regards
Jayant Kumar

This review is intended to prevent people from getting duped by carnation. People at carnation ask you for tea/coffee. But if you say yes, you would never get your tea/coffee. Have you noticed it?

Samsung and me

I and samsung do not have a very long relation, but whatever relation we have is very flawed. I have been trying to "believe" in samsung products, in their "limitless" support. But samsung makes it difficult for me to "believe" in them.

Lets take case by case what has happened in the past 4 years. I got a samsung refrigerator and a washing machine around 3 years back. The life of the refrigerator was not even 2 years. Within 1.5 years the refrigerator was out of gas. For those who are new to the refrigerator business, I would like to make it clear that refrigerator does use some gas for cooling purposes. Now i have used 2 refrigerators in my lifetime of 25 years. The first one was a really old one "leonard". It ran for around 15 years - after which it lost its color and the gas. We sold it at a profit after using it for 15 years. The next one was "kelvinator". It is still running. It needed a gas change after around 6 years - that too because I moved to delhi and there was no one at home for around 6 months. But my dear Samsung refrigerator lost its gas in 1.5 years only. I had the notion that it would be one in a million cases. But after 6 months of gas change the compressor went for a toss. Thanks to Samsung, for even it does not provide reliable products it does provide a guarantee - due to which i could get the compressor change. But both times i had to pay for the gas change. If it goes on like this, by the time the refrigerator is 10 years old (if it does run for 10 years), the cost of maintenance would exceed the cost of the refrigerator. Bang - that's where samsung is making money...

My second experience was with the washing machine i had bought. I still have an LG washing machine at home - and it still runs like a charm. In fact i enjoy washing clothes. It is still as noiseless as it was when we bought it around 10 years ago. But Samsung washing machine - it sounds like an electric drill. We have to wash clothes only in the afternoon. If we try washing clothes in the morning, it disturbs the sleep of people in this home as well as those of the neighbours. The dryer keeps on running at full speed even if you shut off the machine and open the dryer lid. It seems that the brake pads are worn off. What happened to "high" safety standards (if any) of Samsung.

I had got a Samsung mobile around 3 years back. It started facing problems after 1 year and i had to get it resolved. But recently I got another Samsung mobile - and it was within a month that i realized that this purchase was a big mistake. Within a month, the mobile stopped charging. Now this is a very interesting story.

I went to the samsung service center in sector 18, noida. The service center has closed and there was no clue as to where it has shifted. Even the customer care were not aware that the service center was closed. I went to a samsung dealer in noida sector 18, to enquire about another service center - and he did not know any service centers. It seems that once the product has been sold, the dealers forget the customer. What happened to "customer is the king" motto?? So I moved to another service center - which is in sector 29, Noida. The scene there was humorous and frustrating. There was a person who had the exact similar problem with his samsung mobile. He has been trying to rectify his mobile for the past 1 month - but the service center were unable to rectify it. They never have the required parts.

I had to sit in queue and wait for my call. Once i was at the reception, i told him my problem. The "technician" checked my mobile and verified the issue. He then opened up the mobile completely and told me that i will have to deposit the mobile for 7 days to get it fixed. It seems that the charging jack was faulty. I was like hey - what about my data? and why 7 days? Well according to them my data would be safe and 7 days are required to get the new part and fix it. And there is no guarantee that the new phone will not be scratched. The service people we so adamant that they denied to take the phone if i wrote "no scratch" on the form.

That was when i called up the customer care again and he connected me to the manager - Mr Ranjit. I had to repeat my story 2-3 times - I spent around an hour on the phone - after which the manager told me that i do not need to deposit the phone - all I need to do is give a photocopy of my bill. They will arrange for the part and once it is here he will personally call me to get it fixed. I was about to say "liar liar!!!", but I let it go. After waiting for 3 days, i tried calling the service center - but no one picked up the phone. I tried calling again after 6 days - it is easier to get your calls picked up in government offices. Eventually I gave up. No point in running after them. Maybe 6 months down the line, when my phone is all used up and scratched, i will dump it in the service center and purchase a "non" samsung phone.

It is just a random thought, whether there is a problem between me and Samsung - that Samsung products do not suit me or it is a general case. For one thing, i never had a samsung TV. Onida and LG both have been running for more than 10 years now without a single issue. It makes more sense to provide better products rather than "limitless" service (troubles) for endless problems on inferior products.