Saturday, June 15, 2013

Step by Step setting up solr cloud

Here is a quick step by step instruction for setting up solr cloud on few machines. We will be setting up solr cloud to match production environment - that is - there will be separate setup for zookeeper & solr will be sitting inside tomcat instead of the default jetty.

Apache zookeeper is a centralized service for maintaining configuration information. In case of solr cloud, the solr configuration information is maintained in zookeeper. Lets set up the zookeeper first.

We will be setting up solr cloud on 3 machines. Please make the following entries in /etc/host file on all machines.

172.22.67.101 solr1
172.22.67.102 solr2
172.22.67.103 solr3

Lets setup the zookeeper cloud on 2 machines

download and untar zookeeper in /opt/zookeeper directory on both servers solr1 & solr2. On both the servers do the following

root@solr1$ mkdir /opt/zookeeper/data
root@solr1$ cp /opt/zookeeper/conf/zoo_sample.cfg /opt/zookeeper/conf/zoo.cfg
root@solr1$ vim /opt/zookeeper/zoo.cfg

Make the following changes in the zoo.cfg file

dataDir=/opt/zookeeper/data
server.1=solr1:2888:3888
server.2=solr2:2888:3888

Save the zoo.cfg file.

server.x=[hostname]:nnnnn[:nnnnn] : here x should match with the server id - which is there in the myid file in zookeeper data directory.

Assign different ids to the zookeeper servers

on solr1

root@solr1$ cat 1 > /opt/zookeeper/data/myid

on solr2

root@solr2$ cat 2 > /opt/zookeeper/data/myid

Start zookeeper on both the servers

root@solr1$ cd /opt/zookeeper
root@solr1$ ./bin/zkServer.sh start

Note : in future when you need to reset the cluster/shards information do the following

root@solr1$ ./bin/zkCli.sh -server solr1:2181
[zk: solr1:2181(CONNECTED) 0] rmr /clusterState.json

Now lets setup solr and start solr with the external zookeeper.

install solr on all 3 machines. I installed them in /opt/solr folder.
Start the first solr and upload solr configuration into the zookeeper cluster.

root@solr1$ cd /opt/solr/example
root@solr1$ java -DzkHost=solr1:2181,solr2:2181 -Dbootstrap_confdir=solr/collection1/conf/ -DnumShards=2 -jar start.jar

Here number of shards is specified as 2. So our cluster will have 2 shards and multiple replicas per shard.

Now start solr on remaining servers.

root@solr2$ cd /opt/solr/example
root@solr2$ java -DzkHost=solr1:2181,solr2:2181 -DnumShards=2 -jar start.jar

root@solr3$ cd /opt/solr/example
root@solr3$ java -DzkHost=solr1:2181,solr2:2181 -DnumShards=2 -jar start.jar

Note : It is important to put the "numShards" parameter, else numShards gets reset to 1.

Point your browser to http://172.22.67.101:8983/solr/
Click on "cloud"->"graph" on your left section and you can see that there are 2 nodes in shard 1 and 1 node in shard 2.

Lets feed some data and see how they are distributed across multiple shards.

root@solr3$ cd /opt/solr/example/exampledocs
root@solr3$ java -jar post.jar *.xml

Lets check the data as it was distributed between the two shards. Head back to to Solr admin cloud graph page at http://172.22.67.101:8983/solr/.

click on the first shard collection1, you may have 14 documents in this shard
click on the second shard collection1, you may have 18 documents in this shard
check the replicas for each shard, they should have the same counts
At this point, we can start issues some queries against the collection:

Get all documents in the collection:
http://172.22.67.101:8983/solr/collection1/select?q=*:*

Get all documents in the collection belonging to shard1:
http://172.22.67.101:8983/solr/collection1/select?q=*:*&shards=shard1

Get all documents in the collection belonging to shard2:
http://172.22.67.101:8983/solr/collection1/select?q=*:*&shards=shard2

Lets check what zookeeper has in its cluster.

root@solr3$ cd /opt/zookeeper/
root@solr3$ ./bin/zkCli.sh -server solr1:2181
[zk: 172.22.67.101:2181(CONNECTED) 1] get /clusterstate.json
{"collection1":{
    "shards":{
      "shard1":{
        "range":"80000000-ffffffff",
        "state":"active",
        "replicas":{
          "172.22.67.101:8983_solr_collection1":{
            "shard":"shard1",
            "state":"active",
            "core":"collection1",
            "collection":"collection1",
            "node_name":"172.22.67.101:8983_solr",
            "base_url":"http://172.22.67.101:8983/solr",
            "leader":"true"},
          "172.22.67.102:8983_solr_collection1":{
            "shard":"shard1",
            "state":"active",
            "core":"collection1",
            "collection":"collection1",
            "node_name":"172.22.67.102:8983_solr",
            "base_url":"http://172.22.67.102:8983/solr"}}},
      "shard2":{
        "range":"0-7fffffff",
        "state":"active",
        "replicas":{"172.22.67.103:8983_solr_collection1":{
            "shard":"shard2",
            "state":"active",
            "core":"collection1",
            "collection":"collection1",
            "node_name":"172.22.67.103:8983_solr",
            "base_url":"http://172.22.67.103:8983/solr",
            "leader":"true"}}}},
    "router":"compositeId"}}

It can be seen that there are 2 shards. Shard 1 has only 1 replica and shard 2 has 2 replicas. As you keep on adding more nodes, the number of replicas per shard will keep on increasing.

Now lets configure solr to use tomcat and add zookeeper related and numShards configuration into tomcat for solr.

Install tomcat on all machines in /opt folder. And create the following files on all machines.

cat /opt/apache-tomcat-7.0.40/conf/Catalina/localhost/solr.xml
<?xml version="1.0" encoding="UTF-8"?>
<Context path="/solr/home"
     docBase="/opt/solr/example/webapps/solr.war"
     allowlinking="true"
     crosscontext="true"
     debug="0"
     antiResourceLocking="false"
     privileged="true">

     <Environment name="solr/home" override="true" type="java.lang.String" value="/opt/solr/example/solr" />
</Context>

on solr1 & solr2 create the solr.xml file for shard1

root@solr1$ cat /opt/solr/example/solr/solr.xml

<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true" zkHost="solr1:2181,solr2:2181"> 
  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="${zkClientTimeout:15000}" hostPort="8080" hostContext="solr">
    <core loadOnStartup="true" shard="shard1" instanceDir="collection1/" transient="false" name="collection1"/>
  </cores>
</solr>
 

on solr3 create the solr.xml file for shard2

root@solr3$ cat /opt/solr/example/solr/solr.xml

<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true" zkHost="solr1:2181,solr2:2181"> 
  <cores defaultCoreName="collection1" adminPath="/admin/cores" zkClientTimeout="${zkClientTimeout:15000}" hostPort="8080" hostContext="solr">
    <core loadOnStartup="true" shard="shard2" instanceDir="collection1/" transient="false" name="collection1"/>
  </cores>
</solr>
 
 

set the numShards variable as a part of solr starup environment variable on all machines.

root@solr1$ cat /opt/apache-tomcat-7.0.40/bin/setenv.sh
export JAVA_OPTS=' -Xms4096M -Xmx8192M -DnumShards=2 '

To be on the safer side, cleanup the clusterstate.json in zookeeper. Now start tomcat on all machines and check the catalina.out file for errors if any. Once all nodes are up, you should be able to point your browser to http://172.22.67.101:8080/solr -> cloud -> graph and see the 3 nodes which form the cloud.

Lets add a new node to shard 2. It will be added as a replica of current node on shard 2.

http://172.22.67.101:8983/solr/admin/cores?action=CREATE&name=collection1_shard2_replica2&collection=collection1&shard=shard2
And lets check the clusterstate.json file now

root@solr3$ cd /opt/zookeeper/
root@solr3$ ./bin/zkCli.sh -server solr1:2181
[zk: 172.22.67.101:2181(CONNECTED) 2] get /clusterstate.json
{"collection1":{
    "shards":{
      "shard1":{
        "range":"80000000-ffffffff",
        "state":"active",
        "replicas":{
          "172.22.67.101:8080_solr_collection1":{
            "shard":"shard1",
            "state":"active",
            "core":"collection1",
            "collection":"collection1",
            "node_name":"172.22.67.101:8080_solr",
            "base_url":"http://172.22.67.101:8080/solr",
            "leader":"true"},
          "172.22.67.102:8080_solr_collection1":{
            "shard":"shard1",
            "state":"active",
            "core":"collection1",
            "collection":"collection1",
            "node_name":"172.22.67.102:8080_solr",
            "base_url":"http://172.22.67.102:8080/solr"}}},
      "shard2":{
        "range":"0-7fffffff",
        "state":"active",
        "replicas":{
          "172.22.67.103:8080_solr_collection1":{
            "shard":"shard2",
            "state":"active",
            "core":"collection1",
            "collection":"collection1",
            "node_name":"172.22.67.103:8080_solr",
            "base_url":"http://172.22.67.103:8080/solr",
            "leader":"true"},
          "172.22.67.103:8080_solr_collection1_shard2_replica2":{
            "shard":"shard2",
            "state":"active",
            "core":"collection1_shard2_replica2",
            "collection":"collection1",
            "node_name":"172.22.67.103:8080_solr",
            "base_url":"http://172.22.67.103:8080/solr"}}}},
    "router":"compositeId"}}

Similar to adding more nodes, you can unload and delete a node in solr.

http://172.22.67.101:8080/solr/admin/cores?action=UNLOAD&core=collection_shard2_replica2&deleteIndex=true
More details can be obtained from

http://wiki.apache.org/solr/SolrCloud

Wednesday, June 12, 2013

How bigpipe works

Bigpipe is a concept invented by facebook to help speed up page load times. It paralellizes browser rendering and server processing to achieve maximum efficiency. To understand bigpipe lets see how the a user request-response cycle is executed in the current scenario

  • Browser sends an HTTP request to web server.
  • Web server parses the request, pulls data from storage tier then formulates an HTML document and sends it to the client in an HTTP response.
  • HTTP response is transferred over the Internet to browser.
  • Browser parses the response from web server, constructs a DOM tree representation of the HTML document, and downloads CSS and JavaScript resources referenced by the document.
  • After downloading CSS resources, browser parses them and applies them to the DOM tree.
  • After downloading JavaScript resources, browser parses and executes them.

In this scenario, while the web server is processing and creating the HTML document, the browser is idle and when the browser is rendering the html page, the web server remains idle.

Bigpipe concept breaks the page into smaller chunks known as pagelets. And makes page rendering on browser and processing on server side as parallel processes speeding up the page load time.

The request response cycle in the bigpipe scenario is as follows.

  • The browser sends an HTTP request to web server.
  • Server quickly renders a page skeleton containing the tags and a body with empty div elements which act as containers to the pagelets. The HTTP connection to the browser stays open as the page is not yet finished.
  • Browser will start downloading the bigpipe javascript library and after that it'll start rendering the page
  • The PHP server process is still executing and its building the pagelets. Once a pagelet has been completed it's results are sent to the browser inside a BigPipe.onArrive(...) javascript tag.
  • Browser injects the html code for the pagelet received into the correct place. If the pagelet needs any CSS resources those are also downloaded.
  • After all pagelets have been received the browser starts to load all external javascript files needed by those pagelets asynchronously.
  • After javascripts are downloaded browser executes all inline javascripts.

This results in a parallel system where as the pagelets are being generated the browser is rendering the pagelets. From the user's perspective the page is rendered progressively. The initial page content becomes visible much earlier, which dramatically improves user perceived latency of the page.

Source : https://www.facebook.com/note.php?note_id=389414033919
open bigpipe implementation : https://github.com/garo/bigpipe

Monday, June 10, 2013

Getting started with replication from MySQL to Mongodb

Use tungsten replicator to replicate between mysql and mongodb.

Mysql tables are equivalent to collections in mongodb. The replication works by replicating inserts and updates. But all DDL statements on mysql are ignored...

Replication in detail