Showing posts with label database. Show all posts
Showing posts with label database. Show all posts

Monday, February 20, 2012

scaling using mongodb : Map-Reduce on sharded collection (part 3)

The idea here is to create a sharded database and then run map-reduce on it to process the data. This is a very basic example that I am trying to emulate. I created a sharded collection "posts" with the following structure. The idea is to find the count of tags throughout the complete sharded collection. I am using two machines named "241" and "243" as shards for mongodb. The mongos service is running on a thrid machine "249".

Input collection structure :

mongos> db.posts.find()
{ "_id" : ObjectId("4f4221149a6777895a000000"), "id" : "0", "content" : "data for content 0", "tags" : [ "tag1", "tag2", "tag3", "tag4" ] }
{ "_id" : ObjectId("4f4221149a6777895a000001"), "id" : "1", "content" : "data for content 1", "tags" : [ "tag2", "tag4", "tag5", "tag7", "tag9" ] }

Output collection structure :

mongos> db.tags.find()
{ "_id" : "tag1", "value" : 14705 }
{ "_id" : "tag3", "value" : 14418 }

Lets see the step by step process for creation, population of test data and running of map-reduce.

Lets create the posts collection by putting in a few records. If you print the collection stats, you will see that it is not sharded.

mongos> db.printCollectionStats()
---
posts
{
        "sharded" : false,
        "primary" : "shard241",
        "ns" : "test.posts",
        "count" : 2,
        "size" : 256,
        "avgObjSize" : 128,
        "storageSize" : 8192,
        "numExtents" : 1,
        "nindexes" : 1,
        "lastExtentSize" : 8192,
        "paddingFactor" : 1,
        "flags" : 1,
        "totalIndexSize" : 8176,
        "indexSizes" : {
                "_id_" : 8176
        },
        "ok" : 1
}
---

To shard the collection, you will need to first create index on "id". And then shard the collection using "id" as the key.

mongos> db.posts.ensureIndex({id:1})
mongos> db.posts.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.posts",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "id" : 1
                },
                "ns" : "test.posts",
                "name" : "id_1"
        }
]

mongos> use admin
switched to db admin
mongos> db.runCommand( { shardcollection : "test.posts" , key : { id : 1 } } )
{ "collectionsharded" : "test.posts", "ok" : 1 }





The collection "posts" is now sharded. Lets populate some test data into the collection. Here is the php script that I used to populate data into the collection.

$m = new Mongo( "mongodb://192.168.1.249:10003", array("persist" => "x") );
$db = $m->test;
$table = $db->posts;
$start = 0;
$end = 200000;
for($i=$start; $i<$end; $i++)
{
        $tags = getTag();
        $obj = array("id"=>"$i", "content"=>"data for content $i", "tags"=>$tags);
        $table->insert($obj);
        echo "$i:".implode(',',$tags);
}
$found = $table->count();
echo "Found : $found\n";

function getTag()
{
        $tagArray = array('tag1','tag2','tag3','tag4','tag5','tag6','tag7','tag8','tag9','tag10','tag11','tag12','tag13','tag14','tag15','tag16','tag17','tag18','tag19','tag20','tag21','tag22','tag23','tag24','tag25','tag26','tag27','tag28','tag29','tag30','tag31','tag32','tag33','tag34','tag35','tag36','tag37','tag38','tag39','tag40','tag41','tag43');

        $tags = array();
        $tagcount = rand(2,5);

        $count = sizeof($tagArray);
        for($x=0; $x<$tagcount; $x++)
        {
                $tid = rand(0,$count);

                $tags[] = $tagArray[$tid];
        }
        return $tags;
}
?>



I pushed in 200,000 records into the collection. Here is how the data was sharded between "241" and "243";

mongos> db.printCollectionStats()
---
posts
{
        "sharded" : true,
        "flags" : 1,
        "ns" : "test.posts",
        "count" : 200000,
        "numExtents" : 10,
        "size" : 24430872,
        "storageSize" : 32743424,
        "totalIndexSize" : 15534400,
        "indexSizes" : {
                "_id_" : 6508096,
                "id_1" : 9026304
        },
        "avgObjSize" : 122.15436,
        "nindexes" : 2,
        "nchunks" : 4,
        "shards" : {
                "shard241" : {
                        "ns" : "test.posts",
                        "count" : 109889,
                        "size" : 13423484,
                        "avgObjSize" : 122.15493598947415,
                        "storageSize" : 17978183,
                        "numExtents" : 8,
                        "nindexes" : 2,
                        "lastExtentSize" : 12083200,
                        "paddingFactor" : 1,
                        "flags" : 1,
                        "totalIndexSize" : 8531049,
                        "indexSizes" : {
                                "_id_" : 3573332,
                                "id_1" : 4957718
                        },
                        "ok" : 1
                },
                "shard243" : {
                        "ns" : "test.posts",
                        "count" : 90111,
                        "size" : 10913985,
                        "avgObjSize" : 121.11711711711712,
                        "storageSize" : 33251771,
                        "numExtents" : 8,
                        "nindexes" : 2,
                        "lastExtentSize" : 12083200,
                        "paddingFactor" : 1,
                        "flags" : 1,
                        "totalIndexSize" : 13274730,
                        "indexSizes" : {
                                "_id_" : 6617370,
                                "id_1" : 6657360
                        },
                        "ok" : 1
                }
        },
        "ok" : 1
}
---




Now we will create the map and reduce functions. The map function will check for the tags array for each record in the posts collection. For each element of the tag array, it will emit the tag and a count of 1. Next we create a reduce function which counts the occurrances of each tag and returns the final count. The map function calls the emit(key, value) any number of times to feed data to the reducer. The reduce function will receive an array of emitted values from the map function and reduce them to a single value. The structure of the object returned by the reduce function must be identical to the structure of the map function's emitted value.

mongos> map = function() {
... if(!this.tags) {
... return;
... }
... for ( index in this.tags) {
... emit(this.tags[index],1);
... }
... }
function () {
    if (!this.tags) {
        return;
    }
    for (index in this.tags) {
        emit(this.tags[index], 1);
    }
}
mongos> reduce = function(key, values) {
... var count = 0;
... for(index in values) {
... count += values[index];
... }
... return count;
... }
function (key, values) {
    var count = 0;
    for (index in values) {
        count += values[index];
    }
    return count;
}
To understand how it works, lets say that after some iterations, map emitts the following value { "tag1" , 1 }. Suppose at that point "tag1" has a count of 50. That is the document can be represented as:

{ "tag1", 50 }

It map again emits { "tag1", 1 }, reduce will be called as follows :

reduce( "tag1", [50,1] )

The result will be a simple combination of counts for tag1

{ "tag1", 51 }

To invoke map-reduce run the following commands. The command states that mapreduce is run on "posts" collection. Map function is "map" and reduce function is "reduce". Output is redirected to a collection named "tags".

mongos> result =  db.runCommand( {
... "mapreduce" : "posts",
... "map" : map, //name of map function
... "reduce" : reduce,  //name of reduce function
... "out" : "tags" } )
{
        "result" : "tags",
        "shardCounts" : {
                "192.168.1.241:10000" : {
                        "input" : 109889,
                        "emit" : 499098,
                        "reduce" : 6400,
                        "output" : 43
                },
                "192.168.1.243:10000" : {
                        "input" : 90111,
                        "emit" : 200395,
                        "reduce" : 3094,
                        "output" : 43
                }
        },
        "counts" : {
                "emit" : NumberLong(699493),
                "input" : NumberLong(200000),
                "output" : NumberLong(43),
                "reduce" : NumberLong(9494)
        },
        "ok" : 1,
        "timeMillis" : 9199,
        "timing" : {
                "shards" : 9171,
                "final" : 28
        }
}

See how the output documents. The output has only "no of tags" documents - in our case 43.

mongos> db.tags.find()
{ "_id" : "tag1", "value" : 14643 }
{ "_id" : "tag2", "value" : 14705 }
{ "_id" : "tag3", "value" : 14418 }
{ "_id" : "tag4", "value" : 14577 }
{ "_id" : "tag5", "value" : 14642 }
{ "_id" : "tag6", "value" : 14505 }
{ "_id" : "tag7", "value" : 14623 }
{ "_id" : "tag8", "value" : 14529 }
{ "_id" : "tag9", "value" : 14767 }
{ "_id" : "tag10", "value" : 14489 }
has more
 

mongos> db.tags.count()
43


References

http://cookbook.mongodb.org/patterns/count_tags/
http://www.mongodb.org/display/DOCS/MapReduce/

Thursday, February 16, 2012

Scaling using mongodb : HA and Automatic sharding (part 2)

We discussed creation of HA using replica sets in the earlier post part 1. In this post we will create a sharded cluster of replica sets consisting of multiple mongodb servers. If you would have noticed the parameters passed during starting mongodb on the earlier server 241 & 242, you could see that in addition to the db path, log path and replica set parameters, I have also added the parameter '--shardsvr'. This enables sharding on this instance of mongodb.

For creating a cluster of database servers using mongodb you need data nodes, config servers and router servers. Data nodes are used to store data. The config servers are used to store sharding information. It is used for the client to figure out where the data resides on the data nodes. The router servers are used by the client to communicate with the config servers and data nodes. The clients interface to the database through the router servers.

On each of the config servers, start mongodb with the parameter '--configsvr'.

230:/home/mongodb } ./bin/mongod --configsvr --logappend --logpath ./data/log --pidfilepath ./data/mongodb.pid --fork --dbpath ./data/config --directoryperdb --bind_ip 192.168.1.230 --port 10002
231:/home/mongodb } ./bin/mongod --configsvr --logappend --logpath ./data/log --pidfilepath ./data/mongodb.pid --fork --dbpath ./data/config --directoryperdb --bind_ip 192.168.1.231 --port 10002
And start routing service on the routing servers.

233:/home/mongodb } ./bin/mongos --port 10003 --configdb 192.168.1.230:10002,192.168.1.231:10002 --logappend --logpath ./data/route_log --fork --bind_ip 192.168.1.233 &
The same service can be started on routing server 234. To shard a database and add nodes to the shard, you need to connect to the routing server.

233:/home/mongodb } ./bin/mongo 192.168.1.233:10003
233:/home/mongodb } mongos> use admin
switched to db admin

To add the first replica set (set_a) to the cluster, run the following command.

233:/home/mongodb } mongos> db.runCommand( { addshard : "set_a/192.168.1.241:10000,192.168.1.242:10000,192.168.1.243:10000", name : "shard1" } )
{ "shardAdded" : "shard1", "ok" : 1 }

Similarly replica sets of the other 2 shards can also be added to the mongodb cluster. Eventually you can fire a listShards command to see the shards added to the cluster.

233:/home/mongodb } mongos > db.runCommand( {listshards:1} )
{
        "shards" : [
                {
                        "_id" : "shard1",
                        "host" : "192.168.1.241:10000,192.168.1.242:10000,192.168.1.243:10000"
                },
                {
                        "_id" : "shard2",
                        "host" : "192.168.1.244:10000,192.168.1.245:10000,192.168.1.246:10000"
                },
                {
                        "_id" : "shard3",
                        "host" : "192.168.1.247:10000,192.168.1.248:10000,192.168.1.249:10000"
                }           
        ],                                                                                                                                                        
        "ok" : 1                    
}              

To enable sharding on a database say "test" run the following commands.

233:/home/mongodb } mongos > db.runCommand( { enablesharding : "test" } )
{ "ok" : 1 }

To shard on a particular key. The test collection consists of the following fields id, name and email. Lets shard on id.

233:/home/mongodb } > db.runCommand( { shardcollection : "test.test", key : { id : 1 } } )
{ "collectionsharded" : "test.test", "ok" : 1 }

I used a php script to push more than 20,000 records to the sharded cluster. You can check the status of sharded cluster using the following commands.

233:/home/mongodb } mongos > db.printShardingStatus()
--- Sharding Status ---
  sharding version: { "_id" : 1, "version" : 3 }
  shards:
        {  "_id" : "shard1",  "host" : "192.168.1.241:10000,192.168.1.242:10000,192.168.1.243:10000" }
        {  "_id" : "shard2",  "host" : "192.168.1.244:10000,192.168.1.245:10000,192.168.1.246:10000" }
        {  "_id" : "shard3",  "host" : "192.168.1.247:10000,192.168.1.248:10000,192.168.1.249:10000" }
  databases:
        {  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
        {  "_id" : "test",  "partitioned" : true,  "primary" : "shard1" }
                test.test chunks:
                                shard1        2
                                shard2        1
                        { "id" : { $minKey : 1 } } -->> { "id" : "1" } on : shard1 { "t" : 2000, "i" : 1 }
                        { "id" : "1" } -->> { "id" : "999" } on : shard2 { "t" : 1000, "i" : 3 }
                        { "id" : "999" } -->> { "id" : { $maxKey : 1 } } on : shard2 { "t" : 2000, "i" : 0 }

233:/home/mongodb } mongos > db.printCollectionStats()
---
test
{
        "sharded" : true,
        "flags" : 1,
        "ns" : "test.test",
        "count" : 19998,
        "numExtents" : 6,
        "size" : 1719236,
        "storageSize" : 2801664,
        "totalIndexSize" : 1537088,
        "indexSizes" : {
                "_id_" : 662256,
                "id_1" : 874832
        },
        "avgObjSize" : 85.97039703970397,
        "nindexes" : 2,
        "nchunks" : 3,
        "shards" : {
                "shard1" : {
                        "ns" : "test.test",
                        "count" : 19987,
                        "size" : 1718312,
                        "avgObjSize" : 85.97148146295092,
                        "storageSize" : 2793472,
                        "numExtents" : 5,
                        "nindexes" : 2,
                        "lastExtentSize" : 2097152,
                        "paddingFactor" : 1,
                        "flags" : 1,
                        "totalIndexSize" : 1520736,
                        "indexSizes" : {
                                "_id_" : 654080,
                                "id_1" : 866656
                        },
                        "ok" : 1
                },
                "shard2" : {
                        "ns" : "test.test",
                        "count" : 11,
                        "size" : 924,
                        "avgObjSize" : 84,
                        "storageSize" : 8192,
                        "numExtents" : 1,
                        "nindexes" : 2,
                        "lastExtentSize" : 8192,
                        "paddingFactor" : 1,
                        "flags" : 1,
                        "totalIndexSize" : 16352,
                        "indexSizes" : {
                                "_id_" : 8176,
                                "id_1" : 8176
                        },
                        "ok" : 1
                }
        },
        "ok" : 1
}
---

To use multiple routing servers in a database connection string - for connecting to mongodb through a programming language for example php the following syntax can be used

mongodb://[username:password@]host1[:port1][,host2[:port2:],...]/db_name

This will connect to atleast one host from the list of hosts provided. If none of the hosts are available, an exception will be thrown.

Wednesday, February 15, 2012

Scaling using mongodb : HA and Automatic sharding (part 1)

Mongodb introduces the concept of automatic sharding and replica sets. Lets see how both can be used to create a highly available cluster of database which is horizontally scalable.

First lets take a look at the HA capability of mongodb. HA is handled in mongodb by something known as Replica Sets. Replica sets are basically two or more mongodb nodes which are copies/replicas of each other. A replica set automatically selects a primary master and provides automated failover and disaster recovery.

Lets create a replica set of 2 nodes first and proceed from there.

I have taken 2 machines 192.168.1.241 and 192.168.1.242 for creating replica sets. Lets download the tar of mongodb 2.0.2 and put it in 241 and 242. We will create separate directories for database and replica db. Here is how we go about it.

241:/home/mongodb } mkdir -p data/db
241:/home/mongodb } mkdir -p repl/db
242:/home/mongodb } mkdir -p data/db
242:/home/mongodb } mkdir -p repl/db

Create primary mongodb replica set server on 241

241:/home/mongodb } ./bin/mongod --shardsvr --replSet set_a --logappend --logpath ./data/log --pidfilepath ./data/mongodb.pid --fork --dbpath ./data/db --directoryperdb --bind_ip 192.168.1.241 --port 10000 &
Create secondary replica set on 242 running on port 10001

242:/home/mongodb } ./bin/mongod --shardsvr --replSet set_a --logappend --logpath ./data/log --pidfilepath ./data/mongodb.pid --fork --dbpath ./data/db --directoryperdb --bind_ip 192.168.1.241 --port 10001 &
Check whether the replica set is working fine.

241:/home/mongodb } ./bin/mongo 192.168.1.241:10000
241: } > rs.status()
{
        "startupStatus" : 3,
        "info" : "run rs.initiate(...) if not yet done for the set",
        "errmsg" : "can't get local.system.replset config from self or any seed (EMPTYCONFIG)",
        "ok" : 0
}
241: } > rs.initiate()
{
        "info2" : "no configuration explicitly specified -- making one",
        "me" : "192.168.1.243:10000",
        "info" : "Config now saved locally.  Should come online in about a minute.",
        "ok" : 1
}
241: } > rs.isMaster()
db.isMaster()
{
        "setName" : "set_a",
        "ismaster" : true,
        "secondary" : false,
        "hosts" : [
                "192.168.1.241:10000"
        ],
        "primary" : "192.168.1.241:10000",
        "me" : "192.168.1.241:10000",
        "maxBsonObjectSize" : 16777216,
        "ok" : 1
}

We can see that 241 is master, but 242 is not added to the replica set. Lets add 242 as a secondary server of the replica set.

241: } >  rs.add("192.168.1.242:10001")
{ "ok" : 1 }

241: } > rs.conf()
{
        "_id" : "set_a",
        "version" : 2,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "192.168.1.241:10000"
                },
                {
                        "_id" : 1,
                        "host" : "192.168.1.242:10001"
                }
        ]
}
241: } > rs.status()
{
        "set" : "set_a",
        "date" : ISODate("2012-02-15T09:42:07Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "192.168.1.241:10000",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "optime" : {
                                "t" : 1329298881000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2012-02-15T09:41:21Z"),
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "192.168.1.242:10001",
                        "health" : 1,
                        "state" : 3,
                        "stateStr" : "RECOVERING",
                        "uptime" : 46,
                        "optime" : {
                                "t" : 0,
                                "i" : 0
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2012-02-15T09:42:07Z"),
                        "pingMs" : 5339686
                }
        ],
        "ok" : 1
}
242 is in recovery mode - replicating the data that is already there on 241. Once the data is completely replicated, the second node starts acting as a secondary node.

241: } > rs.status()
{
        "set" : "set_a",
        "date" : ISODate("2012-02-15T10:33:32Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "192.168.1.241:10000",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "optime" : {
                                "t" : 1329298881000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2012-02-15T09:41:21Z"),
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "192.168.1.242:10001",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 3131,
                        "optime" : {
                                "t" : 1329298881000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2012-02-15T09:41:21Z"),
                        "lastHeartbeat" : ISODate("2012-02-15T10:33:31Z"),
                        "pingMs" : 0
                }
        ],
        "ok" : 1
}

Now the replica set is working fine and the secondary node will act as primary if the primary goes out of service. Ideally there should be 3 or more nodes added to any replica set.

Tuesday, December 06, 2011

Postgresql replication

There are many solutions to postgresql replication available in the market. Almost all of them are third party solutions, since there was no inbuilt replication in postgresql. Postgresql 9.0 introduced replication into the database - it is also known as streaming replication.And it can be used only for master-slave replication. There is no master-master or clustering feature available with postgresql SR (streaming replication).

The way SR works is that there are log files (known as XLOG files) which are shipped to the standby or slave server via network. Multiple slave servers can connect to the master over the network. The stand by servers continuously replay the XLOG records shipped in continuous recovery mode.As soon as XLOG files are shipped, they are replayed on the slave. This makes latest data available on slave almost immediately. Log shipping does not interfere with any query execution on master. In case the primary goes offline, the standby server will wait for the primary to become active.

Here is how i did a test setup of master-slave replication using postgresql.

I had 2 machines 241 and 242. I downloaded postgresql-9.1.1.tar.bz2 on both.

Steps to setup replication :

1. untar, compile and install

241/242 ]#  tar -xvjf postgresql-9.1.1.tar.bz2
241/242 ]#  cd postgresql-9.1.1
241/242 postgresql-9.1.1]#  ./configure
241/242 postgresql-9.1.1]#  make
241/242 postgresql-9.1.1]#  sudo make install

This will install postgresql in /usr/local/pgsql folder

2. Setup 241 as master. Initialize the database cluster on 241

241 ]# adduser postgres
241 ]# mkdir /usr/local/pgsql/data
241 ]# chown postgres /usr/local/pgsql/data
241 ]# su - postgres
241 ]# /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data

Do not start the postgres database server now.

3. configure master server to listen on all ip addresses.

241 ]# vim /usr/local/pgsql/data/postgresql.conf
  
    listen_addresses = '*'
  
4. Allow standby server to connect to postgresql on master with replication privilege

241 ]# vim /usr/local/pgsql/data/pg_hba.conf
  
    host   replication   postgres    192.168.1.242/22    trust
  
5. Setup replication related parameters in the master server

241 ]# vim /usr/local/pgsql/data/postgresql.conf

    # To enable read-only queries on a standby server, wal_level must be set to
    # "hot_standby". But you can choose "archive" if you never connect to the
    # server in standby mode.
    wal_level = hot_standby

    # Set the maximum number of concurrent connections from the standby servers.
    max_wal_senders = 5

    # To prevent the primary server from removing the WAL segments required for
    # the standby server before shipping them, set the minimum number of segments
    # retained in the pg_xlog directory. At least wal_keep_segments should be
    # larger than the number of segments generated between the beginning of
    # online-backup and the startup of streaming replication. If you enable WAL
    # archiving to an archive directory accessible from the standby, this may
    # not be necessary.
    wal_keep_segments = 128
  
    # Enable WAL archiving on the primary to an archive directory accessible from
    # the standby. If wal_keep_segments is a high enough number to retain the WAL
    # segments required for the standby server, this is not necessary.
    archive_mode    = on
    archive_command = 'cp %p /usr/local/pgsql/data/pg_archive/%f'

6. start postgresql on master

241 ]# /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data >logfile 2>&1 &

7. copy the master server's data to standby server

241 ]# /usr/local/pgsql/bin/psql -c "SELECT pg_start_backup('label', true)"
 pg_start_backup
-----------------
 0/4000020
(1 row)

241 ]# rsync -a /usr/local/pgsql/data/ root@192.168.1.242:/usr/local/pgsql/data --exclude postmaster.pid

241 ]# /usr/local/pgsql/bin/psql -c "SELECT pg_stop_backup()"
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
 pg_stop_backup
----------------
 0/40000D8
(1 row)

This will also copy all the configuration parameters and authentication related stuff from primary to standby slave.
Ensuring that the slave can be converted to a master/primary in case of a failover.

8. Change postgresql.conf to enable readonly queries on standby server

242 ]# vim /usr/local/pgsql/data/postgresql.conf

    hot_standby = on
  
9. Enable recovery on the standby server and change configuration.

242 ]# cp /usr/local/pgsql/share/recovery.conf.sample /usr/local/pgsql/data/recovery.conf
242 ]# vim /usr/local/pgsql/data/recovery.conf


    # Specifies whether to start the server as a standby. In streaming replication,
    # this parameter must to be set to on.
    standby_mode          = 'on'

    # Specifies a connection string which is used for the standby server to connect
    # with the primary.
    primary_conninfo      = 'host=192.168.1.241 port=5432 user=postgres'

    # Specifies a trigger file whose presence should cause streaming replication to
    # end (i.e., failover). Once the trigger file is found the server acts as a primary server.
    trigger_file = '/home/postgres/failover'

    # Specifies a command to load archive segments from the WAL archive. If
    # wal_keep_segments is a high enough number to retain the WAL segments
    # required for the standby server, this may not be necessary. But
    # a large workload can cause segments to be recycled before the standby
    # is fully synchronized, requiring you to start again from a new base backup.
    restore_command = 'cp /usr/local/pgsql/data/pg_archive/%f "%p"'

10. Start postgres on standby server. This will start streaming replication on the standby server.

242 ]# /usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data >logfile 2>&1 &

11. You can check the status of streaming replication using either the ps command or through psql - postgresql command prompt

241 (primary) ]# /usr/local/pgsql/bin/psql -c "SELECT pg_current_xlog_location()"
 pg_current_xlog_location
--------------------------
 0/5000EC0
(1 row)

242 (standby) ]# /usr/local/pgsql/bin/psql -c "select pg_last_xlog_receive_location()"
 pg_last_xlog_receive_location
-------------------------------
 0/5000EC0
(1 row)

242 (standby) ]$ /usr/local/pgsql/bin/psql -c "select pg_last_xlog_replay_location()"
 pg_last_xlog_replay_location
------------------------------
 0/5000EC0
(1 row)

To check using ps use the following commands

241 (master)]# ps ax | grep sender
 2728 ?        Ss     0:00 postgres: wal sender process postgres 192.168.1.242(54792) streaming 0/5000EC0
 2768 pts/1    R+     0:00 grep sender

242 (standby)]# ps ax| grep receiver
 28125 ?        Ss     0:00 postgres: wal receiver process   streaming 0/5000EC0
 28154 pts/1    S+     0:00 grep receiver


To do a failover, all that needs to be done is to create the 'trigger' file at the specified location. This will automatically turn off standby mode and the postgres server will start acting as a primary or master.

Do remember to use the "pg_ctl stop" command to stop either the primary or standby server. This will ensure graceful shutdown and no records will be missed being replicated.

In order to create another standby server repeat steps from 7 onwards - after adding the ip of the standby server in master configuration as in step 4


Tuesday, January 25, 2011

Theory : database sharding strategies

There are a number of database sharding strategies that meet the diverse requirements of different categories of application.

Shard by Modulus
For many applications, it is appropriate to shard based on a shard key such as a User ID. Using a modulus of a numeric ID, especially an auto increment primary key, ensures even distribution of data between shards.

Shard by Date/Time Range
For time-based data such as feeds or blogs where data is accumulating over time, it may make sense to shard by date range. For example, each shard could contain data for a single month. New shards can be added each month and old shards can be dropped once historic data is no longer needed.

Master Lookup
It is sometimes a requirement to control the sharding manually or in an application specific manner. One example would be a requirement to host key customer accounts on shards hosted on higher specification hardware. To support this requirement, a master shard can be created which contains lookup tables to map customer IDs to a specific shard number.

Session-based Sharding
Some categories of application, particularly user-centric web applications, can choose a shard when a customer logs in and then direct all queries to the same shard for the duration of that user session.

Fixed Shard
Tables are mapped to specific fixed shards. Also known as table based sharding.

Custom Sharding
If there is any specific logic for sharding, a piece of code can be used to shard data based on that logic.

Global Tables
Global tables are tables which are hosted in all shards and data is automatically replicated across all shards. The benefit is that these tables can be used in joins with sharded tables in each shard. Global tables are typically fairly static tables or with low write volume, such as product codes, countries, and other reference data.

Monday, June 07, 2010

Innodb now supports native AIO on Linux

With the exception of Windows InnoDB has used ’simulated AIO’ on all other platforms to perform certain IO operations. The IO requests that have been performed in a ’simulated AIO’ way are the write requests and the readahead requests for the datafile pages. Let us first look at what does ’simulated AIO’ mean in this context.

We call it ’simulated AIO’ because it appears asynchronous from the context of a query thread but from the OS perspective the IO calls are still synchronous. The query thread simply queues the request in an array and then returns to the normal working. One of the IO helper thread, which is a background thread, then takes the request from the queue and issues a synchronous IO call (pread/pwrite) meaning it blocks on the IO call. Once it returns from the pread/pwrite call, this helper thread then calls the IO completion routine on the block in question which includes doing a merge of buffered operations, if any, in case of a read. In case of a write, the block is marked as ‘clean’ and is removed from the flush_list. Some other book keeping stuff also happens in IO completion routine.

What we have changed in the InnoDB Plugin 1.1 is to use the native AIO interface on Linux. Note that this feature requires that your system has libaio installed on it. libaio is a thin wrapper around the kernelized AIO on Linux. It is different from Posix AIO which requires user level threads to service AIO requests. There is a new boolean switch, innodb_use_native_aio, to choose between simulated or native AIO, the default being to use native AIO.

How does this change the design of the InnoDB IO subsystem? Now the query thread instead of enqueueing the IO request actually dispatches the request to the kernel and returns to the normal working. The IO helper thread, instead of picking up enqueued requests, waits on the IO wait events for any completed IO requests. As soon as it is notified by the kernel that a certain request has been completed it calls the IO completion routine on that request and then returns back to wait on the IO wait events. In this new design the IO requesting thread becomes kind of a dispatcher while the background IO thread takes on the role of a collector.

What will this buy us? The answer is simple – scalability. For example, consider a system which is heavily IO bound. In InnoDB one IO helper thread works on a maximum of 256 IO requests at one time. Assume that the heavy workload results in the queue being filled up. In simulated AIO the IO helper thread will go through these requests one by one making a synchronous call for each request. This means serialisation forcing the request that is serviced last to wait for the other 255 requests before it gets a chance. What this implies is that with simulated AIO there can be at most ‘n’ IO requests in parallel inside the kernel where ‘n’ is the total number of IO helper threads (this is not entirely true because query threads are also allowed to issue synchronous requests as well, but I’ll gloss over that detail for now). In case of native AIO all 256 requests are dispatched to the kernel and if the underlying OS can service more requests in parallel then we’ll take advantage of that.

The idea of coalescing contiguous requests is now off loaded to the kernel/IO scheduler. What this means is that which IO scheduler you are using or the properties of your RAID/disk controller may now have more affect on the overall IO performance. This is also true because now many more IO requests will be inside the kernel than before. Though we have not run tests to specifically certify any particular IO scheduler the conventional wisdom has been that for database engine workloads perhaps no-op or deadline scheduler would give optimal performance. I have heard that lately a lots of improvements have gone in cfq as well. It is for you to try and as always YMMV. And we look forward to hear your story.

NOTE:InnoDB h as always used native AIO on Windows and it continues to do so in Plugin 1.1. innodb_use_native_aio will have no affect on Windows.

Friday, March 05, 2010

Storing trees in databases

A tree is a graph which is connected, uni-directed and acyclic. Lets look at different options of storing such trees in databases.

Parent - Child Model

The most common model for storing hierarchical information is storing the reference of the parent node along with the child node.

so for a tree

EmployeeBosssalary
ANULL1000
BA900
CA950
DC800
EC700
FC600

For this model, the table to be created would be:

create table employee_tree(employee varchar(10) not null primary key, boss varchar(10), salary decimal(10,2) not null);

Some sample operations :
  • Finding the root node : select * from employee_tree where boss is NULL
  • Finding all leaf nodes : select e1.* from employee_tree as e1 left join employee_tree as e2 on e1.employee=e2.boss where e2.employee is null
  • Retrieving a single path : select e1.employee as lev1, e2.employee as lev2, e3.employee as lev3 from employee_tree as e1 left join employee_tree as e2 on e2.boss = e1.employee left join employee_tree as e3 on e3.boss = e2.employee where e1.employee = 'A' and e3.employee='F';

Some problems with this model:
  • Update : if you run "update employee_tree set employee = 'C1' where employee='C'", then all nodes below "C" are left with boss who does notexist. To handle this, you should run a transaction with two queries, one being the above query and another being the query "update employee_tree set boss = 'C1' where employee = 'C'"
  • Insert : if you run a query "Insert into employee_tree (employee, boss) values ('G','H'),('H','G')" you get two employees 'G' & 'H' who are both employees and bosses of each other.
  • Insert : Another problem with insert is "Insert into employee_tree (employee, boss) values ('M','M')". Now 'M' is the boss of 'M'.
  • Delete : Run "Delete from employee_tree where emp = 'C'". Again we create a situation where 'D','E','F' are left with a boss who does not exist. So in effect, deletion would require the tree to be traversed.
  • Descendants : It is difficult to get the number of descendents for a given node in a single query. You will have to write some script and un some recursive function to get that number

Fixing this model: Following checks should be in place to fix this model.
  • An employee cannot be his own boss : Check if ( boss != employee ) or (boss = 0 and employee = 0)
  • On deletion use cascade to remove the employees to that boss
  • To prevent cyclic relations : add unique key (employee, boss)
  • Check for validity of tree (edges = nodes - 1) : count(*) from employee_tree - 1 = count(boss) from employee_tree where boss is not null
  • Check for only single root : select count(*) from employee_tree where boss_id is null = 1

Path Enumeration model

create table employee(emp_id char(1) not null primary key, emp_name varchar(10), emp_path varchar(255) not null);
emp_idemp_nameemp_path
AA NameA
BB NameAB
CC NameAC
DD NameACD
EE NameACE
FF NameACF
The sample operations are easy here
  • Finding root node : select * from employee where length(emp_path)=1;
  • Finding all leaf nodes : This is going to be difficult, You will have to write a script to extract this info
  • Prevent cyclic relations : check whether after removing the node from the path string, the path is reduced by only 1 node. Query : CHECK(NOT EXISTS(select * from employee as e1, employee as e2 where length(replace(e1.emp_path,e2.emp_id,'')) < (length(e1.emp_path)-length(e2.emp_id)) ));
  • No path can be longer than the total no of nodes in the tree : CHECK( (select max(length(e1.emp_path)) from employee as e1) <= (select coun t(e2.emp_id) * length(e2.emp_id) from employee as e2) )
  • Depth of the tree is the path divided by the length of emp_id string : CHAR_LENGTH(emp_path) / CHAR_LENGTH(emp_id); If emp_ids are not fixed length then the depth is : CHAR_LENGTH(emp_path) - CHAR_LENGTH(REPLACE(emp_path,'/','')) + 1 [here / is the separator that is path is written as 'A/B/C'. If you use any other separator, you need to change it in the function as well]
  • To find all subtrees for a parent : the immediate solution is "Select * from employee where emp_path like concat('%',<Parent_emp_id>,'%')". But this query does not use indexes. Another way of doing this is "select * from employee where emp_path like concat((select emp_path from employee where emp_id='<Parent_emp_id>'),'%');"
  • To find the immediate subchildren of a parent : "select * from employee where emp_path like concat((select emp_path from employee where emp_id='<Parent_emp_id>'),'_');"
  • To search for superiors : "create table sequence(seq int not null); insert into sequence values (1),(2),(3),(4),(5); select substring(e1.emp_path from (seq * char_length(e1.emp_id)) for char_length(e1.emp_id)) as emp_id from employee e1, sequence s1 where e1.emp_id = '<emp_id_to_search_for>' and s1.seq <= char_length(e1.emp_path)/char_length(e1.emp_id);"
  • To find the relationships among superiors : "select e2.* from employee e1, employee e2 where e1.emp_id = '<emp_id_to_search_for>' and POSITION(e2.emp_path in e1.emp_path) = 1;"
  • To delete a subtree : "DELETE from employee where emp_path like concat(select emp_path from employee where emp_id='<Dead_node_emp_id>','%')" We may need to write the two queries separately;
  • To delete a node : "Delete from employee where emp_id='<emp_id_of_dead_node>'" And "Update employee set emp_path = replace(emp_path,'<emp_id_of_dead_node>','')" Both queries should be in a single transaction.
  • Inserting a node : "select emp_path from employee where emp_id = '<emp_id_of_boss>'; insert into employee values ('<new_emp_id>','<new_emp_name>',concat('<selected_emp_boss_path>','<new_emp_id>'))"

A derivative of the path enumeration model is the edge enumeration model. In Edge enumeration model the path consists of a set of integers that make the path from the root to the node - from left to right.

For example :
Employee nameEdge path
A1.
B1.1.
C1.2.
D1.2.1
E1.2.2
F1.2.3
The benefit of the edge enumeration model over the path enumeration model is that you do not have to worry about long strings. The numbers give an implied ordering to siblings that have meaning.

Nested set model


Here each node is assigned a beginning and an ending node hierarchy number based on the total amount of data in the hierarchical tree.

An example table could be :

create table emp(emp_id int not null auto_increment primary key, emp_name varchar(10), lft int not null, rgt int not null);
Emp idEmp NameLftRgt
1A112
2B23
3C411
4D56
5E78
6F910
Operations in this model:
  • Finding the root node : "select * from emp where lft = 1"
  • Finding leaf nodes : "select * from emp where lft = rgt - 1"
  • Finding subtree : select node.* from emp as node, emp as parent where node.lft > parent.lft and node.lft < parent.rgt and parent.emp_id='&l t;parent_employee_id>';"
  • Retrieving a single path : "select parent.* from emp as node, emp as parent where node.lft > parent.lft and node.lft < parent.rgt and node. emp_id='<leaf_node_id>' order by parent.lft;"
  • Finding depth of nodes : "select node.emp_name, (COUNT(parent.emp_name)-1) as depth from emp as node, emp as parent where node.lft BETWEEN parent.lft and parent.rgt group by node.emp_name order by node.lft;"
  • Add a single node as child of node with no existing children : If you want to insert a node 'G' below 'B', then the new node 'G' would have left and right values '3' and '4' and all nodes after it (in sequence) would have to increase their left and right values by 2. The procedure
    for it would be
    lock table emp write;
    select @myleft := lft from emp where emp_name='<add_below_this_node>';
    update emp set rgt = rgt+2 where rgt > @myleft;
    update emp set lft = lft+2 where lft > @myleft;
    insert into emp values ('','G',&myleft+1,@myleft+2);
    unlock tables;
    
  • As a single node after a node(not as a new child) : modify the earlier procedure a little.
    lock table emp write;
    select @myright := rgt from emp where emp_name='<add_after_this_node>';
    update emp set rgt = rgt+2 where rgt > @myright;
    update emp set lft = lft+2 where lft > @myright;
    insert into emp values ('','G',&myright+1,@myright+2);
    unlock tables;
    
  • Delete a node and its children : Deletion is just opposite to a adding. After deletion, its width has to be deleted from all nodes to the right.
    lock table emp write;
    select @myleft := lft, @myright := rgt, @mywidth := rgt - lft + 1 from emp where emp_name='<node_to_be_deleted>';
    delete from emp where lft between @myleft and @myright;
    update emp set rgt = rgt - @mywidth where rgt > @myright;
    update emp set lft = lft - @mywidth where lft > @myright;
    unlock tables;
    
  • Delete only the parent node: In this case the child nodes will have to be moved up to the level of the parent.
    lock table emp write;
    select @myleft := lft, @myright := rgt, @mywidth := rgt - lft + 1 from emp where emp_name='<parent_to_be_deleted>';
    delete from emp where lft = @myleft;
    update emp set rgt = rgt - 1, lft = lft - 1 where lft between @myleft and @myright;
    update emp set rgt = rgt - 2 where rgt > @myright;
    update emp set lft = lft - 2 where lft > @myleft;
    unlock tables;
    
Hybrid model

A parent child with nested set model would server the purpose much better. (node, parent, left, right). But with hybrid models, inserts would be more expensive as compared to selects. So it is important to evaluate the pros and cons and figure out what would be best for the purpose tha t your application serves.

Nested intervals model.

This model is similar to the nested set model with the use of rational numbers (a/b) instead of numbers. So in our above tree the left and right values would change.
Emp idEmp NameLftRgt
1A0/1111/11
2B1/112/11
3C3/1110/11
4D4/115/11
5E6/117/11
6F8/119/11

This is a bit complex to implement, but plays well with all the queries.

Proprietary solutions

Oracle, DB2 and Microsoft SQL Server support heirarchial data. In opensource Postgresql 8.4 supports the addition and querying of heirarchial data basically by using a "CONNECT BY" clause. It would be better if their official documentation is referred to for info regarding the syntax for implementing heirarchial data in relational databases.

Wednesday, September 02, 2009

DB Basics : Types of joins

Join Types

So, you have designed a normalized database design for your application. Maybe up to the 3rd normal form. And now, when you need to run queries, you would need to join the tables in the query to get the required information. There has to be some common data that allow those tables to be connected in some meaningful way. Although it’s possible to have more than one common column between two tables, most often, the join key will be the primary key of one table and a foreign key in the other.

Lets perform queries against the following table.

company : company_id (pk), company_name
model : model_id(pk), company_id(fk), model_name

Here, we have a company table and a model table in 1:n relationship. A car company can have multiple models in the market. Let us take some sample data









company_idcompany_name
1Audi
2GM
3Ford
4Toyota
5Tata
6BMW




















model_idcompany_idmodel_name
11A4
21A6
31A8
41Q7
52Chevrolet
62Hummer
73Ikon
83Endeavor
93Fiesta
104Corola
114Camry
124Innova
135Indica
145Nano
155Safari
16NullCustom


Inner join : An inner join is defined as a join in which rows must match in both tables in order to be included in the result set.

mysql> select t1.company_name as 'Manufacturer', t2.model_name as 'Model' from company t1 INNER JOIN model t2 on t1.company_id = t2.company_id where t1.company_name = 'Audi';

+--------------+-------+
| Manufacturer | Model |
+--------------+-------+
| Audi | A4 |
| Audi | A6 |
| Audi | A8 |
| Audi | Q7 |
+--------------+-------+


Outer Join : Outer joins will return records in one table that aren’t matched in another. Outer joins can be further divided into the two types of left and right. In a left outer join, all records from the first (left-hand) table in a join that meet any conditions set in the WHERE clause are returned, whether or not there’s a match in the second (right-hand) table.

mysql> select t1.company_name as 'Manufacturer', t2.model_name as 'Model' from company t1 left join model t2 on t1.company_id = t2.company_id where t1.company_name in ('Toyota','BMW');
+--------------+--------+
| Manufacturer | Model |
+--------------+--------+
| Toyota | Corola |
| Toyota | Camry |
| Toyota | Innova |
| BMW | NULL |
+--------------+--------+


Here 'BMW' is returned even when it does not have any entry in the model Table.

Similar to the left outer join, a right outer join returns all records from the second (right-hand) table in a join that meet any conditions set in the WHERE clause, whether or not there’s a match in the first (left-hand) table.

mysql> select t1.company_name as 'Manufacturer', t2.model_name as 'Model' from company t1 right join model t2 on t1.company_id = t2.company_id where t2.model_name in ('Custom','Nano');
+--------------+--------+
| Manufacturer | Model |
+--------------+--------+
| Tata | Nano |
| NULL | Custom |
+--------------+--------+


Cross-join :

The cross-join, also referred to as a Cartesian product, returns all the rows in all the tables listed in the query. Each row in the first table is paired with all the rows in the second table. This happens when there is no relationship defined between the two tables. We do not require cross join in our general applications, so we should try to avoid it. A cross join happens when we fail to provide a filler for the join in the query.

mysql> select t1.company_name as 'Manufacturer', t2.model_name as 'Model' from company t1, model t2;
+--------------+-----------+
| Manufacturer | Model |
+--------------+-----------+
| Audi | A4 |
| GM | A4 |
| Ford | A4 |
| Toyota | A4 |
| Tata | A4 |
| BMW | A4 |
| Audi | A6 |
| GM | A6 |
.....
.....
.....
| Ford | Custom |
| Toyota | Custom |
| Tata | Custom |
| BMW | Custom |
+--------------+-----------+

Friday, November 28, 2008

postgresql replication using slony-I

As most postgresql users must be knowing, postgresql does not provide any inbuilt replication solution. There are lots of 3rd party replication products available for postgresql. Slony is one of them. Slony is a trigger based replication solution, that is it used triggers to push data to the slave. Slony is supposed to be one of the most stable replication solutions for postgresql.

You can download slony from www.slony.info. There are two major versions of slony - slony-I & slony-II. Slony-I is a simple master-slave replication solution. Whereas slony-II is a advanced multi-master replication solution. We will go ahead with simple master-slave replication solution. So we will download Slony-I. The latest version available is Slony-I 1.2.15. Slony-I 2.0 is in RC and should be soon released. But we will go with a stable release - 1.2.15.

Postgresql version being used is 8.3.3. To install slony, simply untar the downloaded file and run
./configure --with-pgconfigdir=<path to pg_config>
make
sudo make install


I have used two machines for setting up replication. Installed postgresql and slony-I on both of them.

master server ip : 172.16.3.211
slave server ip : 172.16.3.173


We will be working with the superuser postgres which is used to start and stop the postgresql server.

Quick steps

  • Define environment variables on master & slave. The main purpose is to make our task easier. Lets create an env.sh file containing all the definitions.

    #!/bin/sh

    REPLICATIONUSER=postgres
    CLUSTERNAME=replcluster
    MASTERDBNAME=repltestdb
    SLAVEDBNAME=repltestdb
    MASTERHOST=172.16.3.211
    SLAVEHOST=172.16.3.173
    MASTERPORT=5432
    SLAVEPORT=5432
    MASTERDBA=postgres
    SLAVEDBA=postgres
    PSQL=/usr/local/pgsql/bin/psql
    CREATEDB=/usr/local/pgsql/bin/createdb
    CREATELANG=/usr/local/pgsql/bin/createlang
    CREATEUSER=/usr/local/pgsql/bin/createuser
    PGDUMP=/usr/local/pgsql/bin/pg_dump

    export REPLICATIONUSER CLUSTERNAME MASTERDBNAME SLAVEDBNAME MASTERHOST SLAVEHOST PSQL CREATEDB CREATELANG CREATEUSER PGDUMP MASTERPORT SLAVEPORT MASTERDBA SLAVEDBA


    As you can see here, my postgresql is installed in /usr/local/pgsql. I have defined the IP addresses & ports of master and slave servers. I have used the superuser postgres for replication. And i have defined the master and slave databases to be used for replication. You can replicate between databases with different names on master and slave - just change the names in all the scripts.

  • Create database on master & slave
    On master run
    /usr/local/pgsql/bin/createdb -O $REPLICATIONUSER -h $MASTERHOST -p $MASTERPORT $MASTERDBNAME
    On slave run
    /usr/local/pgsql/bin/createdb -O $REPLICATIONUSER -h $SLAVEHOST -p $SLAVEPORT $SLAVEDBNAME

  • Since slony-I depends on triggers for replication, you will need to install the plsql procedural language on master to generate and run triggers & stored procedures for pushing data to slave.
    /usr/local/pgsql/bin/createlang -h $MASTERHOST -p $MASTERPORT plpgsql $MASTERDBNAME

  • Put some tables in the $MASTERDBNAME on master, which you want to replicate. And port the tables to slave. It has to be done manually.

    Dump the tables on master
    /usr/local/pgsql/bin/pg_dump -s -U $MASTERDBA -h $MASTERHOST -p $MASTERPORT $MASTERDBNAME > replmaster.sql

    Import the tables on slave
    /usr/local/pgsql/bin/psql -U $SLAVEDBA -h $SLAVEHOST -p $SLAVEPORT $SLAVEDBNAME < replmaster.sql

  • And now configure the databases for replication. When you install Slony-I, it puts two binaries slonik and slon in the pgsql/bin directory. Slonik is the tool which is used for creating configuration tables, stored procedures and triggers. All we need to do is create a configuration file to pass it to the slonik tool. Here i am assuming that there are two tables which need to be replicated - parent & child.

    vim replconfig.cnf
    # define the namespace the replication system uses in our example it is
    # replcluster
    cluster name = replcluster;
    # admin conninfo's are used by slonik to connect to the nodes one for each
    # node on each side of the cluster, the syntax is that of PQconnectdb in
    # the C-API
    node 1 admin conninfo = 'dbname=repltestdb host=172.16.3.211 port=5432 user=postgres';
    node 2 admin conninfo = 'dbname=repltestdb host=172.16.3.173 port=5432 user=postgres';
    # init the first node. Its id MUST be 1. This creates the schema
    # _$CLUSTERNAME containing all replication system specific database
    # objects.
    init cluster ( id=1, comment = 'Master Node');
    # Add unique keys to table that do not have one.
    # This command adds a bigint column named "_Slony-I_$CLUSTERNAME_rowID" to the table which will have a default value of nextval('_$CLUSTERNAME.s1_rowid_seq') and have UNIQUE & NOT NULL constraints applied on it.
    # table add key (node id = 1, fully qualified name = 'table_name');
    # Slony-I organizes tables into sets. The smallest unit a node can
    # subscribe is a set.
    # you need to have a set add table() for each table you wish to replicate
    create set (id=1, origin=1, comment='parent child table')
    set add table (set id=1, origin=1, id=1, fully qualified name = 'public.parent', comment='parent table');
    set add table (set id=1, origin=1, id=2, fully qualified name = 'public.child', comment='child table');
    # Create the second node (the slave) tell the 2 nodes how to connect to
    # each other and how they should listen for events.
    store node (id=2, comment = 'Slave node');
    store path (server = 1, client = 2, conninfo='dbname=repltestdb host=172.16.3.211 port=5432 user=postgres');
    store path (server = 2, client = 1, conninfo='dbname=repltestdb host=172.16.3.173 port=5432 user=postgres');
    store listen (origin=1, provider = 1, receiver =2);
    store listen (origin=2, provider = 2, receiver =1);


    Pass the config file to slonik for creating required triggers & config tables.

    /usr/local/pgsql/bin/slonik replconfig.cnf

  • Lets start the replication daemons on master & slave

    On master run
    /usr/local/pgsql/bin/slon $CLUSTERNAME "dbname=$MASTERDBNAME user=$MASTERDBA host=$MASTERHOST port=$MASTERPORT" > slon.log &

    On slave run
    /usr/local/pgsql/bin/slon $CLUSTERNAME "dbname=$SLAVEDBNAME user=$SLAVEDBA host=$SLAVEHOST port=$SLAVEPORT" > slon.log &

    Check out the output in slon.log files

  • Now everything is setup and from the slon.log files on master and slave you can see that both the servers are trying to sync with each other. But still replication is not on way. To start replication we need to make the slave subscribe to the master. Here is the required config file for doing this

    startrepl.cnf
    # This defines which namespace the replication system uses
    cluster name = replcluster;
    # connection info for slonik to connect to master & slave
    node 1 admin conninfo = 'dbname=repltestdb host=172.16.3.211 port=5432 user=postgres';
    node 2 admin conninfo = 'dbname=repltestdb host=172.16.3.173 port=5432 user=postgres';
    # Node 2 subscribes set 1
    subscribe set ( id = 1, provider = 1, receiver = 2, forward = no);


    Passing this file to slonik will do the trick and replication would start happening.

    /usr/local/pgsql/bin/slonik startrepl.cnf



Now simply make some inserts, updates and deletes on the master and check out whether they are happening on the slave as well. Officially, since replication is on full swing all changes in master tables should be replicated on the slave.

Please note that new tables & changes to table structures wont be replicated automatically. So whenever a new table is created or an existing table is altered the changes has to be manually propagated to slave and the scripts need to be run to make appropriate changes in the triggers and config tables.

Another important thing to note is that postgresql on master and slave should be able to communicate with both the ip addresses. For this add the ip addresses in the pgsql/data/pg_hba.conf.

For the able replication i had added the lines
host all all 172.16.3.211/32 trust
host all all 172.16.3.173/32 trust

to the pg_hba.conf file in both master & slave.

Tuesday, June 03, 2008

MySQL versus PostgreSQL

I created and ran some simple tests on mysql and postgresql to figure out which one is faster. It is already known that postgresql is more stable and reliable than mysql. pgsql has a rich set of features. It is a complete RDBMS and also supports fulltext search.

All benchmarks were done on my laptop - Intel core 2 duo (2.0 GHz) with 4MB L2 cache & 2 GB ram. I have 64 Bit ubuntu system loaded with MySQL 5.1.24-rc (64 bit binary) and PostgreSQL 8.3.1 (compiled from source).

I used python as a scripting language for writing down my benchmark scripts. I used psycopg2 as a connector from python to postgres and mysql-python as a connector from python to mysql.

The benchmarking was done in phases. Firstly simple Insert, update and select queries were run to check the raw speed of these queries. Then threads were created to run simultaneous insert, update, select and delete queries. I checked the benchmark times for different number of concurrent threads.

I created a simple table on both mysql and pgsql. I used the MyISAM database engine to create table in mysql. :

ABC(id int not null auto_increment primary key, value varchar(250));

Queries that were run are:

Insert(I) : Insert ignore into ABC (id, value) ...(For pgsql, a rule has to be created to ignore duplicate inserts)
Update(U) : Update ABC set value=<something> where id=<random_id>
Select(S) : Select * from ABC where id=<random_id>
Delete(D) : Delete from ABC where id=<random_id>



  • Insert - 100000 rows in 1 thread
    Time taken for Mysql : 20.8 seconds
    Time taken for Pgsql : 58.1 seconds
    So, raw insert speed of mysql is much better as compared to pgsql

  • 100000 selects in 1 thread
    Time taken for Mysql : 21.76 seconds
    Time taken for Pgsql : 20.15 seconds
    Raw selects are better in pgsql as compared to mysql

  • Selects - 2 threads of 100000 selects
    Time taken for Mysql : 40.46 seconds
    Time taken for Pgsql : 27.38 seconds
    So, if i increase the concurrency of selects, pgsql perfors much than mysql

  • Update - 2 threads of 50000
    Time taken for Mysql : 23.97 seconds
    Time taken for Pgsql : 34.03 seconds
    Mysql looks better in handling updates here.

  • 4 Threads
    Run 1 : [100000 Selects, 50000 Inserts, 50000 Updates, 20000 Deletes]
    Time taken for Mysql : 45.25 seconds
    Time taken for Pgsql : 54.58 seconds
    Run 2 : [100000 Selects, 100000 Inserts, 50000 Updates, 10000 Deletes]
    Time taken for Mysql : 59.05 seconds
    Time taken for Pgsql : 69.38 seconds
    Run 3 : [100000 Selects, 20000 Inserts, 20000 Updates, 1000 Deletes]
    Time taken for Mysql : 35.54 seconds
    Time taken for Pgsql : 31.23 seconds
    These runs show that Mysql is good when you have very large no of inserts/updates/deletes as compared to selects. But pgsql's performance surpasses that of mysql when the number of selects are much higher.

  • Finally, lets approach the real life scenario where generally the number of selects are much more than the number of inserts and there are multiple threads performing selects and inserts.
    I will use the following notification here - <no_of_threads> X <no_of_operations(select/insert/update/delete)_per_thread>
    So, for example 3 X 20 Selects = 3 threads of 20 Selects in each thread

    Run 1 : [2 X 30000 selects, 3 X 20000 selects, 1 X 20000 inserts, 2 X 10000 inserts, 2 X 100000 updates, 2 X 1000 deletes] Total - 12 threads
    Time taken for Mysql : 42.28 seconds
    Time taken for Pgsql : 44.28 seconds
    Both Mysql and Pgsql are almost at par.

    Run 2 : [2 X 50000 selects, 2 X 40000 selects, 1 X 30000 selects, 1 X 20000 inserts, 2 X 15000 inserts, 2 X 15000 updates, 2 X 2000 deletes] Total - 12 threads but number of selects are quite high
    Time taken for Mysql : 61.02 seconds
    Time taken for Pgsql : 48.60 seconds
    So, as we increase the number of operations (specially selects) mysql's performance degrades, whereas pgsql's performance remains almost the same

    Run 3 : [4 X 50000 selects, 4 X 40000 selects, 2 X 30000 selects, 2 X 20000 inserts, 3 X 15000 inserts, 3 X 15000 updates, 2 X 3000 deletes] Total - 20 threads (10 threads for select, 5 for insert, 3 for update and 2 for delete) Which is the normal trend in database servers.
    Time taken for Mysql : 169.31 seconds
    Time taken for Pgsql : 128.7 seconds
    Bingo, so as concurrency increases pgsql becomes faster than mysql.



My earlier benchmarks with pgsql 7.x was not as good as this one. With postgresql 8.3.1, the speed of serving concurrent requests has increased a lot. So, in a high concurrency environment, i would generally recommend to go ahead with using postgresql rather than mysql.

Please check the comments section. We have some really interesting comments there...