Whatever....: nosql

Showing posts with label nosql. Show all posts

Monday, February 20, 2012

scaling using mongodb : Map-Reduce on sharded collection (part 3)

The idea here is to create a sharded database and then run map-reduce on it to process the data. This is a very basic example that I am trying to emulate. I created a sharded collection "posts" with the following structure. The idea is to find the count of tags throughout the complete sharded collection. I am using two machines named "241" and "243" as shards for mongodb. The mongos service is running on a thrid machine "249".

Input collection structure :

mongos> db.posts.find()
{ "_id" : ObjectId("4f4221149a6777895a000000"), "id" : "0", "content" : "data for content 0", "tags" : [ "tag1", "tag2", "tag3", "tag4" ] }
{ "_id" : ObjectId("4f4221149a6777895a000001"), "id" : "1", "content" : "data for content 1", "tags" : [ "tag2", "tag4", "tag5", "tag7", "tag9" ] }
Output collection structure :

mongos> db.tags.find()
{ "_id" : "tag1", "value" : 14705 }
{ "_id" : "tag3", "value" : 14418 }
Lets see the step by step process for creation, population of test data and running of map-reduce.

Lets create the posts collection by putting in a few records. If you print the collection stats, you will see that it is not sharded.

mongos> db.printCollectionStats()
---
posts
{
        "sharded" : false,
        "primary" : "shard241",
        "ns" : "test.posts",
        "count" : 2,
        "size" : 256,
        "avgObjSize" : 128,
        "storageSize" : 8192,
        "numExtents" : 1,
        "nindexes" : 1,
        "lastExtentSize" : 8192,
        "paddingFactor" : 1,
        "flags" : 1,
        "totalIndexSize" : 8176,
        "indexSizes" : {
                "_id_" : 8176
        },
        "ok" : 1
}
---
To shard the collection, you will need to first create index on "id". And then shard the collection using "id" as the key.

mongos> db.posts.ensureIndex({id:1})
mongos> db.posts.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "test.posts",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "id" : 1
                },
                "ns" : "test.posts",
                "name" : "id_1"
        }
]

mongos> use admin
switched to db admin
mongos> db.runCommand( { shardcollection : "test.posts" , key : { id : 1 } } )
{ "collectionsharded" : "test.posts", "ok" : 1 }

The collection "posts" is now sharded. Lets populate some test data into the collection. Here is the php script that I used to populate data into the collection.

$m = new Mongo( "mongodb://192.168.1.249:10003", array("persist" => "x") );
$db = $m->test;
$table = $db->posts;
$start = 0;
$end = 200000;
for($i=$start; $i<$end; $i++)
{
        $tags = getTag();
        $obj = array("id"=>"$i", "content"=>"data for content $i", "tags"=>$tags);
        $table->insert($obj);
        echo "$i:".implode(',',$tags);
}
$found = $table->count();
echo "Found : $found\n";

function getTag()
{
        $tagArray = array('tag1','tag2','tag3','tag4','tag5','tag6','tag7','tag8','tag9','tag10','tag11','tag12','tag13','tag14','tag15','tag16','tag17','tag18','tag19','tag20','tag21','tag22','tag23','tag24','tag25','tag26','tag27','tag28','tag29','tag30','tag31','tag32','tag33','tag34','tag35','tag36','tag37','tag38','tag39','tag40','tag41','tag43');

        $tags = array();
        $tagcount = rand(2,5);

        $count = sizeof($tagArray);
        for($x=0; $x<$tagcount; $x++)
        {
                $tid = rand(0,$count);

                $tags[] = $tagArray[$tid];
        }
        return $tags;
}
?>

I pushed in 200,000 records into the collection. Here is how the data was sharded between "241" and "243";

mongos> db.printCollectionStats()
---
posts
{
        "sharded" : true,
        "flags" : 1,
        "ns" : "test.posts",
        "count" : 200000,
        "numExtents" : 10,
        "size" : 24430872,
        "storageSize" : 32743424,
        "totalIndexSize" : 15534400,
        "indexSizes" : {
                "_id_" : 6508096,
                "id_1" : 9026304
        },
        "avgObjSize" : 122.15436,
        "nindexes" : 2,
        "nchunks" : 4,
        "shards" : {
                "shard241" : {
                        "ns" : "test.posts",
                        "count" : 109889,
                        "size" : 13423484,
                        "avgObjSize" : 122.15493598947415,
                        "storageSize" : 17978183,
                        "numExtents" : 8,
                        "nindexes" : 2,
                        "lastExtentSize" : 12083200,
                        "paddingFactor" : 1,
                        "flags" : 1,
                        "totalIndexSize" : 8531049,
                        "indexSizes" : {
                                "_id_" : 3573332,
                                "id_1" : 4957718
                        },
                        "ok" : 1
                },
                "shard243" : {
                        "ns" : "test.posts",
                        "count" : 90111,
                        "size" : 10913985,
                        "avgObjSize" : 121.11711711711712,
                        "storageSize" : 33251771,
                        "numExtents" : 8,
                        "nindexes" : 2,
                        "lastExtentSize" : 12083200,
                        "paddingFactor" : 1,
                        "flags" : 1,
                        "totalIndexSize" : 13274730,
                        "indexSizes" : {
                                "_id_" : 6617370,
                                "id_1" : 6657360
                        },
                        "ok" : 1
                }
        },
        "ok" : 1
}
---

Now we will create the map and reduce functions. The map function will check for the tags array for each record in the posts collection. For each element of the tag array, it will emit the tag and a count of 1. Next we create a reduce function which counts the occurrances of each tag and returns the final count. The map function calls the emit(key, value) any number of times to feed data to the reducer. The reduce function will receive an array of emitted values from the map function and reduce them to a single value. The structure of the object returned by the reduce function must be identical to the structure of the map function's emitted value.

mongos> map = function() {
... if(!this.tags) {
... return;
... }
... for ( index in this.tags) {
... emit(this.tags[index],1);
... }
... }
function () {
    if (!this.tags) {
        return;
    }
    for (index in this.tags) {
        emit(this.tags[index], 1);
    }
}
mongos> reduce = function(key, values) {
... var count = 0;
... for(index in values) {
... count += values[index];
... }
... return count;
... }
function (key, values) {
    var count = 0;
    for (index in values) {
        count += values[index];
    }
    return count;
}To understand how it works, lets say that after some iterations, map emitts the following value { "tag1" , 1 }. Suppose at that point "tag1" has a count of 50. That is the document can be represented as:

{ "tag1", 50 }

It map again emits { "tag1", 1 }, reduce will be called as follows :

reduce( "tag1", [50,1] )

The result will be a simple combination of counts for tag1

{ "tag1", 51 }

To invoke map-reduce run the following commands. The command states that mapreduce is run on "posts" collection. Map function is "map" and reduce function is "reduce". Output is redirected to a collection named "tags".

mongos> result = db.runCommand( {
... "mapreduce" : "posts",
... "map" : map, //name of map function
... "reduce" : reduce, //name of reduce function
... "out" : "tags" } )
{
        "result" : "tags",
        "shardCounts" : {
                "192.168.1.241:10000" : {
                        "input" : 109889,
                        "emit" : 499098,
                        "reduce" : 6400,
                        "output" : 43
                },
                "192.168.1.243:10000" : {
                        "input" : 90111,
                        "emit" : 200395,
                        "reduce" : 3094,
                        "output" : 43
                }
        },
        "counts" : {
                "emit" : NumberLong(699493),
                "input" : NumberLong(200000),
                "output" : NumberLong(43),
                "reduce" : NumberLong(9494)
        },
        "ok" : 1,
        "timeMillis" : 9199,
        "timing" : {
                "shards" : 9171,
                "final" : 28
        }
}
See how the output documents. The output has only "no of tags" documents - in our case 43.

mongos> db.tags.find()
{ "_id" : "tag1", "value" : 14643 }
{ "_id" : "tag2", "value" : 14705 }
{ "_id" : "tag3", "value" : 14418 }
{ "_id" : "tag4", "value" : 14577 }
{ "_id" : "tag5", "value" : 14642 }
{ "_id" : "tag6", "value" : 14505 }
{ "_id" : "tag7", "value" : 14623 }
{ "_id" : "tag8", "value" : 14529 }
{ "_id" : "tag9", "value" : 14767 }
{ "_id" : "tag10", "value" : 14489 }
has more

mongos> db.tags.count()
43

References

http://cookbook.mongodb.org/patterns/count_tags/
http://www.mongodb.org/display/DOCS/MapReduce/

Thursday, February 16, 2012

Scaling using mongodb : HA and Automatic sharding (part 2)

We discussed creation of HA using replica sets in the earlier post part 1. In this post we will create a sharded cluster of replica sets consisting of multiple mongodb servers. If you would have noticed the parameters passed during starting mongodb on the earlier server 241 & 242, you could see that in addition to the db path, log path and replica set parameters, I have also added the parameter '--shardsvr'. This enables sharding on this instance of mongodb.

For creating a cluster of database servers using mongodb you need data nodes, config servers and router servers. Data nodes are used to store data. The config servers are used to store sharding information. It is used for the client to figure out where the data resides on the data nodes. The router servers are used by the client to communicate with the config servers and data nodes. The clients interface to the database through the router servers.

On each of the config servers, start mongodb with the parameter '--configsvr'.

230:/home/mongodb } ./bin/mongod --configsvr --logappend --logpath ./data/log --pidfilepath ./data/mongodb.pid --fork --dbpath ./data/config --directoryperdb --bind_ip 192.168.1.230 --port 10002
231:/home/mongodb } ./bin/mongod --configsvr --logappend --logpath ./data/log --pidfilepath ./data/mongodb.pid --fork --dbpath ./data/config --directoryperdb --bind_ip 192.168.1.231 --port 10002
And start routing service on the routing servers.

233:/home/mongodb } ./bin/mongos --port 10003 --configdb 192.168.1.230:10002,192.168.1.231:10002 --logappend --logpath ./data/route_log --fork --bind_ip 192.168.1.233 &
The same service can be started on routing server 234. To shard a database and add nodes to the shard, you need to connect to the routing server.

233:/home/mongodb } ./bin/mongo 192.168.1.233:10003
233:/home/mongodb } mongos> use admin
switched to db admin
To add the first replica set (set_a) to the cluster, run the following command.

233:/home/mongodb } mongos> db.runCommand( { addshard : "set_a/192.168.1.241:10000,192.168.1.242:10000,192.168.1.243:10000", name : "shard1" } )
{ "shardAdded" : "shard1", "ok" : 1 }
Similarly replica sets of the other 2 shards can also be added to the mongodb cluster. Eventually you can fire a listShards command to see the shards added to the cluster.

233:/home/mongodb } mongos > db.runCommand( {listshards:1} )
{
        "shards" : [
                {
                        "_id" : "shard1",
                        "host" : "192.168.1.241:10000,192.168.1.242:10000,192.168.1.243:10000"
                },
                {
                        "_id" : "shard2",
                        "host" : "192.168.1.244:10000,192.168.1.245:10000,192.168.1.246:10000"
                },
                {
                        "_id" : "shard3",
                        "host" : "192.168.1.247:10000,192.168.1.248:10000,192.168.1.249:10000"
                }
        ],
        "ok" : 1
}
To enable sharding on a database say "test" run the following commands.

233:/home/mongodb } mongos > db.runCommand( { enablesharding : "test" } )
{ "ok" : 1 }
To shard on a particular key. The test collection consists of the following fields id, name and email. Lets shard on id.

233:/home/mongodb } > db.runCommand( { shardcollection : "test.test", key : { id : 1 } } )
{ "collectionsharded" : "test.test", "ok" : 1 }
I used a php script to push more than 20,000 records to the sharded cluster. You can check the status of sharded cluster using the following commands.

233:/home/mongodb } mongos > db.printShardingStatus()
--- Sharding Status ---
sharding version: { "_id" : 1, "version" : 3 }
shards:
        { "_id" : "shard1", "host" : "192.168.1.241:10000,192.168.1.242:10000,192.168.1.243:10000" }
        { "_id" : "shard2", "host" : "192.168.1.244:10000,192.168.1.245:10000,192.168.1.246:10000" }
        { "_id" : "shard3", "host" : "192.168.1.247:10000,192.168.1.248:10000,192.168.1.249:10000" }
databases:
        { "_id" : "admin", "partitioned" : false, "primary" : "config" }
        { "_id" : "test", "partitioned" : true, "primary" : "shard1" }
                test.test chunks:
                                shard1        2
                                shard2        1
                        { "id" : { $minKey : 1 } } -->> { "id" : "1" } on : shard1 { "t" : 2000, "i" : 1 }
                        { "id" : "1" } -->> { "id" : "999" } on : shard2 { "t" : 1000, "i" : 3 }
                        { "id" : "999" } -->> { "id" : { $maxKey : 1 } } on : shard2 { "t" : 2000, "i" : 0 }

233:/home/mongodb } mongos > db.printCollectionStats()
---
test
{
        "sharded" : true,
        "flags" : 1,
        "ns" : "test.test",
        "count" : 19998,
        "numExtents" : 6,
        "size" : 1719236,
        "storageSize" : 2801664,
        "totalIndexSize" : 1537088,
        "indexSizes" : {
                "_id_" : 662256,
                "id_1" : 874832
        },
        "avgObjSize" : 85.97039703970397,
        "nindexes" : 2,
        "nchunks" : 3,
        "shards" : {
                "shard1" : {
                        "ns" : "test.test",
                        "count" : 19987,
                        "size" : 1718312,
                        "avgObjSize" : 85.97148146295092,
                        "storageSize" : 2793472,
                        "numExtents" : 5,
                        "nindexes" : 2,
                        "lastExtentSize" : 2097152,
                        "paddingFactor" : 1,
                        "flags" : 1,
                        "totalIndexSize" : 1520736,
                        "indexSizes" : {
                                "_id_" : 654080,
                                "id_1" : 866656
                        },
                        "ok" : 1
                },
                "shard2" : {
                        "ns" : "test.test",
                        "count" : 11,
                        "size" : 924,
                        "avgObjSize" : 84,
                        "storageSize" : 8192,
                        "numExtents" : 1,
                        "nindexes" : 2,
                        "lastExtentSize" : 8192,
                        "paddingFactor" : 1,
                        "flags" : 1,
                        "totalIndexSize" : 16352,
                        "indexSizes" : {
                                "_id_" : 8176,
                                "id_1" : 8176
                        },
                        "ok" : 1
                }
        },
        "ok" : 1
}
---
To use multiple routing servers in a database connection string - for connecting to mongodb through a programming language for example php the following syntax can be used

mongodb://[username:password@]host1[:port1][,host2[:port2:],...]/db_name

This will connect to atleast one host from the list of hosts provided. If none of the hosts are available, an exception will be thrown.

Wednesday, February 15, 2012

Scaling using mongodb : HA and Automatic sharding (part 1)

Mongodb introduces the concept of automatic sharding and replica sets. Lets see how both can be used to create a highly available cluster of database which is horizontally scalable.

First lets take a look at the HA capability of mongodb. HA is handled in mongodb by something known as Replica Sets. Replica sets are basically two or more mongodb nodes which are copies/replicas of each other. A replica set automatically selects a primary master and provides automated failover and disaster recovery.

Lets create a replica set of 2 nodes first and proceed from there.

I have taken 2 machines 192.168.1.241 and 192.168.1.242 for creating replica sets. Lets download the tar of mongodb 2.0.2 and put it in 241 and 242. We will create separate directories for database and replica db. Here is how we go about it.

241:/home/mongodb } mkdir -p data/db
241:/home/mongodb } mkdir -p repl/db
242:/home/mongodb } mkdir -p data/db
242:/home/mongodb } mkdir -p repl/db
Create primary mongodb replica set server on 241

241:/home/mongodb } ./bin/mongod --shardsvr --replSet set_a --logappend --logpath ./data/log --pidfilepath ./data/mongodb.pid --fork --dbpath ./data/db --directoryperdb --bind_ip 192.168.1.241 --port 10000 &
Create secondary replica set on 242 running on port 10001

242:/home/mongodb } ./bin/mongod --shardsvr --replSet set_a --logappend --logpath ./data/log --pidfilepath ./data/mongodb.pid --fork --dbpath ./data/db --directoryperdb --bind_ip 192.168.1.241 --port 10001 &
Check whether the replica set is working fine.

241:/home/mongodb } ./bin/mongo 192.168.1.241:10000
241: } > rs.status()
{
        "startupStatus" : 3,
        "info" : "run rs.initiate(...) if not yet done for the set",
        "errmsg" : "can't get local.system.replset config from self or any seed (EMPTYCONFIG)",
        "ok" : 0
}
241: } > rs.initiate()
{
        "info2" : "no configuration explicitly specified -- making one",
        "me" : "192.168.1.243:10000",
        "info" : "Config now saved locally. Should come online in about a minute.",
        "ok" : 1
}
241: } > rs.isMaster()
db.isMaster()
{
        "setName" : "set_a",
        "ismaster" : true,
        "secondary" : false,
        "hosts" : [
                "192.168.1.241:10000"
        ],
        "primary" : "192.168.1.241:10000",
        "me" : "192.168.1.241:10000",
        "maxBsonObjectSize" : 16777216,
        "ok" : 1
}
We can see that 241 is master, but 242 is not added to the replica set. Lets add 242 as a secondary server of the replica set.

241: } > rs.add("192.168.1.242:10001")
{ "ok" : 1 }

241: } > rs.conf()
{
        "_id" : "set_a",
        "version" : 2,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "192.168.1.241:10000"
                },
                {
                        "_id" : 1,
                        "host" : "192.168.1.242:10001"
                }
        ]
}
241: } > rs.status()
{
        "set" : "set_a",
        "date" : ISODate("2012-02-15T09:42:07Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "192.168.1.241:10000",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "optime" : {
                                "t" : 1329298881000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2012-02-15T09:41:21Z"),
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "192.168.1.242:10001",
                        "health" : 1,
                        "state" : 3,
                        "stateStr" : "RECOVERING",
                        "uptime" : 46,
                        "optime" : {
                                "t" : 0,
                                "i" : 0
                        },
                        "optimeDate" : ISODate("1970-01-01T00:00:00Z"),
                        "lastHeartbeat" : ISODate("2012-02-15T09:42:07Z"),
                        "pingMs" : 5339686
                }
        ],
        "ok" : 1
}242 is in recovery mode - replicating the data that is already there on 241. Once the data is completely replicated, the second node starts acting as a secondary node.

241: } > rs.status()
{
        "set" : "set_a",
        "date" : ISODate("2012-02-15T10:33:32Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "192.168.1.241:10000",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "optime" : {
                                "t" : 1329298881000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2012-02-15T09:41:21Z"),
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "192.168.1.242:10001",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 3131,
                        "optime" : {
                                "t" : 1329298881000,
                                "i" : 1
                        },
                        "optimeDate" : ISODate("2012-02-15T09:41:21Z"),
                        "lastHeartbeat" : ISODate("2012-02-15T10:33:31Z"),
                        "pingMs" : 0
                }
        ],
        "ok" : 1
}
Now the replica set is working fine and the secondary node will act as primary if the primary goes out of service. Ideally there should be 3 or more nodes added to any replica set.

Thursday, July 29, 2010

Cassandra Data model : WTF is a supercolumn

BTW: this is long. If you’d rather have a PDF version of this you can download it here.

The Pieces

Let’s first go thru the building blocks before we see how they can all be stuck together:

Column

The column is the lowest/smallest increment of data. It’s a tuple (triplet) that contains a name, a value and a timestamp.

Here’s a column represented in JSON-ish notation:

{  // this is a column
    name: "emailAddress",
    value: "arin@example.com",
    timestamp: 123456789
}

That’s all it is. For simplicity sake let’s ignore the timestamp. Just think of it as a name/value pair.

Also, it’s worth noting is that the name and value are both binary (technically byte[]) and can be of any length.

SuperColumn

A SuperColumn is a tuple w/ a binary name & a value which is a map containing an unbounded number of Columns – keyed by the Column’s name. Keeping with the JSON-ish notation we get:

{   // this is a SuperColumn
    name: "homeAddress",
    // with an infinite list of Columns
    value: {
        // note the keys is the name of the Column
        street: {name: "street", value: "1234 x street", timestamp: 123456789},
        city: {name: "city", value: "san francisco", timestamp: 123456789},
        zip: {name: "zip", value: "94107", timestamp: 123456789},
    }
}

Column vs SuperColumn

Columns and SuperColumns are both a tuples w/ a name & value. The key difference is that a standard Column’s value is a “string” and in a SuperColumn the value is a Map of Columns. That’s the main difference… their values contain different types of data. Another minor difference is that SuperColumn’s don’t have a timestamp component to them.

Before We Get Rolling

Before I move on I wanna simplify our notation a couple ways: 1) ditch the timestamps from Columns & 2) pull the Columns’ & SuperColumns’ names component out so that it looks like a key/value pair. So we’re gonna go from:

{ // this is a super column
    name: "homeAddress",
    // with an infinite list of columns
    value: {
        street: {name: "street", value: "1234 x street", timestamp: 123456789},
        city: {name: "city", value: "san francisco", timestamp: 123456789},
        zip: {name: "zip", value: "94107", timestamp: 123456789},
    }
}
to

 homeAddress: {
    street: "1234 x street",
    city: "san francisco",
    zip: "94107",
}

Grouping ‘Em

There’s a single structure used to group both the Columns and SuperColumns…this structure is called a ColumnFamily and comes in 2 varieties Standard & Super.

ColumnFamily

A ColumnFamily is a structure that contains an infinite number of Rows. Huh, did you say Rows? Ya – rows To make it sit easier in your head just think of it as a table in an RDBMS.

OK – each Row has a client supplied (that means you) key & contains a map of Columns. Again, the keys in the map are the names of the Columns and the values are the Columns themselves:

UserProfile = { // this is a ColumnFamily
    phatduckk: {   // this is the key to this Row inside the CF
        // now we have an infinite # of columns in this row
        username: "phatduckk",
        email: "phatduckk@example.com",
        phone: "(900) 976-6666"
    }, // end row
    ieure: {   // this is the key to another row in the CF
        // now we have another infinite # of columns in this row
        username: "ieure",
        email: "ieure@example.com",
        phone: "(888) 555-1212"
        age: "66",
        gender: "undecided"
    },
}

Remember: for simplicity we're only showing the value of the Column but in reality the values in the
map are the entire Column.
You can think of it as a HashMap/dictionary or associative array. If you start thinking that way then you’re are the right track.

One thing I want to point out is that there’s no schema enforced at this level. The Rows do not have a predefined list of Columns that they contain. In our example above you see that the row with the key “ieure” has Columns with names “age” and “gender” whereas the row identified by the key “phatduckk” doesn’t. It’s 100% flexible: one Row may have 1,989 Columns whereas the other has 2. One Row may have a Column called “foo” whereas none of the rest do. This is the schemaless aspect of Cassandra.

A ColumnFamily Can Be Super Too

Now, a ColumnFamily can be of type Standard or Super.

What we just went over was an example of the Standard type. What makes it Standard is the fact that all the Rows contains a map of normal (aka not-Super) Columns… there’s no SuperColumns scattered about.

When a ColumnFamily is of type Super we have the opposite: each Row contains a map of SuperColumns. The map is keyed with the name of each SuperColumn and the value is the SuperColumn itself. And, just to be clear, since this ColumnFamily is of type Super, there are no Standard ColumnFamily’s in there. Here’s an example:

AddressBook = { // this is a ColumnFamily of type Super
    phatduckk: {    // this is the key to this row inside the Super CF
        // the key here is the name of the owner of the address book

        // now we have an infinite # of super columns in this row
        // the keys inside the row are the names for the SuperColumns
        // each of these SuperColumns is an address book entry
        friend1: {street: "8th street", zip: "90210", city: "Beverley Hills", state: "CA"},

        // this is the address book entry for John in phatduckk's address book
        John: {street: "Howard street", zip: "94404", city: "FC", state: "CA"},
        Kim: {street: "X street", zip: "87876", city: "Balls", state: "VA"},
        Tod: {street: "Jerry street", zip: "54556", city: "Cartoon", state: "CO"},
        Bob: {street: "Q Blvd", zip: "24252", city: "Nowhere", state: "MN"},
        ...
        // we can have an infinite # of ScuperColumns (aka address book entries)
    }, // end row
    ieure: {     // this is the key to another row in the Super CF
        // all the address book entries for ieure
        joey: {street: "A ave", zip: "55485", city: "Hell", state: "NV"},
        William: {street: "Armpit Dr", zip: "93301", city: "Bakersfield", state: "CA"},
    },
}

Keyspace

A Keyspace is the outer most grouping of your data. All your ColumnFamily’s go inside a Keyspace. Your Keyspace will probably named after your application.

Now, a Keyspace can have multiple ColumnFamily’s but that doesn’t mean there’s an imposed relationship between them. For example: they’re not like tables in MySQL… you can’t join them. Also, just because ColumnFamily_1 has a Row with key “phatduckk” that doesn’t mean ColumnFamily_2 has one too.

Sorting

OK – we’ve gone through what the various data containers are about but another key component of the data model is how the data is sorted. Cassandra is not queryable like SQL – you do not specify how you want the data sorted when you’re fetching it (among other differences). The data is sorted as soon as you put it into the cluster and it always remains sorted! This is a tremendous performance boost for reads but in exchange for that benefit you’re going to have to make sure to plan your data model in a such a way that you’re able satisfy your access patterns.

Columns are always sorted within their Row by the Column’s name. This is important so i’ll say it again: Columns are always sorted by their name! How the names are compared depends on the ColumnFamilys CompareWith option. Out of the box you have the following options: BytesType, UTF8Type, LexicalUUIDType, TimeUUIDType, AsciiType, and LongType. Each of these options treats the Columns’ name as a different data type giving you quite a bit of felxibility. For example: Using LongType will treat your Columns’ names as a 64bit Longs. Let’s try and clear this up by taking a look at some data before and after it’s sorted:

// Here's a view of all the Columns from a particular Row in random order
    // Cassandra would "never" store data in random order. This is just an example
    // Also, ignore the values - they don't matter for sorting at all
    {name: 123, value: "hello there"},
    {name: 832416, value: "kjjkbcjkcbbd"},
    {name: 3, value: "101010101010"},
    {name: 976, value: "kjjkbcjkcbbd"}
So, given the fact that we’re using the LongType option, these Columns will look like this when they’re sorted:


    <!--
    ColumnFamily definition from storage-conf.xml
    -->
    <ColumnFamily CompareWith="LongType" Name="CF_NAME_HERE"/>

    // See, each Column's name is treated as a 64bit long
    // in effect, numerically ordering our Columns' by name
    {name: 3, value: "101010101010"},
    {name: 123, value: "hello there"},
    {name: 976, value: "kjjkbcjkcbbd"},
    {name: 832416, value: "kjjkbcjkcbbd"}

As you can see the Columns’ names were compared as if they were 64bit Longs (aka: numbers that can get pretty big). Now, if we’d used another CompareWith option we’d end up with a different result. If we’d set CompareWith to UTF8Type our sorted Columns’ names would be treated as a UTF8 encoded strings yielding a sort order like this:

<!--
    ColumnFamily definition from storage-conf.xml
    -->
    <ColumnFamily CompareWith="UTF8Type" Name="CF_NAME_HERE"/>

    // Each Column name is treated as a UTF8 string
    {name: 123, value: "hello there"},
    {name: 3, value: "101010101010"},
    {name: 832416, value: "kjjkbcjkcbbd"},
    {name: 976, value: "kjjkbcjkcbbd"}

The result is completely different!

This sorting principle applies to SuperColumns as well but we get an extra dimension to deal with: not only do we determine how the SuperColumns are sorted in a Row but we also determine how the Columns within each SuperColumn are sorted. The sort of the Columns within each SuperColumn is determined by the value of CompareSubcolumnsWith. Here’s an example:

// Here's a view of a Row that has 2 SuperColumns in it.
    // currently they're in some random order

    { // first SuperColumn from a Row
        name: "workAddress",
        // and the columns within it
        value: {
            street: {name: "street", value: "1234 x street"},
            city: {name: "city", value: "san francisco"},
            zip: {name: "zip", value: "94107"}
        }
    },
    { // another SuperColumn from same Row
        name: "homeAddress",
        // and the columns within it
        value: {
            street: {name: "street", value: "1234 x street"},
            city: {name: "city", value: "san francisco"},
            zip: {name: "zip", value: "94107"}
        }
    }

Now if we decided to set both CompareSubcolumnsWith & CompareWith to UTF8Type we’d have the following end result:

// Now they're sorted

    {
        // this one's first b/c when treated as UTF8 strings
        { // another SuperColumn from same Row

            // This Row comes first b/c "homeAddress" is before "workAddress"
            name: "homeAddress",

            // the columns within this SC are also sorted by their names too
            value: {
                // see, these are sorted by Column name too
                city: {name: "city", value: "san francisco"},
                street: {name: "street", value: "1234 x street"},
                zip: {name: "zip", value: "94107"}
            }
        },
        name: "workAddress",
        value: {
            // the columns within this SC are also sorted by their names too
            city: {name: "city", value: "san francisco"},
            street: {name: "street", value: "1234 x street"},
            zip: {name: "zip", value: "94107"}
        }
    }

I want to note that in the last example CompareSubcolumnsWith & CompareWith were set to UTF8Type but this doesn’t have to be the case. You can mix and match the values of CompareSubcolumnsWith & CompareWith as necessary.

The last bit about sorting I want to mention is that you can write a custom class to perform the sorting. The sorting mechanism is pluggable… you can set CompareSubcolumnsWith and/or CompareWith to any fully-qualified class name as long as that class implements org.apache.cassandra.db.marshal.IType (aka you can write custom comparators).

Example Schema

Alrighty – Now we’ve got all the pieces of the puzzle so let’s finally put ‘em all together and model a simple blog application. We’re going to model a simple app with the following specs:

support a single blog
we can have multiple authors
entries contain title, body, slug & publish date
entries can be associated with any # of tags
people can leave comments but cant register: they enter profile info each time (just keeping it simple)
comments have text, time submitted, commenter’s name & commenter’s name
must be able to show all posts in reverse chronological order (newest first)
must be able to show all posts within a given tag in reverse chronological order
Each of the following sections will describe a ColumnFamily that we’re going to define in our app’s Keyspace, show the xml definition, talk about why we picked the particular sort option(s) as well as display the data in the ColumnFamily w/ our JSON-ish notation.

Authors ColumnFamily

Modeling the Authors ColumnFamily is going to be pretty basic; we’re not going to do anything fancy here. We’re going to give each Author their own Row & key it by the Author’s full name. Inside the Rows each Column is going to represent a single “profile” attribute for the Author.

This is an example of using each Row to represent an object… in this case an Author object. With this approach each Column will serve as an attribute. Super simple. I want to point out that since there’s no “definition” of what Columns must be present within a Row we kinda sorta have a schemaless design.

We’ll be accessing the Rows in this ColumnFamily via key lookup & will grab every Column with each get (ex: we won’t ever be fetching the first 3 columns from the Row with key ‘foo’). This means that we don’t care how the Columns are sorted so we’ll use BytesType sort options because it doesn’t require any validation of the Columns’ names.

<!--
    ColumnFamily: Authors
    We'll store all the author data here.

    Row Key => Author's name (implies names must be unique)
    Column Name: an attribute for the entry (title, body, etc)
    Column Value: value of the associated attribute

    Access: get author by name (aka grab all columns from a specific Row)

    Authors : { // CF
        Arin Sarkissian : { // row key
            // and the columns as "profile" attributes
            numPosts: 11,
            twitter: phatduckk,
            email: arin@example.com,
            bio: "bla bla bla"
        },
        // and the other authors
        Author 2 {
            ...
        }
    }
-->
<ColumnFamily CompareWith="BytesType" Name="Authors"/>'

BlogEntries ColumnFamily

Again, this ColumnFamily is going to act as a simple key/value lookup. We’ll be storing 1 entry per Row. Within that Row the Columns will just serve as attributes of the entry: title, body, etc (just like the previous example). As a small optimization we’ll denormalize the tags into a Column as a comma separated string. Upon display we’ll just split that Column’s value to get a list of tags.

The key to each Row will be the entries slug. So whenever we want to grab a single entry we can simply look it up by its key (slug).

<!--
    ColumnFamily: BlogEntries
    This is where all the blog entries will go:

    Row Key +> post's slug (the seo friendly portion of the uri)
    Column Name: an attribute for the entry (title, body, etc)
    Column Value: value of the associated attribute

    Access: grab an entry by slug (always fetch all Columns for Row)

    fyi: tags is a denormalization... its a comma separated list of tags.
    im not using json in order to not interfere with our
    notation but obviously you could use anything as long as your app
    knows how to deal w/ it

    BlogEntries : { // CF
        i-got-a-new-guitar : { // row key - the unique "slug" of the entry.
            title: This is a blog entry about my new, awesome guitar,
            body: this is a cool entry. etc etc yada yada
            author: Arin Sarkissian  // a row key into the Authors CF
            tags: life,guitar,music  // comma sep list of tags (basic denormalization)
            pubDate: 1250558004      // unixtime for publish date
            slug: i-got-a-new-guitar
        },
        // all other entries
        another-cool-guitar : {
            ...
            tags: guitar,
            slug: another-cool-guitar
        },
        scream-is-the-best-movie-ever : {
            ...
            tags: movie,horror,
            slug: scream-is-the-best-movie-ever
        }
    }
-->
<ColumnFamily CompareWith="BytesType" Name="BlogEntries"/>

TaggedPosts ColumnFamily

Alright – here’s where things get a bit interesting. This ColumnFamily is going to do some heavy lifting for us. It’s going to be responsible for keeping our tag/entry associations. Not only is it going to store the associations but it’s going to allow us to fetch all BlogEntrys for a certain tag in pre-sorted order (remember all that sorting jazz we went thru?).

A design point I want to point out is that we’re going have our app logic tag every BlogEntry with the tag “__notag__” (a tag I just made up). Tagging every BlogEntry with “__notag__” will allow us to use this ColumnFamily to also store a list of all BlogEntrys in pre-sorted order. We’re kinda cheating but it allows us to use a single ColumnFamily to serve “show me all recent posts” and “show me all recent posts tagged ‘foo’”.

Given this data model if an entry has 3 tags it will have a corresponding Column in 4 Rows… 1 for each tag and one for the “__notag__” tag.

Since we’re going to want to display lists of entries in chronological order we’ll make sure each Columns name is a time UUID and set the ColumnFamilys CompareWith to TimeUUIDType. This will sort the Columns by time satisfying our “chronological order” requirement So doing stuff like “get the latest 10 entries tagged ‘foo’” is going to be a super efficient operation.

Now when we want display the 10 most recent entries (on the front page, for example) we would:

grab the last 10 Columns in the Row w/ key “__notag__” (our “all posts” tag)
loop thru that set of Columns
while looping, we know the value of each Column is the key to a Row in the BlogEntries ColumnFamily
so we go ahead and use that to grab the Row for this entry from the BlogEntries ColumNFamily. this gives us all the data for this entry
one of the Columns from the BlogEntries Row we just grabbed is named “author” and the value is the key into the Authors ColumnFamily we need to use to grab that author’s profile data.
at this point we’ve got the entry data and the author data on hand
next we’ll split the “tags” Columns value to get a list tags
now we have everything we need to display this post (no comments yet – this aint the permalink page)
We can go through the same procedure above using any tag… so it works for “all entries” and “entries tagged ‘foo’”. Kinda nice.

<!--
    ColumnFamily: TaggedPosts
    A secondary index to determine which BlogEntries are associated with a tag

    Row Key => tag
    Column Names: a TimeUUIDType
    Column Value: row key into BlogEntries CF

    Access: get a slice of entries tagged 'foo'

    We're gonna use this CF to determine which blog entries to show for a tag page.
    We'll be a bit ghetto and use the string __notag__ to mean
    "don't restrict by tag". Each entry will get a column in here...
     this means we'll have to have #tags + 1 columns for each post.

    TaggedPosts : { // CF
        // blog entries tagged "guitar"
        guitar : {  // Row key is the tag name
            // column names are TimeUUIDType, value is the row key into BlogEntries
            timeuuid_1 : i-got-a-new-guitar,
            timeuuid_2 : another-cool-guitar,
        },
        // here's all blog entries
        __notag__ : {
            timeuuid_1b : i-got-a-new-guitar,

            // notice this is in the guitar Row as well
            timeuuid_2b : another-cool-guitar,

            // and this is in the movie Row as well
            timeuuid_2b : scream-is-the-best-movie-ever,
        },
        // blog entries tagged "movie"
        movie: {
            timeuuid_1c: scream-is-the-best-movie-ever
        }
    }
-->
<ColumnFamily CompareWith="TimeUUIDType" Name="TaggedPosts"/>

Comments ColumnFamily

The last thing we need to do is figure out how to model the comments. Here we’ll get to bust out some SuperColumns.

We’ll have 1 Row per entry. The key to the Row will be the entries slug. Within each Row we’ll have a SuperColumn for each comment. The name of the SuperColumns will be a UUID that we’ll be applying the TimeUUIDType to. This will ensure that all our comments for an entry are sorted in chronological order. The Columns within each SuperColumn will be the various attributes of the comment (commenter’s name, comment time etc).

So, this is pretty simple as well… nothing fancy.

<!--
    ColumnFamily: Comments
    We store all comments here

    Row key => row key of the BlogEntry
    SuperColumn name: TimeUUIDType

    Access: get all comments for an entry

    Comments : {
        // comments for scream-is-the-best-movie-ever
        scream-is-the-best-movie-ever : { // row key = row key of BlogEntry
            // oldest comment first
            timeuuid_1 : { // SC Name
                // all Columns in the SC are attribute of the comment
                commenter: Joe Blow,
                email: joeb@example.com,
                comment: you're a dumb douche, the godfather is the best movie ever
                commentTime: 1250438004
            },

            ... more comments for scream-is-the-best-movie-ever

            // newest comment last
            timeuuid_2 : {
                commenter: Some Dude,
                email: sd@example.com,
                comment: be nice Joe Blow this isnt youtube
                commentTime: 1250557004
            },
        },

        // comments for i-got-a-new-guitar
        i-got-a-new-guitar : {
            timeuuid_1 : { // SC Name
                // all Columns in the SC are attribute of the comment
                commenter: Johnny Guitar,
                email: guitardude@example.com,
                comment: nice axe dawg...
                commentTime: 1250438004
            },
        }

        ..
        // then more Super CF's for the other entries
    }
-->
<ColumnFamily CompareWith="TimeUUIDType" ColumnType="Super"
    CompareSubcolumnsWith="BytesType" Name="Comments"/>

Woot!

That’s it. Out little blog app is all modeled and ready to go. It’s quite a bit to digest but in the end you end up with a pretty small chunk of XML you’ve gotta store in the storage-conf.xml:

<Keyspace Name="BloggyAppy">
        <!-- other keyspace config stuff -->

        <!-- CF definitions -->
        <ColumnFamily CompareWith="BytesType" Name="Authors"/>
        <ColumnFamily CompareWith="BytesType" Name="BlogEntries"/>
        <ColumnFamily CompareWith="TimeUUIDType" Name="TaggedPosts"/>
        <ColumnFamily CompareWith="TimeUUIDType" Name="Comments"
            CompareSubcolumnsWith="BytesType" ColumnType="Super"/>
    </Keyspace>

Now all you need to do is figure out how to get the data in and out of Cassandra . That’s all accomplished via the Thrift Interface. The API wiki page does a decent job at explaining what the various endpoints do so I won’t go into all those details. But, in general, you just compile the cassandra.thrift file and use the generated code to access the various endpoints. Alternatively you can take advantage of this Ruby client or this Python client.

Alrighty… hopefully all that made sense & you finally understand WTF a SuperColumn is and can start building some awesome apps.

Source : http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model