Whatever....: algorithm

Showing posts with label algorithm. Show all posts

Tuesday, September 07, 2010

Hashing algos : Consistent Hashing

Hashing is a way of mapping keys to locations. Normally you would hash by using a simple Key%n algorithm - which ensures that keys are mapped evenly across n splits. The problem with this algo is that adding or removing a node (or a split) would require a complete rehash of all the keys. And if you have a huge data set, it is ideally not feasable to rehash and re-distribute the keys.

Consistent hashing is a way of hashing that ensures that adding or removing a slot or node does not change the mapping of keys to slots significantly. When using consistent hashing, only K/n keys need to be remapped on average - where K is the number of keys and n is the number of slots.

The way this works is that both keys and slots are mapped to edges of a circle. Meaning that all slots are mapped on to a series of angles around a circle. And the bucket where each item should be stored is chosen by selecting the next highest angle which an available bucket maps to. So, each bucket contains resources mapping to an angle between it and the next smallest angle. If a bucket becomes unavailable, the keys being mapped to that bucket get mapped to the next highest bucket (or the next bucket in the circle). So, only keys which were in the bucket which became unavailable is lost. Similarly when a bucket is added, the keys between the new bucket and the next smallest bucket is mapped to the new bucket. Keys which should be associated with the new bucket and were stored previously will become unavailable.

figure 2

figure 1

Here is an example. Objects 1,2,3 and 4 map to slots A,B and C. To find which slot an object goes in, we move around the circle until we find a slot. So here objects 1 and 4 go into slot A, 2 goes into slot B and 3 goes into slot C. If C is removed, object 3 would belong to slot A. If another slot D is added as shown in figure 2, it will take objects 3 and 4 and only leave object 1 belonging to A.

Lets look at a php example which does consistent hasing us.

<?php
class ConsistentHasher
{
  private $nodeDistribution;
  private $virtualNodes;

  // nodeDistribution is the replica count per node.
  public function __construct($nodenames, $nodedistribution)
  {
    $this->nodeDistribution = $nodedistribution;
    $this->virtualNodes = array();

    for($i=0; $i<count($nodenames); $i++)
    {
      $this->addNode($nodenames[$i]);
    }
  }

  // The addNode() function takes a name and creates virtual nodes (or replicas) by 
  // appending the index of the local loop to the node name.
  // The hash value of a virtual node is an MD5 hash, base converted into an integer. 
  // The virtual node is added to a list and sorted by its value so that we ensure 
  // a lowest to highest traversal order for looking up previous and next virtual nodes
  public function addNode($name)
  {
    for($i=0; $i<$this->nodeDistribution; $i++)
    {
      //int representation of $key (8 hex chars = 4 bytes = 32 bit)
      $virtualNodeHashCode = base_convert(substr(md5($name.$i, false),0,8),16,10);
      $this->virtualNodes[$name.$i] = $virtualNodeHashCode;
    }
    asort($this->virtualNodes, SORT_NUMERIC);
  }

  public function dump()
  {
    print_r($this->virtualNodes);
    echo "\n\n";
  }

  public function sortCompare($a, $b)
  {
    if($a == $b)
    {
      return 0;
    }
    return ($a < $b) ? -1 : 1;
  }

  // The removeNode() function takes a name and removes its corresponding virtual nodes 
  // from the virtualNode list.
  // We then resort the list to ensure a lowest to highest traversal order.
  public function removeNode($name)
  {
    for($i=0; $i<$this->nodeDistribution; $i++)
    {
      unset($this->virtualNodes[$name.$i]);
    }
    asort($this->virtualNodes, SORT_NUMERIC);
  }

  // The hashToNode() function takes a key and locates the node where its value resides.
  // We loop through our virtual nodes, stopping before the first one that has a
  // hash greater than that of the key’s hash.
  // If we come to the end of the virtual node list, we loop back to the beginning to 
  // form the conceptual circle.

  public function hashToNode($key)
  {
    $keyHashCode = base_convert(substr(md5($key, false),0,8),16,10);
    $virtualNodeNames = array_keys($this->virtualNodes);
    $firstNodeName = $virtualNodeNames[0];
    $lastNodeName = $virtualNodeNames[count($virtualNodeNames)-1];
    $prevNodeName = $lastNodeName;

    foreach($this->virtualNodes as $name => $hashCode)
    {
      if($keyHashCode < $hashCode)
        return $prevNodeName;

      if($name == $lastNodeName)
        return $firstNodeName;

      $prevNodeName = $name;
    }
    return $prevNodeName;
  }
}

// demo

$hash = new ConsistentHasher(array("node1","node2","node3","node4","node5"),10);
$hash->dump();

$hash->removeNode("node2");
$hash->dump();

$hash->addNode("node6");
$hash->dump();

echo $hash->hashToNode("testing123")."\n";
echo $hash->hashToNode("key1111")."\n";
echo $hash->hashToNode("data4444")."\n";
?>

Here is a library which provides consistent hasing for php
http://code.google.com/p/flexihash/

Memcache is a widely used distributed cache which uses consistent hashing very efficiently to map keys to caching nodes.

References:
http://en.wikipedia.org/wiki/Consistent_hashing
http://www.osconvo.com/post/view/2010/9/1/distributed-hashing-algorithms-by-example-consistent-hashing

Friday, May 28, 2010

Quick sort algorithm

Quicksort follows the simple strategy of dividing the data into parts and sorting it.

Simple steps for the algorithm are:

Pick an element, called a pivot, from the list.
Reorder the list so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation.
Recursively sort the sub-list of lesser elements and the sub-list of greater elements.

Time complexity of the quick sort algorithm is O(n log n). And the worst case time complexity is O(n^2). Quicksort has a space complexity of O(log n).

Lets see a java program to do quick sort

public class QuickSort {

    public int[] data;
    public QuickSort(int[] data)
    {
        this.data = data;
    }

    public void sortMe()
    {
        this.sortMe(data, 0, data.length-1);
    }

    public void sortMe(int[] array, int start, int end)
    {
        int i=start;
        int k=end;

        if((end - start) < 1) //return if there is < 1 element
            return;

        int pivot = array[start]; //take an integer as pivot
        while(k > i) //while the left and right indices have not met
        {
            //from left to right find first element greater than pivot
            while(array[i] <= pivot && i <= end && k > i) 
                i++;
            //from right to left find first element smaller than pivot
            while(array[k] > pivot && k >= start && k >= i)
                k--;
            //if left index is still smaller than right, swap the elements
            if(k > i)
                swap(array, i, k);
        }
        //swap the last element in the left partition with the pivot.
        swap(array, start, k);

        //quicksort left partition
        sortMe(array, start, k-1);
        //quicksort right partition
        sortMe(array, k+1, end);
    }

    public void swap(int[] array, int idx1, int idx2)
    {
        int temp=array[idx1];
        array[idx1] = array[idx2];
        array[idx2] = temp;
    }

    public void printMe(int[] data)
    {
        for(int x=0; x<data.length; x++)
        {
            System.out.print(data[x]+", ");
        }
        System.out.println();
    }

    public static void main(String[] args)
    {
        int[] data = {9,2,6,4,11,88,66,44,55,32,23,5,32,10};
        QuickSort qs = new QuickSort(data);
        System.out.print("Before Sorting : ");
        qs.printMe(data);
        qs.sortMe();        
        System.out.print("After Sorting : ");
        qs.printMe(qs.data);
    }

}

Tuesday, December 30, 2008

Priority Queue & Heapsort

A priority queue is a queue where each item has a priority associated with it. And the item with the highest priority is at the top of the queue. So, you will be removing the highest priority items from the queue first.

A priority queue can be implemented using a heap where the heap is implemented using a complete binary tree. A complete binary tree is supposed to have the heap property if the root has higher priority than its child and each child also follows the same heap property.

Now, lets try to understand this thing in english.

A complete binary tree is a tree filled from left to right. So, this is a complete binary tree. Also, it could be seen that for each node the root is always bigger than the child nodes. That is the heap property. And since the root has the highest priority and it is the first element, this is also a priority queue. If you remove the root, you will have to re-arrange the elements so that the new root again has the highest priority.

The benefit of using a heap structure is that inserting new element and removing root are handled in a constant O(log n) time, that is the time taken to re-arrange the elements. When the highest priority element is removed, we put the last element in the tree (lowest right element) at the root position. This new root is compared with both its children and if it has low priority as compared to either of its children, it is exchanged with the child. This goes on till the node resides at a place where its children are of lower priority and its parent is of higher priority. While inserting a new element, we place it at the last available node (remember, the tree should always be a complete tree), and move it up (basically follow the reverse procedure of removing root).

A heap of n elements could be easily stored using n sequential locations in an array. The left child node of node at position k is placed at position 2k, and the right child node is placed at 2k+1.

So, for the above heap, the elements would be stored as

16, 11, 9, 10, 5, 6, 8, 1, 2, 4

Lets check out the pseudo code for pushing and popping elements from the heap:

function heapPop(arr, count)
{
  start = (count-2)/2;
  while(start >= 0)
  {
    shiftDown(arr, start, count-1);
    start = start=1;
  }
}

function shiftDown(arr, start, end)
{
  root = start;
  while root*2+1 <= end // while root has only 1 child
  {
    child = root*2+1;
    if ( child+1 < end ) and ( arr[child]<a[child+1] )
      child = child+1;
    if ( arr[root] < arr[child] )
    {
      swap(arr[root], arr[child]);
      root = child;
    }
    else
      return;
  }
}

function heapPush( arr, count)
{
  end = 1;
  while (end < count )
  {
    shiftUp(arr, 0, end);
    end = end+1;
  }
}

function shiftUp(arr, start, end)
{
  child = end;
  while (child > start )
  {
    parent = floor((child-1)/2);
    if(arr[parent] < arr[child])
    {
      swap(arr[parent],arr[child]);
      child = parent;
    }
    else
      return;
  }
}

function heapSort(arr, count) // input is unsorted array "arr" of "count" elements
{
  heapPop(arr, count);
  end = count-1;
  while(end > 0)
  {
    swap(arr[end],arr[0]);
    end = end-1;
    heapPop(arr, count);
  }
}