Whatever....: system architecture

Showing posts with label system architecture. Show all posts

Thursday, August 05, 2010

Linked-in data infrastructure

Jay Kreps of LinkedIn presented some informative details of how they process data at the recent Hadoop Summit. Kreps described how LinkedIn crunches 120 billion relationships per day and blends large scale data computation with high volume, low latency site serving.

Much of LinkedIn's important data is offline - it moves fairly slowly. So they use daily batch processing with Hadoop as an important part of their calculations. For example, they pre-compute data for their "People You May Know" product this way, scoring 120 billion relationships per day in a mapreduce pipeline of 82 Hadoop jobs that requires 16 TB of intermediate data. This job uses a statistical model to predict the probability of two people knowing each other. Interestingly they use bloom filters to speed up large joins, yielding a 10x performance improvement.

They have two engineers who work on this pipeline, and are able to test five new algorithms per week. To achieve this rate of change, they rely on A/B testing to compare new approach to old approaches, using a "fly by instruments" approach to optimize results. To achieve performance improvements, they also need to operate on large scale data - they rely on large scale cluster processing. To achieve that they moved from custom graph processing code to Hadoop mapreduce code - this required some thoughtful design since many graph algorithms don't translate into mapreduce in a straightforward manner.

LinkedIn invests heavily in open source,with the goal of building on best in class components, recruiting community involvement. Two of these open source projects were presented as central to their data infrastructure. Azkaban is an open source workflow system for Hadoop, providing cron-like scheduling and make-like dependency analysis, including restart. It is used to control ETL jobs that push database and event logs to their edge server storage, Voldemort.

Voldemort is LinkedIn's NoSQL key/value storage engine. Every day they push out an updated multi-billion edge probabilistic relationship graph to their live site for querying in rendering web pages. This data is read-only: it is computed in these cluster jobs, but then filtered through a faceted search in realtime, e.g., restricting to certain companies of interest, or excluding people the user has indicated are not known to them. It grew out of their problems using databases to solve this problem, which required sharding and devolved into a system with a fulltime manual job moving data around. Voldemort is fully distributed and decentralized, and supports partitioning and fail over.

LinkedIn updates their live servers with a large scale parallel fetch of results from Hadoop into Voldemort, warms up the cache, then institutes an atomic switchover to the new day's data on each server separately. They keep the previous day's data in the server to allow near instantaneous fallback in case of problems with new data sets. LinkedIn builds an index structure in their Hadoop pipeline - this produces a multi-TB lookup structure that uses perfect hashing (requiring only 2.5 bits per key). This process trades off cluster computing resources for faster server responses; it takes LinkedIn about 90 minutes to build a 900 GB data store on a 45 node development cluster. They use Hadoop to process massive batch workloads, taking their Hadoop cluster down for upgrades periodically, whereas their Voldemort servers never go down.

For more details, see the slides from the presentation.

Copied from http://www.infoq.com/news/2010/08/linkedin-data-infrastructure

Friday, May 01, 2009

How facebook stores billions of photos?

The new photo infrastructure merges the photo serving tier and storage tier into one physical tier. It implements a HTTP based photo server which stores photos in a generic object store called Haystack. The main requirement for the new tier was to eliminate any unnecessary metadata overhead for photo read operations, so that each read I/O operation was only reading actual photo data (instead of filesystem metadata). Haystack can be broken down into these functional layers -

* HTTP server
* Photo Store
* Haystack Object Store
* Filesystem
* Storage

In the following sections we look closely at each of the functional layers from the bottom up.

Storage

Haystack is deployed on top of commodity storage blades. The typical hardware configuration of a 2U storage blade is –

* 2 x quad-core CPUs
* 16GB – 32GB memory
* hardware raid controller with 256MB – 512MB of NVRAM cache
* 12+ 1TB SATA drives

Each storage blade provides around 10TB of usable space, configured as a RAID-6 partition managed by the hardware RAID controller. RAID-6 provides adequate redundancy and excellent read performance while keeping the storage cost down. The poor write performance is partially mitigated by the RAID controller NVRAM write-back cache. Since the reads are mostly random, the NVRAM cache is fully reserved for writes. The disk caches are disabled in order to guarantee data consistency in the event of a crash or a power loss.

Filesystem

Haystack object stores are implemented on top of files stored in a single filesystem created on top of the 10TB volume.

Photo read requests result in read() system calls at known offsets in these files, but in order to execute the reads, the filesystem must first locate the data on the actual physical volume. Each file in the filesystem is represented by a structure called an inode which contains a block map that maps the logical file offset to the physical block offset on the physical volume. For large files, the block map can be quite large depending on the type of the filesystem in use.

Block based filesystems maintain mappings for each logical block, and for large files, this information will not typically fit into the cached inode and is stored in indirect address blocks instead, which must be traversed in order to read the data for a file. There can be several layers of indirection, so a single read could result in several I/Os depending on whether or not the indirect address blocks are cached.

Extent based filesystems maintain mappings only for contiguous ranges of blocks (extents). A block map for a contiguous large file could consist of only one extent which would fit in the inode itself. However, if the file is severely fragmented and its blocks are not contiguous on the underlying volume, its block map can grow large as well. With extent based filesystems, fragmentation can be mitigated by aggressively allocating a large chunk of space whenever growing the physical file.

Currently, the filesystem of choice is XFS, an extent based filesystem providing efficient file preallocation.

Haystack Object Store

Haystack is a simple log structured (append-only) object store containing needles representing the stored objects. A Haystack consists of two files – the actual haystack store file containing the needles, plus an index file. The following figure shows the layout of the haystack store file:

The first 8KB of the haystack store is occupied by the superblock. Immediately following the superblock are needles, with each needle consisting of a header, the data, and a footer:

A needle is uniquely identified by its <Offset, Key, Alternate Key, Cookie> tuple, where the offset is the needle offset in the haystack store. Haystack doesn’t put any restriction on the values of the keys, and there can be needles with duplicate keys. Following figure shows the layout of the index file -

There is a corresponding index record for each needle in the haystack store file, and the order of the needle index records must match the order of the associated needles in the haystack store file. The index file provides the minimal metadata required to locate a particular needle in the haystack store file. Loading and organizing index records into a data structure for efficient lookup is the responsibility of the Haystack application (Photo Store in our case). The index file is not critical, as it can be rebuilt from the haystack store file if required. The main purpose of the index is to allow quick loading of the needle metadata into memory without traversing the larger Haystack store file, since the index is usually less than 1% the size of the store file.

Haystack Write Operation

A Haystack write operation synchronously appends new needles to the haystack store file. After the needles are committed to the larger Haystack store file, the corresponding index records are then written to the index file. Since the index file is not critical, the index records are written asynchronously for faster performance.

The index file is also periodically flushed to the underlying storage to limit the extent of the recovery operations caused by hardware failures. In the case of a crash or a sudden power loss, the haystack recovery process discards any partial needles in the store and truncates the haystack store file to the last valid needle. Next, it writes missing index records for any trailing orphan needles at the end of the haystack store file.

Haystack doesn’t allow overwrite of an existing needle offset, so if a needle’s data needs to be modified, a new version of it must be written using the same tuple. Applications can then assume that among the needles with duplicate keys, the one with the largest offset is the most recent one.

Haystack Read Operation

The parameters passed to the haystack read operation include the needle offset, key, alternate key, cookie and the data size. Haystack then adds the header and footer lengths to the data size and reads the whole needle from the file. The read operation succeeds only if the key, alternate key and cookie match the ones passed as arguments, if the data passes checksum validation, and if the needle has not been previously deleted (see below).

Haystack Delete Operation

The delete operation is simple – it marks the needle in the haystack store as deleted by setting a “deleted” bit in the flags field of the needle. However, the associated index record is not modified in any way so an application could end up referencing a deleted needle. A read operation for such a needle will see the “deleted” flag and fail the operation with an appropriate error. The space of a deleted needle is not reclaimed in any way. The only way to reclaim space from deleted needles is to compact the haystack (see below).

Photo Store Server.

Photo Store Server is responsible for accepting HTTP requests and translating them to the corresponding Haystack store operations. In order to minimize the number of I/Os required to retrieve photos, the server keeps an in-memory index of all photo offsets in the haystack store file. At startup, the server reads the haystack index file and populates the in-memory index. With hundreds of millions of photos per node (and the number will only grow with larger capacity drives), we need to make sure that the index will fit into the available memory. This is achieved by keeping a minimal amount of metadata in memory, just the information required to locate the images.

When a user uploads a photo, it is assigned a unique 64-bit id. The photo is then scaled down to 4 different sizes. Each scaled image has the same random cookie and 64-bit key, and the logical image size (large, medium, small, thumbnail) is stored in the alternate key. The upload server then calls the photo store server to store all four images in the Haystack.

The in-memory index keeps the following information for each photo:

Haystack uses the open source Google sparse hash data structure to keep the in-memory index small, since it only has 2 bits of overhead per entry.

Photo Store Write/Modify Operation

A write operation writes photos to the haystack and updates the in-memory index with the new entries. If the index already contains records with the same keys then this is a modification of existing photos and only the index records offsets are modified to reflect the location of the new images in the haystack store file. Photo store always assumes that if there are duplicate images (images with the same key) it is the one stored at a larger offset which is valid.

Photo Store Read Operation

The parameters passed to a read operation include haystack id and a photo key, size and cookie. The server performs a lookup in the in-memory index based on the photo key and retrieves the offset of the needle containing the requested image. If found it calls the haystack read operation to get the image. As noted above haystack delete operation doesn’t update the haystack index file record. Therefore a freshly populated in-memory index can contain stale entries for the previously deleted photos. Read of a previously deleted photo will fail and the in-memory index is updated to reflect that by setting the offset of the particular image to zero.

Photo Store Delete Operation

After calling the haystack delete operation the in-memory index is updated by setting the image offset to zero signifying that the particular image has been deleted.

Compaction

Compaction is an online operation which reclaims the space used by the deleted and duplicate needles (needles with the same key). It creates a new haystack by copying needles while skipping any duplicate or deleted entries. Once done it swaps the files and in-memory structures.

HTTP Server

The HTTP framework we use is the simple evhttp server provided with the open source libevent library. We use multiple threads, with each thread being able to serve a single HTTP request at a time. Because our workload is mostly I/O bound, the performance of the HTTP server is not critical.

Summary

Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Saturday, April 28, 2007

LiveJournal - system architecture

Lets discuss the system architecture of Live Journal.

Live Journal or LJ for short kicked off as a hobby project in April 1999 and was built on open source completely. It reached 2.8 Million accounts in April 2004 and 6.8 Million accounts in April 2005. Currently It has more than 10 Million accounts. Caters to several thousands of hits per second and lots of MySQL queries.

Here is a complex diagram which roughly outlines the architecture of LJ.

The technologies which are visible over here are -

Caching - Memcached
Mysql Clusters - HA & LB
Httpd load balancing - using perlbal
MogileFS - Distributed File System

Lets start off with mysql...

A single server with mysql wont be able to handle the large no of reads and writes. With increasing no of reads and writes, the server slows down. Next stage would be to have 2 servers in a master-slave architecture in which the master handles all inserts and the slaves are read-only. But then the queries have to be spread over in such a manner that replication lag between master and slave (though very small) is handled. As the number of database and web servers are increased - chaos increases. Site is fast for a while and then again slow - and there is need for more servers with higher configurations. Also, as the number of slaves increases, the number of writes to the slave also increases. So eventually you come to a situation where the number of writes is very large as compared to the number of reads. Resulting in large I/O and low CPU utilization.

The best way to handle such situation is to divide the database. How LJ did this was by creating user clusters. So each user was assigned a cluster number. And each cluster had multiple machines in a master-slave fashion. The first query would then find the cluster number for that user from the global database and subsequent queries for that user could then be redirected to the user cluster. Ofcourse few issues like uniqueness of userid, and moving user around clusters had to be tackled. Caching of mysql connections and using mysql query cache to cache query results added to the better performance of the site.

Again the problem was the single point of failure with the master databases. If any of the master database dies, the site would go down. To avoid this situation master-master cluster was created. In case of any problem - the other master would come into play and handle all active connections.

Which database engine to use - InnoDB or MyISAM. InnoDB allows concurrent reads and writes and so is comparatively fast. Whereas MyISAM has table level locks and so is not as fast as InnoDB.

And then there is MySQL cluster which is an in-memory engine. It requires about 2-4x of RAM for the dataset. So it is good for handling small data sets only.

An even better way of storing database is by using shared storage - SAN, SCSI, DRDB. You turn a pair of InnoDB machines to a cluster - looks like a single box from outside with floating IP address. Heartbeat to move IP, mount/unmount filesystem, start/stop mysql. DRDB can be used to sync one machine's block device with another. This requires dedicated gigabit cable between the two machines to handle the high amount of data transfer.

Cache

Memcache is used to cache records which has already been computed for frequent access. Memcache is an open source distributed caching system - instances of which can be run on any machine where-ever free memory is available. It also provides simple APIs for different languages like java, php, perl, python and ruby. And it is extremely fast.

LJ created 28 instances of memcache on 12 machines (not dedicated) and was able to cache 30 GB of data. This cache was getting a hit rate of 90-93%. Which reduced the number of queries to the database to a great extent. They started caching stuff which was very frequently accessed and aim at caching almost everything possible. With cache - there is an extra overhead of updating the cache.

http load balancing

After trying a large number of reverse proxies, LJ people were unable to find anything which satisfied their needs. So they built up their own reverse proxy - perlbal - a small, fast, manageable, HTTP web server which can do internal redirects.
It is single threaded, asynchronous and event based. Handles dead nodes. And works in multiple modes - static web server, reverse proxy and plug-ins.

Allows persistent connections and has no complex load balancing logic - uses whatever is free. Connects fast and has multiple queues - for free and paid users.

MogileFS - Distributed File System

Files belong to classes. It tracks what disks are files on. Keeps replicas on devices on different hosts. It has libraries available for most of the languages - php, perl, java, python.

clients, trackers, mysql database cluster and storage nodes - all were brought under MogileFS. It handles automatic file replication, deletion etc.

Have put in only major points and finer details can be found in the link below.

source:
http://danga.com/words/2005_oscon/oscon-2005.pdf

Saturday, August 05, 2006

mysql partitioning

Table partitioning is a concept which allows data to be stored in multiple locations on one or more disks and access to the data is through SQL queries. The way data is stored and retrieved is invisible to the end user.

Mysql allows the user to select the rules based on which data would be spread over multiple partitions in a table. Mysql supports horizontab partitioning whereby the rows of a table are spread across multiple locations. Support for vertical partitioning is not there. So using mysql partitioning, you cannot assign different columns to different physical partitions of a table.

Benefits of partitioning a table :
1. Being able to store more data in one table than can be held on a single disk or filesystem partition. Though current operating systems allow extremely huge files on the disk.
2. Data that loses its usefulness can often be easily be removed from the table by dropping the partition containing only that data. And adding of new data can be facilitated by adding a new partition specially for that data.
3. Some queries can be greatly optimized in virtue of the fact that data satisfying a given WHERE clause can be stored only on one or more partitions, thereby excluding any remaining partitions from the search. Data can also be re-organized to move more frequently accessed data to one partition.
4. Queries involving aggregate functions such as SUM() and COUNT() can easily be parallelized.
5. Greater query throughput can be achieved by spreading data seeks over multiple disks.

I had used partitioning with MYIASM tables and found out that they were very useful. The complete logic of how rows were divided and stored and how they need to be accessed was invisible to me. What happened when i created some partitions for the table is that the same number of .MYI and .MYD files were created. Though the table definition remained in the single .frm file. A separate .PAR file was created for the table which i think was being used to define the logic of how database was being partitioned.

Moreover, an insert on the table locks all the partitions for a small time that is, the time needed by mysql to decide in which partition the record should be put.

The logic using which a table can be partitioned are listed as below:

RANGE partitioning -> Here table partitions are defined based on a range of data to be stored in each partition. For example -

CREATE TABLE abc (
name VARCHAR(100) NOT NULL,
email VARCHAR(40),
dob DATE NOT NULL
)
PARTITION BY RANGE( YEAR(dob) ) (
PARTITION p0 VALUES LESS THAN (1960),
PARTITION p1 VALUES LESS THAN (1970),
PARTITION p2 VALUES LESS THAN (1980),
PARTITION p3 VALUES LESS THAN (1990),
PARTITION p4 VALUES LESS THAN MAXVALUE
);

This would create 5 partitions in the table abc. Partition p0 will contain records whose dob is from 0 to 1959, partition p1 will contain records whose dob is from 1960 to 1969 and so on. In case a row is entered whose dob cannot be accomodated in any of the partitions, an error is thrown and the data is not inserted in the table.

LIST partitioning -> Here table partitions are defined based on a list of data for each partition. Rows which satisfy a criteria list is inserted in that partition.

CREATE TABLE abcs (
name VARCHAR(100),
dob DATE NOT NULL DEFAULT '1979-01-01'
age INT
)
PARTITION BY LIST(age) (
PARTITION p0 VALUES IN (3,5,6,9,17),
PARTITION p1 VALUES IN (1,2,10,11,19,20),
PARTITION p2 VALUES IN (4,12,13,14,18),
PARTITION p3 VALUES IN (7,8,15,16)
);
This would create 4 partitions in the table abcs. Partition p0 would contain records whose age would be 3,5,6,9 or 17. Again if you try to insert a record whose dob is not in any of the partition list created, an error is thrown and the record is not inserted.

LINEAR HASH partitioning -> Here table data is evenly divided between all partitions using some simple functions. for example

CREATE TABLE abc (
name VARCHAR(100),
dob DATE NOT NULL DEFAULT '1970-01-01',
age INT
)
PARTITION BY HASH( YEAR(dob) )
PARTITIONS 4;
Here 4 partitions are created and data is distributed using the following formula
Partition number = MOD(YEAR(dob),number_of_partitions);
So since my date of birth is 1980, my record will be in partition number 0.

LINEAR HASH partitioning -> This is almost similar to hash partitioning except for the fact that the algorithm used to divide data is different. The syntax is also almost same. We use PARTITION BY LINEAR HASH instead of PARTITION BY HASH.

The algorithm used is :

Given an expression expr, the partition in which the record is stored when linear hashing is used is partition number N from among num partitions, where N is derived according to the following algorithm:

Find the next power of 2 greater than num. We call this value V; it can be calculated as:
V = POWER(2, CEILING(LOG(2, num)))

Set N = F(column_list) & (V - 1).

While N >= num
{
Set V = CEIL(V / 2)
Set N = N & (V - 1)
}

The advantage in partitioning by linear hash is that the adding, dropping, merging, and splitting of partitions is made much faster, which can be beneficial when dealing with tables containing extremely large amounts of data. The disadvantage is that data is less likely to be evenly distributed between partitions as compared with the distribution obtained using regular hash partitioning.

KEY partitioning ->
Partitioning by key is similar to partitioning by hash, except that where hash partitioning employs a user-defined expression, the hashing function for key partitioning is supplied by the MySQL server. The syntax used is PARTITION BY KEY instead of PARTITION BY HASH.

In most of the cases either a primary key or an unique key is used to create partitions.

Mysql also supports sub partitioning whereby a partition can be divided into further sub partitions of similar type.

I think i am creating a very long blog for mysql partitioning. However there are a list of syntaxes available on the mysql site to alter, analyze, optimize, rebuild, checking and repairing partitions. Please check http://dev.mysql.com/doc/refman/5.1/en/partitioning.html. It will give a more detailed idea of whatever ranting i have done over here...

I will be signing off from here now. Next blog may or may not contain more info about partitions...

Anyways will keep blogggging...

Sunday, April 02, 2006

Work Work Work ...

Have been working straight for 4 days... getting only 2-4 hours of sleep in the night.

Well, i am a hard worker, but not that hard. Now the question comes, what is the urgency for this hard work. And i would say, buddy, it is a difficult to maintain the product on which i am working. People call it RESDEX. Short form for "Resume Database Access". We have more than 55 lakh profiles as of today. And it is my job to ensure that everything works without any problem. Providing a search on such a huge database, and getting results in matter of seconds is a difficult task.

What the user sees is a search form (well actually 4 search forms). Which is the tip of the iceberg. What i have to see is the complete iceberg. And it is a big iceberg. Like my collegue says - "Har click ki alag kahani hai" (Every click has its own story).

This is the first time i am talking about my work over here. Though all i do is work and work. Actually i have done things in my life which i am proud of. Building system architecture block by block is something which is really very exciting.

Earlier i used to work for an organization named 3di systems. I worked on lots of projects over there, but my major contribution was in handling the website named as www.newgateway4u.com. And it was an MLM (Multi Level Marketing) website. Logic for the website was - in theoretical terms - a binary tree. To explain in layman terms, each user registered with the website had to pay a certain amount to get himself activated. And once activated, he had to work on getting active people below him. Each user can have at most one person on left and one on right. The tree used to look something like this


            Root
             |
            /            /             L    R
           \   /\  
            \ /             LR RL RR
             /
            RLL

And the site used to work as - if R is active and both RL & RR are active then some percentage of the activation fees used to be passed to R. So Root would get percentage of activation money from L, R, LR, RL, RR, RLL, if all of them are active. There used to be weekly calculations to extract the amount payble for all active users.

The owner of the site used to make money by taking what was remaining from giving out the percentages to eligible active people.

When i joined the project, there were 5000 users and the amount calculation time was 11 hours. The site was pathetically slow. And then I Joined. Moved the complete binary tree to a doubly linked list in memory. Everything became cool. Calculations that used to take 11 hours were done in 30 minutes. WOW!!

And then after 1 year of working with the site, i left the organization. When i left, the no of users were 50,000 and the weekly calculation time was 4 hours. Boy!! the code i wrote and how i handled the application architecture. The ideas that used to flow from me. I was good. And i am proud of myself.

And after that i joined Naukri. And worked over here. Handled servers, built up lots of applications. And finally got into resdex. The Baap of all applications. RESDEX started off with i dont remember how much, but we hit a bottleneck at 10 Lakh cvs. When we shifted the architecture to something better. I still remember the date. It was 11th june 2004. Things stabilized. And we kept on adding more features and cvs to the product. The product is going stable. But there is always a fear of what would happen once the number reaches something like 100 lakh. How will we be providing a search on such a large database.

Challenges are unlimited. And step by step evolution is the only way to adapt to them.

Cool yaar...

And currently i am just adding one more block to the architecture...
And it is a bit heavy block. Well all blocks are heavy... Right!!.

Wish that everything goes right and i get to sleep on Monday...