Whatever....: http

Showing posts with label http. Show all posts

Friday, May 01, 2009

How facebook stores billions of photos?

The new photo infrastructure merges the photo serving tier and storage tier into one physical tier. It implements a HTTP based photo server which stores photos in a generic object store called Haystack. The main requirement for the new tier was to eliminate any unnecessary metadata overhead for photo read operations, so that each read I/O operation was only reading actual photo data (instead of filesystem metadata). Haystack can be broken down into these functional layers -

* HTTP server
* Photo Store
* Haystack Object Store
* Filesystem
* Storage

In the following sections we look closely at each of the functional layers from the bottom up.

Storage

Haystack is deployed on top of commodity storage blades. The typical hardware configuration of a 2U storage blade is –

* 2 x quad-core CPUs
* 16GB – 32GB memory
* hardware raid controller with 256MB – 512MB of NVRAM cache
* 12+ 1TB SATA drives

Each storage blade provides around 10TB of usable space, configured as a RAID-6 partition managed by the hardware RAID controller. RAID-6 provides adequate redundancy and excellent read performance while keeping the storage cost down. The poor write performance is partially mitigated by the RAID controller NVRAM write-back cache. Since the reads are mostly random, the NVRAM cache is fully reserved for writes. The disk caches are disabled in order to guarantee data consistency in the event of a crash or a power loss.

Filesystem

Haystack object stores are implemented on top of files stored in a single filesystem created on top of the 10TB volume.

Photo read requests result in read() system calls at known offsets in these files, but in order to execute the reads, the filesystem must first locate the data on the actual physical volume. Each file in the filesystem is represented by a structure called an inode which contains a block map that maps the logical file offset to the physical block offset on the physical volume. For large files, the block map can be quite large depending on the type of the filesystem in use.

Block based filesystems maintain mappings for each logical block, and for large files, this information will not typically fit into the cached inode and is stored in indirect address blocks instead, which must be traversed in order to read the data for a file. There can be several layers of indirection, so a single read could result in several I/Os depending on whether or not the indirect address blocks are cached.

Extent based filesystems maintain mappings only for contiguous ranges of blocks (extents). A block map for a contiguous large file could consist of only one extent which would fit in the inode itself. However, if the file is severely fragmented and its blocks are not contiguous on the underlying volume, its block map can grow large as well. With extent based filesystems, fragmentation can be mitigated by aggressively allocating a large chunk of space whenever growing the physical file.

Currently, the filesystem of choice is XFS, an extent based filesystem providing efficient file preallocation.

Haystack Object Store

Haystack is a simple log structured (append-only) object store containing needles representing the stored objects. A Haystack consists of two files – the actual haystack store file containing the needles, plus an index file. The following figure shows the layout of the haystack store file:

The first 8KB of the haystack store is occupied by the superblock. Immediately following the superblock are needles, with each needle consisting of a header, the data, and a footer:

A needle is uniquely identified by its <Offset, Key, Alternate Key, Cookie> tuple, where the offset is the needle offset in the haystack store. Haystack doesn’t put any restriction on the values of the keys, and there can be needles with duplicate keys. Following figure shows the layout of the index file -

There is a corresponding index record for each needle in the haystack store file, and the order of the needle index records must match the order of the associated needles in the haystack store file. The index file provides the minimal metadata required to locate a particular needle in the haystack store file. Loading and organizing index records into a data structure for efficient lookup is the responsibility of the Haystack application (Photo Store in our case). The index file is not critical, as it can be rebuilt from the haystack store file if required. The main purpose of the index is to allow quick loading of the needle metadata into memory without traversing the larger Haystack store file, since the index is usually less than 1% the size of the store file.

Haystack Write Operation

A Haystack write operation synchronously appends new needles to the haystack store file. After the needles are committed to the larger Haystack store file, the corresponding index records are then written to the index file. Since the index file is not critical, the index records are written asynchronously for faster performance.

The index file is also periodically flushed to the underlying storage to limit the extent of the recovery operations caused by hardware failures. In the case of a crash or a sudden power loss, the haystack recovery process discards any partial needles in the store and truncates the haystack store file to the last valid needle. Next, it writes missing index records for any trailing orphan needles at the end of the haystack store file.

Haystack doesn’t allow overwrite of an existing needle offset, so if a needle’s data needs to be modified, a new version of it must be written using the same tuple. Applications can then assume that among the needles with duplicate keys, the one with the largest offset is the most recent one.

Haystack Read Operation

The parameters passed to the haystack read operation include the needle offset, key, alternate key, cookie and the data size. Haystack then adds the header and footer lengths to the data size and reads the whole needle from the file. The read operation succeeds only if the key, alternate key and cookie match the ones passed as arguments, if the data passes checksum validation, and if the needle has not been previously deleted (see below).

Haystack Delete Operation

The delete operation is simple – it marks the needle in the haystack store as deleted by setting a “deleted” bit in the flags field of the needle. However, the associated index record is not modified in any way so an application could end up referencing a deleted needle. A read operation for such a needle will see the “deleted” flag and fail the operation with an appropriate error. The space of a deleted needle is not reclaimed in any way. The only way to reclaim space from deleted needles is to compact the haystack (see below).

Photo Store Server.

Photo Store Server is responsible for accepting HTTP requests and translating them to the corresponding Haystack store operations. In order to minimize the number of I/Os required to retrieve photos, the server keeps an in-memory index of all photo offsets in the haystack store file. At startup, the server reads the haystack index file and populates the in-memory index. With hundreds of millions of photos per node (and the number will only grow with larger capacity drives), we need to make sure that the index will fit into the available memory. This is achieved by keeping a minimal amount of metadata in memory, just the information required to locate the images.

When a user uploads a photo, it is assigned a unique 64-bit id. The photo is then scaled down to 4 different sizes. Each scaled image has the same random cookie and 64-bit key, and the logical image size (large, medium, small, thumbnail) is stored in the alternate key. The upload server then calls the photo store server to store all four images in the Haystack.

The in-memory index keeps the following information for each photo:

Haystack uses the open source Google sparse hash data structure to keep the in-memory index small, since it only has 2 bits of overhead per entry.

Photo Store Write/Modify Operation

A write operation writes photos to the haystack and updates the in-memory index with the new entries. If the index already contains records with the same keys then this is a modification of existing photos and only the index records offsets are modified to reflect the location of the new images in the haystack store file. Photo store always assumes that if there are duplicate images (images with the same key) it is the one stored at a larger offset which is valid.

Photo Store Read Operation

The parameters passed to a read operation include haystack id and a photo key, size and cookie. The server performs a lookup in the in-memory index based on the photo key and retrieves the offset of the needle containing the requested image. If found it calls the haystack read operation to get the image. As noted above haystack delete operation doesn’t update the haystack index file record. Therefore a freshly populated in-memory index can contain stale entries for the previously deleted photos. Read of a previously deleted photo will fail and the in-memory index is updated to reflect that by setting the offset of the particular image to zero.

Photo Store Delete Operation

After calling the haystack delete operation the in-memory index is updated by setting the image offset to zero signifying that the particular image has been deleted.

Compaction

Compaction is an online operation which reclaims the space used by the deleted and duplicate needles (needles with the same key). It creates a new haystack by copying needles while skipping any duplicate or deleted entries. Once done it swaps the files and in-memory structures.

HTTP Server

The HTTP framework we use is the simple evhttp server provided with the open source libevent library. We use multiple threads, with each thread being able to serve a single HTTP request at a time. Because our workload is mostly I/O bound, the performance of the HTTP server is not critical.

Summary

Haystack presents a generic HTTP-based object store containing needles that map to stored opaque objects. Storing photos as needles in the haystack eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file. This keeps the metadata overhead very small and allows us to store each needle’s location in the store file in an in-memory index. This allows retrieval of an image’s data in a minimal number of I/O operations, eliminating all unnecessary metadata overhead.

Monday, October 30, 2006

404 - a legend

You've felt it. You know the power of 404. You're surfing one night, you've got your new modem, your fast Mac, your ergonomic mouse, your precise mousing surface. Bring it on. You've got 7 windows open at a time, and you flip through them like dealing cards. You could tile them on the screen, like a little website mosaic, but you choose to keep them stacked on each other, so that each flip is a new adventure. Then suddenly, it screeches to a halt: 404. You don't want to believe it; that site was there last week! You reload, hoping it was a fluke. How could your painstakingly-compiled bookmarks betray you? 404 glares back at you, challenging you to contact the referring page's administrator. Questioning your spelling skills. What does 404 want from you?

Relax! 404 is your friend. It just wants to help you get where you want to go. It might intimidate you at first, with its stark white background and unadorned black text. But just think about it for a minute: 404 is baring its soul for you. It gives you its message and asks nothing in return. No login and password, no banner ads, no mailing list to keep you informed of future updates. All 404 has, it offers to you, knowing the likelihood that you will scorn it nonetheless, and leave as quickly as you came in. And 404 will continue to do so for every visitor, regardless of color, religion, or gender. 404 is nothing if not fair.

But why leave 404 so quickly? Why not stay a while and have a drink? 404 is an oasis on the web. It's like a rest stop with clean bathrooms on the interstate. 404 doesn't ask you to 'Click Here' or 'Visit our Sponsor'. It's perfectly satisfied if you just sit there and do nothing. 404 doesn't care how many visitors it's had since 8/1/96, and it's not tracking your click-through rate. So consider just hanging out for a while and relaxing. 404 is easy to get along with.

404 is full of intrigue. What did it used to be? What internet delight has escaped you? Will it return? 404 will never tell. Its mystery drives you to return again and again. Where there is 404, there is always the potential for something new. 404 is the eternal ebb and flow of life. One day you will return and 404 will be gone, replaced by a new page about South Park or the webmaster's cats. And it will be filled with the bittersweet memory of 404. You will be driven to seek out 404 in other places. Your desire for 404 will start to overshadow your career, your loved ones, even your passion for role-playing video games. 404 draws you deeper and deeper into its vortex. You must admit that you are powerless before it. 404 is not evil, it is a natural force that defies control. 404 is wild and free.

404 believes in your abilities. It doesn't try to lose you in a crazy series of redirects, it challenges your problem-solving skills. It asks, "Now that you're here, what are you going to do?" 404 willingly hands you the reins. Have you ever just closed a window on 404? No! You've considered your options, exercised your reasoning skills, and firmly chosen a course of action to deal with your situation. When you successfully navigate 404, you feel the blood coursing in your veins, the wind in your hair, and it's good to be alive. 404 is life-affirming. When you find 404, you know that even though the thing you were looking for no longer exists, there is still 404, stepping in to fill the void.

Where there was darkness, there is now 404. And all is right with the world.

Tuesday, October 17, 2006

Reverse Proxy using squid

what is proxy server caching? hmm... lets start with what apache or any other web server does. Whenever you send a request to apache, the request (whether static or dynamic) is processed. The appropriate file is located in the file system and, the content type is identified and data is then streamed from the file to the web and hence to the browser. This is what the apache does. So where does proxy server come in picture? Proxy servers like squid are set up at the gateway level in cyber cafes or large companies. The proxy server caches web content in its internal cache and whenever a request comes to the proxy server, the server matches the modified time of the web content with that at the server, and if the time is same, the content is served from the proxy cache, else the content is fetched from the remote server and served and cached for future purposes.

So, now, what is reverse proxy caching? Reverse proxy is something which is installed in the neighbourhood of a web server. All incoming traffic to the web server is passed through the reverse proxy server. This reduces the load on a busy server by using a web cache between the server and the internet.

Following benefits are derived by deploying reverse proxy servers alongside web servers:

1. increase capacity of existing servers and prevent purchase of new servers.
2. static content is easily cached at the proxy server level, leaving the web server to handle dynamic content.
3. Increase response time of any web request.
4. proxy server acts like an additional layer of defence against hacking.
5. Load balancing: reverse proxy can be used to distribute load on several web servers.
6. Compression: web content can be optimized and compressed to speed up download times.

A reverse proxy server intercepts requests to the Web server and responds to the request out of a store of cached pages. Dynamic web content cannot be cached. Reverse proxy caches static pages / images based on HTTP header tags that are returned from the web page. Important HTTP header tags are:

Last-Modified -> when was the page last modified
Expires -> when would the page expire so that it can be removed from proxy server cache
Cache-control -> should the page be cached
Pragma -> similar to Cache-control, deciding factor whether the page should be cached or not

Here is what i did to install squid:

>> download the squid source gz file.
>> tar -xvzf squid-2.6.STABLE4.tar.gz - creates a directory squid-2.6.STABLE4
>> cd squid-2.6.STABLE4
>> ./configure --disable-internal-dns - makes squid use the /etc/hosts file for dns lookup.
>> make - compile the files
>> make install (as root) - copies the compiled files to /usr/local/squid
>> vi /usr/local/squid/etc/squid.conf
>> make the following changes. The configuration file format has been changed in squid 2.6. I am putting the new configuration settings here. Older directives of httpd_accel_* have been depricated.

http_port 80 vhost
socket address where squid listens to httpd client requests. Default is 3128 (for proxy server). Various options can be put after the port number. Like
transparent : support for transparent proxies
vhost : accelerator using vhost directive
vport : accelerator with IP virtual host support
defaultsite= : main website name for accelerators
protocol= : protocol to reconstruct accelerated requests with. Default is httpd.
no-connection-auth : prevent forwarding of microsoft connection oriented authentication
tproxy : support for linux TPROXY for spoofing outgoing connections using the client IP address

cache_peer [options]

for apache running on localhost on port 81, the configuration for reverse proxy - cache_peer directive would be

cache_peer localhost parent 81 0 originserver
hostname : cache peer to which connection has to be established type : how cache peer is treaded (as parent, sibiling or multicast)
parent -> the child cache will forward requests to its parent cache. If the parent does not hold the requested object, it will forward the request on behalf of the child.
sibling -> a peer may only request objects already held in the cache. a sibling cannot forward cache misses on behalf of the peer.
multicast -> multicast packet is from one machine to one or more.
proxy port : port no where cache listens to the peer requests
icp port : used for querying neighbour caches about objects
options : lots of options available like
proxy-only -> objects fetched from this cache should not be saved locally
weight=n -> specify weight of parent
round-robin -> define a set of parents to be used in a round robin way.
weighted-round-robin -> define a set of parents to be used in a round robin way, frequency of each parent being based on the round trip time.
originserver -> contact this parent as a origin server. used for accelerator setups

Thats it, just start your apache and squid and everything should run fine.
Hope this helps...

Source : http://www.visolve.com/squid/whitepapers/reverseproxy.php with some customizations to upgrade it for the new version.

Monday, April 17, 2006

HTTP protocol : absolute / relative urls

The question here is which one is better using absolute urls or using relative urls? Aim is to obtain better performance from the httpd server.

Performance wise there is no difference between an absolute and a relative url. Though relative urls need to be converted to absolute urls before they can be used. But who does the conversion? The client/browser or the http server. If the resolution is done at server end then use of absolute urls is better since extra processing required for conversion of relative to absolute would be avoided at the server end.

According to http://www.htmlhelp.com/faq/html/basics.html the browser resolves the relative URLs and not the server. Relative urls are resolved and converted to absolute urls before sending request to the server.

To quote exactly

Before the browser can use a relative URL, it must resolve the relative URL to produce an absolute URL. If the relative URL begins with a double slash (e.g., //www.htmlhelp.com/faq/html/), then it will inherit only the base URL's scheme. If the relative URL begins with a single slash (e.g., /faq/html/), then it will inherit the base URL's scheme and network location.

If the relative URL does not begin with a slash (e.g., all.html , ./all.html or ../html/), then it has a relative path and is resolved as follows.

1. The browser strips everything after the last slash in the base document's URL and appends the relative URL to the result.
2. Each "." segment is deleted (e.g., ./all.html is the same as all.html, and ./ refers to the current "directory" level in the URL hierarchy).
3. Each ".." segment moves up one level in the URL hierarchy; the ".." segment is removed, along with the segment that precedes it (e.g., foo/../all.html is the same as all.html, and ../ refers to the parent "directory" level in the URL hierarchy).

Please note that the browser resolves relative URLs, not the server. The server sees only the resulting absolute URL. Also, relative URLs navigate the URL hierarchy. The relationship (if any) between the URL hierarchy and the server's filesystem hierarchy is irrelevant.

From this following things can be derived:

1. In case of using relative urls, the processing for resolution of relative to absolute is done at client/browser end.
2. Relative urls are smaller to use and take up somewhat less bandwidth
3. Shifting documents from one server to another is easier when you use relative urls, since you dont need to go and change the urls in all the html documents.
4. The only disadvantage in using relative urls is that all content has to be on the same server. If you need to spread out the page content on different servers then you will need to use absolute urls.

This is what i got from the web...

All types of discussions on this are welcome.