Thursday, August 05, 2010

Linked-in data infrastructure

Jay Kreps of LinkedIn presented some informative details of how they process data at the recent Hadoop Summit. Kreps described how LinkedIn crunches 120 billion relationships per day and blends large scale data computation with high volume, low latency site serving.

Much of LinkedIn's important data is offline - it moves fairly slowly. So they use daily batch processing with Hadoop as an important part of their calculations. For example, they pre-compute data for their "People You May Know" product this way, scoring 120 billion relationships per day in a mapreduce pipeline of 82 Hadoop jobs that requires 16 TB of intermediate data. This job uses a statistical model to predict the probability of two people knowing each other. Interestingly they use bloom filters to speed up large joins, yielding a 10x performance improvement.

They have two engineers who work on this pipeline, and are able to test five new algorithms per week. To achieve this rate of change, they rely on A/B testing to compare new approach to old approaches, using a "fly by instruments" approach to optimize results. To achieve performance improvements, they also need to operate on large scale data - they rely on large scale cluster processing. To achieve that they moved from custom graph processing code to Hadoop mapreduce code - this required some thoughtful design since many graph algorithms don't translate into mapreduce in a straightforward manner.

LinkedIn invests heavily in open source,with the goal of building on best in class components, recruiting community involvement. Two of these open source projects were presented as central to their data infrastructure. Azkaban is an open source workflow system for Hadoop, providing cron-like scheduling and make-like dependency analysis, including restart. It is used to control ETL jobs that push database and event logs to their edge server storage, Voldemort.

Voldemort is LinkedIn's NoSQL key/value storage engine. Every day they push out an updated multi-billion edge probabilistic relationship graph to their live site for querying in rendering web pages.  This data is read-only: it is computed in these cluster jobs, but then filtered through a faceted search in realtime, e.g., restricting to certain companies of interest, or excluding people the user has indicated are not known to them. It grew out of their problems using databases to solve this problem, which required sharding and devolved into a system with a fulltime manual job moving data around.  Voldemort is fully distributed and decentralized, and supports partitioning and fail over.

LinkedIn updates their live servers with a large scale parallel fetch of results from Hadoop into Voldemort, warms up the cache, then institutes an atomic switchover to the new day's data on each server separately. They keep the previous day's data in the server to allow near instantaneous fallback in case of problems with new data sets. LinkedIn builds an index structure in their Hadoop pipeline - this produces a multi-TB lookup structure that uses perfect hashing (requiring only 2.5 bits per key). This process trades off cluster computing resources for faster server responses; it takes LinkedIn about 90 minutes to build a 900 GB data store on a 45 node development cluster. They use Hadoop to process massive batch workloads, taking their Hadoop cluster down for upgrades periodically, whereas their Voldemort servers never go down.

For more details, see the slides from the presentation.

Copied from http://www.infoq.com/news/2010/08/linkedin-data-infrastructure

No comments: