Whatever....: August 2010

Monday, August 23, 2010

Troubleshooting : mysql server has gone away

When running query against a database the following error can be generated:

ERROR 2006 (HY000) at line NNN: MySQL server has gone away

Where "NNN" is the line number of the script currently being run where the error occurred.

Possible Causes and Resolution

This is a general error which can have a number of possible causes. The one certainty is that the MySQL database is no longer listening on the connection - but this does not necessarily mean that MySQL is no longer available.

Possible causes:

MySQL has crashed.
Check that MySQL is still running on the server. If it is not running and this was not a planned shutdown then it is reasonable to conclude that MySQL has crashed. I personally have never experienced this, but if it happens then investigate whether you can upgrade to a newer version.

MySQL has been shutdown.
Check that MySQL is still running, if it is not and it was shut down (accidentally or intentionally) then that is the cause of the problem. Restart MySQL.

Network connectivity has failed.
Even if the database is still up, is it possible to contact the server? If you can ping the server (or MySQL is running no the localhost) then basic network connectivity is not the problem.

MySQL networking has failed
The problem may be the MySQL connection. If the query was trying to transfer a large block of data then the MySQL packet-size may be set too small.

Try increasing the value of max_allowed_packet and restart the MySQL database service. In MySQL Administrator select "Startup Variables" andon the "Advanced Networking" tab thsi is shown under "Data / Memory size" and is labeled as "Max. packet size." The default is 1MB and one of my systems now runs at 16MB.

The database connection has timed out and thus the SQL has failed.
Try increasing the wait_timeout, or reconnecting just before firing the query.

Modified after copying from http://www.cryer.co.uk/brian/mysql/trouble_mysql_gone_away.htm

Thursday, August 05, 2010

Linked-in data infrastructure

Jay Kreps of LinkedIn presented some informative details of how they process data at the recent Hadoop Summit. Kreps described how LinkedIn crunches 120 billion relationships per day and blends large scale data computation with high volume, low latency site serving.

Much of LinkedIn's important data is offline - it moves fairly slowly. So they use daily batch processing with Hadoop as an important part of their calculations. For example, they pre-compute data for their "People You May Know" product this way, scoring 120 billion relationships per day in a mapreduce pipeline of 82 Hadoop jobs that requires 16 TB of intermediate data. This job uses a statistical model to predict the probability of two people knowing each other. Interestingly they use bloom filters to speed up large joins, yielding a 10x performance improvement.

They have two engineers who work on this pipeline, and are able to test five new algorithms per week. To achieve this rate of change, they rely on A/B testing to compare new approach to old approaches, using a "fly by instruments" approach to optimize results. To achieve performance improvements, they also need to operate on large scale data - they rely on large scale cluster processing. To achieve that they moved from custom graph processing code to Hadoop mapreduce code - this required some thoughtful design since many graph algorithms don't translate into mapreduce in a straightforward manner.

LinkedIn invests heavily in open source,with the goal of building on best in class components, recruiting community involvement. Two of these open source projects were presented as central to their data infrastructure. Azkaban is an open source workflow system for Hadoop, providing cron-like scheduling and make-like dependency analysis, including restart. It is used to control ETL jobs that push database and event logs to their edge server storage, Voldemort.

Voldemort is LinkedIn's NoSQL key/value storage engine. Every day they push out an updated multi-billion edge probabilistic relationship graph to their live site for querying in rendering web pages. This data is read-only: it is computed in these cluster jobs, but then filtered through a faceted search in realtime, e.g., restricting to certain companies of interest, or excluding people the user has indicated are not known to them. It grew out of their problems using databases to solve this problem, which required sharding and devolved into a system with a fulltime manual job moving data around. Voldemort is fully distributed and decentralized, and supports partitioning and fail over.

LinkedIn updates their live servers with a large scale parallel fetch of results from Hadoop into Voldemort, warms up the cache, then institutes an atomic switchover to the new day's data on each server separately. They keep the previous day's data in the server to allow near instantaneous fallback in case of problems with new data sets. LinkedIn builds an index structure in their Hadoop pipeline - this produces a multi-TB lookup structure that uses perfect hashing (requiring only 2.5 bits per key). This process trades off cluster computing resources for faster server responses; it takes LinkedIn about 90 minutes to build a 900 GB data store on a 45 node development cluster. They use Hadoop to process massive batch workloads, taking their Hadoop cluster down for upgrades periodically, whereas their Voldemort servers never go down.

For more details, see the slides from the presentation.

Copied from http://www.infoq.com/news/2010/08/linkedin-data-infrastructure