Extreme Agility at Facebook

The Facebook social utility is phenomenally successful. As of summer 2009, the site attracted around 300 million visitors per month. It is well noted that if Facebook was a nation it would be ranked in the top five most populous states; and the growth seems to be accelerating! In a nutshell Facebook has simply changed the way everyday individuals (worldwide) conduct their social lives.

Robert Johnson (pictured on photo), director of engineering at Facebook was the last keynote at OOPSLA 2009. Robert’s talk: “Moving Fast at Scale”, aimed to shed some lights on Facebook’s scaling issues and successes, as well as elaborate the type of processes they have used to deal with such incredible growth.

Facebook’s architecture is based on typical hierarchical PHP Web application model with a layer of data caching and extracted services components. The caching layer is done via the stable and fast memcached open source software on top of one of the largest installation of MySQL. The caching layer is so critical to Facebook’s success that Facebook is now one of the main contributors to memcached.

To support their extreme scale needs, the various service components use an homegrown, now Apache open source, RPC mechanism called Thrift. The RPC language was influence by CORBA interface definition language (IDL); it is designed to bind to various languages and is optimized for speed over the wire.

Perhaps the most interesting and revealing aspect of Robert’s talk was the discussion of Facebook’s somewhat unique development process. At the surface it appears to have the contradictory goals of: minimizing down time, scaling fast, and extremely quick code updates. First, Facebook developers are encouraged to push code often and quickly. Pushes are never delayed and applied directly to parts of the infrastructure. The idea is to quickly find issues and their impacts on the rest of system and surely fixing any bugs that would result from these frequent small changes.

Second, there is limited QA (quality assurance) teams at Facebook but lots of peer review of code. Since the Facebook engineering team is relatively small, all team members are in frequent communications. The team uses various staging and deployment tools as well as strategies such as A/B testing, and gradual targeted geographic launches. This has resulted in a site that has experienced, according to Robert, less than 3 hours of down time in the past three years.

To help Facebook deal with the enormous amount of data that it collects daily from its users, the team has developed various backend batched services that use Hadoop, Scribe, and Hive. For instance, data is periodically moved to central repositories (e.g., photos, statuses, and comments) not only to facilitate access but also to perform data analytics. The team has even created a tool called HiPal that gives a SQL-like GUI interface to the data which allows the marketing and business teams to perform queries on the data and make informed business decisions. Most of the tools are made open source. Facebook’s culture is that the “world is a better place due to open source.”

Naturally, while Facebook’s process is adapted to deal with its unusual growth and success, it is not without its challenges. For instance, while caching relational data allows for quick access to read-mostly data (e.g., photo, status, and profile) since such data have higher read than write patterns, it can fail miserably when the access pattern is one of write often instead.

This became quickly apparent to the team when they introduced the “Like” feature, or the ability for friends and fans to give thumbs up to posted items; and as soon as popular users like US President Barack Obama or Hollywood actor Vin Diesel had posted items which experienced thousands of likes within minutes… By gradually deploying features and having all developers think of “horizontal scaling by design” for all features, the Facebook team hopes to encourage new features that will integrate smoothly with the site.

Another aspect of the Facebook engineering team is how large the ratio of active user to developer is. Currently it stands at 1.1 million users per developer. This is an attractive recruiting figure since every new Facebook engineer knows that they will have huge impact (positive or negative) quickly and thus this should keep the adrenaline high when pushing new features. Naturally, an immediate question to ask of such a high-pace and high-impact development environment, is whether burnout is or will become a significant issue at Facebook as the company matures? Other issues raised by the audience at the Q&A session related to security and privacy as well as the social graph of the network…

Regardless of ones take on Facebook and whether it continues to grow and soon connect more than 500 million or reach the unprecedented figure of 1 billion active users, one thing is certain, and it is that Facebook has forever changed how a large portion of the world’s population conduct their social lives. This has global, national, regional, and local consequences as was noted during the Iran elections of 2009. In the process, the Facebook team has also taken agility to the extreme and devised a set of principles, tools, and process improvements that allow their engineers to have quick and real impact on you and I.

11/11/2009 – first post and also minor update to fix some typos and grammar
11/12/2009 – noted that photo is of Robert Johnson. A few more grammar fixes (nothing major)