20 Obstacles to Scalability

20 Obstacles to Scalability, illustration

Web applications can grow in fits and starts. Customer numbers can increase rapidly, and application usage patterns can vary seasonally. This unpredictability necessitates an application that is scalable. What is the best way of achieving such scalability?

This article reveals 20 of the biggest bottlenecks that reduce and slow down scalability. By ferreting these out in your environment and applications, and stamping out the worst offenders, you will be on your way to hyper growth.

10 Obstacles to Scaling Performance

Performance is key to Web scalability. If you add more customers, you want your application to continue servicing them all equally quickly. Too much latency will cause users to give up. You can keep your application’s performance from degrading by knowing all the ways it can get clogged up and avoiding those bottlenecks.

1. Two-phase commit. Normally when data is changed in a database, it is written both to memory and to disk. When a commit happens, a relational database makes a commitment to freeze the data somewhere on real storage media. Remember, memory does not survive a crash or reboot. Even if the data is cached in memory, the database still has to write it to disk. MySQL binary logs or Oracle redo logs fit the bill.

With a MySQL cluster or distributed file system such as Distributed Replicated Block Device (DRBD) or Amazon Multi-AZ (Multi-Availability Zone), a commit occurs not only locally, but also at the remote end. A two-phase commit means waiting for an acknowledgment from the far end. Because of network and other latency, those commits can be slowed down by milliseconds, as though all the cars on a highway were slowed down by big loads. For those considering using Multi-AZ or read replicas, the Amazon RDS (Relational Database Service) use-case comparison at http://www.iheavy.com/2012/06/14/rds-or-mysql-ten-use-cases/ will be helpful.

Synchronous replication has these issues as well; hence, MySQL’s solution is semi-synchronous, which makes some compromises in a real two-phase commit.

2. Insufficient Caching. Caching is very important at all layers, so where is the best place to cache: at the browser, the page, the object, or the database tier? Let’s work through each of these.

Browser caching might seem out of reach, until you realize the browser takes directives from the Web server and the pages it renders. Therefore, if the objects contained therein have longer expire times, the browser will cache them and will not need to fetch them again. This is faster for not only the user, but also the servers hosting the website, as all returning visitors will weigh less.

Details about browser caching are available at http://www.iheavy.com/2011/11/01/5-tips-cache-websites-boost-speed/. Be sure to set expire headers and cache control.

Page caching requires using a technology such as Varnish (https://www.varnish-cache.org/). Think of this as a mini Web server with high speed and low overhead. It cannot handle complex pages as Apache can, but it can handle the very simple ones better. It therefore sits in front of Apache and reduces load, allowing Apache to handle the more complex pages. This is like a traffic cop letting the bikes go through an intersection before turning full attention to the more complex motorized vehicles.

Object caching is done by something like memcache. Think of it as a Post-it note for your application. Each database access first checks the object cache for current data and answers to its questions. If it finds the data it needs, it gets results 10 to 100 times faster, allowing it to construct the page faster and return everything to the user in the blink of an eye. If it does not find the data it needs, or finds only part of it, then it will make database requests and put those results in memcache for later sessions to enjoy and benefit from.

3. Slow Disk I/O, RAID 5, Multitenant Storage. Everything, everything, everything in a database is constrained by storage—not by the size or space of that storage but by how fast data can be written to those devices.

If you are using physical servers, watch out for RAID 5, a type of RAID (redundant array of independent disks) that uses one disk for both parity and protection. It comes with a huge write penalty, however, which you must always carry. What’s more, if you lose a drive, these arrays are unusably slow during rebuild.

The solution is to start with RAID 10, which gives you striping over mirrored sets. This results in no parity calculation and no penalty during a rebuild.

Cloud environments may work with technology such as Amazon EBS (Elastic Block Store), a virtualized disk similar to a storage area network. Since it is network based, you must contend and compete with other tenants (aka customers) reading and writing to that storage. Further, those individual disk arrays can handle only so much reading and writing, so your neighbors will affect the response time of your website and application.

Recently Amazon rolled out a badly branded offering called Provisioned IOPS (I/O operations per second). That might sound like a great name to techies, but to everyone else it does not mean anything noteworthy. It is nonetheless important. It means you can lock in and guarantee the disk performance your database is thirsty for. If you are running a database on Amazon, then definitely take a look at this.

4. Serial Processing. When customers are waiting to check out in a grocery store with 10 cash registers open, that is working in parallel. If every cashier is taking a lunch break and only one register is open, that is serialization. Suddenly a huge line forms and snakes around the store, frustrating not only the customers checking out, but also those still shopping. It happens at bridge tollbooths when not enough are open, or in sports arenas when everyone leaves at the same time.

Web applications should definitely avoid serialization. Do you see a backup waiting for API calls to return, or are all your Web nodes working off one search server? Anywhere your application forms a line, that is serialization and should be avoided at all costs.

5. Missing Feature Flags. Developers normally build in features and functionality for business units and customers. Feature flags are operational necessities that allow those features to be turned on or off in either back-end config files or administration UI pages.

Why are they so important? If you have ever had to put out a fire at 4 A.M., then you understand the need for contingency plans. You must be able to disable ratings, comments, and other auxiliary features of an application, just so the whole thing does not fall over. What’s more, as new features are rolled out, sometimes the kinks do not show up until a horde of Internet users hit the site. Feature flags allow you to disable a few features, without taking the whole site offline.

6. Single Copy of the Database. You should always have at least one read replica or MySQL slave online. This allows for faster recovery in the event that the master fails, even if you are not using the slave for browsing—but you should do that, too, since you are going to build a browse-only mode, right?

Having multiple copies of a database suggests horizontal scale. Once you have two, you will see how three or four could benefit your infrastructure.

7. Using your Database for Queuing. A MySQL database server is great at storage tables or data, and relationships between them. Unfortunately, it is not great at serving as a queue for an application. Despite this, a lot of developers fall into the habit of using a table for this purpose. For example, does your app have some sort of jobs table, or perhaps a status column, with values such as “in-process,” “inqueue,” and “finished”? If so, you are inadvertently using tables as queues.

Such solutions run into scalability hang-ups because of locking challenges and the scan and poll process to find more work. They will typically slow down a database. Fortunately, some good open source solutions are available such as RabbitMQ (http://www.rabbitmq.com/) or Amazon’s SQS (Simple Queue Service; http://aws.amazon.com/sqs/).

8. Using a Database for Full-Text Searching. Page searching is another area where applications get caught. Although MySQL has had full-text indexes for some time, they have worked only with MylSAM tables, the legacy table type that is not crash proof, not transactional, and just an all-around headache for developers.

One solution is to go with a dedicated search server such as Solr (http://lucene.apache.org/solr/). These servers have good libraries for whatever language you are using and high-speed access to search. These nodes also scale well and will not bog down your database.

Alternatively, Sphinx SE, a storage engine for MySQL, integrates the Sphinx server right into the database. If you are looking on the horizon, Fulltext is coming to InnoDB, MySQL’s default storage engine, in version 5.6 of MySQL.

9. Object Relational Models. The ORM, the bane of every Web company that has ever used it, is like cooking with MSG. Once you start using it, it is hard to wean yourself off.

The plus side is that ORMs help with rapid prototyping and allow developers who are not SQL masters to read and write to the database as objects or memory structures. They are faster, cleaner, and offer quicker delivery of functionality—until you roll out on servers and want to scale.

Then your database administrator (DBA) will come to the team with a very slow-running, very ugly query and say, “Where is this in the application? We need to fix it. It needs to be rewritten.” Your dev team then says, “We do not know!” And an incredulous look is returned from the ops team.

The ability to track down bad SQL and root it out is essential. It will happen, and your DBA team will need to index properly. If queries are coming from ORMs, they do not lend themselves to all of this. Then you are faced with a huge technical debt and the challenge of ripping and replacing.

10. Missing Instrumentation. Instrumentation provides speedometers and fuel guages for Web applications. You would not drive a car without them, would you? They expose information about an application’s internal workings. They record timings and provide feedback about where an application spends most of its time.

One very popular Web services solution is New Relic (http://newrelic.com/), which provides visual dash-boards that appeal to everyone—project managers, developers, the operations team, and even business units all can peer at the graphs and see what is happening.

Some open source instrumentation projects are also available.

10 Obstacles to Scaling Beyond Optimization Speed

Speed is not the only thing that can gum up scalability. The following 10 problems affect the ability to maintain and build scalability for a Web application. Best practices can avoid these issues.

1. Lack of a Code Repository and Version Control. Though it is rare these days, some Internet companies do still try to build software without version control. Those who use it, however, know the everyday advantage and organizational control it provides for a team.

If you are not using it, you are going to spiral into technical debt as your application becomes more complex. It will not be possible to add more developers and work on different parts of your architecture and scaffolding.

Once you start using version control, be sure to get all components in there, including configuration files and other essentials. Missing pieces that have to be located and tracked down at deployment time become an additional risk.

2. Single Points of Failure. If your data is on a single master database, that is a single point of failure. If your server is sitting on a single disk, that is a single point of failure. This is just technical vernacular for an Achilles heel.

These single points of failure must be rooted out at all costs. The trouble is recognizing them. Even relying on a single cloud provider can be a single point of failure. Amazon’s data center or zone failures are a case in point. If it had multiple providers or used Amazon differently, AirBNB would not have failed when part of Amazon Web Services went down in October 2012 (http://www.iheavy.com/2012/10/23/airbnb-didnt-have-to-fail/).

3. Lack of Browse-only mode. If you have ever tried to post a comment on Yelp, Facebook, or Tumblr late at night, you have probably gotten a message to the effect, “This feature is not available. Try again later.” “Later” might be five minutes or 60 minutes. Or maybe you are trying to book airline tickets and you have to retry a few times. To nontechnical users, the site still appears to be working normally, but it just has this strange blip.

What’s happening here is that the application is allowing you to browse the site, but not make any changes. It means the master database or some storage component is offline.

Browse-only mode is implemented by keeping multiple read-only copies of the master database, using something such as MySQL replication or Amazon read replicas. Since the application will run almost fully in browse mode, it can hit those databases without the need for the master database. This is a big, big win.

4. Weak communication. Communication may seem a strange place to take a discussion on scalability, but the technical layers of Web applications cannot be separated from the social and cultural ones that the team navigates.

Strong lines of communication are necessary, and team members must know whom to go to when they are in trouble. Good communication demands confident and knowledgeable leadership, with the openness to listen and improve.

5. Lack of Documentation. Documentation happens at a lot of layers in a Web application. Developers need to document procedures, functions, and pages to provide hints and insight to future generations looking at that code. Operations teams need to add comments to config files to provide change history and insight when things break. Business processes and relationships can and should be documented in company wikis to help people find their own solutions to problems.

Documentation helps at all levels and is a habit everyone should embrace.

6. Lack of Fire drills. Fire drills always get pushed to the backburner. Teams may say, “We have our backups; we’re covered.” True, until they try to restore those backups and find they are incomplete, missing some config file or crucial piece of data. If that happens when you are fighting a real fire, then something you do not want will be hitting that office fan.

Fire drills allow a team to run through the motions, before they really need to. Your company should task part of its ops team with restoring its entire application a few times a year. With AWS and cloud servers, this is easier than it once was. It is a good idea to spin up servers just to prove that all your components are being backed up. In the process you will learn how long it takes, where the difficult steps lie, and what to look out for.

7. Insufficient Monitoring and Metrics. Monitoring falls into the same category of best practices as version control: it should be so basic you cannot imagine working without it; yet there are Web shops that go without, or with insufficient monitoring—some server or key component is left out.

Collecting this data over time for system and server-level data, as well as application and business-level availability, are equally important. If you do not want to roll your own, consider a Web services solution to provide your business with real uptime.

8. Cowboy Operations. You roll into town on a fast horse, walk into the saloon with guns blazing, and you think you are going to make friends? Nope, you are only going to scare everyone into complying but with no real loyalty. That is because you will probably break things as often as you fix them. Confidence is great, but it is best to work with teams. The intelligence of the team is greater than any of the individuals.

Caution and risk aversion win the day. Always have a Plan B.

Teams need to communicate what they are changing, do so in a managed way, plan for any outage, and so forth. Caution and risk aversion win the day. Always have a Plan B. You should be able to undo the change you just made, and be aware which commands are destructive and which ones cannot be undone.

9. Growing Technical Debt. As an app evolves over the years, the team may spend more and more time maintaining and supporting old code, squashing bugs, or ironing out kinks. Therefore, they have less time to devote to new features. This balance of time devoted to debt servicing versus real new features must be managed closely. If you find your technical debt increasing, it may be time to bite the bullet and rewrite. It will take time away from the immediate business benefit of new functionality and customer features, but it is best for the long term.

Technical debt is not always easy to recognize or focus on. As you are building features or squashing bugs, you are more attuned to details at the five-foot level. It is easy to miss the forest for the trees. That is why generalists are better at scaling the Web (http://www.iheavy.com/2011/10/25/why-generalists-better-scaling-web/).

10. Insufficient Logging. Logging is closely related to metrics and monitoring. You may enable a lot more of it when you are troubleshooting and debugging, but on an ongoing basis you will need it for key essential services. Server syslogs, Apache and MySQL logs, caching logs, among others should all be working. You can always dial down logging if you are getting too much of it, or trim and rotate log files, discarding the old ones.

Conclusion

These 20 obstacles can affect scalability and result in performance degradation of a Web application. By avoiding these obstacles and following the practices outlined here, Web-application developers will be able to minimize latency and guarantee that their applications can scale as needed.