Enterprise reliability is a discipline that ensures applications will deliver the required business functionality in a consistent, predictable, and cost-effective manner without compromising core aspects such as availability, performance, and maintainability.
While achieving a high level of reliability is a common goal of most enterprises, reliability engineering involving third-party applications can be a complex landscape. First-party software affords the luxury of building a modular and extensible application that integrates seamlessly with an enterprise’s IT ecosystem. Third-party software does not always have the same flexibility. Incorporating an off-the-shelf enterprise application within an existing IT ecosystem, without compromising functionality and reliability, is a classic engineering and philosophical problem the CIO’s office has to deal with all the time.
Despite this complexity, enterprises still pursue and select third-party software to power their business verticals such as human resources (HR), legal, and finance, since it makes economic sense to pay for an enterprise application rather than building the software in-house. Enterprises sometimes base their buying decisions only on the required business functionality, however, and tend to overlook the application’s overall reliability. This can compromise the availability and sup-portability of the application and increase the cost of managing it in the long run.
This article describes a core set of principles and engineering methodologies that enterprises can apply to help them navigate the complex environment of enterprise reliability and to deliver highly reliable and cost-efficient applications and that can help them navigate the complex environment of enterprise reliability.
Reliability axioms are a set of principles that emphasize the values and behaviors that help foster and maintain the culture of enterprise reliability.
Culture = Reliability Axioms (Values) x Reliability Engineering (Behaviors)
These five core axioms define enterprise reliability and form the basis of this article: Focus on the customer; select the right vendor; invest in a common application platform; engineer reliability to be cost effective; and build an engineering-centric support organization (or site reliability engineering, SRE).
Focus on the customer. Customer objectives determine the reliability of an application. Having a well-defined set of customer objectives is foundational, as these translate into tangible and measurable goals. These goals, also known as service-level objectives (SLOs), drive the overall reliability posture of an application such as availability, performance, data integrity, monitoring, and responding to incidents. SLOs ensure an application is engineered to meet the precise needs of the customer.
Select the right vendor. The choice of vendor impacts the reliability of the core application. Choosing an enterprise application involves much more than just buying software that meets the business functionality. It involves partnering with a vendor that thinks and builds software with similar principles and values to the enterprise (for example, secure by design, scalable software components, open APIs for extensibility, and ease of support and maintenance).
Invest in a common platform. The overall reliability of an application is the sum total of the reliability of the business’s core application and all its dependencies. Transforming the baseline dependencies into a common platform can help standardize and bring consistency in how an application is managed. Using a common platform can drastically decrease technical silos and increase overall reliability and efficiency. A common platform could mean having a shared deployment manager, CI/CD (continuous integration/continuous delivery) frameworks, or shared service management workflows for monitoring, logging, backups, and so on.
Engineer reliability to be cost effective. Over-engineering reliability breaks the ROI (return on investment) curve. Reliability is a function of how mature an application is and, as a result, its overall availability. Imagine you have a service with a 99.9% SLO. Adding an extra nine (99.99%) sublinearly increases the availability of your service, as shown in Figure 1.
Figure 1. Reliability as a function of availability.
While improving the reliability of your service from 43.2 minutes of downtime per month to 4.32 minutes can be tempting, it can represent a significant engineering feat with a hefty price tag. Therefore, when specifying the required availability of a service, the decision should be based on the business requirements: “How available (how many nines) does my system need to be, in order to meet the business objectives?”
Build an engineering-centric support organization. Application reliability is preserved by SREs. Designing a perfect application doesn’t guarantee a high-quality production experience—at least not without the support of an SRE organization. Both the application and the IT ecosystem where its runs change constantly—with developers pushing new code, vendors publishing new security patches, or infrastructure teams updating the software of the underlying platform.
Reliability is not a “build once and forget for life” construct; it is a continuous process of maintaining and upholding its principles and methodologies. Enterprises that recognize the need and invest in developing SRE skills stand out from the rest because they recognize that without these skills, enterprise reliability cannot be sustained.
Designing Enterprise Reliability Engineering
Designing for enterprise reliability is a multidimensional problem that spans multiple entities: customer, vendor, platform engineering, cost, and the SRE organization. The rest of this article expands on these axioms and describes the behaviors, principles, and methodologies that influence and shape the discipline of enterprise reliability.
Customer objectives. “If you don’t understand your customer objectives, then you do not need to exist as an org.” Whether you are a traditional IT organization or a mature SRE org, this fundamental principle holds true.
Translate customer objectives to SLOs. In an enterprise setting, a typical customer is the owner of a business vertical such as legal, finance, or HR trying to accomplish a specific business goal. Having a well-defined set of business objectives lays the foundation for developing concrete functional requirements, allowing you to effectively translate those requirements into quantifiable and measurable outcomes, also known as SLOs.
Defining SLOs early on leads to a better design and implementation of the overall system. Arriving at a clear set of measurable SLOs, however, is an exhaustive process with a lot of considerations (for example, what is technically feasible vs. infeasible, expensive vs. cost effective, reliable vs. fragile). Closely involving the customer and vendor throughout this process is crucial, as it develops a shared understanding of requirements, constraints, and trade-offs, and helps reconcile the gap between aspirational and achievable SLOs.
Documenting the SLOs, including a strong rationale for the established targets and thresholds (for example, 99.9 uptime) is key, as this becomes the contract among all the parties (SRE team, software vendor, and customer). This rigor also creates a culture of transparency and openness to inform how the system should be designed and how the service should operate. For a deep dive into engineering SLOs effectively, refer to the SLO chapter in the SRE book.1
Empathy toward customer and vendor is key. Customers (business owners) may not always have the same level of understanding of the problem space. Their approach could be purely business driven, and they may expect the application never to go down. Likewise, the vendor may not entirely understand how the IT ecosystem is designed and cannot operate independently to deliver the system. The SRE team should become a true partner to bring alignment between the customer and vendor and develop a shared understanding of the overall objective, specific requirements, and constraints of the domain.
Given the nature of third-party domains, it may be difficult to find a perfect system that meets 100% of the business functionality, as there are many variables in the equation (for example, third-party software, hardware, cost, and vendor). Therefore, working closely with the customers in developing a set of detailed requirements and distinguishing core vs. optional requirements helps with the trade-off analysis—for example, if the application has constraints, evaluating their impact on business objectives or revisiting and adjusting customer requirements without compromising the business objectives, or finding a new vendor altogether.
Taking customers through this journey from beginning to end helps them better understand the space and weigh in on all important considerations, ultimately allowing them to make effective business-driven decisions.
SLOs as a means to customer happiness; solve for SLOs. Solving for customer happiness based of objective goals is key; it is better to cater to functionality based on the customers’ objectives in a measurable way (SLOs). Customers have only one fundamental criterion: Is the system able to translate business objectives into business functionality in a cost-effective and reliable manner?
Having this objective view creates a transparent and blameless culture. The key point to remember, however, is that SLOs are not fixed for life: As business needs evolve, the system SLOs need to be revisited. Therefore, having a strong discipline of revisiting the SLO agreements periodically with the customer helps tackle these changes and adjust the scope and expectations as business needs evolve.
Vendor selection. Enterprise-application engineering with a vendor is a long-term investment that goes beyond the application itself. Therefore, it’s important to select a vendor that aligns with the values and principles of the enterprise—for example, software design discipline (scale and performance), data security and privacy management, use of open standards, and ease of operations and maintenance.
To ensure a vendor meets its requirements, an enterprise needs a rigorous evaluation and validation process. Two distinct sets of evaluations determine and shape the reliability of an application:
- Functional evaluation: represents the business functionality required by the customer.
- Infrastructure evaluation: represents the application’s IT requirements.
Functional evaluation. Functional requirements are derived directly from customer objectives and form the basis of the evaluation process. Each functional requirement has a set of key functional characteristics. The goal of the evaluation process is to do an in-depth analysis of these characteristics and assess the feasibility of third-party software.
To understand this, consider the following scenario. Assume your enterprise is evaluating a third-party IT inventory system to manage your corporate IT asset information. One of your business objectives is to predict the supply and demand for your inventory in realtime. This could result in a requirement for a centralized global inventory database that updates in realtime every time a checkout happens.
Based on this scenario, let’s analyze the core characteristics that a functional evaluation should delve into.
Functional specification. Does the vendor understand the functional requirement and the expected outcome? In the scenario just described, the functional requirement is to maintain a global inventory database for all asset information. The expected outcome is the ability to track asset information and update the global inventory database in realtime.
Dependencies and constraints. Does the vendor need to be aware of any core dependencies or constraints? For example, does the global inventory database depend on any external entities? Is a centralized database required for reads and writes, or is a distributed setup required? What are the pros and cons of both approaches?
Defining SLOs early on leads to a better design and implementation of the overall system. Arriving at a clear set of measurable SLOs, however, is an exhaustive process.
Functional interfaces. Does the vendor understand all the end-to-end functional interfaces involved in this requirement? For example, does the inventory database have any reporting interfaces? How does the admin interact with the database? How do the users interact with the database when they do a checkout? What is the end-to-end flow?
Geographic requirements. Does the enterprise have a presence across the globe? Will users access this inventory system from different regions? What are the specific performance and latency requirements for these users?
Scale and load requirements. How many users are going to use the inventory system, both globally and per region? What are the QPS (queries per second) or load requirements for these users? Are there any peak or off-peak volume requirements or considerations?
Security requirements. Does the vendor understand the security posture of the system? Are there any specific access restrictions based on user type (for example, admin vs. normal user)? What is the authentication and authorization mechanism? Does the application depend on a centralized authorization service such as LDAP (Lightweight Directory Access Protocol) or AD (Active Directory)? Is there a single sign-on dependency?
Compliance requirements. Does the vendor understand and meet the compliance requirements for this application?
Handling requirements. Does the vendor understand the key failure modes based on the design of the system? How does the vendor’s software handle exceptions (for example, request timeouts, retries during write failures, and connection resets)?
Release management. What software release management discipline does the vendor use? What is the release cycle? How are changes tested before being released to the customer? What is the QA/qualification process?
Testing and validation. Does the vendor have a holistic testing plan that covers the end-to-end workflow, and does it include all the edge cases? What is the testing plan for measuring load and performance?
Infrastructure evaluation. Infrastructure requirements create the foundation for the whole application. Therefore, ensuring the end-to-end reliability of this base layer is critical.
Every enterprise is unique and has its own set of infrastructure requirements and constraints. When evaluating an enterprise application, you want to ensure the vendor can comply with the requirements of the enterprise’s IT ecosystem. For example, suppose your enterprise has fully adopted virtualization for internal efficiencies and other business reasons. In this case, the vendor’s application should be compatible with and supported on VMware. Otherwise, the application could become a nonstandard model in your IT organization, driving up costs related to infrastructure, licensing, hardware, and support.
Following are a set of key infrastructure requirements to ensure a vendor’s software is compatible with an enterprise’s IT ecosystem.
Core infrastructure. The vendor must meet an enterprise’s hardware, network, and operating system requirements. This includes specific hardware models, enterprise databases, software and operating systems versions that the IT team supports.
Networking. The vendor must meet the authentication and authorization requirements of the network—for example, LDAP or AD, or single sign-on requirements.
Infrastrucutre security. The vendor should understand and meet the enterprise’s security policies related to access management, perimeter security, and data encryption.
Infrastructure sizing. The enterprise should derive a concrete sizing plan including the number of environments and compute and storage requirements based on its functional requirements, and evaluate the vendor closely to ensure its software can scale and meet those sizing needs.
High availability and disaster recovery. The SRE team should have a clear understanding of reliability requirements based on SLOs and customer objectives. Deciding on the high-availability design such as active-active or active-passive, disaster-recovery requirements and strategy, and data recovery (recovery point objective) and restore (restore point objective) are all critical when engaging the vendor. The enterprise must ensure the vendor’s application can meet its requirements, or that the vendor is willing to collaborate with the SRE team to provide the needed reliability.
Data management. The vendor should have a clear data management discipline and methodologies when it comes to data integrity, backup, recovery, and retention. Does the vendor have a strong data security discipline such as encryption of data both in transit and at rest?
Integrations. Make a list of all the dependent systems and necessary integrations that the IT ecosystem requires—for example, authentication services such as LDAP or AD; corporate mail service and the necessary integrations; and service management workflows such as centralized backups, monitoring, and logging.
Operability. Ensure the vendor has a strong discipline of software updates/upgrades, clearly defined maintenance windows, and so on.
These requirements provide an overview of the core aspects and characteristics you should evaluate when choosing an enterprise application. Note this is not an exhaustive list, and requirements may vary among enterprises.
Functional and infrastructure requirements can heavily influence the design and delivery of an application. Therefore, evaluating the feasibility of these requirements is a crucial step in engineering the reliability of an application.
Common application platform. Most enterprises rely on third-party software to support the operations and needs of their business verticals (Figure 2). Running different third-party applications, however, can lead to a large number of disparate systems within an enterprise. Not having a common baseline across applications makes maintaining the reliability and efficiency of service more difficult over time. This creates a lot of overhead for the SRE team and increases the organization’s operational costs.
A common platform provides a standard operating environment in which to run all of a system’s applications, enhancing the overall reliability and efficiency of an enterprise. The key principle of implementing a common platform is to identify, build, and enforce a set of shared modules and standards that can be reused across the applications that support the business verticals.
On the other side, overengineering a common platform can have a negative impact. If a platform has many standards in place or becomes too rigid, an enterprise’s delivery and execution speed can decrease significantly.
The goal is to develop a strategy that allows enterprises to find the right balance between optimizing for reliability and maintaining the development speed needed to deliver and support business functionality. Finding this balance requires a careful analysis of the trade-offs and net benefits.
Common platform layout. An application platform consists of a set of modules that can be grouped into three main categories (Figure 3):
Figure 3. Common platform layout.
- Infrastructure deployment modules.
- Application management modules.
- Common service modules.
Infrastructure deployment modules provide intent-based deployment of an end-to-end application environment based on a set of resource requirements such as CPU, memory, operating systems, and the number of instances. This mechanism is highly efficient since the workflows only need to be configured once and can be triggered as needed. It also provides a standardized, consistent, and predictable environment, which improves overall reliability.
Many enterprises are already embracing open-source technologies to help them manage the underlying infrastructure of their applications. Tools such as Terraform provide abstractions to handle the provisioning and deployment of end-to-end environments agnostic to the underlying platform (for example, on premises vs. cloud).
Application management modules handle critical workflows during the life of an application. A few examples of these workflows include:
- Configuration management workflows to deploy application configuration.
- Release management workflows to manage software releases and rollbacks.
- Security management workflows to manage secrets and certification deployments.
Software solutions such as Puppet, Chef, and Ansible provide frameworks and solutions for enterprises to orchestrate these workflows across their applications.
Common service modules manage the standardized workflows that can be shared across all applications, such as logging, monitoring, and reporting. This layer can also include custom service modules for the specific needs of an enterprise, such as a custom web front end or a single sign-on service.
Some examples of common service modules include:
- Monitoring module to collect and publish metrics for reporting and alerting.
- Backup module to execute backups, retention, and recovery.
- Log collection module to securely ship logs to a centralized log service.
- Custom Weblogic/Tomcat as a service offering middleware capabilities.
- Managed DBaaS (database as a service) module to manage database workflows.
Combining infrastructure deployment, application management, and common service modules creates a platform that enables enterprises to move away from managing monolithic applications and into a new realm of modular, extensible, and reusable applications.
Cost engineering. When enterprises opt for third-party software, they are making a cost- and ROI-based decision to use a “reliable” enterprise application that delivers the business functionality in a cost-effective manner. Determining the right reliability-to-cost trade-off that sustains the ROI curve is the crux of cost engineering.
Reliability-to-cost trade-off. Figure 4 illustrates how reliability (the number of nines) directly influences the overall availability or reduction in downtime. The reduction with each additional nine is sublinear. While it is extremely tempting to add a nine, it is important to recognize that engineering an additional nine can be expensive, and overengineering reliability produces diminishing ROI. To understand this, let’s look at the following scenario.
Figure 4. Availability and reliability.
Enterprise ABC is looking for a third-party sales application that can provide market analysis and insights. The sales team predicts they can generate an average of $600/hour of revenue by leveraging those insights. Their revenue target per quarter is approximately $1.2 million. What is the required uptime (availability SLO) for this application?
If the application was available 100% of the time, the maximum revenue would be:
Net revenue = hours in a quarter (3 months x 30 days x 24 hours = 2,190) * earnings per hour ($600)
$1,296,000 (~$1.29M) = 2,160 hours in a quarter * $600 per hour
The net revenue (~$1.29 million) clearly exceeds the target revenue of $1.2 million, but 100% availability is infeasible. Figure 5 illustrates how to choose the perfect availability SLO that meets the ROI.
Figure 5. Selecting the right availability SLO.
Here are the key conclusions reached in this scenario:
- A 90% availability SLO generates ~$1.16 million in revenue, which falls short of the target revenue of $1.2 million. This SLO is not feasible.
- A 95% availability SLO generates ~$1.23 million in revenue, which comfortably meets (slightly exceeds) the revenue objective of $1.2 million. This SLO is feasible.
- A 99% availability SLO generates ~$1.28 million in revenue, which far exceeds the revenue objective of $1.2 million, but it comes with additional overhead:
- A 95% SLO guarantees no more than 36 hours downtime per month and still comfortably meets the target revenue.
- In contrast, a 99% SLO guarantees no more than 7.2 hours downtime per month, but the cost of engineering and support can be higher.
- As long as the cost to engineer a 99% SLO does not exceed $80,000 ($1.28 million to $1.2 million), this is a viable option.
- The net revenue growth for each additional nine provides diminishing returns (delta revenue)—for example, between 99.99% and 99.999%:
- There is a significant reduction in downtime per month from 4.32 minutes to 25.92 seconds, but the revenue increase is only $116.64.
- To choose a 99.999% SLO, the added engineering cost should be <$116.64.
Account for application dependencies. To design a system with a 99.9% SLO, the rule of thumb is to have all critical dependent systems provide an additional nine (that is, 99.99). This means you have to factor in the reliability investment (additional cost) for your application and all of its critical dependencies, because a system is only as available as the sum of its dependencies.2
Choose a SLO that fits the ROI curve. The ideal SLO is one that delivers the required functionality with a degree of reliability that fits within the ROI curve. In the previous scenario, the best SLO would be 95%, because it is the least expensive option that meets the business goal ($1.2 million).
Overengineering reliability prodcues diminishing ROI. From the previous scenario, it is evident that increasing the availability of a service does not always translate to a significant growth in revenue. This is clearly evident from the scenario. In fact, with each additional nine, the benefit of engineering the reliability increases sublinearly, breaking the ROI curve.
Preserving Enterprise Reliability
Reliability is not just a systems design problem. You can have the world’s best-designed system, but without proper rigor and discipline, preserving core aspects of the system such as availability, performance, and security can become extremely difficult. Reliability is a responsibility that should be shared across all teams involved in the system, including vendors, development, and SRE. The SRE teams are ultimately accountable, however, since they are responsible for achieving their SLOs. During the lifecycle of an application there are a few critical junctures where maintaining proper rigor can translate into preserving the reliability of the service.
Design for standardization and uniformity. Reliability is preserved when you recognize the importance of uniformity and invest in standardization. One of the challenges of enterprise applications is there is no agreement or consensus among vendors on common standards around software technologies, operating systems, and workflow orchestration methodologies, such as release management and patch management. Each vendor provides its own flavor.
The role of SRE is to publish common standards for the portfolio of tools and technology they support (the base operating system, release management, and configuration frameworks) and the minimum operational maturity they expect from the vendor (for example, automated installs and seamless patching workflows).
Mature enterprises that rely on multiple software vendors recognize the importance of having a baseline ecosystem and strong operational maturity. They not only consider business functionality, but also account for ecosystem maturity when looking for third-party applications.
Change management. Change is powerful. You can build a highly reliable system, but one small change (a bad config push or a software bug) can compromise the reliability of the entire system. Preserving reliability comes from having a change-management rigor with a set of checks and balances that can detect, prevent, or minimize the impact pf problems. SRE should be responsible for maintaining this rigor. Consider the following checks and balances.
Measure, monitor, and alert. Measure, monitor, and introduce thresholds to alert for everything that is on the critical path of your SLO. This provides the ability to proactively detect and fix issues.
Streamline change. Require all changes to go through validation and regression testing. This should be enforced as a strong requirement across all teams that introduce changes.
Dedicated canary environment. Every critical production application should have a dedicated canary environment as a prerequisite. It should be an exact replica of the production environment. This allows for testing user-facing impact such as load and performance.
Phased rollouts help reveal unforeseen issues (those not uncovered by tests) that are discovered only in production. This provides the agility to roll back the changes quickly and minimize the impact.
Rollbacks and restore. Another key discipline is to ensure every change can be rolled back. It is particularly important to understand the dependency graph of the change and ensure an atomic rollback. This is difficult in complex systems, but in such cases having a clear restore point is key for most critical changes.
Error budgets are a simple concept. Every service has a target SLO, and if it exceeds that SLO, then that positive delta of uptime becomes the budget to use in pushing any changes or releases. This is a powerful concept explained in depth in the SRE book.1 Sharing this rigor with your application development team is a good way to ensure service reliability.
Outages and incidents. No matter how reliable a system is, you should anticipate and prepare for a disaster. Rather than solving for no outages, which is impractical, the focus should be on effectively managing the outage (minimizing downtime) and learning from it, so the same patterns don’t repeat.
Resiliency testing. The goal here is to stress test application resiliency by breaking the system, observing the effects of the breakage, and subsequently improving the reliability of the application.
Incident preparedness. The SRE team should periodically run fire drills to practice incident management that involves extensive coordination with partner teams, timely communication to stakeholders, and restoring the service as soon as possible. Responding to and handling an actual incident without this preparation can reduce the speed and effectiveness of restoring the service.
Learning from outages. A repeated outage is not an outage anymore; it is a mistake. For every outage there should be a thorough post-mortem that clearly identifies the root cause of the outage and focuses on what went wrong and what can be improved going forward. It is critical for enterprises to foster a blameless post-mortem culture that focuses on improving the reliability of the application.
The Future of Enterprise Reliability
Over the past few years, cloud platform providers have increasingly focused on enterprises, offering a suite of secure, reliable, and cost-effective products from highly scalable compute, storage, and networking services to modernized managed offerings such as container as a service (Kubernetes), serverless, and DBaaS. In addition, cloud providers are delivering advanced services in the realms of AI (artificial intelligence), ML (machine learning), and big data, opening a wide range of possibilities for enterprises to rethink and transform their business verticals.
This shift represents a tremendous opportunity for enterprises to embrace and adopt the cloud. Undertaking such a large-scale migration, however, introduces a new challenge: How can enterprises adapt and rapidly evolve without reducing their reliability?
Cloud migration strategy. Enterprises typically have complex business requirements, so a lift-and-shift strategy to migrate 100% of their workloads to a single cloud provider may not be feasible. A hybrid cloud environment provides the flexibility for workloads to operate seamlessly across both public and private cloud environments. This approach greatly simplifies the cloud adoption strategy and provides a controlled environment that ensures a predictable level of reliability throughout the transition to the cloud.
Enterprises that thoughtfully embrace the hybrid cloud strategy have less risk in terms of overall reliability and have a faster path to cloud transformation. Investing in a common application platform, coupled with the adoption of technologies such as Kubernetes (https://kubernetes.io/), Istio (https://istio.io/), and serverless computing (https://en.wikipedia.org/wiki/Serverless_computing), provides the flexibility to operate workloads, agnostic to the cloud provider. Technologies such as the GCP (Google Cloud Platform) Anthos platform (https://cloud.google.com/anthos/) can also help enterprises expedite their transition to the cloud in a reliable and efficient manner.
VEC ecosystem. Developing a strong relationship among vendors, enterprises, and cloud providers is pivotal to the future of enterprise reliability. Cloud providers need to motivate software vendors, through partnership programs, to modernize third-party software embracing cloud-based technologies and building certified multicloud-compliant software offerings. This VEC (vendor-enterprise-cloud) ecosystem coupled with the technological shift will bring a rapid transformation shaping the enterprise domain.
Maintaining enterprise reliability is a continuous process that is in a crucial moment with the advent of the cloud. The next decade will be the era of large-scale enterprise transformations leveraging cloud capabilities, and only those enterprises that grasp the discipline of reliability engineering will be able to transform successfully into the realm of cloud-based enterprise computing.
Related articles
on queue.acm.org
Toward Software-defined SLAs
Jason Lango
https://queue.acm.org/detail.cfm?id=2560948
Enterprise Software as Service
Dean Jacobs
https://queue.acm.org/detail.cfm?id=1080875
Why Cloud Computing Will Never Be Free
Dave Durkee
https://queue.acm.org/detail.cfm?id=1772130
Join the Discussion (0)
Become a Member or Sign In to Post a Comment