Comprehensive software reviews to make better IT decisions
Site Reliability Engineering: What Is It? Why Is It Important for Online Businesses?
Hell hath no fury like a customer not being able to access an online service when they want to. They expect the online services to always be on, always be accessible, and always treat them like there’s no one else in the world who matters more. Thank heavens then for giving these online services the ability to use site reliability engineering (SRE) to keep their customers happy, engaged, and most importantly, feeling valued.
Human beings are fickle. We like something one minute and have an intense disagreement with it the next. If we use an online service, we want it to dedicate all its attention (and CPU and threads) to us. The slightest appearance of divided attention (or performance lag) is enough to make us jump ship and look for the next online service. So many relationships between man and online services have been frayed, and eventually destroyed, only because we couldn’t bear the thought of sharing a services processing power with someone else. Of course, online services don’t want us to leave, but how do they make sure they don’t have to mend broken hearts. Fortunately, there is a particularly effective form of relationship therapy they can use. It’s called site reliability engineering.
What is site reliability engineering?
Site reliability engineering (SRE) is an operational model for running online services more reliably by a team of dedicated reliability-focused engineers. These engineers are working across the realm of “Anything-as-a-Service.” Their reliability-focused work does not discriminate between infrastructure, software, networking, or platforms. If a customer's perception of service reliability is going to be impacted by something, site reliability engineers are looking after it.
Site reliability engineering, as a way of ensuring system agility, availability, and performance, attempts to bring as much of the system into a resilient, predictable, and measurable state as is possible. It supports an organization’s capability to sustain an appropriate level of reliability for its services by implementing and continually enhancing data-driven production feedback loops.
Production feedback loops you say?
Webster defines feedback as “the transmission of evaluative or corrective information about an action, event, or process to the original or controlling source.” That’s a mouthful for something that could easily be described as “frequent communication.”
Production feedback loops within the context of an online system is a communication mechanism between teams and the people who constitute them. In a hyper-competitive economic landscape, riddled with more competitors than you can shake a stick at, if the technology teams (development and operations) are not constantly talking to each other, they are hastening the end of their business.
Historically, a poorly configured feedback loop has been a significant reason for development and operations to have no sense of shared ownership. In recent times, philosophies like Agile and DevOps have alleviated some of these concerns, but old habits die hard. People have a natural tendency to fall back into their own cocoon at the first opportunity. Site reliability engineering (and site reliability engineers) play an intermediary role between these cocooned silos. Site reliability engineers’ effectiveness comes from their dedicated purpose of establishing and maintaining production feedback loops. As a part of their responsibility, they collect, aggregate, synthesize, analyze, and report on data from production servers and ensure both development and operations teams are aware of the state the systems are in.
Production feedback loops rely on data, not opinion
Complete power corrupts completely, they say. Experience and familiarity of systems also has a tendency of doing that, but instead of us getting corrupted, we favor instinct over data. For effective site reliability engineering, instinct is derived from proactive data analysis of system performance. Modifications in the environment and deeper understanding of the system will also lead to measurements getting modified to adapt to the changing circumstances.
Appropriate level of reliability
Many online systems are expected to “always be available” and one of the differences between an online service that succeeds versus one that fails is the effort the success stories put into “always being available.” Does this mean that their infrastructure never fails? That’s like saying a barking dog never bites. It does and I have teeth marks to prove it.
Amazon, Google, Alibaba, Facebook: they all have outages but the cloak of invisibility they wrap around their infrastructure's failure makes their outages go unnoticed, except unless they are prolonged and herein lies their promise of providing “appropriate levels of reliability.”
To provide reliability at appropriate levels, an online service must consider the nature of their business, the users they target, and the cost involved with keeping the lights on. To achieve an optimal answer for this triumvirate, site reliability engineering tracks an outage “budget.”
Outage “budget” is primarily determined by service level indicator (SLI) and service level objective (SLO). SLI is a question of what is measured and where, while SLO is the acceptable values for the SLI within a given time period.
*Note: for a detailed note on SLI and SLO, read the note ‘SLO in Site Reliability Engineering.’
Site reliability engineering is just a longer word for DevOps?
In the opinion of this analyst, they have the same fundamental principle supporting them: the setting of proper expectations between all stakeholders to avoid surprises and gotchas. While DevOps focuses on continuous delivery all the way to deployment, SRE focuses on continuous operations at the point of customer value creation.
Irrespective of site reliability engineering being a fancy wrapper on DevOps or being its own standalone concept, it is an integral part of an online service’s success. It's no longer just “call help desk to solve your problem” but rather “yes, we know there is a chance of a problem occurring, we know what it is, we know why it is, and before it hits our users, we will resolve it.”
Want to know more?
COVID-19 has forced software companies and their suppliers to refocus efforts around prioritizing systems and workflows that are nearly 100% digital in nature. As a result, Info-Tech has observed the quick emergence of six market themes that are highly relevant after COVID-19. This note series will profile key vendors and how they fit into the post-COVID-19 world.
IBM is changing the terms of its ubiquitous Passport Advantage agreement to remove entitled discounts on over 5,000 on-premises software products, resulting in an immediate price increase for IBM Software & Support (S&S) across its vast customer landscape.
Is it true that everything that can go wrong will go wrong? Don’t bet on it to not.
While Microsoft is not a prominent player in the RPA space now with its Power Automate solution, compared to Blue Prism, UiPath, and Automate Anywhere, its latest acquisition of Softomotive, maker of WinAutomation, demonstrates Microsoft’s dedication to mature and expand its RPA offerings.
Test data management tools offer you the ability to provision, mask, and govern the access and use of your test data, alleviating these manual, laborious and error-prone tasks from your testing, operations, and DBA teams.
When trying to implement Agile as a defined process, Scrum turned BAs or other roles into order takers with the title “product owner.” This undermines the entire value proposition of product management.
Agile systems delivery (implemented through Scrum) is quickly becoming an accepted norm in IT. But using Scrum successfully in an organization requires a deep understanding of how it works and why. For example, many of our members don’t understand the importance of selecting a Product Owner who has three ears.
Reeling from the pandemic response executed by governments all the over world, companies are accelerating their implementation of low-cost automation. That bodes well for UiPath – a leader in RPA aiming to go public this year.
Thor, the Norse God of Thunder, tells Jane Foster, the woman he’s trying to impress, that on his home world of Asgard, the realm eternal, science and magic are two sides of the same coin. Had Jane been a part of the operations teams at Google (or other mature online service providers), she would have immediately realized we have a similar technology right here on good old Earth. We call the science site reliability engineering (SRE), and service level objectives (SLO) is the magic behind it. SRE is a powerful concept for organizations that are serious about keeping their customers happy. It is therefore important for them to develop well-thought-out SLOs and make certain that management is intellectually equipped to derive valuable business perspectives from them.