Updated: May 26, 2023
The ever-accelerating business world has been obsessed for quite a while with the word “agility” when it comes to IT product development. And it’s not hard to see why: the prospect of getting the much-desired solution to fix your pain points ASAP is and always will be enticing.
However, the COVID-19 pandemic has shown that simply being Agile may not be good enough anymore: Transposit’s report suggests that ever since the world was forced to adopt a remote working model, MTTR (mean time to repair) has increased (93.6%), with downtime rates growing by as much as 68.4%!
Now, how does one ensure that agility will no longer hamper reliability? Meet DevOps and Site Reliability Engineering (SRE). These concepts that have been around for quite some time and, considering the challenges posed in the wake of the pandemic, are more relevant than ever.
The difference between SRE and DevOps
DevOps and SRE were designed in the early 2000s to find an equilibrium between development agility and system stability. However, they are terms that are often misused: some think these are the same; some think they’re competing ideas. Most believe that a company always has to choose between them. So let’s dig deep into what DevOps and SRE are and whether a DevOps vs SRE debate even makes sense.
DevOps is, at its core, a methodology that reduces silos between development, testing, QA, and operations teams to accelerate application development, improve software quality, increase infrastructure availability, maximize application performance, and reduce costs. Now, all of this sounds awesome. There is one problem, though. DevOps is basically a set of abstract principles, some of which many companies struggled to put them into practice. And, to help everyone end this struggle, in 2016 Google published a book called “Site Reliability Engineering”, shedding light on their internal DevOps practices, but most importantly, giving easy-to-understand practical advice on how to make DevOps work.
So, in a nutshell, while DevOps is a philosophy, SRE is one good way of implementing that philosophy.
How does SRE add to DevOps methodology?
If you look at the DevOps manifesto, you’ll probably find that there are 5 key categories that DevOps is broken into, the methodology’s mantras, if you like, which could be put as:
- Removing organizational silos
- Accepting failure as normal
- Deploying small incremental changes
- Benefiting from tooling and automation
- Measuring everything
All of these are undoubtedly integral to a team’s success in finding a proper balance between agility and reliability, and we’re about to find out why. But, again, as neat as they sound, they don’t look like concrete instructions (“Measuring everything”? Well, of course!). So let’s go through these principles one by one and see where the difference between DevOps and SRE truly lies.
Removing organizational silos
DevOps idea: the communication between people who do coding (developers) and people who provide maintenance services (operators) must be seamless so as to prevent quick changes in code from damaging the infrastructure and creating major threats to the system’s stability. Initially aimed to break the wall between dev and ops teams, DevOps has rapidly spread beyond the software delivery pipeline to areas, such as security, finance, HR, marketing, sales, etc., where collaboration is vital.
SRE implementation: you need to build a tight-knit cross-functional team not only by bringing developers and operators together but also by expanding synergy-provoking practices to finance, human resources, executive leadership teams, and more. The culture of better communication and knowledge sharing, that DevOps and SRE inherently demands, can be created via frequent stand-ups, while integration and automation are to be deployed with special toolsets.
Accepting failure as normal
DevOps idea: no man-made system can be 100% reliable, so a failure of the said system shouldn’t be perceived as a disaster by any company, but instead, should be treated as normality… as long as a lesson is learned in the process.
SRE implementation: you need to internally agree on the amount of downtime that is acceptable in given circumstances and be prepared to swiftly deal with system failures (since they are inevitable and shouldn’t come as a surprise); one way to do that is to hold so-called “blameless post-mortems,” where time won’t be wasted on seeking whom to blame for the failure. Instead, the team, in a routine manner, figures out ways of improving the system, focusing on the future, not the past.
Deploying small incremental changes
DevOps idea: making frequent, but small changes to the code help react to issues faster and fix bugs easier. Why? It’s simple. Looking for a bug in 100 lines of code is far easier than in 100,000 lines of code. It also enables the development process to be generally much more flexible and alert to sudden changes.
SRE implementation: you need to note that it’s not the actual number of deploys per day that matters. Striving for an inordinate amount of deploys just for the sake of them is wasted effort. You should indeed deploy often, but also make these deploys count—the more sensible the nature of the deploy is, the easier it is to fix a potential bug, thus reducing costs of failure.
Benefiting from tooling and automation
DevOps idea: human nature doesn’t allow us to perform massive monotonous tasks efficiently. In the same way that it takes a lot of time and energy to manually address crucial workflows, companies that leverage tooling and automation can improve these processes exponentially.
SRE implementation: you should consider what long-term improvements to the system need to be made and automate the tasks that will be done regularly in a year or a couple of years’ time (SRE calls this “automating this year’s job away”). In doing so, you avoid investing in short-term gains and focus on long-term automation.
Measuring everything
DevOps idea: having tangible metrics that measure different aspects of your development process not only helps to tell whether the company is working on a certain project successfully but also provides justification for this or that business decision.
SRE implementation: you should adopt the use of Service-Level Agreement (SLA), Service-Level Objective (SLO), Service-Level Indicator (SLI), keep track of the system’s Mean Time Between Failures (MTBF) and Mean Time To Recovery (MTTR), and have a defined Error Budget. These will help your project run much more efficiently.
We’ll soon elaborate on what some of the notions in the paragraph above actually entail, but the rest of the picture should be more than clear by now. DevOps was created to make IT development better. Meanwhile, SRE was meant to show HOW exactly we should do that. As many SRE specialists like to say, “ SRE implements DevOps.”
SRE Metrics
SRE, as a concept, is next to impossible to imagine without Service-Level Agreement (SLA), Service-Level Objective (SLO), Service-Level Indicator (SLI). As stated above, these are the core notions that relate to the measurement of your SRE implementation success in many ways. And, much like the names of these concepts, their natures are very similar to each other, yet with some crucial differences:
- SLA is referred to as an agreement between the service provider and the customer about such metrics as uptime, downtime, responsiveness, responsibilities, etc. In other words, it acts as a set of promises made to the customer and represented by various metrics, and a set of consequences if these promises are not lived up to;
- SLO is, in turn, referred to as an agreement within an SLA about one specific metric i.e. uptime or response time. Basically, an SLO is an individual promise made to the customer. So, in this respect, it’s possible to see an SLA as a certain set of SLOs;
- Lastly, SLI is an indicator that shows whether the system is functioning in compliance to this or that SLO.
A typical example of all these three notions working together would be something along these lines: an SLA you made with your customer states that the system will be available 99.9% of the time (the so-called “three nines of availability”), so it would have the SLO in it that would be 99.9% uptime, and the SLI would be the actual measurement of the system’s uptime.
Which types of companies need SRE and DevOps?
Considering that DevOps and SRE are there to assist development teams with securing great system stability whilst still being very agile, it’s relatively safe to say that any dev company should, to some extent, have a grasp of what DevOps/SRE techniques are and how to implement them. They’re modern software development essentials.
In our previous pieces, we’ve already looked at how beneficial DevOps can be for large-scale manufacturing business and at the massive impact it had on financial services’ giants, but the sheer brilliance of DevOps and SRE is in their universal applicability — your business does NOT have to be a software development business to reap benefits from these practices; as long as you’re dealing with update roll-outs, infrastructural changes, growth and upscaling, feel free to delve into this philosophy.
And, effectively, there’s no team that’s too small for DevOps/SRE, either. You don’t even need to have a dedicated SRE specialist if you’re a small company. In this case, it may pay to train one of your team members to use the SRE methodology as the learning curve is not that massive.
So taking all of that into account, we can easily make a case that the ideas and concepts behind DevOps and SRE are there for every business to relish — large enterprise or a small start-up, IT or non-IT, they’re for everyone.
DevOps or SRE? You can leverage them both
In an attempt to settle the Site Reliability Engineering vs. DevOps debate, we now can say for certain that there’s no point in either-or statement. In fact, how can we be talking about a debate here if the two things we are desperately trying to contrast are virtually the same, with one being a vital part of the other?
If you say that you can do DevOps well, chances are you do that with the help of SRE principles.
If you say that you can do SRE well, you should realize that we’re technically talking about DevOps.
So it’s not a “red pill–blue pill” scenario at all, both DevOps and SRE are to be embraced and we’re very excited to see how they both develop in years to come.
Excited about SRE and DevOps? Talk to our experts to find out how DevOps and SRE can help you uncover new business opportunities.