The difference between Site Reliability Engineering, System Administration, and DevOps

Date June 2, 2014

I wrote earlier that I was going to SREcon14 in Santa Clara, and I did. I also met some readers - hi everyone!

Attending this conference where I wasn't automatically part of the target audience was fascinating. I went in with a lot of misconceptions about SRE in general, and without a good feel for who SREs were. Now, I've got to say that I have a much better concept of both Site Reliability Engineering and the people who do it. And I'd like to share.

To start out, quickly, lets do the impossible. System administrators have said for a long time that it's really hard to define system administration, just because the responsibilities are so expansive. Yes, that's true, but when you boil it down, what you get is that System Administrators are the IT operations staff responsible for designing, building, and maintaining an organization's computer infrastructure. It's really not that hard to define, is it?

The next step is to work on DevOps, the recent movement to help organizations' IT move in agile, performant ways. It takes the IT Operations staff mentioned above and builds a healthy, working relationship with the organization's development team, allowing each to see how their work influences and affects the other, and by combining knowledge and effort, produces a more robust, reliable, agile product as a result.

I came into SREcon thinking that Site Reliability Engineering was basically DevOps++. I thought it was the practices of DevOps at scale, and I thought that it was basically super-smart classically-trained system administrators doing awesome things with a big scale. But I was wrong. So wrong.

SREcon keynote speaker Ben Treynor (founder of Google's SRE team) laid it out clearly:

Site Reliability is what happens when a software engineer is tasked with what used to be called operations.

In a recent interview, he explained further:

To SRE, software engineers are people who know enough about programming languages, data structures and algorithms, and performance to be able to write software that is effective. Crucially, while the software may accomplish a task at launch, it also has to be efficient at accomplishing that task even as the task grows.

In other words, when infrastructure is code, you only hire people who write code. And that makes a lot of sense, especially for anyone who operates at web-scale. In the keynote, he also mentioned that he sees the end of "on-premise computing", which is where we own the physical hardware that our services run on and we use *aaS instead (most likely provided by a company that hires SREs to run its infrastructure). While I do think that the majority of new infrastructures being built today will be coming online with that model, I suspect it will take longer to win the hearts and minds of everyone, even once the technology and price points both make sense.

In my mind, the distinction between classically trained system administrators and SREs is really pretty important. Historically, a *lot* of system administrators have come into their roles through work like tech support, or even just running Linux on their desktops and then transitioning that into server work. It should be pretty clear that the same path won't work to transition into SRE.

The majority of people doing SRE work will inevitably come out of Computer Science courses in universities, because that's the source of most of the programmers who can do what Google et al needs them to do. It's not that you can't learn software engineering on your own - it's just that most people don't. And no amount of time on help desk will compensate.

To everyone who is currently a system administrator, but not an SRE, what should you take from this? Well, take it as evidence that software is eating the world, and if you don't know how to program, now is an excellent time to start. Seriously, go now. Also, if you're mentoring someone, encourage them to learn, as well. There's probably nothing more important to someone's future in IT than the ability to program. Don't worry about the language, just learn something to start with.

Not every infrastructure needs an SRE, but every infrastructure could use an administrator who acted more like one. Instrumentation, scalability, repeatability...these are the watchwords of reliable infrastructures, and each one of us should work to improve them, and by doing so, improve our infrastructures.

  • Dag

    I've always said, part of my job is to blow holes in my Directors' crazy ideas. Usually get a death glare from him, but, hey, we're running five 9s, must be doing something right.

  • http://www.wizardbeard.net Lord_NShYH

    Matt,

    Great post! SRE is a very interesting discipline, and I understand the need to tap CS graduates - it makes sense. I'm glad I never stopped programming!

    -Lord_NShYH

  • Pingback: Sick of hearing about DevOps? It hasn’t even started. | Standalone Sysadmin

  • http://darksim905.com Shawn Sheikhzadeh

    At first you said you thought SRE were basically doing Sysadmin/DevOps/Operations, what have you, at scale. Then you go on to say that it's about having essentially what equates to a programming background, going into Operations & learning how to write code to solve a problem with a goal of working great at scale. I'm not too sure I see a distinction. It's all just a subset of System Administration.

  • http://www.standalone-sysadmin.com Matt Simmons

    Shawn: It is definitely doing system administration. The primary difference is in who does it (or, put more correctly, the skills and background that they have).

    A sysadmin in a traditional shop who doesn't program would find it almost impossible to move into an SRE role.

  • http://ifconfig.blogspot.com Fred Woodbridge

    I'm tickled that you simultaneously agree and disagree with how difficult it is to define system administration.

  • Pingback: Top 50 Site Reliability and Reliability Engineering Blogs and Online Resources