June 2, 2014
I wrote earlier that I was going to SREcon14 in Santa Clara, and I did. I also met some readers – hi everyone!
Attending this conference where I wasn’t automatically part of the target audience was fascinating. I went in with a lot of misconceptions about SRE in general, and without a good feel for who SREs were. Now, I’ve got to say that I have a much better concept of both Site Reliability Engineering and the people who do it. And I’d like to share.
To start out, quickly, lets do the impossible. System administrators have said for a long time that it’s really hard to define system administration, just because the responsibilities are so expansive. Yes, that’s true, but when you boil it down, what you get is that System Administrators are the IT operations staff responsible for designing, building, and maintaining an organization’s computer infrastructure. It’s really not that hard to define, is it?
The next step is to work on DevOps, the recent movement to help organizations’ IT move in agile, performant ways. It takes the IT Operations staff mentioned above and builds a healthy, working relationship with the organization’s development team, allowing each to see how their work influences and affects the other, and by combining knowledge and effort, produces a more robust, reliable, agile product as a result.
I came into SREcon thinking that Site Reliability Engineering was basically DevOps++. I thought it was the practices of DevOps at scale, and I thought that it was basically super-smart classically-trained system administrators doing awesome things with a big scale. But I was wrong. So wrong.
SREcon keynote speaker Ben Treynor (founder of Google’s SRE team) laid it out clearly:
Site Reliability is what happens when a software engineer is tasked with what used to be called operations.
In a recent interview, he explained further:
To SRE, software engineers are people who know enough about programming languages, data structures and algorithms, and performance to be able to write software that is effective. Crucially, while the software may accomplish a task at launch, it also has to be efficient at accomplishing that task even as the task grows.
In other words, when infrastructure is code, you only hire people who write code. And that makes a lot of sense, especially for anyone who operates at web-scale. In the keynote, he also mentioned that he sees the end of “on-premise computing”, which is where we own the physical hardware that our services run on and we use *aaS instead (most likely provided by a company that hires SREs to run its infrastructure). While I do think that the majority of new infrastructures being built today will be coming online with that model, I suspect it will take longer to win the hearts and minds of everyone, even once the technology and price points both make sense.
In my mind, the distinction between classically trained system administrators and SREs is really pretty important. Historically, a *lot* of system administrators have come into their roles through work like tech support, or even just running Linux on their desktops and then transitioning that into server work. It should be pretty clear that the same path won’t work to transition into SRE.
The majority of people doing SRE work will inevitably come out of Computer Science courses in universities, because that’s the source of most of the programmers who can do what Google et al needs them to do. It’s not that you can’t learn software engineering on your own – it’s just that most people don’t. And no amount of time on help desk will compensate.
To everyone who is currently a system administrator, but not an SRE, what should you take from this? Well, take it as evidence that software is eating the world, and if you don’t know how to program, now is an excellent time to start. Seriously, go now. Also, if you’re mentoring someone, encourage them to learn, as well. There’s probably nothing more important to someone’s future in IT than the ability to program. Don’t worry about the language, just learn something to start with.
Not every infrastructure needs an SRE, but every infrastructure could use an administrator who acted more like one. Instrumentation, scalability, repeatability…these are the watchwords of reliable infrastructures, and each one of us should work to improve them, and by doing so, improve our infrastructures.