Becoming a Site Reliability Engineer (SRE)

November 1, 2019

Introduction

They say writing is a form of mediation. Well, I always say it, and I think it's true. Reflecting on your past and writing down your thoughts is a meditative activity because it helps you ponder on your decisions and become more self-aware. This article is my reflection on a very recent career move that I've made. It's also the contemplations about my past four years at Buffer.

I've recently switched careers, from being a product engineer to a site reliability engineer (SRE). I've become Buffer's first SRE this year and learned a lot from my experience. I want to share everything I've learned along the way, as well as the challenges I faced. If you're a developer who is thinking about getting into SRE, I'm sure you will find this article very useful.

This is one of the rare instances where I sat down one Saturday morning and finished the first draft in one sitting. There is a lot to cover, and I tried to keep it short but insightful.

Let's start, shall we?

Who am I?

My name is Tigran, and I'm a senior software engineer at Buffer. Buffer is a fully remote company with no office. I primarily work from home or a co-working space in NYC. I joined Buffer almost four years ago. I joined as a backend engineer to work in a team called "Buffer for Business." We have performed a lot of team structuring after I started. My team consisted of a PM, product engineers, a designer, and a data engineer.

Before Buffer, I was in a fin-tech startup based in NYC for a year, and before that, I was at school doing my MS in Computer Science. When I was at school, I did two internships, and one of them was at Twitter. I loved my time at school as I got exposed to a great environment with a bunch of smart people.

Right after joining Buffer, I started to work on backend services that pull and store social media analytics data for our customers. We were a heavy AWS user, and as I was a novice to AWS, I learned a lot about the services running in the cloud. We were using Elastic Beanstack for all our environments, and I got to deploy my first Pull Request on my first day. It felt amazing to run a Slack command that could deploy my Github PR to our production servers.

As a backend engineer, I worked mostly with PHP and Javascript. I wanted to learn as much as possible about our infrastructure. How was the build system set up on Jenkins, and who wrote the Slack bot that could do things for engineers? How could I spin up a new EC2 instance in our infrastructure? I had so many questions that made me excited about my work.

Then, over time, I transitioned into a full-stack engineering role working with React/Redux and our PHP API. I had to teach myself React so I could help my co-workers build Buffer Analyze. It was a brand new product that we were planning to launch as part of our multi-product vision. It felt like I was in a startup with a fast pace turnaround and decision making.

Becoming a full-stack engineer also helped me make strides on my side-projects. I could finally build things from zero to production and not struggle with state management or Javascript code. When I started my career, I thought I wanted to be a specialist. My view has changed over time, and I started to grasp the power of being a generalist and how much impact and influence I could have on the company. One day, I could build interactive charts with React and the other day deploy a backend SQS worker or look up the right format of the Cron schedule.

Breaking down the monolith

In early to mid-2016, we decided to break down our monolithic application into multiple services and embrace Kubernetes. We were confident that it would allow us to move faster and align our engineering work with the multi-product vision we had as a company. At that time, we have worked and had good experience with Docker. Docker entirely powered our local dev environment, and we moved away from Vagrant.

As this was in 2016, Kubernetes wasn't as popular and mainstream as it is now. We knew we were throwing ourselves into a completely new world that has a lot of uncertainties and open questions. We knew that we would face technical challenges that we never encountered before. I remember our CTO Dan, and I would attend local k8s meetups to see how other engineering teams solved problems that we were so keen to find answers for.

Of course, the whole microserver transition part was incremental and involved a lot of experimentation and mistakes. Even with mistakes, looking back now, it turned out to be the right decision. I think, most importantly, we never stopped iterating on our infrastructure and making it more reliable and secure.

As we decided to transition from AWS to Kubernetes, I was one of the early pioneers to advocate for this change. As a product engineer, I saw the value we would get if we could break down our analytics infrastructure into microservices. We could use Kubernetes to orchestrate and manage our entire cloud infrastructure.

Why make a role switch?

Until early 2019, I continued to be part of the Buffer Analyze team. I helped my team to move our Analytics data infrastructure from AWS to Kubernetes. We started to build our new features on top of our new cloud infrastructure and launch the product for the public. It was an exciting time to see new users praise Buffer Analyze.

Of course, this didn't come without the new challenges we started to face as a team. The move from monolithic to microservice architecture increased our infrastructure complexity by several orders of magnitude. The transition introduced major complex problems that one would face working with distributed systems. We needed to think about networking, observability, deployments, and security.

Sometimes, this meant that product engineering teams had to solve some of these common problems instead of focusing on building great products for our customers. This complexity reduced product engineering productivity and team velocity.

I thought this created a need to have a new type of engineer(s) who could be embedded into product teams. This role would help engineers to implement reliability oriented features and educate them about operational best practices, automation, monitoring, and observability. These engineers are usually called SREs (Site Reliability Engineers).

I knew I loved working with developers. I started my side-project last year to help developers with Cron job monitoring. My work at Buffer and my personal project helped me to discover my passion for developer productivity and tools. I also knew that I enjoyed being a backend generalist and working on infrastructure type of projects more. I loved writing code but also enjoyed understanding how complex distributed systems work. It's not a coincidence that I took mainly database and distributed systems courses at RIT and specialized in databases.

Considering my professional interests and the needs of the company, I decided to write a pitch document. I named the document "Tigran's role change motivation" and shared it with our engineering leadership. Of course, it wasn't that sudden. I had prior calls with my manager and co-workers, openly sharing the motivation behind my role change. I'm very grateful that we have a very open culture at Buffer, and I felt very supported throughout the entire process of my role transition.

After weeks of discussion and working closely with my manager, we created a role transition plan for me. It was a one- to three-month plan during which I would hand my responsibilities to a newly hired engineer. We also worked together to define what my new role would be in the infrastructure team. We never had an SRE before, so as the first SRE at Buffer, I had to think deeply about my roadmap and future projects.

After two months of preparation and hard work, I gradually handed all my product engineering responsibilities to the product engineer who replaced me in my old team. I was lucky he picked up things very quickly.

In August of 2019, I was a full-time Site Reliability Engineer at Buffer as a part of the infrastructure team. Our team consists of four engineers and one engineering manager. It's already been three months full of all kinds of emotions and learnings. I got started with my projects, met my new team in person in our product-engineering summit, and started taking more SRE responsibilities.

What I'm working on as an SRE

Quite often, I see people wondering and asking questions about an SRE's role in an organization. I don't think there is a definite answer as the role varies in companies. However, I believe there are common projects that most SREs work on as part of their role. At least, that's what I learned by talking to other SREs in our industry and attending tech conferences.

Because I like being a generalist, I work across the entire stack of infrastructure. I like juggling multiple projects at the same time; it makes my work more fun and exciting. Working on a vast range of technical problems helps me create mental models that come in handy when thinking about large systems. I like working on various projects because I can use this practice to see the broad picture of the problems I'm trying to solve.

I usually decide what I should work on, but I always consult with my manager. I think as one grows into a more senior role, this becomes inevitable. One does a lot more thinking and less coding. I always try to work on impactful projects, and this was, in part, why I wanted to switch my role. In this way, I could expand my scope of influence at the company. Every problem I tackle is most likely to impact all engineers across multiple teams.

The vision of our team is to make product teams