Anurag Gupta

Watch this Session
(includes captions)

Restarts and rollbacks don't fix everything : Automating Day 2 Operations

Watch this Session (includes captions)

As our systems grow exponentially larger and more complex, the challenge DevOps and SREs face to keep production systems online also grows. In order to get ahead of ticket queues and improve availability, there’s an imperative for us to automate remediation of issues entirely. This is more attainable than most people realize because while the causes of an incident may be in the thousands, the number of remediations is usually small and consistent.

In this session, I’ll describe real outages I saw at AWS, group them into their common infrastructure resolutions, and describe how we built speculative, automated resolutions that reduced tickets, improved availability, and reduced costs while growing our fleet 1000x. You’ll walk away with concrete ideas that you can put into place to improve availability and reduce burnout.

More about Anurag Gupta

Prior to Shoreline, I spent 7.5 years at Amazon Web Services (AWS), where I ran their analytic and relational database services. Within analytics, this included Athena, Data Pipeline, EMR, Glue, LakeFormation, Managed Blockchain, and Redshift. Within relational databases, these were Aurora for MySQL and PostgreSQL and RDS for MariaDB, MySQL, Oracle, PostgreSQL, and SQL Server. Managing services deployed on millions of nodes while ticketing on a per-instance basis taught me the importance of distributed control and automation.