As our systems grow exponentially larger and more complex, the challenge DevOps and SREs face to keep production systems online also grows. In order to get ahead of ticket queues and improve availability, there’s an imperative for us to automate remediation of issues entirely. This is more attainable than most people realize because while the causes of an incident may be in the thousands, the number of remediations is usually small and consistent.
In this session, I’ll describe real outages I saw at AWS, group them into their common infrastructure resolutions, and describe how we built speculative, automated resolutions that reduced tickets, improved availability, and reduced costs while growing our fleet 1000x. You’ll walk away with concrete ideas that you can put into place to improve availability and reduce burnout.
Prior to Shoreline, I spent 7.5 years at Amazon Web Services (AWS), where I ran their analytic and relational database services. Within analytics, this included Athena, Data Pipeline, EMR, Glue, LakeFormation, Managed Blockchain, and Redshift. Within relational databases, these were Aurora for MySQL and PostgreSQL and RDS for MariaDB, MySQL, Oracle, PostgreSQL, and SQL Server. Managing services deployed on millions of nodes while ticketing on a per-instance basis taught me the importance of distributed control and automation.