In this Twitch Engineering Blog article, I share how my former team at Twitch achieved 4 9's of availability on AWS for a .NET microservice, on the world's largest video game streaming platform. It's an insightful story, and uses medieval castle defenses as an analogy for availability defenses.
Defend Your Castle: High Availability for High-Stakes Cloud Services
My time at Twitch gave me first-hand insights into operational excellence, in a way my former career did not. My consulting experience had equipped me with a working knowledge of proper architecture for high availability at scale, but not the experience of actually managing operations.
I had previously thought of operations management as a necessary but somewhat boring area of IT. I couldn't have been more wrong. One of my proudest achievements was leading an initiative to raise availability for a critical .NET microservice at Twitch, and we exceeded expectations, because we made it a moonshot project:
- We treated availability like security, using techniques like threat modeling.
- We used a defense-in-depth approach to designing our availability defenses.
- We innovated. Our systems engineer designed brilliant CDN behaviors that shielded customers from the impact of service failures.
- We optimized. Our senior software engineer attained an astounding 50-fold increase in service performance through pioneering work with response compression techniques.
If you're running .NET microservices on AWS, rest assured that 4 9s of high availability at mammoth scale is both achievable and sustainable. Read my article on The Twitch Engineering Blog for a blow-by-blow account of what we learned and the principles we followed.