The Stack Overflow Podcast

How chaos engineering preps developers for the ultimate game day

Episode Summary

On this sponsored episode, our fourth in the series with Intuit, Ben and Ryan chat with Deepthi Panthula, Senior Product Manager, and Shan Anwar, Principal Software Engineer, both of Intuit about how use self-serve chaos engineering tools to control the blast radius of failures, how game day tests and drills keep their systems resilient, and how their investment in open-source software powers their program.

Episode Notes

In complex service-oriented architectures, failure can happen in individual servers and containers, then cascade through your system. Good engineering takes into account possible failures. But how do you test whether a solution actually mitigates failures without risking the ire of your customers? That’s where chaos engineering comes in, injecting failures and uncertainty into complex systems so your team can see where your architecture breaks.

Episode notes:

Sometimes old practices work in new environments. The Intuit team uses Failure Mode Effect Analysis, (FMEA), a procedure developed by the US military in 1949, to ensure that their developers understand possible points of failure before code makes it to production.

The team uses Litmus Chaos to inject failures into their Kubernetes-based system and power their chaos engineering efforts. It’s open source and maintained by Intuit and others.

If you’ve been following this series, you’d know that Intuit is a big fan of open-source software. Special shout out to Argo Workflow, which makes their compute-intensive Kubernetes jobs work much smoother.

Connect on LinkedIn with Deepthi Panthula and Zeeshan (Shan) Anwar.

If you want to see what Stack Overflow users are saying about chaos engineering, check out

Chaos engineering best practice

, asked by

User NingLee

two years ago.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague and collaborator, Ryan Donovan. How's it going, Ryan?

Ryan Donovan Oh, it's going pretty well. How’re you doing?

BP I'm good. So we have a sponsored episode from Intuit today, and it's on a topic that you and I have worked on before: chaos engineering. It sounds really cool. We did a great blog post on it: Failing Over Without Falling Over. That was a couple years ago, and we're excited to chat about it today. So, Shan, Deepthi, welcome to the Stack Overflow Podcast.

Deepthi Panthula Hi! Thanks a lot for having us here.

BP Of course. Deepthi, why don't you go first. Just quickly tell the audience who you are, how it is you got into the world of technology, and how you found yourself in this role, maybe specializing in this particular area.

DP Hello, everyone. I'm Deepthi Panthula, Senior Product Manager at Intuit, owning the vision and strategy for the reliability engineering track, particularly focusing on chaos and performance engineering to improve our system resiliency along with our development team’s experience and productivity. So usually VPMs spend a lot of time thinking on how to delight our customers with new features and capabilities, because our main objective is to attract and retain users that will help generate the revenue for the company. But reliability is also equally important because it impacts our customer satisfaction, and reliability is the key in making sure that our new feature services are actually available when our customers need them. So this is what motivated me to get into the reliability space because this is a must for any product.

BP That's great. So Shan, I'm sure our audience would like to know a little bit of the same about you. What was it that brought you to the world of software and technology, and how'd you find yourself in the role you're at now focusing on things like chaos engineering?

Shan Anwar Thanks, Ben. Hi, everyone. I'm Shan Anwar, Principal Engineer here at Intuit, leading the chaos engineering initiative under the reliability umbrella. And I work really closely with the product team and Deepthi. So, it's interesting. I started off my career in an information security space and I learned about how to break and cause vulnerability, attack the system, and then later when I was working in a platform, I realized the product we are building, everything breaks. So that got me into similarities that it's about having some sort of reliability issues in the system and how can I find it, produce it, find it, and then figure out the impact, because towards the end, whatever product we are delivering, it's to make sure that our customers have a delightful experience.

BP Were you the kind of child who preferred to break toys versus building them? Where does this instinct come from to want to be the one creating problems? No, I'm just joking.

SA It's just curiosity. When you are looking at the thing, how is it built and what's underneath the hood? Just to know the gritty details, sometimes you have to break a little bit to learn more.

BP Yeah. It's funny, people of my generation, I'm living in a more rural area now and when I talk to neighbors and I say something's wrong with my car, they say, “Oh, just pop the hood and look underneath and figure it out.” And I just stare at them. I've never done that. I would never try to do that. But like you say, some people get curious. They take things apart and then they know how to put them back together. So it's a cool skill to have.

RD So we've talked with some folks in chaos engineering, and we did a little bit of it in my last job– the engineering team did. I'm wondering how it works for you all in practice. Are you intentionally breaking things in production? Do you have a staging ground for this?

SA So we started the chaos engineering, just to retouch that topic again, as in intentional planned failures that we wanted to do for that purpose, and by definition, has to be in production. But at Intuit we wanted to make sure that we can actually practice this before we even go into the production. So we have test environments, and we wanted to also make sure that we have some sort of a controlled way of doing things to control the blast radius, maybe run in a canary environment so that we don't impact our customer and learn. The main thing is how to really test. I mean, chaos engineering is thought of as there are multiple parts of it. What are we trying to achieve? What is our hypothesis? And then build on top of it how we want to inject and how we want to observe that. So all those pieces come into play, and that's where the tooling part comes in, and that's what we are trying to look for with the help of our technologists.

BP Deepthi, anything you want to add on that, sort of the specifics of how you build these chaos engineering simulations or tests in a way that keeps it feeling real, teaches you something, but also doesn't, as you pointed out, maybe put your clients’ experience in danger?

DP So running it once doesn't help. This is an additive process where you need to keep experimenting things. You need to take the continuous feedback, make some improvisations and you need to keep going. If we are all good, then you need to move on to the next failure condition or the real time scenario and then you need to continue testing it out.

BP Yeah, it was interesting. I was listening to someone speak at the Next.js conference and they were saying the thing that had helped them advance most in their career as an engineer was being on call, even early, and that it's a very stressful thing but that's also the thing that puts you in a position to learn when something breaks and hopefully you're there with a senior and they can talk to you about things. But being in those situations is where you're going to get some of the most value in terms of education at a career and that will serve you going forward.

RD You see how the thing works by seeing how it breaks.

BP Yeah, exactly. So I guess for either of you, I would love to hear some stories about how you've done this in practice, specific tests or trials you ran, things that broke, and how you or your teams or your employees learned from it. What kind of interesting war stories, anecdotes, or data points can you provide for us?

DP So before I answer it, let me set some context on the infrastructure scale with which we operate at Intuit. So we have around 6,500 technologists across 1,000+ teams among different business units like TurboTax, QuickBooks, MailChimp, Credit Karma, and we are running our services in around 250+ Kubernetes clusters, which is huge, and this is of utmost importance for us to be prepared for the failures. And in order to be prepared, we need to build out resilient systems. So at Intuit, as part of our organization's charter, we provide different experiences for our development teams to run chaos testing, be it enabling them to run continuous testing as part of the pipeline, or have a self-serve UI to perform some on-demand testing, or making sure that teams participate in this company-wide mandatory game days. So let me share a small story from a recent game day event where we were trying to simulate a Kubernetes failure event. So there was one team who had implemented the resiliency patterns and created all the alerts, monitoring, and making sure that everything is good before the actual game day, but when they were trying it out in the pre-prod environment with the limited traffic, they saw some percentage, around 10 to 15 percent of impact. So before the actual game day, the team tried to fix those bugs, validated again using the capabilities that I was just talking about like the self-serve chaos testing tools, and they were very confident and prepared for the production game day. And as expected, it went really smoothly and there was no impact to the users. And this is huge, and the reason why I'm sharing it here is that the team not only was proactive in making these changes and making sure that their end users didn’t have any impact, but they have shared their journey and learning in the process with all the other teams in the organization which motivated them to apply these principles to avoid any potential incidents. And going back to the point where Shan was mentioning that chaos engineering definitely needs to be run in a dynamic runtime environment, but it always helps if the teams start practicing as part of their design or as part of any changes to the production. This will go a long way.

BP And Shan, how about you?

SA Yeah, it's the same as Deepthi was mentioning. Aside from just the big company-wide game day, we kind of do a little bit of drills as well, where teams come into the play and they want to learn rather than knowing about all the resiliency patterns, because most of the engineers and technologists, they focus really on solving the business problem. And as you both probably work in the chaos engineering and reliability space, it's pretty heavy lifting for somebody who is not familiar with the space, so we also provide guidance and best practices around what to look for. And internally we use things called FMEA, failure mode effect analysis, because at the design time, it's a very old practice that started in the 1940s, making sure you go on a paper, look at your design, making sure that you uncover all the possible failure scenarios which is not possible in today's moving world with a lot of distributed computes happening. Actually that's what's helping us educate our technologists and help them build tools that we can provide help with such scale.

RD Yeah, it's interesting you say that sometimes you can't even understand what dependencies are going to fail down there. You have so many downstream on your big service architectures. What do you recommend to folks to build resiliency into their services?

SA So there are many ways a system can break. You are touching on something where we have so many dependencies in place, so what kind of design patterns can we apply? The resiliency patterns that are out there already, if your downstream is not responding, first you see if you have seen the error, you'll retry it, but how many times do you retry it? You're not continuously retrying it and causing more of a chaos and having cascading failures. So are you applying best practices in such a way that there's a circuit breaker in place which controls? Is there a fallback if a dependency is not there, and so forth? And apply it. And let's say you are in a particular region or availability zone if you're using AWS, are you able to fail over or evacuate from that such thing so that you're not impacting your SLOs in that case? So there are various ways. I mean, every application is different, but some of these design practice patterns can help.

DP And it's always better to be proactive than being reactive. So we've always wanted teams to apply these resiliency patterns continuously to validate and get the continuous feedback. And also as Shan was mentioning, the FMEA, the failure mode and effect analysis needs to be part of the design phase, and also follow the best practices and all the standard templates so that way it's consistent across. So let's say if you have a downstream service and you are following one way of doing things and the teams are doing it differently, then things might not work as effectively. So I think it's always better to be on the standard set of tools and the best practices. I think that would really help.

BP I have a question because we were just talking a bit about microservices and how distributed compute is, and how it can be difficult to tell, and you mentioned working with different teams internally. Do you work closely with the observability team, which often is tasked with identifying the source of a problem and tracing it back? You’re intentionally creating those problems ahead of time and running trial exercises, but what does the interaction with the observability team look like?

SA Let me take that question first. So prior to joining the chaos engineering, I was part of the observability team.

BP Okay, so you’ve got both sides.

DP That's where the smile is coming from.

SA So I can go back over my previous work in that sense. Without observability, I don't think any of these chaos engineering efforts can be successful, because if any failure comes in, the goal is actually to learn from that. Chaos engineering is not about, we inject a fault and that's it. It's really to kind of poke in the chaos that's already in your system and if we are able to detect it. Observability, all three pillars come into play. Do you have metrics all configured correctly? Do you have logs and traces to really tie in when you are trying to do the RCA, or root cause analysis. So it's very essential that we work really closely if they are part of our platform. The modern platform that we are working on provides basically everything to our developers. We have a unified experience in development dashboards portal we call it, and from there we build and give them all the tooling to start off their building application, observability, and now chaos tooling out of the box.

RD So that's a nice segue into talking about the tooling. I've heard of a couple tools, the Chaos Monkey, probably one of the originals, Gremlin. I know you all have a different stack. Can you tell us a little bit of how your tooling works and how it compares to the stuff that's out there?

SA I think in the past we were working on building a lot of small tools before we had Chaos Monkey and other things come in to do the FMEA. And we wanted to solve and be part of the cloud native chaos engineering as leading in this space of container-based solutions with Kubernetes. We wanted to start from there, so we looked into the principles of what technologies we can leverage that are cloud native, are open source, have a strong community to back it up, and a plug-in architecture. So once we started looking deeper into this we came across LitmusChaos. At that time it was actually fairly new, and we are one of the maintainers over there too. It provides us the capability to inject failures. It has a lot of multi or cross-cloud support now. And the best thing is that with the community and open source, we are able to actually integrate with our unified experience within the development portal. So that was the reason behind it, but we wanted to make sure that whatever we are building the implementation is just abstracted out, so tomorrow we can change it to some other provider. If you see that Gremlin has something better, we can just move out to them. Or Amazon has their fault injection simulator, we can switch over there. At the same time, we want to make sure that we have experiences for our developers, both in terms of a CI/CD continuous way of doing things ad hoc and as a part of a game day scenario. And one more thing I need to add is that Intuit is big into open source. We have Argo tools, and particularly I want to call out the Argo Workflow, which is a general purpose container native workflow engine mostly used for compute-intensive jobs, machine learning data, and even for the pipeline. But we leverage that for chaos experiments. So that was a really beneficial handy tool for us to actually use and apply for cluster-wide or basically game day scenarios.

BP So we've touched on a few things here that I think have come up in a lot of podcasts– the idea of compute moving to this more distributed and more flexible world, microservices that are extremely useful but can lead to their own complexities, observability being key to running kind of a modern organization, and then chaos engineering being a great way to stress test some of this stuff and make sure it's robust. For both of you, what do you see happening in the future? You don't have to give away the roadmap, but what are the things you're thinking about for the next year or two that you're excited about, things you're going to build to continue to improve this process, or areas where other technologies, whether that be cloud or AI/ML might come in and help you to do a more robust version of chaos engineering?

DP I feel that the chaos engineering discipline is used to build a more reliable tomorrow, and it's pretty much globally embraced by the tech community. It's not an old methodology. It's recently started, but it's picking up very well. And as we are relying more and more on complex cloud infrastructure and distributed systems, the ability to identify the potential issues before they lead to outages is becoming very crucial and important. So I feel that there is a lot of future for this chaos engineering discipline for sure. Our focus at Intuit is to deliver simple and secure and scalable chaos testing tools along with self-serve integrated experiences where developers can experiment the failures and identify the weaknesses to make sure that we have reliable and highly fault-tolerant systems. So two things that come to my mind that I'm excited about is, one, creating a self-serve experience that would let developers release faster while also ensuring that they have a high availability and fewer incidents. And now that we already have some kind of solution which we have been already doing, we are contributing to the open source because we wanted to make sure that the industry around it can benefit out of it too. That was one. And the second thing is championing the reliability ownership across all the development teams. This is not just one team's responsibility, but all the teams who are responsible for their services need to own this reliability. I think that's when we will all win together and make sure that our end users have a delightful and uninterrupted experience. So these are the two things that at least I'm excited about.

[music plays]

BP All righty, everybody. It is that time of the show. I want to shout out a member of the community who came on and spread some knowledge. Shout out today to NingLee, who left a question from two years ago: “What are some best practices for chaos engineering?” Well, these might be outdated, but we will at least put it in the show notes and you can take a look. Maybe there's some things you can learn going forward from here and figuring out what's changed. I am Ben Popper. I am the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. Email us with questions or suggestions, podcast@stackoverflow.com. And if you like what you heard, leave us a rating and a review. It really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to find me on Twitter, I'm @RThorDonovan.

DP I'm Deepthi Panthula, Senior Product Manager at Intuit, focusing on building the reliability engineering capabilities. So please reach out to me on LinkedIn or on Twitter if you want to provide feedback or want to learn more about what we are building and how we are performing company-wide game days.

SA I'm Shan Anwar, Principal Engineer at Intuit leading the chaos tooling initiatives. If you want to learn more about chaos engineering, observability, or in general wanted to have a chat, feel free to reach out to me on LinkedIn. Thank you.

BP All right, everybody. Thanks for listening and we will talk to you soon.

[outro music plays]