The Stack Overflow Podcast

Banking on a serverless world

Episode Summary

Kathleen Vignos, VP of Software Engineering at Capital One, sits down with Ryan to explore shifting to 100% serverless architecture in enterprise, deploying talent for better customer experience, and fostering AI innovation and tech advancements in a regulated banking environment.

Episode Notes

Explore how Capital One is using tech to innovate the banking experience here.

Connect with Kathleen on LinkedIn and visit her blog.

Shoutout to user Theraot for answering the questions How to connect a signal with extra arguments in Godot 4, which won them a Lifeboat badge.

Episode Transcription

[intro music plays]

Ryan Donovan: Assembly AI just launched a new streaming speech-to-text model purpose built for voice agents universal streaming delivers ultra fast immutable transcripts with intelligent and of turn detection. The API is available now at assemblyai.com/stackoverflow.

Hello everyone and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ryan Donovan, your host, and today we are gonna be talking about serverless in the enterprise with our guest, VP of Software Engineering, Capital One, Kathleen Vignos. How are you doing today, Kathleen?

Kathleen Vignos: I'm doing fine. Ryan, how are you?

RD: I'm good enough. Good enough. So at the top of the show, we'd like to get to know our guests, find out how they got into software and technology.

KV: Before I came into software, I actually was designing buildings for earthquakes. I was a structural engineer. And I found myself drawn to the analysis of structures, the stresses on a structure during an earthquake, which leads you to finite element analysis and programming and things like that. And when I started, structural engineering was very manual. We were still using drafting tables and we were still handwriting our calculations. And so that technical option was really like mostly in this analysis category. I followed the curiosity breadcrumbs and sort of found myself in early tech by way of, you know, an emerging technology consulting company, Accenture at the time, where they were kind of helping lead other companies to digital transformation. We're in quite a different place now where most, you know, companies are well along their technical journey, and highly digital. So, outside of that, then I worked for large software companies doing professional services, and then I got into freelance web development, which I did for many years. Before stepping into a role at Wired, where then I became the tech lead and then ran the engineering organization there for the Wired website. Then that led to Twitter, so I led full stack teams at Twitter, but at the back half of my time there, I was actually closer down into the infrastructure. So I led infrastructure automation in our platform engineering organization. We were doing a lot of projects that had to do with automating repair and remediation on hundreds of thousands of servers in a data center. Right. And a lot of the work was very closely tied to what server is your process running on? And we were on a journey to try to move more in the serverless mindset of not that we would choose a server, but that we would deploy, and that would all be managed for you with Kubernetes, et cetera. So from there, I came to Capital One. And here at Capital One, I am helping with, you know, a lot of tech transformation in the card technology organization, leveraging cloud, leveraging serverless AI, etc.

RD: So can you tell me a little bit about the impetus for that transformation and how that went at an organization the size of Capital One?

KV: Well, first of all, Capital One started a cloud journey. Right. You know, a lot of banks find themselves in mainframe or data center environments. And so, you know, Capital One started that journey a long time ago and is 99% in the cloud at this point. And so that's the first step.Then once you're in the cloud, if you've lifted and shifted, you might find that you end up with still, you're running large bass processes and you're still having to provision large servers and probably are over-provisioned and all of those things. And so that leads you to the serverless patterns. In our organization, we've been taking batch processes, breaking them down, because in a serverless environment, you're either gonna run out of your time because they only can run a certain amount of time, or you're gonna run out of memory. And so you can't just run these long processes in batch. You've gotta start breaking down and re-architecting your services in order to operate in a serverless environment.

RD: Yeah. And I imagine that being payments related, having this operate pretty quickly is pretty important, right? Any engineering there, any software has to operate fast. Did you do anything to remediate the Cold Start problem that serverless has?

KV: There are ways to configure so that you have, you know, the ability to warm caches. Warm clusters. Bring them online before you need them. You can set up autoscaling configuration to make sure that you're not dealing with a cold start problem.

RD: Mm-hmm. Were there any programs, procedures that resisted batching that were sort of like, this is a really hard one to break down to constituent parts.

KV: Oh yeah [laughter]. I mean there's lots and lots of places where the batch problem is a very hard problem. And so one way to tackle that, it's pretty difficult to go from a large batch oriented process to totally real time. So you start breaking it down by saying, ‘okay, can we get closer to operating as if we were real time, even though we are still batch?’ So are you breaking down the file? Are you breaking down running in 15 minute increments instead of, you know, running for six hours? Like, there's multiple different ways to start to break that down. So those are the kinds of things that we've started to do because there are situations where we might receive a large batch from some other system that we have no control over. And so then what do we do downstream from that to feed into a system that is more real time, that is more event driven, behaves as if these were events instead of back streams. And then can get closer to being able to be deployed in a serverless model where you have, you know, your lambdas and they're not gonna time out and you can run a bunch of things in parallel instead of running things sequentially line by line in a large batch file.

RD: It's interesting, these sort of big foundational changes. You know, we just went through, uh, moved to the cloud with our public site and finding how many things were directly tied to the data center, like discovering all these things where it's like, ‘oh, we gotta figure out how to undo this little thing’. With all those, was there like additional big engineering projects that you had to sort of spin up to be like, oh, we have to do this first?

KV: Well, actually, I'll tell you a funny story about the data center. So when I was at Twitter, I used to take my teams. On field trips to our data center, which is super awesome because these days you can't get to a data center and go walk around. But we had, you know, the ability to do that, which was awesome. And the first time I went, I was not working in the platform engineering organization yet. I was working with application development teams and you know, we're walking around and one of my engineers goes, ‘oh look, there's my server!’. Which is the absolute anti-pattern of serverless. Like you should not be able to walk into a data center and go, oh, look, right. Like there, the failure mechanisms there, you know, are definitely not what you wanna have. Now, what do you want? Right? You want to not care. There's a cluster and if it fails over, it dies, the server has a problem, there's a repair issue. Whatever you don't care. It spins that down. It pulls up another one, you're good to go. Plus, you probably have multiple clusters running in the first place, so you're not just dependent on one machine, one cluster, etc. Coming back to Capital One. So in our organization we did have quite a lot of different batch processes and, you know, things running on EC2, ECS. And so over the last couple of years we have just been slowly marching toward making that happen. You know, setting standards for what does it mean to be serverless? What exactly is the goal? How are we measuring progress, having enterprise goals around getting to serverless? Because you know, we're working with our product partners and they have a lot of business ideas and things that they wanna ship, and we wanna help them do that, but we also wanna make progress on this serverless journey. So having that, ‘okay, well, you know, we're trying to get to 50% serverless or 75% serverless, or a hundred percent serverless this quarter’. In fact, my own team, we just in the last couple months reached a hundred percent serverless. And you know, our product partners have been very supportive and really cheered us on in that journey. But it helped to have an enterprise check in. You know, I'm held accountable by my leadership too. ‘Hey, where are you? You know, on your serverless journey’. So those things help. Engineers are excited. They're on board. You know, if you're owning your own. Infrastructure, then you are having to deal with every single vulnerability patch that comes, every upgrade. We have had to move out of a model where we have our own images to move to the provided images, to standardize. There's a lot of standardization steps. Standardizing the pipeline as you go along is also really important. I could go on and on, but there's been a lot of important pieces. Oh, I was gonna finish the point about developers. No developer wants to sit, you know, managing vulnerabilities all day. They wanna be building features and product. We are getting there and we're getting to that developer satisfaction.

RD: So what does serverless mean in that context, especially since you're a hundred percent.

KV: For us, it means getting out of EC2 and ECS instances, getting more into managed servers, so it's taking those steps up the stack for us. It's a combination of Fargate and Lambda. As long as we are deploying in those models, we're considering ourselves serverless.

RD: Talking about moving up the stack into terms of like managing only the things that you need to manage. Right? I assume that's part of the move to serverless is that you don't wanna be futsing with the bits and pieces of the infrastructure as much, right?

KV: Yeah, I think any business needs to consider where's your business value, where should you be investing your talent? And in some businesses, investing, you know, well, Amazon Web Services, Google, Azure, etc, right? Like they're gonna be optimizing the heck out of everything that's happening at the infrastructure layer. But for most businesses, certainly startups, all of your business value is gonna be on building components way up at the top of the stack. Like you don't have the team, the resources, the talent to spread across and to go vertically deep. Most times you don't need it, maybe for some specific ML cases, right? Of course. Like if you're trying to do something super innovative with ML, you've, you now you need GPUs. You gotta be highly optimize those. You gotta worry about that kind of performance. While we are making use of ML, and we certainly have teams dedicated to those use cases, like for me as a more business domain engineering owner, like everything I think about has to do with our business domain. It makes sense for us to take our talent and deploy it to customer experiences. And/or agent experiences and/or things that improve our operational processes, automate them, bring in AI wherever we can, bring in ML decisioning, wherever we can. So that's really where we're focused. So we don't wanna be focused down at the infrastructure level.

RD: Yeah, that's, you know, the old build versus buy trade off. Right. We've talked about it. We came up with a third one of ‘borrow’ with the open source stuff. Are you all a fan of open source software?

KV: Absolutely. Capital One has an open source group and we take advantage of open source technologies and we contribute to open source technologies as well.

RD: You know, we talked about not building the things lower down in the stack. I know developers love to build everything, and I've heard of a specific financial company where they build every piece of their stack. Are there arguments to be made to go a little lower in the stack sometimes?

KV: Probably depends on the use case, right? That's gonna be unique. And again, how close to transaction level capabilities. You know, our partnership with AWS is quite strong and so we also always wanna make advantage of that relationship that we have and together work on any place where we need to get more from AWS services or regions or whatever. You know, we need to be having those conversations too. I think it's been a really successful partnership to push AWS's services to do all kinds of different things. So we look for those opportunities as well.

RD: And you mentioned the AI MLs as part of your purview. What sort of things are you getting into there?

KV: Data is the basis for everything that we do and, and certainly the foundation for AI and for ML. So I think that we've paid a lot of attention to how we standardize our data, how we understand our data, how we protect our data, govern our data. When you have that baseline, then there's a lot of innovation that you can build on top of that. So certainly we wanna understand, you know, our customers meet needs. We wanna be able to make offers and predict, you know, the kinds of things that they might want us to do next for them. How, for example, in MySpace, when does someone need a payment reminder? When is that gonna be a helpful thing that, you know, will help them on their journey? There's a number of places in my particular stack where we have ML models that have been tuned and customized for the customer experience.

RD: Are you touching on any of the generative AI stuff, or is that, doesn't have the use case yet?

KV: Oh, no, we, there's, there's tons of conversation about activity around generative AI at Capital One. We have a large generative AI team. We have a lot of work that we're doing on how we build our own foundation for all Generative AI use cases. So that again, we can be governed, we can be well managed, we can be transparent. We are a bank, we are regulated. We want to have relationships with our regulators so they understand what we are trying to do. AI can be quite a black box, and so how do you make sure that you're responsibly using it, and you can explain how you got the answer you got. You can't just say, ‘oh, I don't know my model gave it to me. Or, you know, my foundational model LLM generated this response’. Like, you've gotta do a lot of work. And so there's tons of investment on doing this well, doing it right, doing the right thing for our customers. And as of course you can imagine, our developers are extremely eager, enthusiastic, experimenters. So we have sandboxes where, you know, our developers get a chance to experiment working with our product and business folks to see what we can do. There's all kinds of use cases. It's really, really an exciting time.

RD: It's definitely the greenfield right now, the untapped territory. I like what you said about protecting the data, the governance, are you looking into explainability solutions? Are you building anything?

KV: So are we building anything specific? I think it's more broad than that. How are we making sure that we have a responsible platform and foundation for everything that we do? I think each different business domain has different sets of controls, different sets of regulations that we need to follow. So that's gonna probably not be one size fits all type of solution for explainability. It's gonna vary within our different business areas.

RD: It's something I talked to folks about and I'm interested in how that works. I talked to somebody that said he could read individual neurons in an LLM. I'm still not sure what that means.

KV: So in software development, a lot of times you get the requirements from the product and the business and then you build a thing and then at the end you go, ‘okay, now how are we gonna prove that it does what we meant for it to do? How are we gonna monitor it?’ We've in our group, done a lot to pull that whole set of requirements forward. What is the set of curated data that we're gonna need for auditability, for controls, for regulatory, audit, etc. So would do we wanna design from that from the beginning? So we think a lot about that too, which reduces our risk and makes us more confident that our systems are doing exactly what they are supposed to be doing at all times.

RD: Do you have an engineering challenge that either you've recently completed that you're most proud of, or something in the future that you're very excited to tackle?

KV: Our group has done a lot of work around how do you move from, especially you make the journey to cloud, maybe you lifted and shifted some things. Depends on the group and kind of what you find yourself with. I think moving toward a future where we can configure, instead of writing every single thing in procedural code where you know, your business logic is kind of all mixed and fragmented. How do we get to models? Where we can establish better workflows and then configure the business needs and policies and capabilities. So my group is getting ready to launch some additional features in a model that's more of a low code, high configuration model.

RD: Talked to somebody recently that they're using generative AI to create components instead of tokens. That's an interesting sort of way to have a repeatable, generative experience -

KV: repeatable experiences with Gen AI, right? Like I see a lot of examples of some experimentation with generative AI that's very focused. Now, I'm not saying this isn't Capital One, I'm saying like more generally, where it's very focused on constraining the prompting and the rag model to get a more repeatable answer, which doesn't feel like the step change opportunity that AI represents.

RD: It almost seems like an oxymoron. Like it's missing the point of a what an LLM does

KV: And would, you know, in some cases, you know, really wanna reinvent, if you have a clear business workflow that works well and is repeatable and sound and something you can be confident with and scales and is well managed. That's probably not the thing you rerun.

RD: Right? [Laughter]

KV: You know, I think there's other better opportunities. So we're gonna be chasing after the bigger ways to transform.

RD: Is there anything I didn't talk about that you wanna talk about?

KV: You know, we didn't really talk about cost considerations with serverless, and I think that that's a question that people continue to have. The promise is it's gonna be cheaper, but then you're like, ‘well, really, really, is it, I don't, is it really?’ And there's a lot of effort that goes into proving whether or not that is true, right? You gotta kind of go do the work and then you find out. I have particularly appreciated and enjoyed, you know, I bring my team together every month for a performance and health of all of our systems read out, and we always look at cloud costs. And so as we've been making this journey to serverless. And you know, you see the graph and it's showing how much we're using EC2 or ECS, or Fargate or Lambda, you know, etc. There's a period of time when you're in the middle where you see the cost go up. Because now you're running multiple instances that are doing some similar things. Then there's the moment where you stop you, you're able to shut down a bunch of old clusters and then you kind of see where you land. And it's been really gratifying to see the cost come down, to see the spend efficiency go up. That's good, right? Like that's the hypothesis. And you know, seeing that happening is certainly where we wanna be.

RD: With serverless, my layman's impression is that the cost per is a little more expensive, but you use less because of its nature.

KV: Right. You're not reserving instances. You're not running over-provisioned by whatever percentage overutilized, you're using as you go. Of course, it takes a lot of skill to set up your configuration properly so that you do the proper auto scaling. The dynamic nature of that is configured properly. So luckily we have really talented folks who are able to do that. There's a big caveat there -

RD: Yeah -

KV: It should be cheaper. If it's configured properly, it will be cheaper because then you're taking advantage of the pay as you go model.

RD: Do you think if you have cloud computing, does it take a little more monitoring on the payment side, on the cost side than say like having a server rack somewhere?

KV: Definitely, you should always be monitoring your costs. You should have alerting on costs, like if things are going above, you know, what you expect. I get consistent emails that show me what's going on with our costs. We have monitoring and dashboards and all of those things with alerting and threshold. So yeah, you definitely have to keep an eye on it because, you know, sometimes you can be surprised. You know, certainly as we increase volumes, right? We'll throw, you know, new scale. We have a partnership. We have a bunch of new data that we're bringing in and you might expect it to go up X but really it goes up Y. That can be a great signal to, ‘oh, whoa, something's misconfigured’. ‘Oh, whoa, things are not scaling properly’. Right? And we need to go in and address that.

RD: And it's pretty easy to forget about something that's running on there because I remember a previous job they did a review of cloud compute and sort of found five, six figures worth of savings just by shutting things down.

KV: It is really important to have consistent system monitoring and automated tear down of unused instances.

RD: Alright everyone, it is that time of the show where we shout out somebody who came onto Stack Overflow, dropped a little knowledge, shared a little curiosity, helped out the community, and got a badge for it. Today we're shouting out the winner of a lifeboat badge. Congrats to the dot or thedo if they're French for dropping an answer on how to connect a signal with extra arguments in Gado four. If you're curious about that, we'll have the answer in the show notes. I am Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you want to share some comments with us, topics we should cover, email us at podcast@stackoverflow.com and if you wanna reach out to me directly, you can find me on LinkedIn.

KV: I'm Kathleen Vignos, VP software Engineering at Capital One, and you can check us out at Capital One Tech. That's capital one.com/tech. I personally also have a blog. It's kathleencodes.com and you can find me on LinkedIn.

RD: Thank you very much for listening, everyone, and we'll talk to you next time.