The Stack Overflow Podcast

Think you don’t need observability? Think again

Episode Summary

Ben and Ryan chat with Daniela Miao, cofounder and CTO of Momento, a real-time data platform. They discuss the advantages of real-time observability, the challenges of multi-tenancy in databases and caching, the use of WebAssembly in UI development, and the benefits of Rust. Daniela also shares her experiences working at AWS and a startup focused on observability, which led to the creation of Momento.

Episode Notes

Memento is a real-time data platform designed to help developers ship better products faster. Explore the platform here or get started in the docs

Connect with Daniela on LinkedIn and follow Momento on X.

Stack Overflow user Simon Juhl won a Lifeboat badge for dropping some knowledge on HTMLCSS change Date input highlight color.

Episode Transcription

[intro music plays]

Ben Popper Maximize cloud efficiency with DoiT, an AWS Premier Partner. Let DoiT guide you from cloud planning to production. With over 2,000 AWS customer launches and more than 400 AWS certifications, DoiT empowers you to get the most from your cloud investment. Learn more at doit.com. DoiT.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am your host, Ben Popper, Director of Content here at Stack Overflow, joined today by my colleague and collaborator, Ryan Donovan, Editor of our blog. And our guest today is Daniela Miao, who is the co-founder and CTO of Momento, a real time data infrastructure platform. Before that, she helped to build DynamoDB for AWS serving as the Tech Lead and was with IBM Canada's Software Lab. So we're going to talk about uptime and availability, observability, and also her passion for Rust, which is always one of the most popular and beloved languages in our annual developer survey. Without further ado, Daniela, welcome to the show. 

Daniela Miao Thank you, Ben. Super excited for our chat today. Thank you for having me on the show. I'm really looking forward to all of our topics. 

BP So first of all, tell folks, how did you get into the world of software and technology? What led you down this path?

DM Well, mostly through dabbling with it when I was much younger and just learning programming, starting with C and C++, and then grew to really just love the application of it. And of course, with all the exciting things constantly happening in tech, keeping me interested, keeping me on my toes, including with Rust in the more recent years, especially working on it with my team here at Momento.

BP So let's go back a little to the experience I mentioned at AWS, because it sounds like maybe that was formative to creating the company you're at today. What were you doing at AWS and how did you get assigned to that particular project? 

DM I was working on the back end engineering team for DynamoDB, which is a NoSQL database now scaling to tens of millions, hundreds of millions of transactions per second. I got into it, actually, I was working initially on ElastiCache, and at the time, I think the ElastiCache team was already doing really well with their product. DynamoDB was still considered early in its development journey, so I wanted something that could show me that kind of product and technical growth. And then actually combined with AWS, subsequent to AWS, I joined an early stage startup, at the time 10 people, called Lightstep, focused on observability and tracing. I was there for four years. The company really also grew exponentially while I was there, and that kind of gave me combined experience in distributed systems and observability, which all very much seeded the idea for Momento.

BP Nice. I was reading up a little bit on DynamoDB before we got started, and it said that Werner Vogels, who's CTO at Amazon, wanted to do this, especially after a 2004 holiday season where there was a lot of stress and some downtime or some failures. Do you remember going in thinking about Amazon at scale and AWS, which I don't know what year you started, but has just continued to grow and grow? Did you think about specific pain points that you were trying to solve for and can you tell us a little bit about some of the engineering decisions you were involved with or the buildout that helped to solve for those?

DM I think the biggest underlying primitive that Dynamo was very successful at is this idea of multi-tenancy. I think the idea of having a big pool of resources and being able to offer it to isolated workloads and customers without them noticing any difference in performance. In fact, what they do end up noticing is increased availability, decreased costs, so more economical costs, and the ability to absorb unpredictable spikes which are really the underlying principles that we're trying to apply to caching workloads as well. With today's real time applications, you really cannot as a developer predict the type of workload you will actually get in production. I think a lot of the underlying design decisions we made at DynamoDB contributes to its success today. 

Ryan Donovan So I know I've heard a little bit about multi-tenancy in a cloud provider situation. How does multi-tenancy in databases work? 

DM I think the way that you can think about it working is that in traditional databases, you as a developer have to predict ahead of time what your maximum load could be, and then you pay for the bigger and beefier machines in order to account for that. First of all, it's wasteful, because most of the time, if you're overestimating, then you're paying for way more resources to be up 24/7 that you don't need. Secondly, we're usually not very good at predicting workloads. So if we're underestimating, then you're looking at an incident and outage during the time where you need it the most, because if your application or if your app blows up in popularity, you want to be able to support the peak traffic and then you have to scramble to try to figure out a live migration from a smaller instance to a bigger instance. How multi-tenancy works is that there's a gigantic pool of resources, like DynamoDB has behind the scenes, and it does very sophisticated workload distribution and spreads the traffic over multiple machines and instances to dampen the effect of these bursts and to be able to absorb it because you're working with hundreds, if not thousands of machines today. With Dynamo, I bet it's thousands. So it gives you better availability because it's able to absorb those unpredictable traffic, but then during times of no traffic or very little traffic, you're not having to pay for idle instances because it's being used for someone else's workload. So that's the beauty of multi-tenancy. And like I said, it's not as simple as that. There's a lot of sophistication to ensure that the workloads are isolated so that you don't feel the impact of other people's workloads. 

RD These multi-tenants, are they actually sharing the same database instance or just the same cloud resources? 

DM It depends on the exact implementation. Behind the scenes, a lot of the time it's sharing the same physical machine, and these are large, beefy instances at AWS, so definitely CPU, things like memory. With regards to the database instances, you can choose to isolate a given customer's workload on its own database instance. You just run multiple of them on the same machine. I think that's an implementation detail that kind of evolves over time behind the scenes, and that's the beauty of it. As the end user, as long as I get my data, I get the security of my data, I get isolation on my workload, the implementation behind the scenes can always evolve to provide better performance and better cost without the end user noticing.

BP You talked about that from the perspective of that I don't want to have a single instance devoted to me if I don't use enough of it, then I'm paying for more than I need, or something goes down, now I'm noticing it, so you spread it around. That way, you get the best of both worlds. You transitioned from AWS to a company that was working on observability. There you're looking at what's using up the most resources, do we have bottlenecks somewhere, if there was a failure, can we trace it back? Talk a little bit to us about what you learned in observability coming from the database world.

DM I think observability is also just a super interesting field because people often need it the most when they don't have it. So what I mean by that is that when you're developing an application, observability is never at the forefront because you're really just focused on core functionality. And oftentimes in the startup world, you don't even know whether or not your product or application will be successful enough for observability to matter. People tend to under-invest in observability early on and then it becomes too late and too burdensome to add it when the application is actually successful enough for observability, debugging, performance optimizations to matter. So we often, at Lightstep, the startup I mentioned for observability, we often get a lot of customers coming in with a sense of urgency– “I need something, I need visibility yesterday. I want it ASAP,” and there's a lot of debt that's accumulated by that point. So the way I think about it is that if you amortize a little bit of the cost of observability, you put a little bit up front, it's very similar to testing. If you put a little more visibility into your application while it's small and then it scales with your stack, then it's a lot easier and becomes more natural to observe your system when it's under duress and it becomes an easy fix at that point. 

BP Hard to convince folks to put the resources into that when it's a small, lean team, maybe not a lot of runway and they're trying to figure out what they need to change in the product to get that product market fit and to get a little bit of traction to get to the next round. 

DM Totally.

BP But I hear you, if you have good documentation and some good observability, when you suddenly start to hit scale, things will run a lot smoother.

DM Totally. And I think it's a constant balance. I'm not suggesting that you go all out. Which is why I also think it's important and I'm happy to see a lot of the innovation in the instrumentation side as well. Make it automatic, make it a part of the development journey so that it's not another thing that developers have to learn, and then it’s kind of just naturally there when you copy/paste code and you're adding more features. It's just automatically a part of your stack. 

BP Remind me, what was the name of the startup you worked at? 

DM It was called Lightstep and it was subsequently acquired by ServiceNow a few years later.

RD We've had a bunch of folks come on and talk observability, but it seems like most people think of observability as almost an extension of just logs. You're writing some data to a file and then checking it later. What are the additional challenges and advantages of the real time observability?

DM I think a lot of it now has some overlap with ‘analytics.’ Logs, I think, is very helpful in debugging and trying to figure out what's going on. In today's world where you have, I would say, multiple orders of magnitude more data and transactions compared to a decade ago, information overload is an understatement. You really need to be able to look at what's going on at a higher level than logs. Logs give you the details that you need. You need to be looking at aggregate insights is what I always come back to. Patterns, the machine needs to be able to give you some insight that can be abstracted to the human level, maybe what's happening across this machine in the past 15 minutes, in the past hour, what's happening across a set of my machines that's located in region X versus region Y. It's all abstracted beyond the single application, the single process level. And I think with that, what I worked a lot on at Lightstep was tracing insights, aggregate insights, metrics designed to give you aggregate, “How's my system doing overall?” And I think that goes into a little bit of almost analytics. It’s not just about debugging an issue anymore, it's about predicting where my system is spending a lot of the time and resources so that I can optimize my system and catch outages before they happen. 

BP So you've had really interesting experiences as a developer. You got to work inside of a big company on a new product, which was kind of fun. Then you went to a startup from zero to a hundred people and then that got acquired. And now you're in this new position of co-founder and CTO, and you've raised venture capital, you're trying to create your own thing. So what was the inspiration that led you and your co-founder to decide that this is an idea we can build a business around and that genesis of how it got started?

DM I think it was sort of natural for us because we both had the startup itch. Obviously I had been through sort of a complete lifecycle from being early to getting acquired, so I thought that that journey was really, really interesting and kind of stretched me in ways that I couldn't imagine. So I wanted to sort of do it from the front row seats. So Khawaja, my co-founder, and I were exploring different ideas and we started by really just becoming application builders. We were building this fitness app that we thought had some impact, real life impact, at the peak of the pandemic to try to help others. And in building that app out we realized, first of all, we weren't very good app builders. We're infrastructure, we're platform builders. But we did discover multiple pain points, and that's the point is to discover problems and think about how our experiences would fit. And manually managing multiple caching clusters was definitely a pain point for us personally, and we started doing more user interviews and found that it was a pain point that others could resonate with. And we saw this, I mentioned this paradigm of sharing resources in a more sophisticated manner via multi-tenancy and didn't see that apply to the caching world at all. It's been done multiple times in the database world, not just with Dynamo, but things like Mongo as well. So anyway, we thought that there was something here, pitched it, raised money for it, and acquired design partners in production, and kind of our first product is the Momento caching product in the real time infrastructure world. 

RD So talk a little more about the multi-caching aspect. I think I've thought about caching as a single in and out, but what's the problem you're talking about here? What is the multiple part of the caching? 

DM Caching is usually used as a load shedding layer. It's a performance layer for sure, but it's oftentimes also used to front a lot of the traffic that would otherwise go to the database. So think of everybody turning on their TV during the Super Bowl and watching the same ads. If it's served from some database, you're getting millions of transactions hitting the same database in the same second, and databases aren't usually designed for that kind of bursty unpredictable traffic and that's why Dynamo became so successful. But even DynamoDB, in order to provide database guarantees, has hard limits around how much you can query a given item in a given second, and that's where people usually put a caching layer to be able to serve those requests very quickly in a short time period. So caching sees bursty activities even more often, and the peak to average ratio is even higher than databases. And yet the caching world in terms of platform provisioning and capacity management is still very much the same as what we had to do with databases back in the day. You choose an instance type, it gives you a limited set of resources– CPU, memory, network, bandwidth, connections– and that's what you get. And if you get bursty activity that goes above those limits, any of those limits, you start getting errors, your application starts suffering. So again, the same set of problems. So we just thought that by making this sophistication around multi-tenancy where you have a big pool of resources dedicated that could be utilized for your cache when it has bursty traffic, and then when it doesn't, it can be utilized for someone else's bursty traffic. So that really plays to our advantage in terms of how we're able to offer better availability and at cost-effective levels to our customers compared to traditional caching and provisioning. This is just about dollars costs and infrastructure, but if you include the mental burden of having to benchmark predict and hope that your cluster size is big enough during your biggest event, then I think the value goes up exponentially.

BP I remember seeing in the email with a little bit of background that you had clients like Paramount and Taco Bell. We had someone on recently from Shopify and I can imagine with Paramount and the Olympics, suddenly you've got this burst of activity. Or Taco Bell, you're running this weekly special and one of them just happens to hit the right zeitgeist and all of a sudden everybody's using the app all of a sudden. There were a few other things on your site alongside caching– Topics and Storage. Storage is coming soon, so do you want to tell us a little bit about Topics? 

DM For sure. So if you look at AWS, the interesting thing about being an infrastructure provider is that people often have multiple levels of problems or pain points with their infrastructure, and when they start using you for one, they start talking to you about other problems that they're having. And the interesting thing about what I said about this paradigm of multi-tenancy and resource sharing is that it's not limited to databases or caching. You can apply this to all kinds of resources. And so with the same architecture, we were able to come up with different lines of product. So caching is for obviously performance and load shedding reasons, and then Topics, which is our real time Pub/Sub service, is for instant delivery to many, many devices. So if you think about chat or video comments, emoji reactions where you kind of have one single source– somebody reacts, somebody posts a comment, and then you have to instantly propagate that to potentially millions of users who are watching a stream or playing a game. So this problem is very bursty in nature and unpredictable. So the same underlying computer engineering, computer paradigms get applied here and that's what Topics is here to do. If you need instant delivery to up to millions of users at scale and with this bursty nature, that's what Momento Topics is designed to do. And especially in things like media, so Paramount streaming, sporting events where it's a one to many relationship, gaming is a big vertical for us as well. Chats, raids, tournaments, where a lot of people come together and you need things to be instant, because what a bad experience it would be for one player to get information with a few seconds lag from another player in the tournament. So that's what Topics is. And then Storage I'll just talk a little bit about. Naturally, when you want to have a caching layer, you have a database that you're trying to work with it. A lot of our customers may not. They might be building a new application and they kind of want a one-stop shop for both caching and their durable data. So really for us, it’s extending our architecture to work with databases as well, and that's what Momento Storage will be. 

RD I know in the initial contact we talked about Rust, and Rust is a big, well-loved language in our survey. Every year it tops the charts. I'm wondering how Rust plays in and how do you think it improves, especially the real time nature of this. So if you could talk a little bit about what Rust does for you. 

DM For sure. I think probably a lot of the folks in the audience are familiar with the performance benefits that Rust offers. I think a popular saying is that it gives you higher level abstractions as a language paradigm, but then the performance of lower level languages like C and C++. That's the reason that we started exploring it in the first place. Most of our stack was actually written in a JVM-based language in Kotlin, because initially most of our engineers were very familiar with developing and debugging in Java or JVM-based languages. Once we reached the point where we saw that there was basically not too many low-hanging fruits left with performance optimizations within the JVM, and a lot of it is caused by garbage collection, the classic characteristic of JVM, we started looking at other languages, obviously the traditional C/C++ versus Rust, and we did find that the way that Rust is designed with memory safety in mind gives you guardrails that you cannot work around where you have to think about a lot of things ahead of time. Its ownership model forces you to think about who owns this variable, when you can let it go, immutability, so it teaches you these concepts that you must learn in order to even develop in it, which seems like a little bit of tax that you pay at the beginning to be able to develop much faster in a way that prevents you from making errors when you're developing, and it catches errors very early on during compile time. By the time that the program compiles and runs, it compiles to a binary and it does run as if you had written it in a lower level language like C and C++ which kind of gave us a step function in terms of improvement in terms of optimizations. And us being a real time platform, performance is in our DNA. We need to be faster, we need to get cheaper, and we need to be able to serve more requests per second constantly to be able to stay competitive. So I think for us, choosing Rust and really migrating our core services over to Rust has been a game changer. 

RD I hear a lot of folks talk about fighting with the borrow checker and complaining about the Rust ownership model, but it's an interesting way to think about it that it almost makes you think about the program in a different way. Do you think there's ways to take that way of thinking about memory safety to other languages? 

DM I think that it depends on how much things like performance and scaling with performance in mind matters to a given application. It's hard for me to answer in one broad stroke because I understand that for some people an app is an app, and they're sort of making a lot of these, and then more than half of them won't be successful. And the one that’s successful maybe they can work on optimizing. For us, when we work on an infrastructure platform, when we're down, our customers are down, so we have to think about performance and reliability as a part of our DNA, because the negative consequences are too crucial, too significant. So for us, I think, like I mentioned, that little bit of tax that kind of forces you to think about the ownership model ahead of time prevents us from inadvertently committing things like data races, which is almost impossible to observe and to catch early on, and then really, really a big pain to debug later on during runtime. So the one thing that I've heard about interpreter-based languages like Python and JavaScript versus Rust is, yes, it takes you a little bit longer to get going because you have to learn about how immutability and how ownership works and it kind of changes the way that you're thinking about programming. But once you have it, then you can continue with the same velocity even when your application has scaled and has grown a lot more popular. Versus a lot of the time, I have seen this with Python and JavaScript, that maybe it takes you a couple of days to get going, and that's awesome to go from zero to having an application really quickly, but if your application does become serious and becomes popular and it becomes very complex and it becomes almost impossible to track down where it's being slow, where a bug is happening because everything is happening in runtime and you can't observe the actual bugs. It's all a trade-off, I would say, but we've liked the side of the trade-off that we ended up on.

BP Very cool. Daniela, anything you feel like we missed? 

DM I would actually love to talk about one more thing with Rust that I think a lot of people don't talk about enough. Because of its performance comparisons to C and C++, people think of it as a back end system-level programming language. And even with us at Momento, obviously we're using it to build an infrastructure product, but more recently, I've met more and more Rust developers through my involvement at QCon San Francisco, which is this conference that a lot of developers go to, and I have a whole Rust track lined up for folks there, and I've met a lot of Rust developers who are working on Rust for higher level applications for UI. So for instance, Amazon Prime Video completely adopted Rust to be able to use it because it compiles into WebAssembly and makes their UI just super snappy, super fast. That's something that with Gen AI especially, people like snappy applications. They want things to be faster and they want the application to be reactive. So I found that really interesting that it's not limited to system level programming. People are using it for UI and for front end applications. 

RD That is interesting. I've definitely heard of people complaining about the heavy JavaScript load on clients and traditionally solving that through server-side compiling. I've also heard of the heavy JavaScript load being an undue burden on places with slower internet connections. Does the WebAssembly help with that? 

DM Completely. I think part of the reason for server-side compiling is that you can compile it down to these minified bin packs. I'm not a web developer, so I apologize if I'm not using the right words. But the idea is to front-load the work and then just have the clients, like you mentioned, whether it's a phone or honestly a super simple tablet device to do as little work as possible doing the rendering and things like that. So the benefit of WebAssembly is that it compiles ahead of time into machine code and it cross-compiles over multiple platforms. So whether it's iOS, Android, or some generic tablet device, also with power. On mobile devices that's key as well, to lower power consumption. So I don't want to give away too much on the Prime Video engineer’s talk, but they're rewriting all of their applications for Prime Video in a Rust slash compiled to WebAssembly because they've seen such a big difference in performance and power consumption benefits. So I found that to be super interesting and continuing to unlock the power of the language.

[music plays]

BP All right, we have a Lifeboat Badge winner. We give a Lifeboat when somebody comes and saves a question from the dustbin of obscurity with a great answer. Somebody asked, “I have an input of type data in my page, and I want to change the color of the highlight inside of it and I don't know if there's a CSS selector for that.” Well, Simon Juhl knows what to do and was awarded a Lifeboat Badge. Congratulations, Simon. Anybody else who's curious, we'll put that question in the show notes. Over 13,000 people have benefited from this little tidbit of knowledge. I am Ben Popper. I'm the Director of Content here at Stack Overflow. Find me on X @BenPopper. If you want to come on the show as a guest or you have a suggestion for topics, email us, podcast@stackoverflow.com. And if you like the show, the nicest thing you could do for us is leave a rating and a review. 

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me, you can find me on LinkedIn. 

DM And I'm Daniela Miao. I'm the CTO and co-founder at Momento, a real time infrastructure platform. You can find me on LinkedIn to connect further, and please follow us on Twitter @MomentoHQ, or check out our website at gomomento.com. Thank you.

BP All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]