The Stack Overflow Podcast

Building zero tier systems on bare metal

Episode Summary

On this episode of the podcast, we talk to Mauricio Linhares, a senior software engineer at Stripe about the pain of migrating monoliths to microservices, defining zero-tier systems, and why plugging all your servers into the same power supply is a bad idea.

Episode Notes

While Mauricio and team had to get back to bare metal, most programmers are headed in the opposite direction. It’s why MIT switched from Scheme to Python.

At Stack Overflow, we’re familiar with what happens to websites during physical failures, like hurricanes.

Connect with Mauricio on LinkedIn.

Congrats to Lifeboat badge winner

The Nail

, who pinned a solid answer on the question,

if->return vs. if->else efficiency

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, worst coder in the world, Director of Content at Stack Overflow, joined as I often am by my colleague and collaborator, Ryan Thor Donovan, editor of our blog, champion of our newsletter, CMU graduate and 20 year anniversary attendee. Ryan, how was your time with all the old fogies from CMU?

Ryan Donovan It was good. We had the even older crowds there. It was their 50th. So it's good to see what everybody's doing, and everybody's in software.

BP And when you go to a reunion, the people who just graduated are there and they want to party, and then there's the 10 year, the 20 year, and then there’s 40 and 50 and you're like, “Whoa.” But they were at their 20 year reunion once.

RD That's right. Back in the day.

BP Well, were there any particular topics of conversation that dominated or was it just the usual AI hype cycle?

RD It was just the usual. Honestly, a lot of people didn't actually talk about AI. They were just like, “Meh.”

BP Yeah. They were like, “You got any kids?” Right, that's nice. All right, well today we are going to talk about technology. We are going to be chatting with Mauricio Linhares, who is a Senior Software Engineer at Stripe, and he has worked at Stripe as well as DigitalOcean. We're going to be chatting about things like monoliths and microservices, what it means to work in the ‘cloud’, how you do migrations from a monolith to a microservice, and a bunch of other sort of interesting things that he's done throughout his career. So Mauricio, welcome to the Stack Overflow Podcast.

Mauricio Linhares Hey, it's great to be here. Having been a user for many, many, many, many years and having answered many questions and asked many questions as well, it's great to be here talking to you folks.

BP That's awesome. If you're brave enough to share your username, we'll add it to the show notes.

ML Oh, yeah. Just look for Mauricio Linhares. It was a lot of Ruby and Java. So now I'm mostly doing Go Lang and that kind of stuff so it's always moving, always switching.

BP So Mauricio, for folks who are listening, one of the reasons you're on the podcast is because you work with a colleague of ours, Roberta Arcoverde, who's been at Stack for a very long time and helped us through a lot of big technology building, architecting, and evolution. Tell the folks a little bit about who you are, what you do, and sort of how you got into the place you are as a senior software engineer at a very well-respected tech company.

ML It was kind of a wild chase, I guess. I left college, started to work for a local company. My hometown is just a couple miles away from Roberta's hometown. We're from the same region in the country. We kind of almost have the same accent, but the people from our hometown have a slightly different accent that makes them sound a little bit different from us. But we’re kind of from the same place, as would say. And I worked for a year at a local company and it was right at the time when Ruby on Rails was eating everything. Everyone was talking about Ruby, everyone was doing Rails. This was like 2005-2006 and it was the talk of the town. And I was doing Java back then and people were saying, “Hey, those things that are taking you a week to do in Java, you could do it in five minutes in Rails,” and my eyes just started glittering and I was like, “Well, I should have a look at this different language.” And I started playing with Ruby, did a little bit of Rails, and it was around the time where there was a lot of people hiring people remotely from all over the place and I landed a job at a consultancy in South Africa. And it was me and three other guys doing this work. We were mostly doing consultancy for other companies that needed Ruby on Rails experience. So we needed apps to be built, kind of the beginning of the Software as a Service kind of thing, so we built a lot of that kind of stuff. We built social networks. It was a lot of different things. And since that time I haven't really ever worked for a company in Brazil anymore. And eventually just switching jobs and working with Rails and other stuff, I eventually moved to the US eight years ago and started to work at a local company in Philly and had an amazing time there. I moved to DigitalOcean, still living in Philly working remotely, but I would visit our office in New York almost every week. And eventually before the apocalypse happened, we decided to have a kid. My wife got pregnant and we were like, “Hey, we need to move elsewhere. We need a bigger place. We can’t just be inside the city anymore.” And we moved to Florida and now I'm a neighbor. It's just 10 minutes away from our place and we have our little one in here. And last year I switched jobs and came to Stripe. I was looking for a different challenge. I worked a lot in platforms and building services like internal services for the business. Most of the time that I was working at DigitalOcean I was at the team that was helping people migrate from the monolith to microservices, so there was a lot of infrastructure. Most of my work was infrastructure and I was like, “Well, maybe I should do something that is not infrastructure anymore, work on stuff that's customer-facing and that kind of stuff,” and Stripe seemed like a really good opportunity to do that kind of work. Work more on teams that are building features to customers instead of an internal team that's building infrastructure and platform. So that's how I ended up working on car payments here.

BP Interesting.

RD So you've done a lot of work. You are sort of an expert on the migration from monolith to microservices, and that's obviously not a super new thing, but it is for some companies still who are migrating from their monoliths. What would you say are the most important things for people to keep in mind when they're making that migration?

ML I think one of the most important things that people should keep in mind is that both the systems need to continue working. So that was one of the things that was number one for us back at DigitalOcean. As we were building new systems and building new functionality, allowing people to migrate out of the services, the old and the new services had to work together. So it couldn't be a thing where you're just going to say, “We're just going to flip and switch everything from one place to the other,” because that would just not be feasible. The amount of traffic and all the work that had to go into making all of this work, it couldn't just be a big bang switch where everything changes and now all the traffic lives in the new place. We had to make sure that the migration path was a migration path that was slowly moving traffic and features into the new system. And also there's a lot of stuff that I think people don't pay attention to with the amount of work that has already been invested in that monolith. So you're going to have metrics, you're going to have alerting, you're going to have logging, you're going to have a lot of infrastructure that exists around that solution already that might not be visible because you are just so used to it that you don't even see that all of those pieces exist inside the system. And when you start to migrate that out to a new microservice that exists in its own place, its own code base, its own special way of building, you're not going to have access to all of those things that are already built for the monolith unless someone is building that for you. And that was one of the things that we had in mind back at DigitalOcean that we really wanted to do. We had to build all of the capabilities that existed in the monolith so that people migrating into these microservices wouldn't have to build it themselves. Because if you imagine 20 teams all building their own microservices, all building their own thing, they would end up with 20 different solutions for every single one of these things. They would have their own logging, they would have their own metrics, all different ways of doing this stuff, and that was one of the number one goals that we had that we just did not want people to go that direction, just reinventing the wheel every single time they were building a new service. And I think this is something that people miss when they're doing this kind of migration. There's all of this stuff, all this history that is built and maybe not visible anymore, and it starts to be really visible when you migrate it. There's a lot of growing pains that you get as you're moving into microservices.

BP Yeah, that's interesting. We were on a call or a podcast recently with the CEO of Retool, and I think his pitch was similar. Instead of building internal tools at every company and reinventing the wheel– or as you make a good point, even different departments building it– get a SaaS provider and people can pick off the shelf stuff and it's up to Retool to keep them up to date and do all that kind of stuff. Another podcast we were on and they were talking about creating sort of a developer portal, but basically a design and developer language that was universal throughout the company. So if somebody built a developer tool internally, there was sort of a guidebook that was like, “This is how you can build it so that we can make sure it integrates.”

RD Like a developer platform or something.

BP Yeah. I don't think it was the Backstage one, but.

RD I mean, a lot of the sort of internal developer tooling for services ends up being service meshes and API gateways and the sort of things that handle the traffic shaping and manage failovers and such. So how did you all at DigitalOcean solve that? Did you create the kind of ligaments that tied everything together or was it something else?

ML We had to, we had to. That is one of the downsides of actually being the cloud provider. You can't hire cloud providers to do that work for you. We couldn't have a third party system or application sit in between the user doing the operations so we just had to do a lot of the work ourselves. So rolling out all of the systems, and we had to do feature flagging, we had to do the actual API gateway. Back when we started, Envoy and all the service mesh things that we have in place right now were not things, so we just had to develop a solution that would work for the environment that we were in. And it was an environment where a lot of stuff was changing. That was the beginning of everyone using Kubernetes so there was a lot of learning on how we would be running Kubernetes in an environment like that, and that surfaced a lot of stuff that we did not understand about the network and how routing traffic into the Kubernetes cluster was going to work and mixing. So we have the virtual machines. Can we use the virtual machines, or if we use the virtual machines that we create ourselves, what if the virtual machines break? Is that going to break a control panel? So there's even the consideration of how are you going to build the systems in tiers so that, as a tier that's on top of you breaks, that doesn't break you? And that was another challenge that we had to have, because as you are building the cloud, depending on where you are, you can't just dog food. Because those systems are being built on top of the primitives that you're providing, so your primitives cannot depend on these services. So we could not build a system that creates the virtual machines on top of virtual machines. That had to be built on something else lower level so that breaking the virtual machines doesn't break this internal system, and that was one of the challenges of it. It was a business that was already running. There was a lot of stuff going on, and how do we make sure that these tiers are actually in the right place? We used to call those the zero tier systems. It would be the systems that would have to run independent of everything else. They can only depend on themselves and they have to do all the work themselves. So building this kind of stuff when everyone is talking about, “Oh yeah, we have cloud formations, we have autoscaling groups, and we have managed Kubernetes,” and then inside we look, “Oh yeah, we just have to run everything manually or in and automated fashion, but not using cloud solutions because the systems are so low level that they can fail.” They would only be able to fail by themselves. They could not use the other solutions that we had and building virtual machines and using managed Kubernetes because it would just make it impossible. You’d just have a cycle so one breaks and then everything breaks. So that was one of the biggest challenges that we had. We just had to roll all of this ourselves.

RD I mean that's interesting. I think when everybody talks about cloud, they sort of forget that there's actual silicone and metal underneath it. Did you run your stuff on an operating system or was it something even lower than that?

ML It was mostly operating system, so it was a mix of operating system and Kubernetes depending on what kind of system we were using. So most of the stuff that was compute heavy like the API gateways, that did not actually have to hold any data, they would just ship data elsewhere, they could all run on Kubernetes. But we had a lot of services like Kafka databases that all had to run directly on hardware that was not virtualized. So those were actual machines that we had to kind of operate by ourselves using Ansible, Chef, and that kind of stuff. So there was a lot of manual operation to make the cloud work. Contrary to what most people expect, there is a level where you can't get that much automation out of it.

BP I was watching a talk the other day from a famous MIT professor and it was about why they had changed their course to move SICP over to Python, and he was saying, “When we designed this, people had to get down at a low level and understand the metal and the memory and how the compiler worked and all that kind of stuff. Now when I talk to students, they just take a bunch of libraries and packages and poke and prod and see if they can get out what they want.” So it's interesting to hear you say that you may want to have this incredibly lean, nimble, agile, serverless, headless microservices company, but somebody at the end of the day has to be responsible for some of the bedrock, otherwise one piece breaking would kind of cause the whole thing to spin out of control.

ML Yeah, it's just computers. At the end, there's going to be a large server that has lots of CPUs, lots of memory, lots of disks, and that is one of the things that I think most people don't pay attention to, but we just had to. We had to look at the discs, how many discs do you need? What kind of RAID setup are you going to use so that you're not going to lose data? What are the failure modes? In a lot of cases, we actually had to look at placement. So if you're going to get the server, you want to make sure that you have a database. This database is going to have failed over to other databases. They cannot be plugged to the same wiring and they cannot be plugged to the same power supply. And you actually have to go there and make sure that, yes, when we're going to plug in these boxes, you have to make sure that this box has to sit on this switch on this power supply and the backup needs to be on another switch on another power supply, because you don't want one switch causing your database to be gone and then your failover doesn't work because it's unreachable as well. So even the placement of the boxes is going to affect it.

RD Right. If you want to get even more strict, you can put them in different buildings. Just protecting against godzillas.

ML And we had to do that. It has to be different buildings and different regions altogether. So we had to make sure, “Okay, so we have a bunch of stuff that sits in data centers in New York and New Jersey. We also have to have the backups in San Francisco or in Toronto, so that if one of these places fails completely, the stuff has to move to these other places. All the traffic and processing has to move to these other places.” And most of the time we would have to have hot systems where both of them are working and operating, even if they're not taking all the traffic, but so that it's easy for you to flip a switch and say, “Hey, now we're going to send all of this traffic to this other completely different region because we have lost a full data center or multiple data centers in a region.” So there's a lot of work that has to be done at this level just because of how low level you are to make sure it works, when in a cloud provider you just say, “Yeah, I'm just going to use different availability zones and that is going to work.” And that's one of the things that, whenever I see people saying, “Oh, it's so hard for you to deploy stuff right now. It was much easier. We would FTP to a box and deploy to the box and then it would just work.” And I'm just like, “No, you don't want to go back to the time of FTP. I've been there. It's not fun.”

BP Yeah. Underneath all that, “Just log into this website and you'll have an app running in minutes,” there are real people mucking around in real buildings plugging and unplugging things. Stack Overflow has a famous story about when Hurricane Sandy came to New York and people were walking up and down stairs bailing out buckets of water so that they could make sure Stack Overflow’s servers didn't go on the fritz. So tell us a little bit about your recent decision. You said you wanted to move away from the backend and do things that were more consumer-facing. Was that just something you felt was an opportunity to grow, to stretch and learn new things? What was it about the consumer-facing side of stuff that interested you?

ML It was mostly because I felt like I've been doing this work for such a long time, before joining DigitalOcean I was coming from small startups where I would just be working all over the place, but I ended up always moving to the operations because it was always this pain of DevOps and how do we make all of this scale, how do we make all of the systems work together. And after years and years and years doing this kind of work and being on the back office, I thought maybe this is a good time for me to go work on something where there's more user-facing visibility to the features. So it is a completely different experience, because while in the past I used to be the person that people would reach out to to set up architecture and all that kind of stuff, now I'm on the other side. Now I'm asking. And I think having the empathy to also notice the difference in that now I have to actually work with these other teams to make sure that the work that I need them to do is on schedule. So they have to build the infrastructure, we have to set up a new environment, or we're going to start processing elsewhere. So you just have to go and work with all of these different infrastructure teams to make sure that all the things are going to be aligned for when we're going to do the launch and that kind of stuff. So I feel like I needed to get this perspective on the other side. Being someone that was mostly working and getting the asks from other teams, now I am the one asking people to do this kind of basic infrastructure work and finding out where the holes are. So we have all of these pieces on the platform that need improvement and now I am the one also helping drive the direction, where in the past I would be on the other side, just working on the platform, making all of these things run. So having this kind of perspective of working on the other side where I'm building the features, building the applications for someone to run these applications at is definitely something that I would recommend to anyone that is working in operations to do every once in a while to get this perspective on the other side.

RD Is there an incident where you were on the infra team that you gained more empathy for the askers when you were now asking? Was there something that you were like, “Oh, I get why they were asking that.”

ML Yeah. The perspective that you have as you're building infrastructure is very different from the perspective of the people that are using the infrastructure. And one of the things that really caught us underhanded was that we had to build new dashboards and we had to build ways, especially for the operations team, to find out if there was something going on in the systems. Because as the API gateway back then, it was the first layer of all the traffic. So any traffic that was going through the public API, the control panel at DigitalOcean, had to flow through this API gateway. So we had visibility into all the metrics, and what we ended up doing was building a really complex dashboard with a lot of panels, a lot of metrics– not a lot of explanation because we were like, “We understand all of this. We don't need explanations for it.” But then when operations people would open that dashboard, they would be like, “What are these numbers? What do they mean? What is this about? How do I even figure out if there's something going on?” because it's graphs all over the place and there's no definition on what is going on with the system. And at that point we were like, “Hey, we made a huge mistake in here. This dashboard is really useful for us as developers and the people that are running the platform, but not for the customers of the platform or for the operations teams that need to understand what's going on in here.” So one of the first things that we did after that was starting to build separate dashboards for the separate demographics that we had. So one, we would build dashboards that were services-specific for the people building new microservices that would sit behind the API gateway, so those would be high level dashboards only metrics for your system so that you could see how your system is behaving on that specific environment, and then we would build different dashboards for instant detection with multiple levels. So instead of just saying, “Oh, we're going to build this dashboard that only looks at our system,” we will say, “What are the other systems that are related to us that could also be causing instances?” So we would be looking at the CDN provider that we had. We would be looking at the main databases that we had. We would be looking at the highest traffic systems that we had, and we ended up building a tiered dashboard where you would be looking from the high level all the way to the low level to the main databases, and all high level perspectives, high level metrics with colors to make sure that people would see. You open the dashboard and right on top was a bunch of panels that would be green, yellow, or red showing high level metrics and high visibility. So when you open it and you see, “Oh, the API gateway is showing yellow,” we do have an issue here. The database is showing red, the issue is at the database. So that way they wouldn't have to actually go through every single one of these panels and understand all these different pieces. And we also introduced descriptions to the metrics. We started to add marks for deployments, so if there was a deployment on that application, that would be marked on the dashboard. If something started to happen right after there was a deployment mark, it's very likely it was the deployment that caused it. So building these separate dashboards for the separate people that had to look at the data was one of the things that became really visible to us. We are not building something just for us. This has to be built for everyone, even people that are not direct customers like the operations people. Because at the beginning we saw our direct customers are the teams that are migrating into microservices. And then we noticed, no, our direct customers are many more people in the business that actually want to have a high level perspective of how systems are behaving. And this really led us to improve the way we were sharing data, the way we were exporting data, building dashboards, collecting metrics to make all of this more visible to everyone else.

BP Yeah, we had a great guest on recently who's also written from the blog who talked about observability debt, and I think what you're talking about is building tools internally that start to take care of some of that stuff, because as you point out, it can be hard to know where it started. And when you're extending things in different directions and piping things out to microservices, it’s harder and harder for teams to understand why their particular service isn't working because something upstream or downstream has gone wrong. That was Jean Yang from Akita. Observability debt is the new technical debt.

ML And that is true. There is so much data that we have to process that actually operating on all of it and making sense of it is a lot of work in itself.

RD Yeah, you no longer just have a text file in a certain folder. Now it's just all these services, everything is just throwing out data.

ML Yeah, and even as you start to mix the variables, you have this small service that's not taking traffic from India and it's just a trickle of traffic that's not working. But if you don't have tracking for these collections of variables, you will never notice that that is happening– that specific amount of traffic from that specific country for that specific application.

[music plays]

BP All right, everybody. It is that time of the show. We want to shout out someone from Stack Overflow who came on and helped save a little knowledge from the dustbin of history. Awarded April 12th to The Nail, “What is the efficiency of if->return vs. if->else?” If you're curious, The Nail has an answer for you and has helped 25,000 people over the years with that little tidbit of knowledge. And of course, don't forget to check out Mauricio, who is in the top 0.61% overall. So Mauricio, we appreciate all the knowledge you shared. Ruby on Rails, Ruby Java, Ruby on Rails 3, ActiveRecord in Scala are your top areas of contribution, so we really appreciate it. All right, everybody. As always, thanks for listening. I am Ben Popper. I'm the Director of Content here at Stack Overflow. Find me on Twitter @BenPopper. Email us with questions or suggestions, podcast@stackoverflow.com. And if you like what you heard, why don’t you leave us a rating and a review. It really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to find me on Twitter, I'm @RThorDonovan.

ML And I'm Mauricio Linhares, Senior Software Engineer at Stripe. And to find me, you can search for Mauricio Jr on Twitter. And if you're in Brazil or you can actually understand Portuguese, you can go listen to our Portuguese podcast on technology, it's hipsters.tech. So go there, you're going to find me and Roberta there.

BP Very cool. All right, everybody. As always, thanks for listening, and we will talk to you soon.

[outro music plays]