The Stack Overflow Podcast

How developer experience can escape the spreadsheet

Episode Summary

Ben and Ryan are joined by Cortex cofounders Anish Dhar, CEO, and Ganesh Datta, CTO. Cortex offers an internal developer portal that helps devs document and reinforce organizational best practices and improve developer productivity. The portal includes features like scorecards that incentivize developers to improve their work and AI-powered search to make finding information easier.

Episode Notes

Cortex is an internal developer portal that cuts noise and helps devs build and continuously improve software. Explore their docs or see what’s happening on their blog.

Cortex is also hiring, so if you’re an engineer who wants to work on these kinds of problems, check out their careers page.

Connect with Anish on LinkedIn or X.

Ganesh is also on LinkedIn and X.

Shoutout to Alex Chesters, who earned a Great Question badge with How to count occurrences of an element in a Swift array?.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, Director of Content here at Stack Overflow, joined by my co-host with the most, Ryan Donovan, Editor of our blog, maestro of the newsletter, oldest living technical writer.

Ryan Donovan We die young, we die young.

BP Maybe. Ryan, we're going to have some folks on today from Cortex, and what's interesting about this in terms of our history of doing the podcast together for the last four or five years is that they worked, at least in part, as engineers at Uber and figuring out issues and pain points there as part of a hyperscaling company, taking some of that knowledge and moving it into a product that became its own external company. We did a podcast recently with the folks from Temporal. I think there's been at least one other.

RD Chronosphere is another one.

BP Chronosphere was the other one. Temporal and Chronosphere, there’s a time theme there. And then obviously we had that gentleman writing for us who then went on to be a big Substacker. You know how there's a PayPal Mafia? There'll be an Uber Mafia– people who worked there and went on to do their own great things. But without further ado, I'd like to introduce our guests. Anish Dar is a co-founder and CEO and Ganesh Datta is the co-founder and CTO at Cortex. So welcome to both of you on the Stack Overflow Podcast.

Anish Dhar Thank you so much.

BP So Anish, let's start with you. Just quickly for our audience, how'd you get into the world of software and technology, what landed you at Uber, and how did you transition from working there as an engineer to becoming a founder yourself?

AD Absolutely. Well, I think the really cool thing about both our backgrounds is that we both grew up in the Bay Area, and growing up in the Bay Area, you're just constantly surrounded by technology. My dad was an engineer, and so growing up I always had this fascination with computer science, with building things. I loved hacking my Nintendo Wii. And that was really, I think, what really got me excited about pursuing computer science and becoming an engineer. So I just kind of had this natural interest in building websites or software or messing with technology. And I remember Uber, especially in 2013/2014, was having this meteoric rise and it felt like the company everyone was talking about just because of the impact it was having across the world. And so for me, joining a company like Uber was always just my dream role and job, and when I got there things were moving so fast. I remember in my first onboarding, the CTO at the time, Thuan, mentioned that the company was literally doubling every six months and they had just entered China and just the technology problems that they were trying to solve from a scale standpoint were so massive that it felt like the perfect opportunity to join. And one of the things I quickly learned right when I joined Uber was just the scale of their infrastructure. Uber builds pretty much everything in-house, and one of the results of that was that there was almost this explosion of services and infrastructure. And so one team could have over 400 services and they all did specific components. One service was calculating the price you would charge a driver when you get your fare, another would be doing the matching algorithm between a rider and driver, all of these just kind of working together to build the software that is now Uber. What was interesting was that over my few years there, the number of services didn't just linearly scale– it exponentially grew. And what ended up happening is that over the four years there, the engineers who wrote those services ultimately started leaving the company. And I remember one specific moment where I got alerted for a service that I spent almost 15 minutes just trying to look up where the documentation for the service even lived, and when I found that documentation, it was completely outdated and I didn't even know who I should contact because the original person who wrote it had left the company. And so I started noticing things like this happening, not just at Uber, but really in a lot of the conversations I was having with friends who were also engineers, and I realized that because of this service sprawl, there was almost this natural tendency to have problems just in developer productivity related issues. Everything from new engineer onboarding to tracking and enforcing best practices became difficult. We would often track a lot of this data in spreadsheets which would immediately go out of date when you needed it, and just scaling this became enormously difficult and I think that's really what led me and Ganesh ultimately to start something like Cortex.

BP I can't say that the story you tell is unfamiliar to us here at Stack Overflow. There was sort of a founding ethos of, “We're all great engineers here. We know what we're doing. We're going to build it better, and so maybe we'll make our own mail system. Maybe we'll make our own CRM. Why would we pay for it? We can roll our own.” And as you said, learned the hard way that when people leave who are the stakeholders, they don't always pass it along, there isn't always great documentation, and it's not easy to keep up with the pace of development that a service provider who's focused on that one thing exclusively can do.

AD That's exactly right. And I remember there were several internal tools that Uber had built to try and solve different pieces of this equation, but they were all scattered across the company and different teams weren't aware of different tools. For example, there's a tool called uOwn that would attempt to tell you which team owns a specific piece of software, which was super useful, but what it didn't tell you is, okay, does that piece of software actually meet our quality standards, which were constantly changing as Uber was evolving. And I think that really inspired a lot of the roadmap that we have here at Cortex.

RD It's interesting. When I first heard about developer portals like this, I was like, “Oh man, I wish I had this at my last job.” One of my responsibilities in my last job was figuring out internal documentation for service-oriented architecture. And we had only a hundred or so services, but nobody had documented who owned it, where those repos were, anything about it, so I put together a spreadsheet that eventually became the sort of de facto thing, and it was still a mess. That's all we had. So how did it go from all these ad hoc solutions to being an actual portal, an actual product?

AD I would say that it was kind of this iterative process, because when we started the company, it was really under this specific problem that there is a ton of services and we don't know who owns them, and so let's start with building just a microservice catalog. And that was really what it was for almost a year. And what was really interesting about that is we had just gotten into Y Combinator actually, and I was our company's SDR, so I would email almost a hundred people every single day, basically asking them if they have this problem and people would respond and say that this is definitely a problem and they would get on a call with us, but that's kind of where it would end because people would say, “Yes, this is a problem, but I've tried building something like this internally. How much better is this really than a spreadsheet? How are engineers going to maintain it? What can I do with this catalog?” And then we realized that there's really two problems that every company faces. The first problem is, how do I organize my services in a way where I can understand things like who owns a service or where's the documentation for this service live? But then the second problem, which you inevitably reach after you try to solve the first problem, is how do I get engineers to actually care about the quality of this data? And I think that second problem is actually much harder to solve, and is actually where we realized a lot of the ROI comes. Because what you do with the data that ultimately is your catalog is where you can really drive meaningful improvements and see ROI, especially around how do you create this culture of continuous improvement? How do I get engineers to care about reliability or security from day one? How do I create this culture of operational excellence? And I think this is a goal that many CTOs and VPs of engineering have today, especially in today's economic environment. That really led us to build our second product, which is called Scorecards, which effectively lets you enforce best practices and help engineers understand what good looks like. And I think that's kind of the natural evolution to what is now an internal developer portal, really around cataloging, scorecarding, and now our third, which is developer self-serve. And I think the combination of those three is what ultimately makes this portal.

BP So Ganesh, tell us a little bit about your perspective. How does it differ being the CTO? Were you also at Uber and came over?

Ganesh Datta So I was not at Uber. I was also a software engineer. I was at a fintech startup, got to see the monolith to microservice journey from the first service we pulled out of the monolith, and by the time I left we had almost 200-ish services. It was a very similar journey. The first service we pulled out, lots of learning, so then we made the infrastructure a lot easier to spin up new services. We built scaffolding and templates and all these things, and eventually we made it so easy to spin up services that people obviously took advantage of that and they would spin up new services as they needed it. And I remember there was a week where we spun up six or seven services and it was just becoming a nightmare to track all this stuff. And a very similar story– we had a spreadsheet that I was maintaining to track all this. I was using a spreadsheet to track production readiness standards because people started spinning up new services. It was like, “This team over here is logging things a different way. This team over here has a completely different convention for how they do basic code quality standards,” and so everyone was kind of doing different things. To try to corral that together, we had started putting together these spreadsheets and guidelines and things like that. And very similar to Anish, I had a moment where I was alerted, it was the middle of the night, 2 AM, and it’s some service named after a Game of Thrones character or something, and I'm like, “I can't believe this. It's 2 AM. I'm scrambling through Slack and Confluence trying to find information and it's the nightmare of all these services that we had created.” It was great from a velocity and scalability and all that stuff standpoint, don't get me wrong, but the human complexity of it had just grown so much, and that was kind of when I talked to Anish and I was like, “Uber, well-known case of microservices. You guys did it. It went haywire, way too many services. You guys must have solved this problem. How did you do it?” And Anish's answer was, “Eh, it's kind of the same. Yeah, there's this uOwn thing, but it's this kind of hodgepodge of different tools and Confluence and you ping people on Slack.” And I think that was the initial, “Huh, if something like Uber and something like the startup that I was at both had the same class of problems even though we were operating at completely different scales, there's probably something here to solve in this microservice ownership quality space,” and so that's what we started iterating. Clearly there's something to solve here. That's how we ended up working on this.

RD You know a company is maturing when they start renaming the cute names on their services to something usable.

GD Exactly.

RD So I remember one of the big pains was getting everybody to update stuff. Every quarter, I'd have to email all of the service owners and say, “Is this still correct?” What's the better way? How do you do that better than just emailing?

GD It's a great question. What's funny is that when we first started demoing Cortex, the way we started our demos was actually that we’d pull up a spreadsheet and we’d just be like, “You probably have a spreadsheet that looks like this and has a bunch of services and fill in the blanks with red cells that were not filled out,” and people would immediately say, “I don't even care what you're going to show me. If you're telling me that at the end of this demo you're going to tell me how to kill the spreadsheet, I'm all ears.” And a big part of that was that we have all these pieces of information that we don't know if it's true. We actually haven't filled out this information. We have to run after developers all the time and ask them “Hey, can you please tell us if this is still active? Where does your docs live?” or whatever. And so what we found is that a system like Cortex is only as powerful as the data it can collect, and so the way we think about Cortex is that it's this data platform around your engineering ecosystem, so it's very much integration-focused. So can we treat Cortex and the catalog as a pointer to other systems? So instead of it telling us everything about your services, it's telling us where we can find out more. And so with Cortex you say, “This over here is my on-call rotation. You can use these tags to go look at my monitors. If you look at this label, you're going to find all the tickets associated with it.” And so you kind of just tell us how to find things and then we do the finding for you. And so that takes away the entire class of problems, which is, I need to go chase people to have them update things. Because once you define the core data model, everything else is kind of automated and we're pulling things in from different data sources for you and that naturally keeps the catalog more up to date over time, including ownership. I think ownership is a big one, because with a spreadsheet what you run into is, “Hey, Anish, can you please update this information about your service?” and maybe Anish doesn't exist at the company anymore. Maybe Anish is on a different team and he's like, “I don't know. I don't own it. Somebody else now.” And so now you're not just chasing somebody to go and update information, but you're trying to chase down who the actual person is. And so can we take a different lens to ownership, which is, can we reflect your HR system from a service ownership standpoint and map those two things together to say, “This service is owned by team X. Team X comes from Workday or Okta or wherever that is,” and if that team doesn't exist anymore or people have left that team, update that and flag it and say that service is orphaned and just reassign it. So everything starts with ownership because if you don't have ownership, then even when you need to chase somebody, if you don't know who to chase down, the entire life cycle just ends right there.

BP So it sounds like you're doing some data labeling to create this richer metadata and you're piping that to various systems that exist. So now there's a place to understand where the tickets were. There's a place to understand where the documentation might live. There's a place to understand who the team is or if it's been orphaned. Does that sound right?

GD Exactly. And everything from that to vulnerabilities from Snyk and Wiz to learning information from PagerDuty and so on and so forth. All the things in our ecosystem are now connected into the catalog because all those things matter in terms of visibility and ownership and these kinds of things. So we've been able to connect all these different tools and pull that data in.

AD And one of the really interesting things that we've seen, the natural question is, okay, we show engineers the information, but then how do we actually incentivize or do the behavior change to actually get them to update their package version, for example. Engineers love building software that impacts the business and sometimes some of these migrations, which are still incredibly important to the health of the business, take just a long time. How do you track and encourage engineers to change the behavior? So one of the things we did really early on is actually invest in this concept of gamification, and with our Scorecards product you actually can define different levels. For example, a lot of our customers will do things like bronze, silver, gold, and different rules correspond to those levels. And you can kind of build this incentive system– as engineers complete the rules, they increase levels, and we've actually found this has done just amazing things from a cultural perspective inside of a company. Engineers love seeing their services at the highest level and we actually build reports that break down quality across the entire company. And so a lot of our customers and engineering all hands, they'll promote the teams that do really well and congratulate the engineers who got to the highest level. You can even put badges inside of your GitHub repos and things like that. And so we found that it's not just a technology problem. Obviously, you need to get the visibility and you need to drive the visibility across the team, but then it's cultural as well and Cortex kind of tries to solve both.

BP I'm hearing a lot of echoes of Stack Overflow, which is to say, listen, you want somebody to know where to find the best answer. You've got to have tags on it. You've got to have a date on it. You've got to have an accepted answer versus not. That metadata tells you something about whether or not this information is still accurate or relevant or trustworthy, and then in that context, the wisdom of the crowd helps to determine it. People are voting between all the different answers, or this answer should be deprecated because we no longer do it like that. And then there's the gamification side of it– the badges and the internet points. People love them and it helps them take pride in stuff. Is there any wisdom of the crowds within the system you're building? Do people vote on stuff or are encouraged to collaborate to figure out, when something is deprecated or orphaned, how to find the next person?

GD I think there's a lot of behind the scenes collaboration that needs to happen before Scorecards are defined. And what I mean by that is that we always think about developers as craftspeople. Developers want to build good software. It's not that they're shipping bugs or not following production standards because they hate it. It's just that sometimes you don't have a shared language of what ‘good’ looks like. And so the ability to kind of have teams come together and say, “Hey, as an organization, these are the things we care about. We have a shared language now. We can iterate on it together. And once we have that, now we can hold ourselves to that standard.” That's kind of where we see that wisdom in the crowd. Just by driving that visibility and saying, “This is what ‘good’ looks like, and we've all iterated to this definition,” that itself raises the floor because people are like, “Okay, cool. I know what we're striving for. I can go do those things. But before I had no idea and so it was my team versus your team trying to do different things.” I think that's where the wisdom in the crowd really comes in as the shared language of good and how you get people to all kind of move in that same direction. You mentioned this earlier that you learn a lot of these lessons the hard way, and so it's this idea of autonomy with guardrails. Not, “Hey guys, you all have to do these things the same way,” but can you let people run, but say, “Hey, the wisdom of the crowd is, these are the guardrails. If you do these things, you'll be good, and within those guardrails, you can kind of go do whatever you want.”

RD I mentioned this at the top of the episode. Y'all came out of Uber like a couple other folks we've talked to. I've heard that Uber has something like thousands of services. Is there something about Uber's both software architecture and culture that makes it a place where novel solutions come out and they're able to get turned into companies?

AD I think it's a little bit of both. Uber, when I was there, had a very, very biased towards building everything in house. Even the first few years I was there, everything was hosted on-prem. They were not in the cloud. I think now they finally have a hybrid solution with AWS and both on-prem infrastructure. But there was just really heavy bias to build internal tools, to build our own observability platform, and that led to some amazing, amazing technologies being created, which ultimately is why you see amazing companies like Chronosphere and Temporal. And the interesting thing about, I think, their culture also, ultimately to get promoted there was an incentive that the more services you create, it looks like you're shipping code more. It means, ultimately, maybe you'll get promoted to the next level. This led to even Uber building their own internal chat system called uChat for the longest time and not using systems like Slack or things like that. And when you get to when you're building your own chat system, I think that's when you know maybe you've crossed the line.

BP No, this is the golden rule. Whatever you make a goal– what do they say?

RD Anything that becomes a metric becomes a goal.

BP Anything that becomes a metric will eventually be exploited and will eventually become kind of manipulated.

AD That's exactly right, and that's what really led to that explosion of over 4,000 services. And what was interesting is that right when I was leaving, there was actually now this huge push to build things back into the monolith. Does this really need to be a service? And it was interesting because I remember I switched teams. I used to work on the Uber Eats team and then I switched to the bikes and scooters team, and we would rebuild services that should have been shared between the two teams. But because there was such a decentralized culture on engineering and there were just too many services with no centralized catalog or documentation, there was no way of communicating between these two teams that you built this, let me reuse your code or let me reuse your software, and I think ultimately that just led to a loss of productivity. I think that's one of the benefits and reasons why we really push our customers to open the catalog across the entire organization and give engineers the flexibility to understand what are other people building, even in teams that aren't related to myself.

RD Ultimately I've heard that service-oriented architecture isn't a software architecture, it's a people architecture. It's that distributed, self-organizing teams and getting them to coordinate and communicate is a challenge you have to solve.

GD Conway's law I think is a great way people describe it.

RD Yeah, Conway’s law.

BP So you mentioned a scorecard and how that kind of has to almost be an act of culture, an act of people first, and then once it's agreed upon, can be enacted in practice and then hopefully leads to healthy results. So can you just talk to me a little bit about how a company goes about settling on a scorecard and then how you track and measure that for them? And are you able to then say to them after a quarter or a year, “Look, here are the changes, here are the productivity boosts, or here's the amount of deprecation you were able to do that saved you XYZ money/time/compute?”

GD It's a great question. When we think about Scorecards, the idea of a scorecard is just being able to define a set of criteria that you care about with a leveling system. So as you get better, you achieve higher levels, and so you're moving towards something all the time, but then the fundamental concept is a culture of continuous improvement. So it's not, you're production ready or you're not production ready, it's a continuous spectrum of readiness. You're consistently getting better, your service is getting more secure, it's getting more performant, whatever that is, and so how do we create a culture of continuous improvement? So step one is, what are the things that we're trying to move? And so a lot of organizations will say, “Hey, we're trying to improve developer productivity. We're trying to improve MTTR.” What are your high level metrics that you really care about? Then take that and translate that down one level further. To your point, you don't want it just to be a metric that you can game. What are the inputs into that metric? If we do these things, we will see an impact on this metric. Okay, so let's unwind that a bit further. That generally is, “Okay, we're trying to improve MTTR. If we had good on-call practices, if we had good monitoring and alerting, if we had runbooks and visibility into that, if we had consistency around how we operate, we're naturally going to get the benefits of that from an operational standpoint, and MTTR would go down.” Cool, that sounds great. Now we take that and we codify it into a set of systems which is like a scorecard. So step one for a lot of organizations is that you can only improve what you can measure. And so most organizations, if you've been doing production readiness in a spreadsheet for the longest time, you probably don't have a ton of visibility. And so step one is literally just put this criteria into Cortex and see what it spits out. Where are you today in production readiness? How many teams are actually doing all the things you should be doing and what do you not know about? And then once you have that visibility, then our customers can figure out, okay, based on this criteria, what do we think is the most basic? If you don't do these three things, you should not be in production. That's a stick, that is the bare minimum. And so that is usually something like that you have reasonable code quality metrics, or at least producing quality metrics. You have an on-call rotation, but not just one level, you actually have an escalation policy. You have some basic monitors and some SLOs. That's the basics. If you do that, you're good to go. And so as people iterate and they build scorecards, this is where it kind of becomes more of an art than a science because you say, “How do we incentivize developers to continuously improve over time?” So if they're doing the basics, what to us as an organization is the next level of maturity? Where are we trying to go as an organization? What is top of mind for us? For a lot of organizations, this is things like SLO adoption. And so in the next level, we're going to move SLO adoption up to that silver level so that people can get a little bit better and they get that win and they see the impact of that and so on and so forth. And so it's kind of this process of where are we today, what do we care about, how do we plan to move it, how do we incentivize the movement towards those metrics, quantify and measure it, see the impact on the original metric, and then go back to scorecard and tweak the inputs? And so when you think about scorecards as being the inputs like are we doing all the things we should be doing, that is kind of how organizations are thinking with scorecards and driving that behavior. But that process sounds a lot easier than it really is, because it requires people to align on what does production readiness look like, what do we care about, what are the topical metrics, and so on and so forth. And so once you have that defined, that's the knowledge in the wisdom of the crowd. It's like, “Okay, we know where we're going and we know how to get there.”

BP All right. So tell me what your hot new Gen AI feature is and how that makes the product something I should buy today. No, I'm just kidding. Is there any Gen AI or AI at all in this or no, that's not the idea. We're trying to simplify things.

GD There is AI, fortunately or unfortunately, depending on how you look at it. The way we think about the problem space is– we talked about this a lot, and Ryan, you probably know this as well. One of the biggest problems in an engineering organization is just finding information. That's just a big class of problems in and of itself, and so if the catalog is meant to help you solve that problem to some degree of where do things live, where are they deployed, is it healthy, etc, etc, all this stuff is trying to make it easier for developers to find things. And so to us, AI seemed like the perfect use case for this where it's like, “Hey, why can I not ask and introspect my data in these kinds of ways? Hey, who should I talk to about the payment service?” “Cool, you should go talk to Anish. Anish worked on it the last six months.” Questions like that are so easy to interface once you have all this data, and so that's kind of where our Gen AI strategy is coming through.

[music plays]

BP All right, everybody. It is that time of the show. We want to shout somebody out who came on Stack Overflow and helped to share a little knowledge or spread their curiosity. Awarded three hours ago, a Great Question badge to Alex Chesters. This question has been given a score of a hundred or more. “How to count occurrences of an element in a SWIFT array.” Congrats, Alex, on asking a great question. 104,000 other people had this question and you've helped them out, so appreciate your curiosity. As always, I am Ben Popper. Find me on X @BenPopper. If you want to come on the show and discuss something, if you want to be a guest or give us a topic or tell us who the guest should be, you email us– podcast@stackoverflow.com. We want to hear from you. And if you liked today's program, leave us a rating and a review.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me on X, tell me what the secret tech story that nobody's talking about, you can find me @RThorDonovan.

BP Spill some T.

RD Spill some T. Get the deets.

AD I'm Anish. I'm the co-founder and CEO of Cortex. I really appreciate you both having us on the show. You can follow us on LinkedIn. Cortex sponsors a ton of conferences. If you're ever at a conference, please stop by our booth. Me and Ganesh are often there. We'll give you a T-shirt, socks, all the swag that you want, and just we love talking about service complexity, about the product, about the space in general. And you can also follow us on Cortex.io where you can learn more about the product and get a demo and things like that.

BP Very cool.

GD I'm Ganesh. Thanks so much for having us. I really enjoyed the conversation. You can also find me on LinkedIn, on Twitter, or email me at ganesh@cortex.io. We are hiring staff engineers, so if you are looking to work on developer portals and solve these fun problems and cultural problems, please shoot me an email.

BP That's great. We have been talking to lots of folks these days who are looking for jobs, so we'll put the link in the show notes and hopefully some people will apply. Thanks everybody for listening. We will talk to you soon.

[outro music plays]