The Stack Overflow Podcast

“There is a real cost to moving fast”: Using AI to accelerate drug discovery

Episode Summary

On this episode of Leaders of Code, Ben Popper hosts a conversation with Maureen Makes, VP of Engineering at Recursion, and Ellen Brandenberger, Senior Director of Product Strategy for Overflow API. They discuss AI's role in drug discovery, scaling and integration challenges, and the importance of innovation in achieving the high standards desired.

Episode Notes

They also:

Explore key challenges engineering leaders face, including data capacity, relevance, and throttling issues.
Highlight how emerging AI tools and applications are transforming software engineering practices.

Episode notes:

Connect with Maureen Makes on LinkedIn. Learn more about Recursion and their open roles here.
Read about Knowledge Solutions, a subscription-based API service that provides continuous access to Stack Overflow’s public dataset to train and fine-tune large language models.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, one of the hosts here of the Stack Overflow Podcast, and today we have another episode of Leaders of Code. This is our series where folks who are in key positions at organizations on the frontiers of technology get a chance to sit down and have a conversation. Today we are going to be chatting with Maureen Makes who's the VP of Engineering at Recursion, and Ellen Brandenberger, who is a Senior Director of Product Management for our OverflowAPI products.

Ellen Brandenberger Correct.

BP Oh, I nailed it. Okay. Well, welcome to both of you. Maureen, tell us a little bit about yourself. How'd you get into the world of software and technology, and what brought you to the role you're at today?

Maureen Makes So I have one of those increasingly common nontraditional tech backgrounds. My degree is in environmental science, and I started my career doing building energy efficiency work which landed me in San Francisco in the early-2000s. And it seemed like a lot of the interesting things that were happening were in the world of software and that's where I was seeing people really having a meaningful impact on the world and being able to stay on that cutting edge of how we make people's lives better using technology, and that was a big driver for me even early on. So I started teaching myself to code at night going to groups like Women Who Code and Girl Develop It, and just meeting people, talking to people, and ended up landing at a couple different education companies. I spent about five years at Pluralsight as well as some other EdTech companies before I landed at Recursion, where I've been for the last three years working on trying to help people through drug discovery.

BP So I'm a little bit familiar with the use of artificial intelligence for drug discovery only in the sense that I have heard much about AlphaFold and its potential in that it can figure out all kinds of things that would've taken us years. Now we can do it in minutes and that might lead us to a drug discovery. When you say drug discovery, what do you mean and how do you go about it?

MM So at Recursion we are looking at really the end to end from how do you even know where there's opportunities through bringing drugs to market in the clinic using AI. And the way we approach that is we have labs in Salt Lake City where I currently am as well as in California and in the UK where we are generating data that's fit for purpose for machine learning models. So we're not running experiments that are designed for a human to look at the output of them. We're generating image data, video data, compound structure data, all of these things that can then be looked at with models in this massive search space. So we have about 45 petabytes or so of our own high dimensional biological data that we've generated, and then we have a fair amount more data from partnerships as well. And we use that space to look at where is there an area where there might be an interaction that currently isn't known that could lead to a cure to a disease? And then from there, be able to go, “From where we see that interaction, are there ways that we can quickly get to what compound might look like in a human? How do we understand how this will interact? How do we understand where there's opportunities?” And really the goal is to be able to bring more drugs into the top of the funnel of drug discovery in hope that we can have more ultimate success with those. So both more clinical trials and more successful clinical trials.

BP And Ellen, we went over your title which continues to extend and morph and change in new and exciting ways. For folks who are listening, quickly clarify for them, what do you mean by the OverflowAPI suite and what are the ways that customers or partners interact with that?

EB Yeah, happy to. And actually Maureen, it's funny enough, I was working in bootcamps and the EdTech space myself before I came to Stack, so definitely very common there as well. So OverflowAPI is the portfolio of data and knowledge solutions products that we have here at Stack Overflow. So if you think about the umbrella of Stack Overflow’s community of public knowledge, whether it be Stack Overflow, Stack Exchange sites such as Super User or Ubuntu, the sort of sum total of questions and answers on our site as well as community validation on correctness, recency, relevance of that knowledge is incredibly valuable for a lot of enterprises as they build AI tools. We've been working to partner with folks in the space. There's some public partnerships with the likes of OpenAI and Google, but we've been expanding to think about what are the opportunities for community knowledge to be leveraged in a variety of products? Everything from large language models to smaller enterprises looking to scale their AI products with real time RAG and augmentation, as well as large tech enterprises who are interested in leveraging and combining that knowledge with their own internal knowledge bases to make developer teams more efficient. So my team is actually thinking about kind of what are the products in that space, how do we build for new and emerging needs in that category, and what does it mean to accelerate technology teams, not just with AI but with better knowledge and insights about how they can proceed and what they can trust in that space overall.

BP So Maureen, these days when folks talk about AI, they kind of use that as a blanket term to refer to mostly what's happening in the generative AI space, but let's focus in a little on what you're doing. When you talk about AI and drug discovery, are we talking about LLMs? Are we talking about something completely different? What are the sort of tools and technologies that you're using?

MM Yeah, so really kind of all of the above. I think we know that people haven't solved the drug discovery problem of how we cure disease at scale and so I think we have to be looking at a variety of approaches. So we, for example, have LLMs that crawl our data search spaces to look for unique biology. We also have our own foundation models that are image foundation models to look at phenomics. We have one that was hosted on NVIDIA’s platform called Phenom-Beta, but we used to have an internal version of it that’s a masked autoencoder that looks at huge amounts of image data and is able to pick out cell structures and changes in cell structures and really be able to rebuild that space. And then we also think about it in our day to day workflows using LLMs for advancement, for coding tools. For all of those pieces, really, we try to make sure that we are looking at all of our available options and saying, “What are the tools that make sense for this job here?”

BP Ellen, from your perspective, what do you see both within inside of Stack Overflow, but also with the clients and partners? What's the variety of the sort of different flavors of AI that are being used? And I know we often say it all comes back to the data in the end, but are they being utilized in ways that are complementary? Are they in their own lane? What's happening in your world?

EB I mean, I think the end use cases are slightly different to what Maureen just described, where we're not looking, typically our expertise at Stack Overflow is more in the developer space and the engineering space. I think a lot of those methods for accessing data and leveraging AI in new ways to generate insights, whether it be to crawl an existing data space to look at something from a new angle or to sort of aggregate and generate new insights by looking at images, those same sort of patterns exist within the developer space as well, which is super fun to hear. So while the end use case might be different, a lot of the methods under the hood actually are starting to be shared across verticals which I think is great. When it comes to Stack Overflow, a lot of what we think about with the Stack Exchange sites sort of goes back to that developer efficiency piece. So we're looking at use cases like how do we help organizations identify which piece of knowledge might be most contextualized and specific for an engineer working in a particular code base, and how does that look different from enterprise A to enterprise B. It might be very different in different contexts. Conversely, one of the interesting things that we see as well is sort of a lot of the community knowledge around math and mathematics has been really compelling for organizations to think about building logic proofs into LLMs. So how do we think about enabling organizations to help build reasoning models or go beyond just pure generative AI into structuring how we think about what is the right answer when you string lots of insights or even agents together across the board. So there's a lot in there, but certainly some of those newer use cases are starting to come to light.

MM Yeah, I mean, I love thinking about what are the template engines, what are the structures that we need within these. I think that that's where we really see huge advancement is not just, “Oh, are we using a LLM to help us write our unit tests? Cool, we should do that too,” but also how do we train these models, how do we give them the right amounts of data to be able to bring us the structure and the processes we know work rather than just fully solving for that, and I love that that's what you guys are thinking about.

EB I think it's interesting to watch the industry evolve too, because a year or two ago it was a lot about summarization and chat and how do we do those things really well. And there's certainly some questions to still be answered in those spaces, but it's exciting to me to see the industry move beyond that a little bit more– how do we move into reasoning and research and evaluation and some of those higher order tasks that really hadn't been explored to date with quite as much insight.

BP Maureen, what are some of the challenges you're seeing? I mean, this could be everything from cleaning the data to having enough GPUs to using up all the power in your local town. People have different problems depending on their scale. What are you experiencing?

MM I think one of our challenges that we're always thinking about is data storage and retrieval is really important to us, because we have a lot of data, as I mentioned, and we are generating more data all the time, and we've hit all kinds of fun issues with that. Everything from one of our sites, just the volume of data coming out was exceeding the bandwidth of the fiber. We had to increase the diameter of the fiber out of that building, out of that lab to be able to keep up with the scale. One of the things I think about a lot is we just combined with a company in the UK and so I'm thinking a lot about data locality– I actually live in London now– and how do you operate well across regions? Because we have our cloud environment, so we are a Google Cloud partner, but we also have our own HPC, our own supercomputer that's the 35th largest supercomputer in the world– BioHive that's here in Utah, and so we want the data that's most relevant for our models there, but we don't want to constantly be thinking about egress and moving across regions, especially now that you add an international element to that. And so I think that we have certain ways that we approach that, including what's the core data that always needs to live there, and then how do we think ahead to what can come straight out of the lab and into the HPC environment before it even hits the cloud, what needs to go in different places and locations? And I have a team dedicated to object storage that works on a lot of these problems from archiving things that we don't think we'll need access to in the future trying to be as intelligent as we can and to not lose the data we have, make it as accessible as reasonable while being sensitive to cost there is a big challenge for us.

BP I asked that question about some of the physical and logistical needs and I was expecting you to not talk about that, but now I'm insanely excited to hear about the physical and logistical needs, because that's, for me, when the rubber meets the road, it's like, “We need more cable.”

EB The infrastructure needs to be there or else the whole system breaks down.

BP That stuff to me is wild. That is at a whole nother level.

EB So my team thinks a little bit less about how we implement AI ourselves and a little bit more, because we're more on the infrastructure side ourselves, in terms of thinking about how we provide infrastructure to other teams in the space. A lot of what Maureen just discussed around capacity challenges, relevance challenges, so a lot of folks are struggling with that we have all this data. What data is most relevant to this context and how do we think about balancing that with performance, both in terms of a speed perspective, but also in terms of an accuracy perspective overall is really, really important. I would also say another challenge that folks are sort of broadly facing across the industry right now is prouctizing, and so Maureen and her team are probably well ahead of the curve on this regard, in part because of the space that you're in, but a lot of engineering and product teams right now are really struggling with, “Okay, I made this AI concept and it's maybe leveraging a foundational model and it's maybe leveraging some internal data that we have, but how do we bring that to scale? How do we make it better? How do we iterate on that over time?” And a lot of the conversations that my team has been having with folks in the industry really center around that problem, and some have focused on sort of seeing agents as a solution to that problem, but in some ways it's a level of abstraction of that problem further up the stack. So there's a lot to unpack there in the industry overall, but the capacity challenges, the throttling challenges, and the protection of your data are sort of central across the board for sure.

MM Yeah, I think that because of where we play, a lot of those are problems that honestly I get to punt on because most of our users are either at our company or they're at other pharmaceutical companies that we have close partnership relationships with. So if we think about large usage of a model for us, we're talking maybe tens of people, hundreds of people, and rarely thousands of people on a day to day basis. And so I think it's interesting to see the other side of it and the things that you’re looking at there, because I think they're just different types of problems and I think definitely things that I need to start thinking about as we grow.

EB I also think sort of one of the unique things about Gen AI and AI more broadly right now is it's not nearly as linear as people expect it to be. That may seem obvious on the surface, but when you really start to dig into the problems of scale, the compelling argument of software over the last 15 or so years was, “Hey, software can scale almost infinitely. Software is eating the world,” kind of energy. But generative AI is a lot more non-linear. There's a lot more between the generation of new content, the orchestration of agents or other data more broadly, and factorization of data fundamentally. You need to think about kind of a lot more of those edge cases in order for things to come to scale and be performant at scale. So it's opening up a whole new set of challenges I think for engineering leaders more broadly.

BP So Maureen, I guess I wanted to ask you, are there emerging trends within the last 6-12 months you've gotten excited about that you think is going to help push what you and your team are doing forward?

MM Yeah, I mean, there are a lot on the scientific side of the world that I quickly get out of my depth talking about as an engineer, so I'll leave those for the scientists, but I think in my world, we're really excited about a lot of the AI coding assisted tools right now. We spun up late last year an AI Coding Guild to really hone our practices and tools and education of our workforce because I think probably we've all seen the way you use those tools dictates what the outcome and success you get is. And so having a group internally that owns teaching people how to make good use of the different tools available to them and also experiment and test in programmatic ways with new things that are coming online and those changes that we're seeing evolve and making sure that we're informed and proactive about that has really shaped the way we onboard new developers and that we work with our dev teams and the way that they think about their work. And I think we're just barely dipping our toes into that and I'm excited to see what pushing that and really investing in that can do.

EB As a team that plays a lot in the developer space, you're certainly not alone in asking that question. I think almost every enterprise out there has some flavor of what you just described, which is, how do we use these tools, how do we educate folks, how do we onboard, how do we give feedback? And I think there's still a lot of questions about security and compliance that go alongside of that as well, like making sure that your code doesn't end up in some third party code base somewhere, I think is a security risk that a lot of CTOs out there are right now struggling with that I've heard kind of across the board. But I think on top of that, there's a really compelling argument to be made of software engineering is sort of evolving in the tools that it uses and how it thinks about what good code development looks like, particularly at different levels of maturity. An entry level software developer might use a code assistant tool in a very different way than someone who's 5, 10, 15 years into their career, and depending on what kind of work they're doing as well. So I think it's also a, going back to our original point, big plug for L&D in organizations right now and to think about developing engineering teams and providing them with the right tools to accelerate based on different levels of baseline knowledge in existing employees. And I think there's a lot more efficiency to be gained, especially at the top of that range of experience right now with the tools that we have, but starting to solve for some of those other engineering profiles is probably a really exciting frontier as well.

BP The last question that we had prepared here was something along the lines of, “How do you support innovation within your engineering team while maintaining high standards required in biotech?” It sounds like Maureen, you've got this guild and are focused on best practices, so talk to us about some of the ones you think have been most important that have developed out of working on this in a very hands-on way.

MM I think when I was reading that question also I was thinking a lot about how there's a trade-off implied of innovation means that you have to look at high standards in biotech, and I actually don't think that that trade-off necessarily exists. I actually think looking at innovation as a supporter of high quality, for example, code reviews. We know we want really good code reviews and saying that's a quality standard, but also they're not always the most interesting parts of people's day and can we take the parts of it that are just catching people's syntax changes out of the equation so they can focus on the things that only they can do really well to be able to look at our systems and our structures and those? For me, that trade-off doesn't necessarily exist of we want to be constantly pushing ourselves forward to say, “Can we do this better? Can we do this faster?” Because the reality of pharmaceuticals is, for things without cures, a lot of our early trials were in the rare disease space and things that don't have cures, time really matters to those patients even if it feels esoteric to us. So I think something that we try to keep in mind is that there is real cost to not moving fast to people and to humans and to that impact, and so trying to use innovation to allow us to move quickly into the market is key to us getting to the high standard that we want.

EB I was just going to say, I think that's a really good point. We often think of the two things as diametrically opposed, meaning efficiency and innovation have to be inherently different. But to your point, Maureen, I actually think one of the really interesting things in the space right now is that AI is actually driving consumer expectations to change. What people used to expect to wait on, they no longer do, and in some cases, what they didn't expect to wait on before, they want even faster now, which, healthcare, to your point, is really important there. How do we automate, eliminate all the things that are known and then accelerate teams to push their thinking on the things that aren't so that we can do better there, is really the central theme across all the industries right now.

MM Ellen, I'm curious, you talked about you're hearing a lot from CTOs about the security privacy question that goes on. Are you seeing any good answers in how organizations are approaching this and that balance?

EB Yeah, I think it's a mix. It also really depends on vertical as well. If you think about banking or financial services or healthcare in some cases, those industries tend to be a little bit more attuned to those risks than say consumer products maybe as a contrast. So I think industries really matter there and the sort of standards they're held to, at least on a legal and privacy basis, also matter. So I will say, my answer, my original point is a little skewed towards the banking kind of take overall, but I think largely industries and CTOs are really interested right now in understanding who has access to their data, what data is going into the models and the services their teams are consuming, and then the ability to validate and protect the boundaries of each of those things. So that's sort of the best balance the industry has right now, which is sort of, “Let's partition all of those things and only combine them when they need to be combined.” But even that's not perfect. Generation has hallucination and the industry has gotten a lot better there, but thinking about how we store data at a deep level and who can access it and when and how I think are sort of the root questions to ask to sort of build a secure foundation for a lot of industries right now.

MM I think we've also been looking a lot at which code bases are actually high risk. A lot of code you write is just sort of not necessarily wildly proprietary. Not to say I want it into the world, but there are higher risk and lower risk parts of our business, and we choose to test the lower risk parts of our business and things that are less directly interfacing there.

EB The same thing applies to how my team thinks about community knowledge. There are some areas of our community knowledge that are really well-known and really public. There are other portions of our community knowledge that are a little bit more subjective. I always use this example, it's not necessarily in the high risk category, but it's a little bit more subjective: How do you make the best cup of coffee? I have an opinion. I think Ben probably has an opinion, and so do you Maureen, but they might not be the same. So subjectivity is one element of it, but also we're less inclined to encourage reuse of things like the legal knowledge on our site or healthcare knowledge on our site because we know that those things are a little bit more high risk, Maureen, to your point, less from a PII perspective, which we absolutely avoid all across the board, but more that when combined with other things this could potentially create the wrong scenario. And so being intentional about what we include and what we don't include is really central to enabling our partners to use community knowledge in a way that's safe and protected for their own benefit and for their user's benefit as well.

[music plays]

BP All right, everybody. Thank you so much for listening. That was a really terrific conversation. I hope you enjoyed. As always, I want to shout out a user who came on Stack Overflow, shared a little bit of knowledge or curiosity, and in doing so, helped our whole community. Awarded 11 hours ago to Den, a Populist Badge for giving an answer so good it's got a score 2x the accepted answer. “How to detect scroll direction programmatically in SwiftUI ScrollView.” Den has your answer, and it's a good one, better than the accepted answer even. Congrats on your Populist Badge. As always, I am Ben Popper. I'm one of the hosts here at the Stack Overflow Podcast. If you have questions or suggestions for the program, shoot us an email: podcast@stackoverflow.com. And if you liked what you heard today, nicest thing you could do, tell one person to listen to the Stack Overflow Podcast, or leave us a rating and a review.

MM I'm Maureen Makes, VP of Engineering at Recursion. I can be found mostly on LinkedIn. I'm not necessarily on the other socials as much as I should be. And we have some open roles at Recursion in our Salt Lake, Toronto, London, and New York sites right now, so go check out our hiring page there. I think we're always interested in people looking to really stay on the edge of what we're doing in technology.

EB Awesome. I'm Ellen Brandenberger. I'm Senior Director of Product Management for OverflowAPI, the knowledge solutions products here at Stack Overflow. Like Maureen, you can find me on LinkedIn. And similar, we are hiring at Stack Overflow, so if you want to help us build sort of the future of knowledge management for developers and other technical roles, check us out at stackoverflow.co.

BP Sweet. All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]