The Stack Overflow Podcast

Medical research made understandable with AI

Episode Summary

On today’s episode we chat with  two leaders from Srocero - CEO Dipanwita Das and CTO Hellmut Adolphs. Sorcero uses AI and large language models to make medical texts more discoverable and readable, helping knowledge to more easily spread, increasing the chances doctors and patients will find the solutions they need.

Episode Notes

Sorcero uses a mix of natural language processing, generative AI, and even more old school symbolic AI, where they craft their own ontologies, to try and ingest that river of new medical data and make it easier to search and comprehend. 

Less than 0.2% of the global population can read a medical paper! AI can help make these dense works up to 700x more readable. 

Medical Affairs Teams are the groups inside big pharmaceutical companies that helps surface the right information to health providers. It’s hard for them to keep up with the thousands of new articles and research papers being published each month, much less unpack that information. 

Connect with Dipanwita Das and Hellmut Adolphs on LinkedIn. 

Congrats to Lifeboat badge winner John Carrell for saving the question Self join vs. inner join with an excellent answer. 

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, Director of Content here at Stack Overflow, joined as I often am by our blog editor supreme, newsletter aficionado, and Seattle resident for the moment, Ryan Donovan. What's going on, Ryan?

Ryan Donovam Not much, man. How’re you doing? 

BP I'm pretty good. So I live in the United States of America and have felt for a long time like the medical system and the healthcare system has some serious issues, and so when I got a pitch to talk to some folks who are using technology to try and improve the experience for doctors and patients, I was very excited. We're going to be chatting today with two folks from Sorcero, which is an AI analytics and insights company focused on the healthcare life sciences space. So without further ado, I want to welcome Hellmut and Dipanwita to the program. Welcome to you both. 

Hellmut Adolphs Thank you. 

Dipanwita Das Thank you, Ben. 

BP So D, let's start with you. I know you have done startups before. You've been through Y Combinator, you've gotten a degree in business. What brought you to where you are today at this confluence of AI and healthcare?

DD That’s a great question. Well, a couple of things. Most specifically, before I started Sorcero with Walter and Richard, I had spent a number of years in public health at the intersection of public health data and technology, and really discovered that one of the main uses of AI, which is in being able to compute and analyze very large volumes of information faster and more efficiently than a human being alone can, the application of that is most effective and most powerful and most meaningful in healthcare and healthcare data. That was really the origination of this. More recently as both a caregiver and a patient, I then went on to discover some of the nightmares of how patients actually get treated, and in fact sometimes why they don't. And it isn't as simple as saying, “I have a good doctor or a bad doctor.” It's about whether the doctor is set up to treat patients effectively. Do they have all the information they need, in the right format, and so on and so forth? And in fact, whose responsibility is it to gather and communicate that information? As I began unraveling or peeling back layers of the onion, we zeroed in on what Sorcero is today and that medical analytics interplay, and really the need to build a company that is dedicated to unlocking that specific data and analytics challenge.

BP Very cool. And Hellmut, how about yourself? How did you find yourself in your current role, and tell folks a little bit about what it is you do. 

HA Sure, sure. So I am the CTO at Sorcero. I have been in technology my whole life pretty much, a couple of decades working on different verticals in tech, the last 15 years probably in different tech companies and startups, some of which I co-founded in big data AI analytics. The last few years I've been in healthcare. I do have a lot of commitment to the improvement of healthcare in the country, and I’ve been in different stones around it trying to to help here. And my previous startup was also in AI, but this one is different because not only am I proud to be with an awesome team, but also we're focused now on specific healthcare problems, leveraging some of the latest AI technologies. And so I've been with Sorcero for about a year now. 

BP Cool. 

RD So obviously the data that comes from your medical experiences is very powerful. I've seen ML stories about finding tumors early, but it's also treated very sensitively in the US with HIPAA laws and such. Can you talk a little bit about how you all use data to improve healthcare? 

DD Sure. One of the things about Sorcero is that we're not used at the point of care, which means that our platform is not being directly accessed by a doctor or a patient. In fact, we stepped away a little bit from that and are looking at the infrastructure that supports better decision making by the doctor, or increases a patient or a patient advocacy group’s ability to advocate for the treatment, which means that we’re working with literature. On the third party external data side, we primarily focus on the masses of literature out there. We also work with the clinical trial data or with patents. So on the third party side, there's a whole bunch of unstructured scientific information growing at an insane pace. On the internal side, we support our customers in unpacking and being able to compute and analyze better the conversational and the survey data and the notes that they're collecting from the conversations their teams are having with physicians. And as we unpack and help them to analyze that information, we're able to help them pinpoint how to treat a patient population, so a group of patients with a specific set of indicators and then a treatment that's in the market. And the treatment could be one thing or it could be a combination. As you know, some of the most exciting new treatments are a combination. So that's where we play at the intersection of those two things.

BP I'm going to repeat that back to you like I'm five and you tell me if I've got it right. Let's say there’s a cohort of patients and a hospital seeing them again and again, and the outcomes are not terrific and they're wondering if there are alternatives. Your technology has been reading all the latest literature and looking at studies published by academics, but also by pharmaceutical companies and by patient advocacy groups. And when they query you to say, “Is there something we could be doing differently?” It says, “Yes. If you look at X, Y, and Z, I've read everything out there,” and as the AI can kind of synthesize it for you, “you might try this because it's had better outcomes.” And then similarly, like you said for the patient advocacy group, they could go to the hospital and say, “We don't feel like our needs are being served. We've talked to this system and it said a better alternative would be X, Y, and Z. Can you please try to provide that?” and then they try to make that change happen.

DD That was beautiful, with one difference. The querying is not being done by the physician or the provider. The querying is being done to the healthcare company, to the life sciences company. So when the doctors are at a point when they're unsure of how to use a drug that's in market or unsure of how to treat a specific patient that they're looking at, they ask medical affairs, “Hey, do you have any data?” And so it's actually the medical affairs folks who are using our platform and asking exactly the questions that you just laid out and then preparing a package for physicians, partly because they have much better, broader coverage and are experts themselves. Our users are medical experts themselves.

BP Gotcha. That makes sense. 

RD Yeah. I think the clinical trials piece is very interesting. There's so much data produced around it and it's such a risk bringing a drug through the clinical trial process. What sort of data analysis do you all have that simplifies that? 

DD Hellmut, I think you should talk to PLSs and accessibility, both across patients and physicians.

BP Just so folks know, what does that abbreviation you're using stand for? 

DD Yeah, important. There are only less than 0.2% of the global population that can actually read a medical document. That's where we start, which means any of these things that are coming out, the readability of it is very, very, very low– including doctors. A PLS is a Plain Language Summary. Sometimes it's also called patient lay summaries, depending on who the end audience is for that consumption. PLSs are generally known to increase engagement with that piece of content by 16x, and in fact, PLSs are regulatorily mandated in the EU, which means every drug company actually has to create one as a companion piece to everything they're putting out. What we find is that PLSs have been a purely manual process so far, and usually they're actually several grades higher than the required grade seven readability. And this is great because you can measure it. Readability, there's so many indices to measure readability. And this is a great, very, very appropriate space to apply generative AI. So what we do is we've created a workbench or a PLS generation workbench that supports the medical writer in writing PLSs that are faster and we can generate one in minutes. It's vastly cheaper and it's much higher quality. So we're finding that the PLSs that we're generating are 700% more readable than what's in the market, and we've actually done the studies and the numbers on this. So that's what PLSs do. And so when you're talking about a clinical trial report, it's fresh, you know it's important, it's something we need to know, but it's inaccessible both to the doctors and certainly to the patients, and PLSs actually make it much more accessible.

BP This is very interesting.

HA So that's a good example on the side of PLSs. We basically summarize publications for consumption that requires it to be a little bit more digestible. And the way in which we do that is by a combination of multiple machine learning processes and algorithms in combination also with generative AI to generate the summary. Of course, this has a lot of complexities around making sure that it's accurate and that the generative AI is not generating data that is not true. I'm not sure if you guys have heard about hallucinations, so we do employ a lot of different steps that we take into analyzing the results to make sure that the output is not including things that were not originally in the text that we’re summarizing. So we try to identify new entities that were somehow magically inserted there. We identify changes in the context and terms that are not mentioned in the original text, and we make it easy for the user, in this case the medical expert or the PLS, the person who's generating the summary, to point them out to things that they should look out for. And it's as simple as highlighting, “I think this looks like something you want to verify before you say this summary is finalized.” So we are a company that does a lot of AI plus human work and accelerates their work, that's our main thing. And we leverage the human attention also to make that output be accurate and valid, but we make it very easy for them to identify possible occurrences of hallucination. 

BP So just to put this in context, Ryan and I have actually been working a lot with folks at Stack Overflow for the launch of OverflowAI on some similar ideas on how do you do retrieval augmented generation where we say, “Don't go out and look at the whole internet and then tell us an answer. Look just at this corpus of text that we think is relevant, you're much less likely to hallucinate.” I've also seen examples of AI that were trained on Stack Overflow questions to say, “If somebody asks this question and you're going to explain it, try to explain it like one of these top rated Stack Overflow answers, because that's the most accessible, the one that's easiest to understand, the one that's delivering the best information.” So it's kind of cool to think about that in the medical context. Hellmut, obviously you're leveraging the new stuff, the generative AI and the LLMs, but also are you working with some of the techniques that we may be familiar with from older AI breakthroughs that are more focused on data analytics or statistics or trying to wade through, like you said, mountains of texts that are changing every day with new releases and identify what's valuable and what's not?

HA Yeah, we actually have a pretty diverse set of different AI services that we manage inside our infrastructure that support our products. We have all kinds of things ranging from NLP services, so analyzing text and producing insights out of data or identifying entities to combine it with other different services, including annotation, labeling, knowledge graphs, things like just your basic machine learning services, supported in many cases by cloud platforms that enable us to do so. We obviously leverage now Gen AI. We do all sorts of data processing and enrichment on data that we consume from multiple sources, including open data and customer data, so we have the complexity of dealing with heterogeneous data and diversified data from open sets and proprietary sets from our customers, which, like you guys mentioned earlier, add a whole other level of complexity around privacy, security, protection from data leakage, compliance, all those things. So scaling a data platform that does support a variety of models at the same time that is compliant and is secure and protects IP privacy, is compliant with GDPR, with HIPAA, you can imagine the amount of stuff that we have to do in order to achieve that. And there's other services that include things like privacy preserving services, things that we do purposely to improve interpretability and explainability which is a big challenge with AI. And at the same time, we support custom AI models from our customers, which is one of our superpowers, let's say. We built this infrastructure in our ML app’s capabilities to allow our customers to come and say, “Hey, we have been experimenting with this particular model. Can we run it through the data in your infrastructure and combine it with your other insights and the data that we're going to give you?” So we allow them to play with the different pieces of the puzzle and derive all sorts of interesting insights and analytics.

DD One of the things that Hellmut has not told you is that we, from the very beginning, actually have taken a hybrid approach to AI, which means that we've used both symbolic and non-symbolic approaches. We actually use ontologies extensively throughout our stack. That's a lot of where our proprietary methodologies come from, which makes what we produce much more measurably accurate, benchmarkable, explainable, transparent. So yes, we’ve definitely always taken a hybrid approach, and we’ve also philosophically been in the AI plus human camp, and you’re going to see that all over the UX of our products. And because we serve experts, this is the right way to go. 

BP This is very exciting for Ryan. He's going to tell you about how LLMs don't know a gosh darn thing, and he's going to want to know about ontology, so I'm excited. 

RD Yeah, I love giving them a symbolic understanding of material. With that, do you have feedback from the users, from the expert customers, in sort of tagging and flagging data? 

HA Yeah, and that's something that we continue to build on. As we evolve quickly as a startup with new features, we always think, “Okay, how are we going to incorporate into this new feature potential feedback mechanisms? And are these models that we can feed into or are our models not in our control?” Sometimes we use models that are pre-trained so we would have to then do things like incorporate this data as either fine-tuning or potentially doing other methodologies to extend and enhance those capabilities like correcting issues around accuracy and things like that. But the feedback mechanism is definitely one important way in which you can show customers that you are committed to explainability and also to improving the system's performance, of course. 

DD Ryan, however, we do spare our customers from having to tag. That seems like an atrocious use of their time. So the tagging of content and concepts, in fact, across our hybrid dataset, is something that is done using machine and, in fact, using AI. And we extensively use AI for thematic labeling and for ontological labeling. However, our customers give us feedback on whether the suggestions we're giving them, so the actual output, is this the right recommendation, is this the right analysis, that's where they're giving us feedback and we can still, of course, take that feedback in and do something about it, but it also gives our customers control over that output where they know they're not going to have to consume whatever we throw at them. They do have that checkpoint. 

HA And there's a lot of other aspects that make life sciences data complex. I mentioned diversity of the data sources and heterogeneous formats, but then you also have high dimensionality. When you talk about ingesting data that is captured from the interactions of somebody in a healthcare professional with another healthcare professional, there's a lot of nuanced details like the tone and the sentiment and the context that add more complexity to the analysis of that data, especially when we're talking about unstructured data. And then I mentioned also regulatory and compliance issues, and finally, domain specific language. And also paired to that domain specific language complexity, the contextual understanding, because sometimes the medical terms are used different in different contexts. So those are things that we constantly look out for and apply to our validation and quality of inferences as we verify them. 

BP And so are you able to gather new medical data from around the globe in different languages, or are you limited to a certain subset of languages that your systems can understand?

DD So we are inherently multilingual, so we currently do sit on a global database. The language that our customers are using extensively is more English, and what's coming up next, from what I understand, the customer is French, Portuguese, Spanish and then actually Japanese. So yes, there is going to be multilingual in the future and we do actually have a global dataset, which is one of the things that our customers really care about because they're often trying to find a researcher, let's say, in Brazil or in India, which isn't often covered when you're looking at a US-only dataset.

RD So you talk about the global dataset, do you have a central database and do you also have separate sort of customer-specific databases? 

DD Yeah, we do both. Hellmut just talked about tenancy, et cetera, so there's always a room for each customer to live in. But then we have what we call and continue to build out what we call the Sorcero Scientific Content Repository, which is where we are taking all of the third party data that is available, it could be licensed, it could be open access, it could be something we buy, and we put that all into a single database, normalize it, unify it, and enrich it, and that enrichment means that we're also generating a lot of novel data and a lot of novel metadata. And then we can pipe that into different applications for different customers. For our customer, they get to choose those data sources, but they also often bring their own data to the table. So in the customer rooms, it’s a mix of our core data plus what they're bringing to the table, together. 

RD So you must have a series of pretty complex ETL pipelines and such to manage all this data.

DD Hellmut, what are we doing? 

HA So we are building SEER following the data vault patterns and it allows us to support, like I mentioned, a very diverse set of data sources, and some of them are very big like PubMed and UPMC. We ingest clinical trials and also we are recipients of the OpenAlex author data and metadata that we ingest into our system. Those are all huge datasets that require things like deduplication and record tracking, and also we do a substantial effort on data lineage tracking. A lot of the times, the customer really wants to know why their specific queries render specific results on certain publication outputs, or which authors are relevant to this or that, and we need to be able to explain that because they can also go and try to search those authors up in data systems and try to match, and we need to explain why we decided, or our system decided, that was the output. So lineage is important and the data vault pattern is also key because it allows us to add additional open data sources or other data sources without disrupting the pipeline, and there's different levels all the way from the raw data vault to your business analytics kind of levels. And that's where there's all sorts of things between feeding other purpose-built applications to just your plain analytics through dashboarding and stuff like that.

BP Yeah, it's interesting to hear you say that. I think that's kind of one of the core tenets for Stack Overflow as we look to get into the world of Gen AI, that there has to be attribution. If we're going to provide you with an answer for why you should write this code, you also want to know who provided the answer, what their reputation is, what kind of licenses this code has so that I can actually put it into production if I want to. And in your case, you're giving me an answer, but I'd like to check your references, or maybe like you said, dig deeper in. Now that I know which paper was significant, I can go and try to maybe find some more stuff within that that is useful. 

RD What's the p-value on these results? 

DD Oh my God, you just really hit a nerve with that p-value thing. Physicians are influencers unto themselves. I never thought I'd use that word, but they’re the influencers. Our customers are tracking who has the highest influence for their specific therapeutic area and then trying to zero in on them and then targeting them. “If I was retail, I'd already have this,” is the joke. Well, it's not the joke– it's the truth. If I were retail, I'd already have the kind of fine-grain understanding of my target customer and my influencers and how I move them around. And life sciences, we're trying to give them that level of fine-grain analysis on their customer base and who’s influencing them and who to trust.

BP So I checked out your website and there were a few case studies there. I would love to hear maybe from each of you, tell us about one case study or instance where you felt like, as you mentioned before, there's a positive impact we can measure here on the healthcare system, and then one thing that you're looking forward to in the upcoming year. 

DD Sure. So we recently won an award for our work with AstraZeneca. So we have three products in market, one of them is intelligent publication monitoring, which as the name indicates, it supports our customers in monitoring publications pertinent to their therapeutic area. One of our oldest customers in this has been AstraZeneca's oncology team, and because this is public, I can call them out by name. We just won an award with our partners for a zeitgeist in the application of AI to medical affairs transformation. This is incredibly exciting because for the first four and a half years of Sorcero being alive, I had to explain to people what medical affairs was. And so let me tell you why this is so important. Medical and scientific affairs are the team in pharma that is responsible for collecting, analyzing, and distributing absolutely all of the data that governs the safety and efficacy of every one of those products that they bring to market. Making sure that they can do their work means that the physicians will have the information they need, the patients will have the information they need, the regulators will have the information they need, and of course the insurers will as well. So having a case that tells us as a team that our platform is actually moving the needle forward in equipping this team for the future and for improving patient outcomes totally made our day.

HA Awesome. On my side, I can talk about two different things that I'm excited about. There's a lot of things that are exciting about what we're doing. On the technology side, there’s a substantial effort and investment that we're doing on the building of the data asset that we're going to combine with our AI services. There's a mutualism between the two. We enrich, but we also use the AI to classify and tag, so all that stuff is growing and as we partner with more entities and we ingest more data, that's going to become more and more rich and so that's very exciting and is a big problem to solve. But at the same time, when we work with medical affairs for instance, it's exciting to hear things like, “Hey, these teams are actually looking at opportunities for medications, for instance, to be applied in other fields that they have not been tested in, and the data accelerates that process of discovery.” And that to me is super exciting because when you see that you're potentially going to be able to help a cohort of patients by using something that already exists but it has not been applied in that fashion but you're analyzing how some providers are actually testing it and applying it in the real world, and gathering all the data and bubbling it up to these teams so they can notify and inform the providers that it can be used in that way, that’s amazing. So pretty exciting stuff, both on the tech side and in the real world, because you can see how it can make a big difference.

BP Very cool. If you can explain to me maybe later after the episode why I always fill out the form for the new doctor online and then when I get there they still have me fill it out on paper, if the AI could explain that to me I would really like to know. 

RD I don't think that's their domain. 

BP It’s an unsolvable challenge.

DD It is one, however, I support in solving for. 

BP Yeah, we should definitely solve for that.

[music plays]

BP All right, everybody. It is that time of the show. Let's shout out someone from the Stack Overflow community who came on and shared a little knowledge. Thanks to John Carrell, who was awarded a Lifeboat Badge on August 8th for helping to explain the difference between self join and inner join. You've helped 40,000 people, so we appreciate you bringing your knowledge to Stack Overflow. As always, I am Ben Popper. You can find me on Twitter X @BenPopper. You can email us with questions or suggestions: podcast@stackoverflow.com. And if you like the show, leave us a rating and a review, because it really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And you can reach out to me on Twitter/X @RThorDonovan. 

DD My name is Dipanwita Das, I go by D. I'm the CEO and one of the co-founders of Sorcero. You can find me on LinkedIn. But really, really go look at the website: www.sorcero.com. Ask us questions, go ask for a demo.

HA And I'm Hellmut Adolphs, and I'm CTO at Sorcero. And you can find me also on LinkedIn. And yes, I encourage you to go check out our website. Thank you, guys. 

BP Sweet. All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]