Happy New Year! In this episode, Ryan talks with Jetify founder and CEO Daniel Loreto, a former engineering lead at Google and Twitter, about what AI applications have in common with Google Search. They also discuss the challenges inherent in developing AI systems, why a data-driven approach to AI development is important, the implications of non-determinism, and the future of test automation.
Jetify gives developers a cloud environment for building AI powered applications.
Check out their blog or explore Jetify Cloud, a suite of managed services designed to make software development easier for teams.
Daniel is on LinkedIn.
Stack Overflow user Dhaval Simaria earned a Lifeboat badge by explaining the Difference between pushing a docker image and installing helm image.
[intro music plays]
Ryan Donovan Hello, everybody, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ryan Donovan, your host for this episode, and today we are pondering the question: How are AI apps like Google Search? It's a mystery, and our sphinx for the question today is Daniel Loreto of Jetify. Hi, Daniel. How are you today?
Daniel Loreto Hi, Ryan. I'm doing great. Thank you for having me.
RD So at the beginning of every episode, we like to get to know our guests, see how they got into software and technology. So can you tell us a little bit about how you got to where you are today?
DL I don't know how back you want me to go. I guess as a kid I was always interested in technology. My dad brought a PC home, an IBM PC. I'm old now, so floppy disk, hard drive, all that kind of stuff, but I just got really interested in it and that's how I got into tech. And anyway, since then, I went to college to study computer science. I ended up working at Google, I worked at Airbnb, I worked at a few startups, I worked at Twitter. So I've been through several big technology companies and several small ones as well.
RD And then you founded Jetify. When was that?
DL I'd say about three years ago we founded Jetify. High-level, we've been wanting to simplify what it takes to develop applications on the cloud. We've been focused on developer environments so far but we have some products coming soon that are all about AI agents that help with software development and in many ways drive those developer environments to do tasks on behalf of developers.
RD So I'm sure a lot of folks are wondering how AI apps are similar to Google Search. Sort of offhandedly, I would say they're both pretty opaque to the user, a lot of rainmaking and hand-waving, but can you tell us where you're coming from on that?
DL Absolutely. I worked on Google Search– this is back in 2005/2006. At the time, LLMs didn't exist, deep learning wasn't a thing. Google Search was mostly, I would say, a very sophisticated but rule-based system. But what I see as the common thread between the work we used to do there at Search Quality and AI applications today is that they're data-driven systems where it's very hard to predict the behavior just by thinking through the logic that you wrote. If you think about AI and LLMs, there's a lot of nondeterminism. You might have a very open-ended set of inputs. If you're allowing your users to say whatever, and then the LLM can respond in many different ways to those inputs, it becomes very hard to predict the behavior. And so I think you need certain tools and processes to really ensure that those systems behave the way you want to.
RD I feel like they're both similar in that there's a whole industry of folks trying to sort of explain and understand it with the SEO gurus, with the Google Search, and the prompt engineers with the AI. Do you think there's a way that we can make AI apps a little less opaque for the developer understanding and testing and getting predictable results from your AI app?
DL I think there's a few different techniques that can be used that I think people should be adopting when developing AI software. Maybe the first one I'll start with is that you have to look at the data, and what that means to me is that, first, you need to kind of store somewhere traces of how the system responds to different inputs. So your user might type something, and internally let's say you're building an AI agent or a workflow that's taking several steps, and throughout those steps it's creating certain prompts that talk to an LLM. I think you need a system of record where you can see for any session exactly what did the end user type and then exactly what was the prompt that your system internally created, exactly what did the LLM respond to that prompt, and so on for each step of the system or the workflow so that you can get in the habit of really looking at the data that is flowing and the steps that are being taken as opposed to just the absolute end result and the beginning and not understanding what's happening in between.
RD So when you were at Google Search, how did you tackle this problem of gathering the data and then understanding it? Like you said, it's very hard to think through the logic of this just looking at the code you've yourself written.
DL So we actually had a few different tools that we used. So the first one was, if you worked at Google Search on any search page, you could add a little CGI parameter to the URL, and now below any result on that search page, you could see debugging information explaining exactly what steps the ranking algorithm had taken and what the scores were and why it decided to place that result there. So this comes back to being able to understand what the system did. Another tool we had was the ability to stand up two versions of the system, one with a new change and one without the change, and then automatically run a ton of search queries against both systems and kind of automatically diff the search results that Option A would generate and Option B would generate, and then you would be able to inspect those changes. So you could now see, “Oh, this is how my change actually changed search results against this query.” All right, you can go through all those different queries and kind of see for yourself. And then because you had that debugging data that I mentioned before, if there was something surprising, you could then kind of dive deeper there and try to understand what happened. And so I imagine with AI systems, one can do the same where you're kind of doing these little diffs or experiments as you're iterating on the system to try to understand, “Hey, if I replay my end user inputs but against version one of the system and against version two of the system, how do my outputs change?” and kind of inspect those and kind of really get into the details and understand whether it's better or worse.
RD Did you ever run into anything that was really surprising in the search results, anything where you're like, “That's not supposed to be there.”
DL Totally. That will happen all the time. I don't think it was my change, but I remember somebody in the team working on this. They were working on image search, and so the idea came up, which I think was already used for web search results, that if somebody is clicking on an image, that's probably a better image than the ones that are not getting clicked on. So if you think about it, somebody searched for something, they're seeing 20 images, they're probably clicking on the ones they want. It makes sense. And so why don't we collect that click data and then kind of feed it back to the search system and boost those images that are getting more clicks. And when you try that, it mostly works, but later, as we kind of roll out and learn more about the data, it turns out that sometimes if you have a surprising image, so let's say, for example –I'm making this up– let's search for images of dogs. And something went kind of really badly at the beginning of the algorithm and it found a hot dog. And so now you're presenting 20 images, 19 are actual dogs, one is a hot dog. The aspect of that image being surprising might actually cause the user to also click on it, too. It's like, “Hey, wait, why is this here? I don't understand,” and they might click on it. And so all of a sudden that's a case where you have this kind of reinforcement loop of the signal boosting the wrong thing.
RD Right, the unintended consequences.
DL Exactly, right.
RD So I know a lot of people for AI apps are trying to dig into things like explainability and to try to trace the reasoning. With a rule-based algorithm like Google Search, it's easier to say, “Here's why this was chosen.” Are there ways to get that out of the sort of massive list of parameters of whatever is in the LLMs?
DL Yeah. I think what you're pointing out, just to kind of say it more explicitly, is that in a rule-based system, the rules are deterministic. The nondeterminism comes mainly from the quantity of data that goes through the system, but in the case of an LLM, the LLM itself is probabilistic, nondeterministic, it can give different responses. In fact, over time, it might drift. If you're relying on an external model, which I'm sure many people are, that model is not controlled by you. It might be, well, improving over time, but it might change. To me, that means that one needs to be constantly monitoring some measure of quality, and then you need to couple those measurements with, again, the debugability and maybe even human judgments that you're doing frequently to understand if a system continues to behave the way you intended. The other thing I'll say is that nondeterminism means that if you are creating a workflow that has several nondeterministic steps, and each of those nondeterministic steps has a probability of producing an erroneous output, well, the error compounds because you're essentially multiplying the error rate across the different steps. And so I tend to think that you also want to kind of control the quality of each step in that workflow, and maybe you need some sort of feedback loop to kind of control the precision or the quality of the output that you're getting. It depends on the application, so I can't be too specific. Some applications might allow for more errors that are presented to the user, and the user is fine, like you're presenting five options and they get to choose, but other applications might need much higher precision. And so depending on the application, you'll need to control those error rates a lot more.
RD In some ways, I think the nondeterminism is a feature and not a bug of LLMs. That surprise that they give you is something that is almost valuable about it, right?
DL I agree, I agree. In fact, as we've been developing some agents ourselves, I might run a workflow four times because I actually kind of want to see the variety of outputs and then I end up improving my initial prompt based on what I liked about the different outputs. So it's a little exploratory at first. My prompt might be a little bit vague, like, “Do this,” and I'm just like, “Okay, let's see what you do 10 different times,” and then as I see the output, I’m like, “Oh, I really like how you structured the data here, or I really like how you mentioned this aspect here.” I might then incorporate those requests into my prompt and then start to reduce the variability of the output. Like you said, that initial nondeterminism actually helps me kind of explore the possibility of outputs.
RD Right. So for some applications, for some uses, like you said, that nondeterminism helps you as the user to figure things out, but for a developer who doesn't want such a sort of distribution of responses, what can they do to narrow down how an LLM responds?
DL I think there's a few tools. Obviously, if there are cases where you don't need the LLM and you can do kind of traditional logic in one of the steps of the workflow, you should do that. Where you do need an LLM or a machine learning model, I think you have a few choices. One is that you can be more precise in your prompting, and that's kind of what I was describing before. You can start vague and you're going to get a lot of variability in the output, but as you learn exactly what you want, you can make that prompt more precise and kind of reduce the space of responses that the LLM will produce. If you have a fairly specific use case, eventually I think you can train your own model or fine-tune your own model for that application. A lot of the variability in my head comes from having a general model, and so it's a double-edged sword. They let you solve all sorts of problems, but also they might go in more different directions than a more specific model.
RD So if you need the specific use cases, you’ve got to fine-tune it to that use case.
DL I think at Jetify we're still playing with this, but sometimes you can have two systems that can kind of help correct each other. So for example, let's just say in the case of software development which is a lot of what we're thinking about, let's say you're generating some code via the LLM. In that vertical, you have the ability to try to compile the code. You have the ability to try to run the code. And so you might do that as an extra step in the workflow, and if you detect an error from the compiler, feed that back and try to correct. And so now you have essentially the LLM and compiler both kind of feeding off each other and trying to generate something more accurate.
RD There's another sort of similarity between search and LLMs. I saw a sort of jokey meme Venn diagram recently that said that LLMs are just technically a slow database, and search is essentially a very large database too. Is an LLM just a sort of unpredictable search?
DL There's some analogies, but I think fundamentally it is a different kind of system. A search in my head is, behind the scenes, you have an index of keywords that you're kind of looping through and then finding the documents that match, whereas with an LLM there's kind of that neural network and it's a lot more probabilistic. I suppose where the analogy makes sense is in terms of trying to get the output out of the system. There's that aspect of, are you crafting the right query? So there I think there's an analogy that makes sense. Am I typing the right keywords on Google so I can find the right page, and then am I prompting the LLM the right way so that it can produce the output that I want? That's where I think the analogy makes sense.
RD You could look at it as a sort of bafflingly indexed database.
DL I guess that's fair.
RD So for folks who want to create some LLMs at scale, what are the lessons they can take from your search experience?
DL The practices that you want to adopt to develop these types of systems is– I already mentioned the explainability. So I do think you want to collect a dataset that helps you evaluate the performance of the system. And here when I say performance, I'm not talking about the speed, I'm talking about the quality of the results. Is it producing the output you expect or not? I think that often involves humans in the loop, not necessarily at the time that the system is creating the response, but at least after the fact to judge whether certain cases that the system is trying to handle were handled correctly or not. So I think you need to save those datasets, and later you can use those datasets either for training a new model or simply to continue measuring the performance of your current system and ensuring that the precision and recall, the quality of the system is to the standard that you want it to be.
RD So I hear a fair amount about the sort of governance and safety things, but kind of as an add-on. Do you think things like de-biasing, preventing malicious responses, privacy shields, do you think those should be included as first party concerns when you're training up an LLM or are they things that once you have the sort of use case UI fit, something you should think about?
DL To me, I think it all goes back to the use case you're trying to solve. But depending on your use case, you should really care about the safety and possible responses of your LLM. And I think it'd be smart as another best practice to have a layer or layers in your own system where you can decide to filter out certain responses and not move forward with them. You definitely want to avoid profanity– again, depending on the use case, but profanity, violence, etc. But then, I don't know, let’s say you're doing something financial with the LLM and then there's just some sort of suggestions you think would be very financially dangerous. You might want to think about how you monitor for those and maybe even limit those when you don't have enough confidence in the output.
RD It's interesting that you bring up layers. Do you think it's a sort of requirement this day that if you're making an AI application to have multiple LLMs checking responses, prompts, or is it okay to just sort of go freewheeling and have one prompt, one response?
DL I kind of think it's better to try to divide the system into smaller components that you can then measure, test, and refine individually. So I'll put it this way– if you're doing something that just requires a very simple prompt, maybe your advantage there is around the UX and stuff that you're building, but it's not really the AI because you can just go to ChatGPT directly or Claude directly and get the same result. If you're doing a more complicated workflow, for a particular vertical, you're creating additional value over what you might get from just using ChatGPT directly, then I think you'll want to kind of layer the system. And by the way, layers might mean other LLMs, but it could also mean other deterministic pieces in the system that are interacting and/or driving the behavior of the LLM.
RD The rule-based AI is not dead, right?
DL Right. I think at the end, the interesting AI systems are going to be kind of like compounds it seems. You're going to have some deterministic things, you’re going to have some rules, you're going to have LLMs, and you kind of use the right tool at the right time to drive an overall workflow.
RD So what are you most excited about either working on or tackling? What are the challenges and things for the future that you're excited about?
DL I'm just very excited in general about AI agents, personally. We've seen the problems of LLMs and how they can generate English and images and audio. I get extra excited when I start seeing those LLMs coupled with tools that they can drive. The computer use example from Anthropic, and I think all sorts of APIs and other actions can be built into systems that essentially have an LLM driving it and then you have those tools allowing it to do input and output with the ‘external world.’ I'm super excited about that. In fact, a lot of what we are playing with is exactly that. So right now we have, in kind of private beta with a few customers, a fully automated QA engineer, which is really an AI that goes through a web application and is kind of clicking around, forming testing plans, deciding if it's seeing bugs, and then creating reports out of those things. And so all of a sudden you have this kind of rote manual work that people hate doing– in my opinion, I've never met an engineer who loves having to click through their application 50 times and all the different edge cases– that we can automate and then that allows developers to focus on the cool part of software development and the creative part, the part that they enjoy the most.
RD I see a lot of people talking about ClickOps and trying to automate that away. I've also heard that there's something lost by not interacting with those edge cases. Do you agree?
DL I agree, but in a certain sense. I don't think the value comes from the developers doing the human labor. I think that value comes from observing with your own eyes kind of where the UI broke or where the feature broke and what was confusing about it. And so in this example that I'm mentioning, we're actually making the AI create recorded videos of the steps that it takes, and so my hypothesis that we'll see is if we can still show those videos to developers to explain those areas where we feel the application broke, that they still will get that kind of insight into their own application by observing and not necessarily by being the ones moving the mouse.
[music plays]
RD Thank you very much, ladies and gentlemen, for listening today. As always, we're going to shout out a winner of a badge, and today we have a fresh Lifeboat Badge to shout out. Congrats to Dhaval Simaria for providing an answer to: “The difference between pushing a docker image and installing a helm image.” If you are curious about that, we have a solid answer for you and it's won a badge. I am Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog, and if you want to reach out to me, you can find me on LinkedIn.
DL I'm Daniel Loreto, CEO and founder of Jetify. You can check everything we're working on at our website, jetify.com, and you can find me on LinkedIn as well if you'd like to connect.
RD All right. Thank you very much, and we'll talk to you next time.
[outro music plays]