The Stack Overflow Podcast

Is the enterprise (actually) ready for AI?

Episode Summary

Maryam Ashoori, Head of Product for watsonx.ai at IBM, joins Ryan and Eira to talk about the complexity of enterprise AI, the role of governance, the AI skill gap among developers, how AI coding tools impact developer productivity, what chain-of-thought reasoning entails, and what observability and monitoring look like for AI.

Episode Notes

watsonx.ai is an enterprise-grade AI studio. Developers can get started in the watsonx Developer Hub.

We published a technical behind-the-scenes look at watsonx, as well as a Q&A on why it’s business-ready

Find Maryam on LinkedIn

Congrats to Stack Overflow user Michael Kolber, who earned a Lifeboat badge with a straightforward and effective answer to Is it possible to download a website’s entire code, HTML, CSS and JavaScript files?.

Episode Transcription

[intro music plays]

Ryan Donovan: Well, hello there. Welcome to the Stack Overflow podcast, a place to talk all things software and technology. I am your host, Ryan Donovan, and we are here to talk all about AI and the enterprise. Is the enterprise ready for AI? Is AI ready for the enterprise? I'm joined today by co-host Eira May. How are you doing today?

Eira May: I'm doing pretty well. How are you doing, Ryan? 

RD: I am, you know, surviving, thriving, staying aliving. And I wanna introduce our guest today, Maryam Ashoori, Senior Director of Product Management at IBM. [inaudible] Welcome to the show, Maryam. 

Maryam Ashoori: Thanks for having me. 

RD: So, at the top of the show, we'd like to get to know our guest. How did you get into software and technology? 

MA: Oh, that's the fun part. It actually goes back to high school. I have an older sister that did computer engineering in undergrad, so she used to bring her assignments home. And I remember this one that I helped her with an Oracle form, user interface, drag and drop off the boxes to create a warp. And I was fascinated with this drag and drop. Things happen. Code is generated behind the scene, and I think it was sort of trigger for me to follow in her footprint and get into computer engineering in undergrad. Followed by AI actually for my master's degrees, I have two Master's degrees in AI. Believe it or not, surprise, surprise, the title of my dissertations for those AI degrees was multi-agent systems, that after 20 years is coming back again. So exciting time to work on that. But at the time it was mostly around publishing papers and writing papers. And me having, being a developer at heart, I wanted to build. So for my PhD, I went back to my engineering roots and I did systems design engineering with a focus on human computer interactions. So basically how people would interact with AI systems. And that was my segue to my first job at IBM as a user experience designer, and throughout the year I joined the IBM research, worked on AI and human computer interactions for years. Doing research and building this stuff, but at some point I felt like, you know what I wanna see the results of what I built. So I decided to leave IBM, I joined Lyft, the micro mobility company. I was the Head of Engineering for their bikes and scooters operation technology, so think of putting the right bike in the right place at the right time. And in case of E-bikes, you wanna make sure they are charged. You minimize the cost of operation and you maximize the utilization. So that was fun time, working on micro mobility, especially during pandemic because we saw how people started using bikes instead of public transportations. It was amazing until I started thinking, when I was looking at my product managers, I felt like they are the ones that make the call to build the right product, versus, I in engineering was working on how to build the product right. So I wanted to get into also building the right product, not just building the product right. And that was the segue to get into product management. So I came back to IBM, and because of my background in AI, I took over the AI portfolio. And over the past two years, I've been leading the Watsonx.ai, which is IBM's flagship GenAI product for enterprise.

RD: We've worked with y'all at IBM on a couple blogs about the Watsonx and I was struck in doing in those projects the strong focus on governance. That was one of the primary pillars. Can you talk about why the governance piece is so important to Watsonx? 

MA: Yeah. We take responsible implementations of AI very seriously because we designed for enterprise production and scale. If you're looking to last year, the market was mostly exploring with GenAI, looking for wow factor and aha moments, right? But at this point, most of the enterprises are moved to production. When you move to production, you think about actual use, putting these models in front of your users, right? Having immediate impact on your finances. So ensuring a responsible implementation of AI comes to the play. We are well aware of the limitations of LLMs. Lack of explainability, hallucinations, guardrails not in place. But if you think about it, last year, the absolute worst thing that could happen via LLMs was inappropriate content. But this year with agents, LLMs are taking actions. They can access data, they can execute tasks, they can call API calls. So the risks that are associated to those agents are very amplified now. So when you think about that, like for enterprises that wanna take advantage of these agent automations and bringing LLMs to every single corner of their enterprise systems, risks become a very important topic that we gotta tackle that head on. As a matter of fact, we just published a report with our AI Ethics Board on some of the risks associated with LLMs and agents, and what are the ways to mitigate them, in terms of like the governance systems in place, in terms of the observability that you need to bring into the picture to ensure the transparency and traceability of the actions for these agents.

RD: Yeah, I know when we do surveys of our users on AI, there was a big trust gap. You know, there's a lot of people who are using it, but there's a much smaller amount of people trusting it. And I know you all did a survey recently of a thousand AI developers. Can you tell us a little bit about what you learned from that?

MA: Yeah, so we talked to a thousand AI developers in the US that were mostly tasked to create AI applications, to understand what are the challenges in their work. And the reason that we did that was we saw a lot of excitement in the market around Generative AI, but when you really think about how enterprises can capitalize on their investment and realize the potentials of GenAI, it heavily falls on their developers to harvest and master the very complicated, ever changing, modern AI stack. Multiple layers all the way from computer models, to framework to applications. They have to optimize every single layer of the stack. They are dealing with some skill gaps. We wanted to understand like, if it's the case or not, to what extent some of the limitations that they mentioned. So we talked to them. And interestingly, the finding was sort of reinforcing what our hypothesis was: in terms of the skills we noticed there is a major gap in terms of AI skills of developers in the market. The AI application developers that we talked to, only one third of those application developers viewed themselves as an expert in Generative AI. So you see the gap. Now paired it with a market that is evolving rapidly. So we ask them, ‘Hey, how many tools in your typical day are you using to build AI applications?’ and on average, most of them, they set five to 15 tools. The majority of them said they can't afford to spending more than two hours on learning a new tool. So they are craving for tools that are easy to master, catering to their limited AI skills. They are good software engineers, but AI skills is a different thing. And they are asked by enterprises to optimize the stack because everyone wants to optimize the cost. So all the pressure goes on developers, and we are relying heavily on developers to manage the complexity of the stack.

EM: Yeah, that's a huge undertaking obviously. I wonder if we could return to something you mentioned about risk. Obviously huge new developments coming in this space. There's a significant experience gap. All of us are trying to kinda get our arms around this technology. Are there risks that you feel like the enterprise is very cognizant of? And then on the other hand, are there risks that you feel like maybe aren't being talked about enough? 

MA: Yeah, so it also depends on what industries we are talking about. In highly regulated industries like finance, from the first moment, like our insurance for example, companies, they think about risk. Questioning every single thing, but it might not be similar in other domains. It's also use-case specific, like when the stakes are high versus the stakes are low, it's just the normal automation that you're taking that if things go wrong, the consequences are limited. Right. So given that we've been looking, and also the customers, they have been looking into the end-to-end, all the way from building, because you know the AI lifecycle doesn't start with data, it starts from a model. You build a model and you take a pre-trained model and build upon it all the way to production landscape. The main question to ask is how is the model trained? Right. For many of the models in the market, you have absolute minimum visibility into the training of the models. Even if the model is open source, open source is just a delivery channel. You have access to the weights, but you have no idea how those weights are trained, right? So lack of access to the transparency of how the model is trained is a big item. Then you grab the model for LLMs. Through the input, you can nudge the model to potentially create an inappropriate output. So even though if you have put together all the guardrails in place and everything, a person can come in and manipulate the inputs to get an undesirable output, right? Because of that, it's essential for enterprise to automatically document the lineage of who touched the model. To do what? At what point? So if something goes wrong, they have the option to go back and figure out was it caused by the model? Was it caused by the input? Was it caused by the user? What happened? Right. So that's the second item. The third item is the guardrails in place. Most of the time you have sensitive information that you do not want to go to the model or you wanna protect the model in terms of like this behaviors that we just mentioned, like in terms of jailbreaking or anything. So before the data is passed to the LLM, you wanna make sure that you put guardrails in place in terms of input. We do the same thing on the output. HTAP use and profanity is filtered. PII is all filtered. Jailbreak is filtered. We've been actually looking into, we call them detectors, a list of detectors that you can configure on and on, and an orchestrator or orchestrates this, depends on your list because you don't wanna just filter everything, right? If you filter everything potentially, there's nothing left to come out of the model [Laughter]. So, we call them guardrails. I think these are really the three aspects of just LLM governance and observability. So now when agents come into the picture, right, it's a whole new area. Like they take the LLMs and they connect them to the whole functions, enterprise workloads, external calls and everything. Because of that, the two things becomes essential, in addition to all the LLM governance that we talked about. One is observability in terms of transparency and traceability of the actions for the agents. So you know exactly what is the function that the agent is about to call or run and, stop it if it's not appropriate. So it's not just content that you need to filter, you need to stop that action, right? And the second thing is you wanna be able to monitor the performance of the agent over time to detect any drifting performance, drifting biases, drift on anything, and mitigate it back. Right? So these are the new element, we call them ‘adjunct guardrails’ that we've been looking into, but this is an area that is yet to mature in the whole market. Like collectively with the community and the enterprise we are trying to understand what are the risk factors. It's like collecting information to see what are the gaps in this cycle/lifecycle and how collectively as the community we can overcome those.

RD: I wanna jump back to the tool chain. You mentioned AI developers have five to 15 tools that they touch. Seems like a lot. But I'm also thinking, you know, regular software developers have a database, they have programming language, they have frameworks, they have all these dependencies. How unusual is it to have those five to 15? And is it a function of just such a new technology, so much in flux?

MA: Yeah, all the existing software engineering tools that they had access to, they are not going away. This is just the AI stack that is exposing this new sets of tooling to the picture. And that's why it really makes it complicated because presumably all the existing domains, they are master of that. They are software engineers. They know how to tackle that. Like they are comfortable with how to perfect the processes. We've been doing software engineering for more than like multiple 10 years, right? Decades. So, let's say that we have a good handle of that [laughter] so now in the new world -

RD: ‘Let's say!’ [laughter]

MA: Let's say that, exactly [laughter] [inaudible] So now with AI, in the past 24 months, the stack is changing, the tools are changing, the technology providers are changing, many of them, like startups that maybe they weren't in the market six months ago, or maybe they didn't last more than six months. And as a developer, it exposes a certain level of risks to you because you have to understand that technology. You have to integrate with them. You have to rely on that technology provider to maintain that. And if the technology provider is not in the picture, you need to have an alternative to switch away. So you gotta move forward through sort of AI agnostic view, but also you have dependency to those AI tools. The second part that makes it complicated was, let's say historically, if you were a backend developer, you were a backend developer. If you were a frontend developer, you were a frontend developer, that was your expertise. For AI stack, you gotta be everywhere. Starting from lower level, multi GPU. If your firm is deploying them on their own infrastructure, you wanna experiment with multiple GPU types to see which one is more efficient. You wanna experiment in deploying your solutions on Prem or on different clouds to see which one gives you the efficiency that you need. You go up to stack models. There are hundred thousands of models in the market at this point, which is the right model for me to pick? Right? Different versioning. A new model is coming up. Like, what is the thing that I'm using like deprecation of these models? The models come in, the models leave the market. What's gonna happen to my systems that I've integrated with the models? The models are typically not batch work compatible, so these are imposing risk. Then going up the stack; AI frameworks and libraries. So for example, for agents, we have Crewai, LLAMAIndex, Land Graph, Autogen B-I-M-B. There is a series of them. How would you pick which one is the right one and build your application on top of this? So lots of decisions that needs to be made by the developer in a fast evolving market that doesn't have a sense of like long-term stability for the lifecycle of the application that the developers are building.

RD: I think one of the big risks, obvious risks is keeping everything updated, security patched. But I wonder if there's another risk in sort of outsourcing part of that layer itself to AI. Like, I don't have time to learn this. I'm just gonna give it to, you know, whatever AI chat or whatever code that I have to say like, you handle this.

MAi: I'm glad you asked because we asked the same question in that survey one, are you using AI assisted coding? The majority of them said yes. The second question was, how much time saving are you getting out of them? In average, out of thousand developers that we talked to, the majority of them said they get one to two hours of time saving a day. 4% said more than four hours a day. What does it tell us? It tells us that there is a major opportunity for developers to take advantage of AI assisted coding, but that the niche group of developers that have figured out how to effectively use AI for the purpose of what they are developing, they are the winners of the game. So moving forward, those developers that have perfected through prompt engineering or systems or whatever they have in place to maximize their productivity using AI developers, assisted coding, are gonna potentially replace the developers that are not using AI effectively in their jobs. And because of that, the time saving that is gonna bring up to development, we are gonna see accelerated application development. We've seen that over the past two years, I expect to see that even more on the software engineering side of the house moving forward. On the sense of the AI providers like IBM, we are running the AI stack. We are designing that modern AI stack. What can we do on our side to help developers, right? To simplify those complexity. Can we use AI for that purpose? What's the solution? And we've been asking that question ourselves too. So, you know, for Generative AI, the common use case is still, the most common use case is content grounded question and answering. And the pattern that we are using is retrieval augmented generation. And it's a very basic cycle that like, ‘Hey, LLM, when I send something to you, don't hallucinate, go to that body of information that is available to you, get the information, if you find it provide citation, if you don't find it, say I don't find it’. Right? Very simple view for developers. This is not actually that simple because if you wanna optimize your rack pipeline, you need to purposefully experiment and pick the right embedding models, the right generative model, like every single one of them is associated with task. The right chunk portion, right? So now if you don't have the AI background, how are you gonna know what is the right parameter to choose for this pipeline. Hyper parameter optimization, right? So in our portfolio, we felt like, ‘Hey, let's automate this’. Let's automatically create multiple pipelines, show the factors, the measurements that the developers care about in terms of, for example, faithfulness or whatever they care about. Calculate that, show it to the developer, and just help the developer pick the pipeline that makes sense. One single click deployment and they have everything. So we are taking, abstracting, away the complexity of that AI knowledge of developers to help them focus on the application building side versus the optimization parts. So I think that's an example of using AI for the purpose of helping developers.

EM: You talked to, you said about a thousand developers. What else do you wanna tease out of that to talk about a little bit more? 

MA: Yeah, we asked if you were to pick top two challenges for the developers in this era, what would that be? And the majority of them highlighted one, the lack of standards in terms of development processes. And the second one was the ethical and responsible implementation of this technology. On the first one, if you look back to the number of tools that they need to use, the lifecycle of these tools, the providers of these tools, are they gonna be in the market for long to maintain this or they are gonna go away? The rate at which the tools are being developed is pushing the developers to sort of try to establish an agnostic view from these tools, which brings it back to standards. Are there standards in the market, existing or emerging, or gap that we should fill-in to help them to deliver that purpose? And that standards are at every layer, like models. If a new framework comes to the picture, is it built based on a standard agent development process in the market that everyone is adopting, or no, it's gonna break everything. It may feel cliche. Oh why developers care about responsible implementation of AI, but especially for agents, when you think about it, the responsibility falls under developers. The developer builds an agent that is capable of tool [inaudible]. The developer is expected to design a system that makes the right call. So now the developer is responsible for all the risks associated with the LLM planning and reasoning capabilities and the shortcoming of that, and the developer is expected to monitor the logs of the actions of the agents. So you see the responsibility. You are responsible for your agents that you built, basically. So now because of that, suddenly the developer is on the hook to do the observability and make sure that the agent is not going rogue, right? And the stack for that in the market is not mature. Most of the technology providers in the market today are talking about agent building, not necessarily monitoring in production, which is the challenge for developers to overcome.

RD: Yeah. A lot of developments in the sort of monitoring, observability, explainability area. I heard somebody talk about understanding what LLM is doing by observing individual neurons. Is that a thing that you've heard? 

MA: Not individual neurons, but I was just with IBM research talking about agent evaluation and I'm like, what can we do in terms of observability of the agents? And basically one approach that we were talking about was chain of dot reasoning. Basically step-by-step, break it down to smaller pieces and solve them. So now when we break it down to step-by-step, you can think of a node at each step, so we can use LLMs as a judge to evaluate the efficiency of each node. Was it a right decision? Was the right tools called? If it wasn't the right tools called, can we use the LLM to generate the right prompt to fix that at the node level? And these are some examples of different ways that we can look into how agent is making that decision, going through the cycle of reasoning and calling/firing up different calls to tools to evaluate that and mitigate that. And we are using LLMs as a judge, so not a human needs to go actually look into what's happening. And you can log everything and run evaluation, periodic evaluation on the performance of the judge on its own. 

RD: Well, yeah, traditionally, humans haven't scaled very well.

MA: [Laughter] Not to that node level. Yeah -

Ryan Donovan: Yeah. Speaking of chain of thought, it seems like every model today is chain of thought reasoning on some level. What does that actually mean? 

MA: Yeah. Let's go back to the definition of agents actually. What's commonly recognized as ‘agent’ in the market is basically LLM with function calling. But if you look into the definition of agents like 20 years ago, it’s really an autonomous system that can make decisions with some sort of reasoning and planning capabilities, and autonomously can make actions. Depending on who you talk to, people believe the LLMs of today are capable of reasoning. Or they are not. If you ask me, I would say that they are not doing reasoning. This is just rudimentary planning that the agent is able to do by breaking a complex problem to step-by-step, resolving it. So basically, I would call it the step-by-step thinking, which is really the chain of thoughts. It's like when I ask you a complex question, break it down, try to solve one by one and see if the final result is gonna be different. It was a paper, I think it went out in 2022, so at this point, three years old. But it revolutionized the way that the market was thinking about LLMs. Last year, we saw another paper focused on what is something that is called ‘inference time scaling’, which means that, so now you have the model, you can pair it on chain of thoughts planning, break down the complex thought task to smaller task, and then do lots of reasoning like inference on its own to solve those smaller one. So instead of one single inference that you would do to get the results back, you are doing a lot of inferencing and inference time called reasoning and thinking, to get the performance that you want. There are good and bad parts associated with that. On the bright side, now we see that the agents can expose some sort of thinking behavior that is well suited for complex tasks. On the not so bright side, let's say that if you ask an agent, where is Rome? And the agent is doing reasoning, the agent is gonna start thinking, okay, so let's see what is Rome? Let's see if Rome is a country or a city and, and goes through the whole thing. And every time it's token is sent in and out, that translates to cost and that translates to latency. For one simple answer, the agent didn't need to go through that. So because of that, the challenge that we are seeing in the market is: to be well mindful of when to activate this reasoning capabilities and when not to activate it. Some of the models in the market, for example, Deep Seek, the reasoning is activated no matter what the question is, it’s gonna go through the cycle of reasoning. For some of the models, like [inaudible] thinking, that we released a few weeks ago, you can toggle that thinking on and off. So when you need it, you can turn it on. When you don't need it, you stay away from that. I expect the market to keep evolving around this depth of the reasoning and thinking capabilities, as well as the configuration in terms of cost saving and latency control, and the mapping to the use cases, because we don't always need reasoning. The technology may not be mature today, but it will fast forward in the next three months. 

EM: Oh, that was great, Miriam. Thank you so much. 

RD: Well, ladies and gentlemen, thank you for listening today. We are going to shout out the Lifeboat badge winner. Somebody who dropped in on the question, is it possible to download a website's entire code H-T-M-L-C-S-S and JavaScript files? Congrats to Michael Colber for answering that and winning a badge. I am Ryan Donovan, if you liked what you heard today, you can reach us with comments, corrections, insults at podcast@stackoverflow.com and if you wanna reach out to me directly, you can find me on LinkedIn. 

EM: And I'm Eira May. I'm also on the editorial team here at Stack. You can similarly find me on LinkedIn. 

MA: I'm Maryam Ashoori, Senior Director of Product Management and Head of Product for Watsonx.ai. Excited to be talking to you. 

RD: All right, thank you very much everyone, and we'll talk to you next time.