The Stack Overflow Podcast

Can GenAI 10X developer productivity?

Episode Summary

Anand Das, cofounder and CTO of Bito AI, joins Ben and Ryan for a conversation about the intersection of developer productivity and GenAI.

Episode Notes

Bito AI is an AI coding tool that helps developers work more productively with features like code completion within the IDE and personalized answers drawn from your codebase. Get started with their docs here.

ICYMI: Retrieval augmented generation (RAG) is a way of addressing LLM hallucinations and outdated training data.

Listen to our recent episode about how an original architect of Jira is rethinking meaningful engineering metrics.

Connect with Anand on LinkedIn or Twitter.com

Shoutout to Stack Overflow user 

Jan Kardaš, whose answer to Go: Retrieve a string from between two characters or other strings earned them a Lifeboat badge.

Episode Transcription

[intro music plays]

Ben Popper As a dynamic global organization, Citi is fueled by talented individuals with diverse perspectives. Explore front end to back end software engineers, application developers, scrum masters, product owners, and more at jobs.citi.com/tech. At Citi, they're more than just a bank. 

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your hostess with the mostest, Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my partner in crime, Editor of our blog and sender of our newsletter, Ryan Thor Donovan. Ryan, what's happening? 

Ryan Donovan Oh, got the Friday vibes. 

BP Got those Friday vibes. Today we are going to have on the program Anand Das, who is the CTO and co-founder of Bito, which is a productivity tool for developers that tries to accelerate software development using AI models like the ones that come from OpenAI and Anthropic. He has been a CTO at other places like Eyeota, which was acquired for 165 million in 2021, and co-founded and served as CTO of PubMatic which went public in 2020. Anand has also held various engineering roles at Panta Systems, a high performing computer startup led by the CTO of Veritas, and worked at Veritas and Symantec, or in some capacity on storage and backup products. Got seven patents to his name, so quite a diverse background in the world of software technology and even hardware it sounds like. So we're excited to chat. We love to talk about developer productivity. We can't help obviously talk about what's happening in the world of Gen AI and tools like OpenAI and Anthropic are bringing to the world. So without further ado, Anand, welcome to the program. 

Anand Das Thank you, thank you. 

BP So I just gave folks a bit of an overview there, but maybe tell us a little bit about, given your sort of polyglot background working in a lot of different areas, what brought you to your current role and the focus on productivity tools for developers?

AD So I've been a developer all through my life and I've been at stages when I was developing code straight out of college where somebody used to give me requirements, and then later on managing teams which are developing a bunch of tools which work together to deliver a solution to the customers. But as you start managing teams, you kind of figure out when you're a software developer what are the things you are missing as you start managing and where are the loopholes. And as new people come in and old people go and do things that they want to, there's a lot of leak in information. So how do you kind of help the developer team be on the same page and develop code which is according to the specifications of your organization and deliver value to your customers? So that is where this was a core problem that I've been dealing with for the last 20 odd years and wanted to solve it. That is where the idea of Bito started. Again, we didn't start using Gen AI on day one. It was more like the Stack Overflow concept in a way wherein there's a lot of tribal knowledge. Can we gather that tribal knowledge and put it all together but accessible in the flow? When developers are coding in their IDE or maybe a VM and so on, you have to kind of switch between a browser and your development environment, which basically sucks away a lot of energy as well as your thought process when you're actually coding, so you want the information to be available right where you are when you're coding. That is where the idea all started. And we started this in 2021 thinking and building tools around it, but then we got users talking about, “I don't want to provide information. I don't have time to explain to people all the things that they want. If there is a bug or if there is a major issue then I might get involved and so on. Can you automatically generate this?” And OpenAI and stuff, they started launching GPT-3.5 in the end of 2021, I'd say, but nothing was available early except for GPT-2 that wasn't that great. And once GPT-3.5 came out, or GPT-3 I'd say, then we figured out that we can use AI for this, and that is where all the things started with Gen AI. 

BP It's so amazing to me that most people were not really following along. You could play with GPT-1 and I remember there were little things. “Oh, it writes good short stories.” “Oh, it is decent at imitating poems.” “Oh, it's fun for doing Dungeons and Dragons.” But it was kind of thought of as a joke, and then GPT-3 was the one where people started getting it to mockup websites and do a little bit of this and that, but there was some sort of step function it hit at 3.5 and then again at 4 that obviously took it in a different direction, but it’s cool to hear that you were paying attention to that early and I guess that's how you got on here. So let's discuss this because we have OverflowAI coming out, we have a road map, and I think a lot of folks are basically pushing the same general idea, which is, “Let's supercharge enterprise search, whether that's for your codebase or your documentation.” Everything is going to be ingested by the model. It's going to train on it or you're going to have some embeddings that live in a vector database, and then when you ask a question that you need within your organization –“Why am I getting this bug?”– instead of having to tap me on the shoulder or look at an FAQ that's out of date, you're just going to get an answer from the chatbot. So tell us a little bit about how you do that and how you think it's different. Do you have a specific approach that you think differentiates you? For Stack Overflow the differentiation is crowdsourcing and voting which we've been doing for a while, and that's how you make sure the information is accurate, relevant, up to date, et cetera.

AD So I'd say the general principles stay the same. Anybody who's basically building it would typically implement retrieval augmented generation. If you're basically answering questions on code, documents, or whatever, you'll take the input set, preferably break it down into chunks, you will index it using either embedding or a combinational thing like you'll have embedding for some, and then you might overlay it with a semantic search index plus other indexes. Because when somebody is basically asking a question, it can boil down into, “Explain me something in the code,” or it can be, “Can I make changes for adding this column to a database? Can you tell me all the places I need to make changes to?” or somebody might say, “Tell me all the files which are there in my current repository.” And all these three things break down into multiple different objectives. “Give me a directory listing,” which doesn't need an AI. “Give me where all changes are required,” which actually requires some amount of reasoning, which is, “Find the symbol, just table name,” then figure out if you want to add a column, what should I do there? Is it MySQL or something else? And then figure out where all this table is used in code, and then figure out which lines of code have it and then figure out the changes required for it, versus somebody basically saying, “Explain me something,” which is figure out the symbol, figure out the file which actually has a definition, go through those lines and explain that. So again, what is different is that we are basically trying to build an index on your existing code and answer questions for it. We are not trying to search the web or use generic information. We can use generic information to generate code that you're asking for, but primarily focused on your codebase and giving the information from your codebase. The other fact that you mentioned, like on Stack Overflow where you're crowdsourcing data and you're getting data from users, that is very important because generative AI can actually hallucinate and you have to do a bunch of things to make sure that it doesn't hallucinate. The other thing is, how do you know whether the code that you're providing is going to run or not or you created something off the top of your mind? For example, I want to access this API and the API doesn't exist. You don't want OpenAI or any model that you're using to suddenly give you an API which doesn't exist and you are like, “Okay, I can use this.” When you start running it, there's no definition for it. So putting guardrails around it and stuff, those are the differences that are there. And obviously at some point in time everybody will catch up because everybody who's providing a solution will need things like that. And then it's how better you get at it and how better you use the context window that LLMs have to give more of a complete result rather than an incomplete one.

BP The context window went from 8k to 32 to 100k, so we got a lot more context window as of the announcement this week. 

RD Speaking of context window, codebases today are just sprawling across multiple repos, multiple services. They're pretty huge. Is there a sort of hard limit on the sort of context that an AI can understand in understanding searching a codebase, even with retrieval augmented generation?

AD So there is a limit. So if you kind of look at OpenAI GPT-4 Turbo now, it's supposed to give you a 128k context window, which is great. Anthropic already gives you 100k. And the other algorithms which are out there, most of them have reached the 32k limit, but at the same point in time, what these context windows do is the amount of context that you can give is limited to this, so you have to be very careful about what you fill into that context to get the answer. Sometimes you're able to fit in everything that you need so your answer is complete, but as you said, you might have larger code bases, or in bigger organizations you might have multiple repositories for microservices, and you might be building something which uses multiple microservices. So if you're generating a piece of code, you need access to that piece of information so that you generate proper code that somebody can use. It doesn't have to then take the code and tinker around and change 90 percent of it to make it work. Now to do that, if you just limit the context window even to 128k, and if the codebase is huge and the relative context is huge, then the answer that you're going to get is going to be incomplete. Now there are multiple ways to solve it, which is that if the context is more than the context window, let me actually fire more than one request to actually generate the answer. I will basically form a chain of prompts and then keep on running them on different contexts and then gather the results and combine them together. Now that works well with languages that are text data and you cannot basically say, “Let me take this piece of code, or let me take this piece of code,” which are different contexts, and then say, “Let's jam them together.” What if it doesn't fit into the context window? So you have to come up with techniques which basically say that I'm going to provide a solution continuously one thing at a time, and then based on what I've generated and what next piece of information that I have, let me update the existing code and then finally get the code that works and then provide it to the user. But there are certain use cases that you won't be able to solve today completely without taking a totally different and programmatic approach, which is, “I want to upgrade from Java version X to Java version Y across my repository.” That is something that, it's not that you cannot do it, but the amount of effort required to do that is huge. It's not something like, “I'll just use an LLM, give a prompt, give my code, attach a RAG, and I'll get something that will be working up and running quickly.” So there is a limit. 

RD That's why so many people are still going to be putting off their Java upgrades. 

BP Yeah. “I'd like for you to completely refactor this codebase in a different language,” is a tall order, but it's coming. I'm sure it's coming. 

AD Yeah, it will come. You can do it for smaller sized programs, or if you can fit everything into the context window, you can do it for a piece of code, but the thing is, when it goes beyond the context window. And I think context windows will keep on changing and they'll keep on increasing. 

BP So let's get down to the nitty gritty. What are some of the fun things you've been finding with RAG, tricks that make it work better or things where you're like, “If you do this, it really screws it up.” What have you been working on? What's the bleeding edge of how to get the best out of your RAG approach? 

AD So there are a couple of things and there are different areas. One is performance. Obviously when somebody is asking a question, they expect the answers to be instantaneously available. With RAG it's there but it's not quite there, because if your codebase is huge, and not everybody's using Pinecone– Pinecone is costly. So if you're basically using a big codebase and you're running a vector database, then it does take some time to get the context. And after getting the context, the other thing is LLMs. You'll also figure out when you're running LLMs that if you give a larger context it will take a bit larger time than something with smaller context. So managing performance is important. Now, when it comes to creating a RAG the thing is, how do you actually segregate data that you index? So when I say creating chunks, or you have a big file which you're kind of dividing into multiple pieces and indexing them, what content do they have? If I basically just take the approach of breaking down a code file as I break a text file down, then RAG will be there. It will give you the context, but your answers might be wrong because when you're searching for a particular function and you're chunking, when you're chunking you're not looking at function boundaries, so you might get these function definitions if the full function is not there. That is part of your context, the remaining part is missing. And then you try to get an answer and you get a wrong answer. 

BP Do you have an approach to attribution? I know for Stack Overflow, one of the things we're looking at is, “Okay, we're going to get you an answer that is generated through a RAG kind of process, but it's only going to look at Stack Overflow questions with an accepted answer,” and then when it's done, it's going to show you the sources. And the recent upgrade to Chat-GPT where now you can use Bing, it gives me an answer and it cites the websites that it got it from and so I can go check that. Do you have that within your system but for the company's codebase?

AD Yes. So whenever any answer is provided like, “Where do you want to make changes,” and so on, whatever is required based on the question, we basically provide the links to the file from where the code was picked up or the information was used to provide this output.

[music plays]

BP As a dynamic global organization, Citi is fueled by talented individuals with diverse perspectives. Explore front end to back end software engineers, application developers, scrum masters, product owners, and more at jobs.citi.com/tech. At Citi, they're more than just a bank.

[music plays]

RD I like that early on in your website you talk about how security is the priority. Everybody with an enterprise codebase is very much worried about their security. How do you provide that while also indexing the entire codebase and also sending those to AI APIs? 

AD So there are two pieces to security. One is, being a tool which provides help to the developers, we use LLMs and we don't own LLMs. We use LLMs from OpenAI or Anthropic or, for that matter, Amazon Bedrock and stuff like that and even Google, so we've started using Google models also. So one is, for enterprises you can basically deploy these models in a VPC which is private to them, and we can basically use the API URLs from there so that they can be assured that it might be going to an LLM model, but it's not going to a third party, it's in my VPC. So that's one level of security. The second level of security is the data that is transient in between. So even if you do RAG, whenever you're basically answering the question, based on the question, you're identifying the relevant context using RAG, and then you're passing that context along with the question to the LLM, which actually is going to pass through us. And then there's the portion which is RAG, which is the index, where do you maintain it? So for consumers today, we say that we'll index whatever is there open as a project in your IDE and we maintain that index on the user's machine itself. And we use an in-memory vector DB that we built so that the data stays out there. And it does go to an LLM when an individual user is using not an enterprise account and we are working with LLMs like OpenAI, Anthropic, or for that matter, OpenAI on Azure and Amazon Bedrock models, we basically have an agreement in place, which is that anything coming through our APIs you're not going to track, you're not going to use for learning or anything. And then all the transaction actually happens. We don't log anything except for whether the results were liked by the user or not and the telemetry data of how many tokens were used and so on for cost management and stuff like that. So all the data is on the user's machine, that's the primary goal. We've made sure that there is no logging. How do I figure out whether things are working or not? If the things are not working and the user has an issue, let them share data. We are not going to capture their data which is flowing through the platform or the system. And all of the things are secure, that normal stuff that you do, encryption keys and stuff like that. So that is how we provide security. And the other thing is for enterprises, the indexes will be maintained, the RAG can be maintained within their own data centers if they want. Today, we support clouds rather than on premise systems because it's easier to manage and monitor.

BP Makes sense. So another thing that I'm curious about is how you would measure the impact on developer productivity, because that's something we talk about a lot. Ryan just did a great episode recently about DORA metrics and to what degree we can trust those. Have you been able to do any case studies with folks and put hard numbers around the kinds of improvements you can make with this strategy? 

AD So as I said, we don't collect much data which is flowing through our system to basically figure out what kinds of questions people ask, what answers they got, and then try to verify later on whether those were right or wrong. So we depend upon user feedback, which is a thumbs up/thumbs down that we have in our UI. Plus we survey the users on a regular basis on what they feel good about, what they feel bad about. And based on the survey and the thumbs up/thumbs down data, what we have gathered from our users is that it has at least improved their productivity by 30 percent. And when we say 30 percent in which area, one is for people who are working remote and their teams are kind of across different countries, language forms a barrier, but now that barrier is kind of removed because of AI. The other thing is people who are coming into the project and are trying to solve bugs and don't really know a particular language that is being used– somebody wrote a script in Python and the guy doesn't know Python– they can actually understand what that script does and logically figure out that there is an issue and then have AI actually write code. So there was one guy who actually is a Node.js programmer and he's like, “I could actually fix issues in a Python code which I never did before and it just took me five minutes.” So those kinds of productivity gains are there. The other thing that we have seen people do is move repetitive tasks to AI. So for example, you need to basically create doc strings and I'm really bad at that, so let me actually just have AI create doc strings for my code, or basically create commit messages. There's one guy who basically created release messages and tried to compare Jira with the code and then come up with, “Is the issue fixed?” From the product manager perspective it's a big deal because they don't know, they have to depend upon the developers, but now they can have AI to kind of check the code changes of the dev, does it look what I asked in the Jira ticket? So 30 percent productivity gain is what we have seen. Some people talk about 40 to 50. We'll get there. 

BP Nice. 

RD A friend of mine talks about using it, especially for things like creating type definitions in TypeScript, having a big JSON blob and just having it just take care of it instead of spending a bunch of time writing out this big, boring type definition. 

AD YAML files, Terraform files, DevOps guys love that. And LLM models are really good because they have the knowledge of it so you give the requirements and they can immediately generate those files. So now they're kind of like, “I don't have to sit down and type and look at syntax issues.” 

RD Can an LLM figure out if a YAML file has the right spacing on it? 

BP That's ultimately unknowable. Never. 

AD Yeah, that's a problem. There are also other challenges. I'll give a very simple example. If you just give a piece of code which is big enough, and then ask LLM to rewrite it, if it has repetitive code, most of the time an LLM will basically show you how to change the first portion of the code and then later on it will basically say “//todo.” For other functions it will follow the same thing. I will say, “Okay, fine, but I want the whole thing rewritten.” And you might give the same thing again, then it might give you the top two functions and then say “todo,” everything else rewritten. So sometimes it just doesn't understand the human needs and you have to put additional guardrails and additional prompting to actually make it do sometimes some things which are very simple to understand from a human perspective. 

BP So I'd be curious to know, do you think there's going to be a healthy relationship between the big model providers who have an API that you access and the startups like yourself? How do you balance the cost of accessing that API against what you can charge your customers, and how do you balance the service you provide with maybe competing services that big tech giants who create foundation models will also want to offer? 

AD So the thing is, with startups the only advantage you have is you're fast and you're continuously working. And the big companies have an advantage of a massive workforce. They also have an advantage of having existing tools being used, so for them, it's just an upsell. They can obviously give stuff for free, which we cannot. So those kinds of challenges are there, but I think at the end it comes down to the value that you bring to the table, ease of use, how quickly I can get started on this, and do I need to basically make sure that I have to be on a particular platform or I have various different things I use and can you basically connect with all of this without giving me headache? So anybody who does that will see a larger share of wallet when it comes to the ecosystem. So obviously there will be multiple players and having multiple players is good because it keeps you on your toes, otherwise you become stagnant and you're not going to evolve. That is a good part of having competition, but the other thing is that you have to always bring value to the table and the users are going to tell you whether you're bringing value to the table or not. So it's continuous improvement, I'd say. So hopefully that answers the question, but the thing is, there's no easy thing like, “I have this feature which makes me different,” because the other guys will have it in six months to a year. It's going to be a race. So it's the overall value that you bring to the table and then what do you give along with it. So now when you're talking about coding assistants, people think about it as a simple tool which is in your IDE. There are a lot of organizations which basically have developers using the IDE, but there are some checks that they do in their CI/CD pipeline. Can an LLM or your tool actually help them when they are in their CI/CD process, because there you do a catch all. If somebody is not using the IDE that is supported by an AI assistant, what do you do? The DevOps guy is going to use VI, other tools that they want to, Emacs, and there's no plugin out there. So they will do whatever they want to do so you want to do a catch all. Can you apply the same rules out there? Can you basically take those rules that you apply in CI/CD, and then percolate them down to the developers wherein developers don't have to do anything and it's being done automatically? Can I have agents generate your test scripts the moment you save the file? So those are the kind of things that will evolve over a period of time and those will differentiate how you're doing the value add and differentiate you. 

RD Since you brought it up, I wanted to ask about agents. I keep hearing that they are the future of AI, that you'll have a team of testers or linters or whatever working beside you on your code. Do you think you'll be getting into that area, providing some sort of agents that are helping folks understand the code as they're working?

AD Yes. I wouldn't say we'll get into that area. Based on people that we are working with and the needs that they have, we have already dived into it. But what people call agents can differ. So for example, if you're kind of looking at agents or tools in terms of LLaMa index and LangChain, I have an LLM prompt and I kind of run it as a chain. Or some people have that I'll give a natural language question and you have multiple agents that will figure out based on the question which agents to run, give the right inputs, get the output and combine it and then deliver the results. But those kinds of things sometimes may not work for something which is very specific, or can work, but to make it work you'll need something more. So a very simple example is, if you want to write a unit test case for a piece of code using agents, now you might have a developer agent which basically generates the unit test case and you might have a critic agent which basically criticizes the test case that is generated, and then the critic agent, once it criticizes the developer again, takes back the job and modifies the code based on what the critic said. How many times are you going to run this process? There has to be a stop to it. And when do you stop? When the critic says that everything is good? And how does the critic know whether it is good or not? So do you basically then say that I need a statistical code analysis tool which looks at code coverage and the quality, which is outside the Gen AI realm. So you'll have a mixture of non-AI tools plus AI together complimenting each other to get you the right results. Now to do that, if I basically just run an agent, you don't want to also be in a state wherein I've given this task and it's endlessly going between the developer and the critic and I'm not seeing any results for 10 hours. So you have to put a stop to that also. So can you do it today end to end and have it running in an enterprise? The answer is no. Can you make it happen though with all the other tools and so on with right tooling and prompt engineering guidelines and so on? Yes, but the time required to build that is a bit different from just making a call to an LLM and then making that process repeatable without any hallucinations or change in stuff that requires another level of tooling and instrumentation. So those things have to happen. So I'd say, agents are there. It's a good concept, and right now they're in, I'll say, a stage wherein you can do POC in pilot. To take it to the enterprise level, we have to do a lot more work to get there, and it will happen over a period of time. 

BP We had this conversation with somebody just the other day, which is, will these systems start to come up with their own accuracy models, as you pointed out? Will that be a separate system? Their own way of evaluating, “Hey, I'm in this actor critic mode,” or, “I'm in this chain of thought reasoning. How do I know when to stop?” Because those GPU cycles are expensive for inference. A human being is going to stop because they have to feed their kids or they're going to go on vacation. They're going to get to a point where they say, “I feel good about it. I've worked hard enough on it. I'm done.” The AI will never say that. You have to say, “When is it time for you to stop doing this?” And maybe there's even a problem of overfitting, we don't know. If you endlessly ask it to revise and improve its answer, you might end up going backwards at a certain point.

AD Yeah. And there are simple things which actually make it tough also. If you just ask for it like, “Hey, give me performance improvement changes on this code and then rewrite this code with those changes,” and you provide the same code and ask for performance changes, it will give you another set of things. And if you say, “Give me the whole list,” it will give you a whole list, but after you make those changes, it will again give you a list. So it itself doesn't know when to stop so you have to give guidelines like, “What is it that I'm expecting and what level is okay for me?” And then you have to use the right model which understands that, because if you use GPT-3.5 and GPT-4 you'll have different results than if you use Anthropic. So when we say multimodal, this problem expands when you go there. 

BP I like that idea a lot. It's like, “I want you to do this for six hours, and then if you get an improvement in memory or speed or a few past the tests, that's an improvement, keep that. And if you do it again and you don't see any major improvements over 3%, stop.” When you see diminishing returns, that's when you need to stop.

AD And you'll need to use a mix of Gen AI and some real world tools. So for example, when you have a code interpreter, you have a code interpreter which does the job of interpretation. But now once the code is there, whether it works or not, and if there are any issues, you'll only figure it out after running it. So you have an agent which runs real world tools which compiles the code, runs it, figures out the issues, provides it as a feedback, and then you go and change it. So it'll be a combination, it won't only be completely Gen AI, at least for some time before Gen AI is able to do a bunch of more stuff.

BP Uh-oh, that's when we're all making art and planting seeds.

[music plays]

BP All right, everybody. It is that time of the show. Let's shout out someone who came on Stack Overflow and helped to spread a little knowledge. A Lifeboat Badge was awarded to Jan Kardaš on November 1st. They came on and found a question that had a negative score of 3 or less, gave it a great answer, that answer now has a score of 20 or more. The question is, “In Go, how to retrieve a string from between two characters or other strings?” Jan has an answer for you and has helped over 33,000 people, so we appreciate it, Jan. As always, I am Ben Popper. I am the Director of Content here at Stack Overflow. You can find me on X @BenPopper. You can email us with questions or suggestions for the podcast, podcast@stackoverflow.com. And if you like what you hear, then leave us a rating and a review, because it really helps. 

RD I'm Ryan Donovan. I'm the Editor of the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me on X, you can find me @RThorDonovan.

AD Hi, I'm Anand. I'm on Twitter @AnandDas. I am co-founder and CTO at Bito, and check out what we do at Bito.ai. 

BP Awesome. All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]