The Stack Overflow Podcast

Even high-quality code can lead to tech debt

Episode Summary

Ben talks with Eran Yahav, a former researcher on IBM Watson who’s now the CTO and cofounder of AI coding company Tabnine. Ben and Eran talk about the intersection of software development and AI, the evolution of program synthesis, and Eran’s path from IBM research to startup CTO. They also discuss how to balance the productivity and learning gains of AI coding tools (especially for junior devs) against very real concerns around quality, security, and tech debt.

Episode Notes

Tabnine is an AI code assistant that offers AI tools for code generation, testing, and code review.

Eran was previously a researcher at IBM, where he worked on IBM Watson

Connect with Eran on LinkedIn.

Stack Overflow user Anders earned a Populist badge with their first-class answer to How to detect the current screen resolution?.

Episode Transcription

[intro music plays]

Ben Popper Can a blockchain do that? Algorand has answers. Developers are using the open source Algorand blockchain to build solutions disrupting finance, supply chain tracking, climate tech, and more. Hear from devs, learn about the tech, and start building on-chain. Blockchain solutions aren’t hypothetical, they’re here. Check out canablockchaindothat.com. Can a blockchain do that? Algorand can.

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, the Director of Content here at Stack Overflow, and today I am joined by a guest with a lot of experience in the world of software development and at the new frontier of software software development and AI– where those two things intersect, how they might be changing the way developers and programmers work, and what role is key to retain for humans and what makes sense to give over to the machine. So without further ado, I'd love to bring on Eran Yahav, who is the CTO over at Tabnine.

Eran Yahav Great. Thank you, Ben. Thank you for having me. 

BP My pleasure. So tell folks first just a little bit about how you got into the world of software and technology. 

EY I initially got into this entire world by just writing computer games because that's what I enjoyed as a teenager. Then I continued to do some research, went to academia, got my PhD in program analysis, program synthesis, proceeded with an academic career, became a professor of CS at Technion, which is a leading Israel university, and then bumped into stuff that can actually work, which is program synthesis for real, which got me very excited and this is what got us to start Tabnine back in the day. 

BP Very cool. So for folks who are listening who don't know, can you define program synthesis?

EY So program synthesis is a really old problem dating back to the 50s when you can write a high level specification– in the original papers, it was in logic formula– and get a program generated that satisfies exactly this mathematical specification. And there's a long history of that problem being worked on with various approaches going from deductive synthesis, inductive synthesis, all sorts of ways to get from the description into a working program.

BP Okay. So when I hear that, my mind immediately goes to the events of the last two years– since November, coming up on two years, November of 2022– which is to say, I ask in the highest level language I know of natural language for an AI system to write some computer code for me and it goes ahead and creates the function that I asked for, and when I test it, it runs as intended. Is that program synthesis in the way that you mean it, and how would you say that relates to the original field or the mathematical version, the logical version? 

EY In the mathematical version, there is a notion of satisfying this correctly. There is a very precise specification, a very concrete and mathematically-defined way of what is needed to satisfy the specification. Here, we're in kind of the real world where everything is an under-specification, really. You say, “Write a calculator. Generate a React application that implements a calculator,” and there is so much missing details. Do you want a scientific calculator? How many buttons is it going to be with this layout or the other layout? So satisfying that spec is really your decision. Did I get what I wanted or not? So it's up to the human to say whether the spec was satisfying. 

BP That makes sense. I think one of the things that's interesting and challenging about generative AI as we refer to it today is that it's nondeterministic, and so in a lot of ways that's quite different from asking it to do a math problem that has very tight boundaries around how it might be solved or how you might prove out the solution. But you've been working at Tabnine since 2014 and you worked at IBM in the Watson division before that. Let's return to that earlier era. Deep learning and neural nets were around, but they weren't all the rage. Let's talk a little about IBM Watson. What were you working on there? What approaches were you taking? 

EY Back in my IBM research days, we did more of the mathematical sense of synthesis, mostly focusing on low level concurrent programs in which you had to infra to generate synchronization. So the challenge was, “I give you a sequential program that is the specification, you know exactly what the computation should do. Now make it concurrent, make it run on multiple processors with a high level of concurrency. Be more efficient, but still compute exactly the same function as the sequential program did.” That's a very mathematical form of synthesis so you can prove that what you get is actually correct versus the sequential specifications, so that was very cool work. Once we started looking into neural nets, I think the first works that we did were around generating code using concurrent neural networks, and you can get some mileage out of that, but back in the day, those things were not even strong enough to really completely get the syntax right. So in today's models, you rarely get the syntax wrong, but back in the day, even getting the correct syntax was a challenge. So that's a long, long time ago. 

BP So you've been working in academia the whole time. You're at IBM, and then you decide you want to branch out and helped to create a new company. What's the conversation there? What's the genesis of the startup? 

EY The genesis is that I talked to Dror, who's an old friend from a military base through service, and I show him stuff and he says, “Hey, this can actually work. This is working, this can actually generate code for real engineers and not just in the corridors of academia,” and we both got excited. And just for reference, at that time I think we could generate maybe a single function or maybe less than that. Those were the days. I think that we were actually the first to market with something that can do AI code completion. That's circa 2018-ish, so that was really, really cool. 

BP Cool. All right, so you started in 2018, we're now in 2024. How has the tech stack or the approach evolved internally, and talk to us a little bit about the product you have in market right now for folks who just aren't familiar with the company.

EY First of all, the first generation is really kind of the dog playing the piano. The fact that it generates code at all is so impressive that you are less concerned about the quality of the code. You’re so impressed by –let's call it the party trick– and I think in 2018 and even until almost a year ago, it was mostly around that people were so impressed by the productivity that you can get from this technology that we rarely stopped to ask ourselves, “Is this generating the code that we need, the code that we want, or just the code that it wants or that it is able to generate?” I think part of the maturation of the area is that now we're really questioning are we getting the quality that we need and how do we get to a higher degree of autonomy? So we're also getting more ambitious in what we would like to generate, and as we get more ambitious in the amount of code that we want to generate and the scope that we delegate to AI, the question of how can we trust what the AI is doing becomes more front and center for us, at least. Just to get a level set on Tabnine, Tabnine is an AI assistant that helps you accelerate your work across the software development lifecycle, so going from code generation, test generation, documentation generation, code review, et cetera, et cetera, all the way to ideation, to production, looking to accelerate all that. And in the last year or so, we've been really focused on enterprise, making Tabnine the AI assistant that enterprises can trust, and that means mostly two things. One is very deep organizational awareness, so Tabnine really being able to understand the organization, how things are done around here, what existing code do we have, when you ask me to write something, should I use an existing microservice or should I recreate code? The second thing is really quality and validation– how do I guarantee that what I got back from the AI system is of high quality and something that I can actually use? 

BP All right, so this is the part where I sneak in a plug for Stack Overflow. How do you learn about what a company's internal rules or best practices are, how do you learn about its code base and how that's applied? We have Stack Overflow for Teams which is kind of like a private Stack Overflow, and explain exactly that you might ask. Am I supposed to use a microservice here, and if so, which one? What are our rules for API specs when I'm trying to connect with a new provider? And if that data, that information and knowledge was ingested by Tabnine, then it would be able to give you the right answer right there in the IDE when you're getting to work. Is that the idea? 

EY That's the idea. I think the two products are really complimentary. Stack Overflow for Teams is really a place to curate, let's call it, the heavy head of questions, the things that everybody's asking. And there are maybe nuances and some human has to weigh in to make sure that we get the nuance right. Maybe some architectural questions as well which are hard to learn automatically from the code. But for the long tail of questions of how to do things in the org like what code exists, when I'm trying to generate something new, should I actually be using existing stuff, I think that's the place where Tabnine can really complement Stack Overflow for Teams. And the other direction is also true where the Tabnine context engine can connect to Stack Overflow for Teams and basically it learns from that and uses that as a baseline for some of the suggestions and decisions that we're making. 

BP Let's say you go into a large enterprise and they say, “We're excited about Tabnine, it's going to help our developers be more productive,” and not just that, but hopefully the code you help them generate will be of a higher quality and more secure. And you say, “Great, the first thing I’ve got to do is get the context of your whole organization, like you said, so that I understand it inside and out.” And they provide you with access to– I'm going to name some services, maybe these are not all the ones, but a GitHub, a Jira, a Confluence, a wiki, you name it. What does the LLM do or what does the underlying AI system that you have do when some of that data is inaccurate or contradictory or out of date? How do you deal with errors in the source material, the grounding? 

EY We've dealt a lot with that. So first of all, the Tabnine context engine is a component that connects to all this source information, aggregates, correlates, and kind of tries to make sense out of that. I think the sophisticated customers understand that they have good projects and bad projects over their history, and you typically have the conversation of, “Hey, please connect Tabnine to these 30 million lines of code,” and then they kind of think for a second and say, “Actually, no, 27 million of them– don't even look at those, they’re legacy,” and we need to kind of break away from that and do some curation. But still there's definitely some contradictions, even in the good projects and some things, and some of that surfaces all the way to the human as two options of doing stuff and the human has to decide, but some of them may be resolved by a majority or recency or some other sequence.

BP Okay. So I think one of the most interesting questions to me and that we've discussed a lot on this show is the idea that with these AI agents out in the field, it's now possible for developers to produce a step function more code. These things could be running all the time, but more code is not necessarily better. To your point, if it's low quality, you're just going to have a lot of lines that in the future you consider tech debt or you want to deprecate or that are creating issues with the parts that are high quality. So let's walk through a few of the things here. What is the eval method? What is the process for quality that Tabnine runs to try and ensure the code you're suggesting or helping to generate along with a human programmer and sort of that pair programming approach is quality?

EY There's really a central question that you touched on. When I'm generating more code, am I generating tech debt or am I generating an asset? And it is a very real question. I think to realize that it may be really high quality code, but just not the way that we do stuff around here, so foreign code that is high quality, that's also technical debt. It can be a replication of an existing microservice or something like that of high quality, but still technical debt, replication and stuff like that. So quality is not just by looking at the generated code itself and saying, “Is that good?” in isolation, but it is also a question of the organizational context in which it operates. And also you can't really ask an LLM, “Dear LLM, review this code and tell me what you think about it,” because the judgment of the LLM is, to a large degree, immaterial. It's the judgment of the average of all knowledge of humanity, so to speak. 

BP Right. Some people think the Earth is round, some people think the Earth is flat. What are you going to do? 

EY Exactly. Maybe the way we do microservices is not exactly the way that Claude Sonnet thinks. 

BP Tabs or spaces, there's personal preference. 

EY Personal preference and old preference, et cetera. So our approach to all of this is to learn the rules from the history of the org, show them to the human as part of the process so the rules are something that is extracted in a way that is visible to humans. Humans can look at them and say, “Yes, that's right. That's not right. That's right. That's right.” Clearly, we don't always get it right, and the history, as you said, is full of predictions. So maybe you learned the rule that we shouldn't use Kafka but actually some of our projects use Kafka, so there is nuance there as well. And we surface these rules to the humans. Once humans approve them, then they become enforced by Tabnine code review, and that is currently happening in the pull request merge request but very soon going to happen in the IDE as well. So shift-left of that knowledge being enforced. We call the whole approach ‘Tabnine Coaching’ because it really acts as a coach. It really keeps you on track, but not based on some amorphic best practices of the world, but really about how we do stuff around here. And Tabnine also comes equipped with some built-in best practices for common libraries and common things. Again, you can choose what you want to enforce.

BP I like the idea of a coach because then the metaphor occurs to me, which is to say, “Hey, this team has a system and we built it up under this coach,” and it's not necessarily the best system for every team, but it works for you. And you might come in with a best practice, but it doesn't fit the system and so it's not a great addition, so you could think of the code in that way. Again, once you understand the culture or the system that they have, then you're in the best position to recommend something which is a quality fit. 

EY I think there's another subtle thing here which is that when we say coach, we also mean that the human improves as part of that process. If you're a junior developer, you see the coaching that you get and say, “Oh yeah, this actually makes sense. This is how we should do things from now on.” So it helps you on the board and improve within the organization, so it makes you a better programmer on the team. 

BP I can certainly see how it would be especially helpful, and this is true for Stack Overflow for Teams too, for the new hires. How can I dip into all of the institutional knowledge, the tribal wisdom that's inside of here without having to constantly poke someone on the shoulder who's been here for 10 years? That's a big advantage when you've got a new hire. 

EY Exactly right. 

BP Okay, so let's talk about security for a second. There's best practices you could bring, there's tests you could bring, but what is the approach to security to ensure that the code Tabnine generates doesn't introduce any bugs? That would be a quick way to get on the customer's bad side. 

EY Right now, the approach for security is within coaching. So within coaching, we have a bunch of security rules and best practices as well. We encode all of the common things– SQL injection, whatever, all the all-time favorites that never go away. 

BP No public API keys. I gotcha. 

EY Exactly. In coaching, this is not a replacement for a SaaS tool or something that comes later in process and goes way deeper maybe on the security things, but it makes sure that you at least adhere and not break the obvious and don't make really fatal flaws that any sensible developer would observe immediately.

BP And so another question would be, of the new code that's being generated, are you also adding things like code comments or documentation? Does Tabnine do that? 

EY You have to decide as a human what documentation you'd like. Personally, I'm in the camp that thinks that documentation should say why something has been done and not what it does, but there are many, many flavors and nuances to what people like to see as documentation. I think, interestingly, one of the things that is not released yet is a test agent that adds tests during the PR, so really makes sure that your test coverage reaches a certain target by going through the tests and generating that. That works really well right now, and I don't have a release date yet, but it's coming. 

BP I think you could talk to your marketing team and maybe they can get back to you, but we need to come up with a catchphrase or a word that creates clarity around that idea that more code isn't better, but more robust code is unequivocally better. So quality is the right word for it, but kind of what you're saying, nobody is going to argue about more tests if they're auto-generated and they catch 90% of the issues with a 5% error rate or something like that. 

EY Even then there's nuance, because tests take time to run and it's not necessarily right. But generally I agree with you. 

BP So let's talk just a little bit about the size of the company. Have you been growing rapidly? I know you raised a round last year. This is obviously a super exciting time for generative AI in the world of software development. Talk to us a little bit about what you're doing and how you stay competitive with some companies that are among the biggest in the world.

EY I don't think there is a lot of competition, but the distance or the difference from a demo to actually providing consistent value in an enterprise is a very large distance. I think we've shown over the past few years that we are very consistently delivering value to the enterprise. It's not just doing the party trick kind of AI code generation. I think this will continue. I think our approach is really incremental because we have– I don't know even what is the number of enterprise customers right now– but probably around a hundred or something. I genuinely don't know. This keeps you honest. You deliver features that are being tried in an enterprise on a daily basis, and our approach is incremental. Again for example, we released a test generation agent that creates a test plan and not just individual tests and we see good adoption of that. And that is an incremental step about agent version one that we had, generating the individual tests. So it's almost like version one– individual tests, version two– test plans that your user can edit and that generates the tests in bulk. So it's a very incremental kind of process, always adding value and kind of making sure that it is consistent with what enterprise users are looking for. Similarly for the context engine, getting better all the time. Initially it was looking just on, say, similarity via vector embeddings, then started some semantic inferring, semantic edges in the complex engine and then connected to Jira. So it's kind of a more incremental approach. I think this is what helps us be competitive because we are very consistent in delivering this enterprise value.

BP Do people ever come to you and say, “We'd like to use Tabnine for refactoring or to remove lines of code?” 

EY Absolutely, it happens all the time. In software like in literature, writing is rewriting and so you’re kind of refactoring all the time. Maybe you’d like the AI assistant to help you with refactoring. I think it does that pretty well. Still for global refactorings, we're not quite there yet. We'll get there, but it's not quite there yet. 

BP You have a hundred enterprise clients, you've been in the market for a while, you say you feel like you can deliver consistent value. What are the metrics that you use for ROI? Is it time to completion for a project? Is it developer satisfaction? Is it number of bugs reduced? I don't know. How do you measure it? 

EY Measuring productivity is tricky. I find that there are two approaches for mature organizations. So they may apply something like Dora metrics or something like that. It's not exactly measuring productivity, it's more measuring the maturity of the developer team or development team but it's kind of like a proxy to solve that. So you can see some improvement on the release cycles and velocity of requests being merged, et cetera, but for most organizations, they don't have that level of visibility. Then you have to resort to more activity-based metrics in the sense of, “Okay, how many lines of code were generated? How many lines of code were actually adopted? How many comments in the pull request were taken from the main coaching?” So it's more activity-based, but the ROI in this area is very obvious to anyone who's used it.

BP Well, sure. Hopefully your central customer is the head of engineering, but sometimes that person gets a question from the chief financial officer that says, “We've got too many seats and too many SaaS products. I’ve got to figure out how to trim 10%, so I want a one-page report that explains what this service is doing.”

EY I think that the price of AI assistants right now is lower than the coffee or the peanuts budget in the office. 

BP I don't know, have you been to the office at X lately? I think that it's pretty sparse. No, I'm just joking. So I just want to look with you out to the future for a minute. What are you most excited about that's coming in the next year in terms of what these generative AI products can do, but also how do you speak to the college student or the junior developer who's just getting started who feels like, “Wow, these AI agents can do most of what I can do or do it better.”

EY Yes, AI agents are going to replace a lot of the low level code generation tasks of software engineers. We have to be honest with ourselves that for many things, Stack Overflow actually replaced code generation in a lot of the previous generation because people were googling stuff and copy/pasting it. 

BP We were generating code through copy and paste. 

EY Exactly, and we still do. So really, when we look at software engineering as a job, I think the job description was never about generating code. It's always been solving business problems using software. And if you can do that without generating code, without writing code yourself, that's great, more power to you. And I think for certain applications, this is easier than for others. So can you generate your next React application using mostly prompts and not writing a single line of code? I think you probably can. Do you care about the finest details in the code of that React application? You probably don't, as long as it behaves like you want it and looks like you want it. These are the kind of applications I expect to be automated first. But if you look at React versus your code of a nuclear reactor, so React versus the reactor, I would not generate the code for the nuclear reactor using prompts. It's much more nuanced. The code itself is an asset. You need to look at it and to maintain it. You need it to be of high quality, and you need to convince yourself and maybe probably your review committee or something that this code does what it's supposed to. There is also an opportunity here for engineers to kind of learn from the AI. Maybe things that nobody would show you how to do, now the AI can show you how to do certain things and you should apply critical thinking. You should say, “Oh yes, this looks like the right thing,” or, “This does not look like the right thing.”

BP It's a great point. I have read some interesting blog posts from software developers talking about that aspect of it, that it lets you level up faster, to your point about easier onboarding, and that it lets you reach farther into other domains. Maybe I've learned this one language but now I've moved to a company, they work in a different tech stack. It'll be easier for me to get up to speed quickly if I can rely on my coach. 

EY So it's interesting, just a tangent on academic life where people have asked, “How should we do homework?”

BP Oh yeah, you're a professor. Okay, so only in-person tests and pencil and paper. I got it. 

EY No, that's absolutely the wrong approach. The right approach is to say that generation is not the barrier now, and the homework should not be to generate anything. The homework could be to apply critical thinking to this code that you get and kind of decide whether this is a good way to do it or a bad way to do it or hard to interpret, which I think is the skill that is more required and becoming more of the central skill as we move forward with code generating agents. 

[music plays]

BP All right, everybody. It is that time of the show. Let's shout out someone who came on Stack Overflow, contributed a little knowledge or curiosity, and helped to answer somebody's question. A Populist Badge was awarded two hours ago to Anders. A Populist Badge happens when you give an answer to a question, it already has an accepted answer, but your answer is so good that it gets more than twice the upvotes. So Anders provided an answer, “How to detect the current screen resolution.” Thanks to Anders, 120,000 people have benefited from your reply. As always, I am Ben Popper. I'm the Director of Content here at Stack Overflow. Find me on X @BenPopper. If you want to come on the program or there's something specific you want to hear us talk about, shoot me an email, podcast@stackoverflow.com. And if you liked the conversation today, do me a favor, subscribe to the show and leave us a rating and a review.

EY I'm Eran Yahav, I'm the CTO of Tabnine. You can find me online just by googling my name, or on Twitter with the handle @YahavE. 

BP And where should developers go if they want to learn more about Tabnine? 

EY Go to Tabnine.com. 

BP Great. All right, we'll put some links in the show notes.

[outro music plays]