The Stack Overflow Podcast

Exploring the magic of instant python refactoring with Sourcery

Episode Summary

We chat with Nick Thapen and Brendan Maginnis co-founders of Sourcery.ai, which runs in the background of your IDE and makes real time suggestions for improving your Python code.

Episode Notes

Nick is now Sourcery's CTO.  You can find him on Twitter here.

Brendan serves as Sourcery's CEO. You can find him on Twitter here.

You can try out Sourcery for free here and check out the company's open positions here.

Our lifeboat badge of the week, fittingly, goes to Martin Evans, for explaining how to parse an integer from a string in Python.

Episode Transcription

Brendan Magillis Paul doesn't even need to send his code over to us. He can just install Sourcery in his IDE and Sourcery will understand the code as he's writing it, and offer those refactorings suggestions in real time.

Paul Ford Which IDE? Because I am in Emacs. [Brendan laughs]

[intro music]

Ben Popper Ben Popper Big Tech has advantages in budget and resources when it comes to building powerful infra, right? With CockroachDB, you can now build on top of that! The founders came from Google and basically built open source Spanner - but with a serverless option you can use for free at cockroachlabs.com/stackoverflow!

BP Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, the Director of Content here at Stack Overflow. And I am joined by my wonderful co-host, Paul Ford. Hi, Paul. 

PF Beeeen! 

BP So good to have you on. Paul, I know you let the world know that you're taking a new role at Postlight. And you're gonna be on the show here and there. But we may be parting ways in the near future. 

PF Still nearby. Not going very far. 

BP Still nearby. Yeah, not going anywhere. friend of the show, always. So Paul, today we have two folks coming on. They're from a company called Sourcery.ai.

PF You better spell that.

BP Yeah, it's a good one. S O U R C E R Y.

PF It's a classic.

BP Classic black magic URL. Instant Python refactoring, making your code clearer and more concise, and more Pythonic. Oh, they've got lots of sort of neologisms here.

PF Well Pythonic is classic. I've been writing a lot of Python recently as part of my transition to a new role. I'm going to tell you, I'm someone whose code needs to be refactored urgently. [Ben laughs] This is great. 

BP Okay, good, there'll be a free demo after this. So we want to welcome to the show, Nick Thapan, and Brendan Maginnis. 

Nick Thapan Hi there, thanks. Great to be on.

BM Hello!

PF Hello!

BP So yeah, why don't each of you in turn, Nick, we'll start with you. Just tell us sort of like who you are, you know, your role at the company. And then yeah, we can talk sort of a little bit for folks who don't know, at a high level, about some what this stuff is like, what is refactoring? And why is it great? But first, introduce yourselves.

Nick Thapan Yeah, I'm Nick come the CTO at Sourcery. That feels like a bit of a grand title. Because there are only three of us at the moment. We are hiring.

PF Once you get to four, CTO is a valid term.

NT Yeah, that'll be a bit more real. So I basically do a lot of the coding, a lot of creation of the refactoring engine. My background has been in software engineering since 2005. Then I did a Master's in AI in 2013. I did research in public college, on Twitch analytics and machine learning for two years. And we started two or three in 2018. And I've been friends with Brendan since 2007.

BM Yes, I met Nick, my first job out of university. And yeah, we worked together for many years, working on learning to code in the first place. And one thing that I just always loved, or I learned to love in programming is refactoring. I remember the first time I wrote my first refactoring, and it was like, wow, I've actually improved this code that some expert has written before me, and then you slowly get better at it, you start doing bigger and bigger refactorings. And at some stage, the idea became, oh, I wonder if this process of refactoring can be automated. That's really the idea behind Sourcery.

PF Alright, let's break it down for people. When you say refactoring, so I've written some Python code, and boy have I, and it is messy. It's a mix of, I've identified, I've written some class files up top, and then I get to my main loop, and oops, now I've got about 50 things going on, or not loop, inside of my main, my main function, and I have everything in one big file that is the typical state that I leave my Python code in. And it sounds like you might have a way to help. So I'd be curious, when you say refactoring, what do you mean? Let's start from that point, here's my pile of messy Python.

BP Because, yeah, as a layman, when I hear it, I assume you mean something like editing a document, you know, for concision, for clarity, tuning a machine, so that runs a little faster, a little bit more efficiently. But what to you is sort of the heart and soul of it as you would explain it to a layman?

BM Well, if I was to explain it, someone who didn't write any code at all, I would think of it as you've written an email, or you've written an essay or something like that. And it's got all of the content in there that you'd like. But then you go through the editing process, you go through there, you restructure your essay. So things are in the right order, makes it much easier to read and understand. And the argument is a lot stronger, and more direct. And if we think about that in source code, you've got your single main function, and this got all of the logic in there. It's got all of the variables, and it's long and it's difficult to understand. And once you've written that, you can refactor it, you can go through there you can split it out into smaller functions, that each do a single individual thing, and a very clear, easy to understand and easy to read. And when you refactor code, it makes it much easier to read and understand. And therefore, the next person that comes along, can understand your code and they can build on top of it. 

NT I guess the other important thing with refactoring is the code still does exactly what it did before, it's just much easier for the next person to come in and change it.

BP Right. Yeah, if you're moving stuff around in an essay, you know, you're less likely to introduce an error or, you know, create some issue with a dependency. So let's pretend that that policy is sending his stuff over to Sourcery. What is it going to do? And how is that different than let's say what like a human being would do? Or In what way is it emulating a person?

BM So Paul doesn't even need to send his code over to us. He can just install Sourcery in his IDE and Sourcery will understand the code as he's writing it, and offer those refactoring suggestions in real time.

PF Which IDE? Because I am in Emacs. [Brendan laughs] Assuming I already have a problem. Yeah. Yep. Damn it. There we go. Alright, so am I loading up VS code? What am I doing? 

BM So we do actually have an Emacs integration, it's not particularly strong. We haven't written it ourselves. It was actually contributed by someone else. 

PF Listen, Emacs users are used to this. Don't worry about a thing. If it only works about 1/3 of the time. That's still great. 

BM I reckon it's probably half the time. So yeah. You're a happy man. [Paul laughs] Yeah, so you probably have to do some config with Emacs. But we support VS code, we support pi charm. Those are the two main ones that we support. And because we support the language server protocol, we support other IDEs that haven't implementations for that. So that includes Emacs, includes Vim, it includes Sublime. And so those are the five that we currently actively support. 

PF You know, I think our audience would be really interested to learn a little bit about the language server protocol, what it is and where it comes from.

BM Yeah, so the language server protocol was an idea introduced by Microsoft for their VS Code, code editor. And the idea is, well, the problem that occurred in the past was, you would have all of these different IDEs and you'd have all of these different languages. And then if you wanted to get, let's say, code completion, or jump to simple functionality in your IDE, for a new language, you'd have to write a new implementation for each IDE. So you've got n IDEs, and you've got n languages, and you've got NxN implementations that you have to do. And so what the idea is, you have a single protocol that ideas can implement, and language servers can implement. And then each language can have all of this functionality available across all of the different ideas with only one implementation. And that's a real time saving for people like us who are writing Sourcery. We only need to write against the language server protocol API. And then we can work in VS code, Sublime, Vim, Emacs, anything else that uses it. So it's been really, really powerful.

PF It's one of those things I remember when I first learned about it, and I'm like, oh, they actually did that. Like, that's the sort of thing that people talk about for like 30 years is, why doesn't it work this way? And then LSP is one of those things where like, oh, they fixed it, they figured it out. Because all of a sudden, all this support started showing up in Emacs. And I was like, this doesn't belong here. This has never been like this. So okay, you know, I think it'd be good for the audience, like, describe good Python code, like, what are we trying to get to, and then let's take a second and kind of talk about how the refactoring helps you get towards that state. So So what makes good Python?

NT Yeah, so a lot of it comes down to short functions not deeply nested, it's like the enemy of readable code is far too much nesting, especially in Python, where it's all whitespace, you have to kind of feel four levels down, like having to go back and see what the if conditions actually hold true. And then you get to things like sort of doing things in the obvious way, don't be too clever with it. So but the essence of it is sort of good short functions, and not repeating yourself all over the place and not getting into deep nested loops and things.

PF You just described the opposite of my code. And this is this is great. I will say the Python syntax that because they would like to, everybody's encouraging the like, with x as x block structure. And then you nest those, and now you're like three levels deep. It's just like every now and then the syntax is just like, nah, you're going to be indenting. There's no escape for you. Okay, so, I have all those problems, nested functions, you know, unclear class structure, so on and so forth. I've opened up my code, and I've opened up your tool, what comes next?

BM So you can think of it like a pair programmer. You're writing your code. you've got a mate next to you. And let's say it's Ben. Ben's reading your code. He's a Python expert--

PF I mean, okay, alright, let's see how this goes.

 

BM And you can write in your code. And occasionally Ben will see an area for improvement. And he'll just tap you on the shoulder and say, why don't you do this instead. And so the way that works in Emacs or anywhere else is you'll get a little highlight on the first line of code that can be improved. And then you can hover your mouse over and you get a description of the change. So you have an English description at the top of what the change is going to be. And then you have a code there, you can see the before and after. And at that point, you make the review, you understand the change that has been made. And then you can just apply it to your pay base when you're happy with it. And then you carry on programming. So it keeps you in the flow of the problem that you're working on. But it also keeps you writing high quality code all the time. So I think probably a good thing to do is just describe some of the refactorings that Sourcery can do. So you've got an idea.

PF Yes, absolutely. 

BM So the first thing to note is it mainly works at the function or method level, is analyzing within a function or method and the refactorings are mainly in that scope. So yeah, we've got relatively small things like converting a for loop into a list comprehension. Or if you have, for instance, a counter variable that you're incrementing each time you go through the loop, we'll replace that with an enumerate call instead, which makes the code nice and clean. Another example might be, you declare a variable at the top of your function, and it's not used until several lines down. And we're just moved the variable closer to its first usage so that it's much more in your mind, what it actually means when you come to breed that line of codes.

NT The interesting thing is refactorings are composable. So a lot of it sort of we make one change, and that allow us to make another change that you can merge and if condition together and reduce that nesting, that might allow another change, you might, we might need some redundant logic. So the changes was just kind of like three, four or five steps in them. And then the default is gonna be one, and it's like, I compose these things together. And here's the new state of the code. 

PF Why Python?

BM Couple of reasons. Firstly, Python is a nice simple language. So we had no idea whether it was possible to do Sourcery. So we write prototype, and we write it in Python, because Python is nice and simple. The AST is simple, there's, there's not many complexities that go on. And there's not many ways of shooting yourself in the foot that we would have to understand to make sure the code changes that we suggested were refactorings. So that's, that's the main reason actually. And then, Python is actually very, very popular now. It's the second most popular language, I think there's over 10 million professional Python developers. And it's probably the one that's fastest growing in terms of people learning to program tend to learn Python first, nowadays.

BP And so I guess, you know, the system was trained on best practices through like, machine learning and reinforcement learning and some of the techniques that you might be familiar with?

PF No, wouldn't have to be. No, no, don't worry about that, that gets away, yeah, stay away from that.

NT They would actually use any machine learning in the system. So it's sort of a set of rules of things that can be reflected quite quite large set of refactoring code and building blocks. Now, it's like a search engine for combining them together to get a bigger change that we can make, sort of guided by a lot of code quality metrics of like, here's where your code currently is, we think this entity is an improved state, we can suggest that.

BP It's interesting, I was just talking to the folks from AWS, and they have this thing called like, code guru, which similarly, you know, as you're going along, will say, hey, this looks like, you know, it might not be a best practice, or, you know, this might be introducing a bug or a security flaw. And there's they said, it was, yeah, like sort of trained on a large data set to make those kinds of suggestions. But it might be it sounds like a little bit more comprehensive or broad, and what the kind of suggestions that makes.

PF  can help you out, Ben. It's actually, so I'm reading through the refactorings on GitHub, and what you're looking at, so I understand where you're coming from. And I think a lot of our audience will be too especially because GitHub just released Copilot or is about to release Copilot. And so there's two ways to look at this world. One would be I understand the internal logic of this programming language. And I also understand the patterns that people use. I'm going to look for the patterns they use, which I can do automatically by interpreting the syntax tree language. I can watch as they program. And when I see one of those patterns, and I know there's a faster better way to do it, I can automatically recommend that. That is sort of declarative and rules based. And then there is like this looks in a vector space similar to that. I think I should bring up a couple recommendations for what usually happens when these bases overlap. And that's the machine learning approach. Both have benefits, this wouldn't scale. What Sourcery does is not for billions of different possible patterns. It's more for like, let's which is very like relevant to the world of refactoring. Let's focus on code quality. And Python is great, because there actually is a sense of quality in the language. It's not just like, do whatever you want to do. I'm curious for you both, you know, you went, you've gone very deep, probably deeper than you ever expected inside of Python and sort of the the language itself and how the language puts itself together. What would you change about Python? [Ben laughs]

BM I mean, to be honest with you, it's very, very good language now.

PF It really is! That's what I want to hear! Because I like it. And all the JavaScript people in my world are like, eh, I don't know.

BP Oh boy. So actually Python is perfect, is the takeaway from this episode?

BM No, no. It's not perfect. There is a small number of things that I would take away that  are just oddities. So for instance, with a for loop, you can have an else branch. And it's just so confusing to think what that actually does for us. Yeah, and I basically means if you don't enter the for loop, if you've never go through before they go straight to the Ls. That's normally not a problem. Because if I'm writing code, I'm never gonna write an else branch and a for loop. But for us, we have to understand all of the different ways that Python is actually used. And it's possible that someone might use this in the real world. So we have to understand, we have to analyze it and understand it, and make sure our refactorings make sense in the context of this.

NT Yeah, I mean, another four based one is that the it's quite abstruse, the target of the for loop, stick is very, the last one it used is going to be a variable from then on. So people use that in their code, like they expect now has to be true. So when we have to account for that, so a lot of coding is sort of, maybe there's this edge case where this is not a refactoring, we can't suggest it to you, because a lot of things we don't want to break your code and only suggest only things that are correct. And they there are these little edge cases like that where things people are going to use that wouldn't normally use, but it just in case they do you have to account for them. The other ones the default mutable arguments, it's sort of very rarely used properly. 

PF I have to say, as we're talking, I'm looking at the current refactorings document on the Sourcery.ai GitHub Wiki. And I'm losing track of this conversation, because it's a great Python style guide. I mean, I know the computer would do it for me if I use Sourcery. But I'm just like, wow, there's dictionary comprehensions. I never knew that. So talk a little bit because I share your fondness for Python. I think after a long time of programming in my life, I find it to be a very like, just the best possible compromise for a lot of situations, which is what I want out of my technology. Are there any libraries, any sort of like parts of the ecosystem that you wish more people knew about? Like for me, you know, for me, I always, every time I dive back into SQL alchemy, I'm just like, boy, this solves a lot of problems, right? And then of course, they change it a lot. So that part's tricky. But like other things that you've come across in the world that you wish more programmers knew about?

BM Well, we have one that's very specific to Sourcery, right, because of the way we actually package it up and deliver it to you. So we don't want to ship our Python source code to your machine and run it on your machine. So actually, we use this library called Nuidka, which is spelled N U I D K A. It cross compiles your Python code to C code. And then it compiles the C code to a binary and packages in the Python version that you're using. And it transitively walks through all of your imports, and builds all of those as well. And so based on your virtual environment, it just built everything into a single binary that you can ship, both to Windows, Mac, and Linux. It's incredible.

PF That is incredible. So wait, I could install--I can distribute Python apps with this tool?

BM Yeah. So we then also wrapped into pi package as well, or a wheel. So you can just install it, and it'll run on your local machine, whatever iOS you're running.

PF I mean, because that's always the gap, where's my Windows Installer?

BP So I had a question. You know, we were talking about how you started this with Python. But when it comes to future plans, you're mentioning, you know, it's sort of the second most popular and fastest growing. But yeah, what about, you know, helping folks with, you know, JavaScript or something else. Would that just be sitting down with folks who are great at JavaScript? And again, like Paul said, writing a great set of rules, you know, a style guide, and, you know, giving the system the rules it needs or is the nature of JavaScript versus Python make that a much more difficult task?

BM So we've still got a long way to go with Python. So as Nick was saying, all of the rules that we have At the moment, are manually written by him and me. But we want to start actually automatically generating a bunch of these rules. And we're going to do that by mining GitHub, and then building a validation engine that can say, here's a piece of code, here's some code afterwards, is it actually a true refactoring? Because the most important thing that we want to do is make sure that we never break a user's code. So we've got a long, long way to go in terms of building out the Python refactoring functionality, both in terms of the number of the scale of refactorings that we can do at the function level, but also the scope. So as I said near the start, we're mainly focused on the function level. Nick recently started building some class level refactorings, like extracting methods. And we've got a long way to go up that scope into class level, module level project level. So really, we've got maybe a year, two years worth of work on Python, to take it really, really far. And then we're going to start thinking, JavaScript and Java. Those are our two next big languages to tackle.

PF What is some Python code that you think is exemplary? Like, I mean, obviously, you've spent time looking at sort of what makes good code? Is there a library out there that people should go and GitHub and read in order to learn how to be better Python programmers?

BM Yeah, I mean, there's a lot. I wish we could publish our source code. [Brendan laughs] Yeah, fast API is fantastic. Love our code base. It's just brilliant. So it's taking advantage advantage of all of the latest features of Python. So type hints to write an HTTP web back end. And it's just exceptionally well designed and really, really easy to use. 

PF There you go.

BP So I guess yeah, like, one other question I wanted to ask, before we wrap things up, is kind of, you know, a broader question that occurs to me, which is they're saying, you know, the one thing you don't want to do is make people's code worse or break it. But I guess, you know, as you look out, and maybe this doesn't apply to Python, but to what degree do you ever think about like, things becoming increasingly uniform, and homogenous versus differentiated and expressive? Like if everybody is going to start using a wonderful tool like yours, which would definitely help them and improve their code, but also, yeah, they're using AWS code guru. And also, you know, they're using GitHub Copilot, like, at a certain point, there's a lot of sameness get introduced to the system, or, you know, people who are learning are not necessarily learning some things, because everybody has this sort of AI pair programmer. You know, maybe that doesn't bother you at all, or, you know, is that something you think about?

BM Yeah, I can sort of imagine it. It's almost like when people move from writing assembly code to writing higher level languages, like I've never written assembly code. I don't understand things at that very deep level. And potentially, that affects some of my thought processes at this higher level. And so potentially, these tools are pulling up the ladder for the next generation, they don't get the chance to make their own mistakes and learn their own way. So yeah, there is potentially downsides of skipping straight to best practices, and not investigating all of the mistakes that you can make and understanding why they're problems,

NT Potential downsides, but you got to think about the upside, as well. And sort of, you can skip straight to writing nice, brilliant code, everyone's gonna be able to learn a lot faster, and everyone be able to sort of pick up on the code bases much more quickly, which will be pretty awesome.

BP So yeah, we discussed that this would never break your code. But what if you give it code that's broken, will it fix it for you?

NT Nope. 

BP Nope, out of luck. Too bad.

BM So the goal is to keep the code identical in functionality. That's our main goal.

PF Ben, broken code doesn't parse, right? It's actually hard to sort of--you can't figure out human intention. It just doesn't, a computer can't do it. So you have to get to like a somewhat of a baseline before the computer can do it. IDEs also do a lot of that for you. They're like, I don't know what the hell you're trying to say. I'm gonna just put a red wavy line, you can figure it out, right?

BP I think we're getting close, where I do the description of what I want, Copilot will write that. I'll run it through Cloud Guru first, and then I'll pass over to Sourcery. And maybe at the end, we'll get something that actually works.

BM It'd be perfect!

NT They did that the other day. And I said, that's the perfect way of writing code.

PF I wonder. We'll see. [Ben laughs] Somehow the answer to like plain English coding or AI assisted coding, it's always right around the corner and then you play and then you find out like a month later that actually you have to do a lot of work no matter what. 

BP UPT 4, fingers crossed. 

PF Yeah, for all the computer's promise to do everything for you, they end up creating an enormous amount of work.

[music]

BP So Nick, Brendan, thanks so much for coming on. If folks wanted to try this out if they write Python and they're interested in learning more if they have a company where Python is the main language and they might want to use this in a business capacity, how should they check it out? You know, where can they get started?

NT Yeah, just go to Sourcery.ai, and you can try it out straight away in your IDE for free.

BP Alright, great. Well if that is the case, I will wrap up the episode. I usually shout out the winner of a lifeboat badge. Let's see if we got any new ones. Here you go. Awarded yesterday to Martin Evans. 'Python parse--'

PF Ah look at that, it just brings this together. You know what else is exciting? Sourcery.ai has every vowel in its name. [Ben laughs] You don't always see that. 

BM I've never noticed that before. That's brilliant. 

PF It's got them all. I'm literally sitting here as Ben is talking. It's got an A, an E, an I, an O--oh my goodness!

BP Good crossword puzzle. Good crossword puzzle word.

PF Yeah, that is exactly, that's the aspiration. Right? Like that's the goal.

BP Alright, great. Well, we'll say our goodbyes. I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. You can email us podcast@StackOverflow.com with ideas or suggestions or if you want to come on the show. And if you liked the show, please do leave a rating and review, whatever platform you're listening. It really helps. Paul, who are you?

PF I'm Paul Ford, lifelong friend of Stack Overflow. You can get in touch with me @ftrain on Twitter. And check out my company, I co-founded it. It's called Postlight.

NT Awesome. I'm Nick Thapen. CTO at Sourcery. You can find me @NThapan on Twitter. Yeah, come check out Sourcery.

BM And I'm Brendan Maginnis. I'm the CEO at Sourcery. I'm on Twitter, @Brendan_M6S. Tell me all about your Sourcery experiences. I'd love to know them. [Brendan laughs]

PF And you know what, if you're out there and you haven't tried Python, and you're an engineer, just give it a go. It's not a stressful on-ramp. It's not like JavaScript where you're like what what the hell is happening? You'll understand it and you'll love it. 

BM There's excellent, excellent tutorials out there as well. Very, very good.

PF They truly are.

BP Alright everybody. Thanks for listening. We'll talk to you soon.

[outro music]