The Stack Overflow Podcast

Using AI to fake your own voice, podcasting never been easier

Episode Summary

We chat with Andrew Mason, a serial entrepreneur who has worked on Groupon, Detour, and now Descript. His latest company harnesses machine learning for automatic transcription, but also for more unique features. Messed up a line or mispronounced a name? Descript can generate new words from scratch that match a speaker's voice.

Episode Notes

Mason began his career as a developer, went on to be a CEO, but also found time to produce 80s alt rock album full of advice on how to run your startup.

Slack began life as a video game company, eventually pivoting to make an internal chat tool it had built into its main business. Descript had a similar journey, taking the editing software Mason and his team developed at Detour, and moving it to become the center of a new business after Detour was acquired by Bose.

Headquartered in Montreal, Lyrebird is the AI division of Descript . It was founded by PhD students studying under Yoshua Bengio, who won the Turing Prize in 2019 for his pioneering research into deep learning and neural networks.

Our lifeboat badge of the week goes Avinash, who explained what to do with a invalid syntax error that arises while running an AWS command

Episode Transcription

Andrew Mason And so we're trying to take the best of text and bring it to audio and video, and just make a better and better editor. something that anyone can use. But that doesn't have a trade off for speed and power and flexibility, much like a word processor.

[intro music]

Ben Popper Couchbase is the SQL friendly, nSQ JSON document database. The Couchbase Java SDK recently added great new features for Java developers like you. Check out couchbase.com/StackOverflow for sample apps and tutorials.

BP Hello, everybody. Welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ben Popper, the Director of Content here at Stack Overflow, and I have with me as I often do, Cassidy Williams of Netlify. Hi, Cassidy.

Cassidy Williams Hello, glad to be here.

BP For the folks who don't know, tell him who you are and what you do.

CW I am currently Director of Developer Experience at Netlify and I like to make memes on the internet.

BP We have a great guest today. And we're going to talk about a product that both you and I have quite a lot of love for. Andrew Mason, the CEO of Descript is going to be joining us. He has a varied career, we've done a bunch of different companies, some of which you probably heard of. So we want to welcome to the show and chat about yeah, some of the very cool technology that's behind Descript. Andrew, welcome to the Stack Overflow Podcast.

AM I'm really happy to be here. Thanks, Ben.

BP So for folks who don't know, just give them a little bit of background, yourself as an entrepreneur, somebody who's interested in software and technology, what should they know about you as we start this conversation?

AM Yeah, well, I'm probably best known for my hit recording album called Hardly Working. And then after that, you know, I was a podcaster. Well, I was an aspiring podcaster, there was a podcast that I wanted to make, but I never felt like I had time. So I had an idea for a new way to edit podcasts, which would be by making it work like a Word document, basically, where you could edit audio by editing text. So we built that, that's called Descript. And that was kind of the first version of it. And we've now expanded it to where it works with video. So it does everything from basic screen recording all the way up to full on moviemaking, where you're adding titles and pulling together many different clips. It's really a full featured multitrack editor both for audio and video. But the key idea is that all of your media is turned into something that looks like a document. But when you edit the document, you're also moving the media around. You can even type new words in there, and it'll generate audio in your voice with the generative audio feature we have called overdub.

BP Yeah, it does some kind of magic stuff. There's the room tone, there's the ability to, you know, screw something up on a podcast, and then dub it in, in your own voice getting into, you know, like, interesting, deep fake territory there. What was the MVP? Like did you build that? And did you find a group of folks to work within you kick that off? How did you get this project started?

AM Well, as these sorts of things often go, it started out as something else. I was building a startup called Detour, which was basically an audio tour app and half the company was made up of engineers that were building the app and half the company was made up of ex public radio producers. And so we built it as a tool for making those audio tours, which were glorified podcasts. And it was through that and seeing how hard it was to use the tools that exist today for making a podcast that are all really designed for making music first and foremost, that we thought, gosh, it would be a lot easier to do this stuff if it just worked like a word processor. And isn't transcription really good now, couldn't we build something like that? So the MVP was really just single track audio editing, it didn't have recording or multitrack, or effects or anything like that. But that was enough to validate the idea.

BP Cassidy. I know you mentioned you've been playing around with this, what have you been utilizing it for? And as you're using it, you know, from from your perspective, are you thinking like, was it clear to you have the technology worked? Or would you have to like peer under the hood a little bit here?

CW You know, audio stuff blows my mind. And so I, I'm sure that there's a lot of very interesting code happening behind the scenes that I just don't fully understand. But what I thought was really, really cool about it was being able to remove filler words. And that was the big selling point for me, where I could get rid of all of the like 'uhs' and 'ums' and 'sos'.

BP I'm doing it right now. These are like, I do the filler words that like encourage the other person. [Cassidy laughs] Mmmm. Mhm.

CW Mmm. Yeah. Being able to just say remove filler words and getting rid of them. shocked me to add that just kind of immediately sold me on it. And I had first seen it because a friend of mine runs a podcast and he was like, oh, yeah, I'm trying out this newer tool. And when he gave me a demo of it, it just blew my mind. Yeah, once I realized that overdub was a thing, that just kind of, I'm amazed with the product in general.

BP So yeah, Andrew, tell us a little bit like what is the tech stack here? I had the chance to talk with some folks recently from Rev.com. So I know, there's some how involved they mentioned Descript. But yeah, like, what's under the hood, to the degree that you can describe it without giving away the secret sauce?

AM Well, the app is all built using web technology. But when you're working with audio and video, there's still some stuff that is difficult to do well if you're working purely in a browser, so the Descript runs primarily in an electron app. Parts of the experience can run directly in the browser, like basic playback and collaboration, commenting, but to use the editor, you need to be in our electron app. And yeah, we outsource the actual transcription. But most of the other stuff, the AI stuff that we have to for overdub, to generate audio is all in house.

BP And so I guess you mentioned, right, working with audio can be challenging, and it comes to video. Like how do you make it cross platform like in some ways, those two things don't seem like they would be easy to combine? You know what I mean? Like removing filler words or doing overdubs? I haven't really played around with the video side of it because I'm just using a product but can I overdub in video? Can you like replay it? Can you can you drop in some frames there? Like if I do my smile isn't quite right. Like are you applying the same practices to both? Are they a little bit different?

AM Yeah, they're similar. So you could just remove all the filler words and it'll put jump cuts in there. A lot of the types of videos that people are making with Descript are screen recordings, for example, where the person might not be on screen or they might just be in a little preview webcam view in the corner of the screen. So it's fairly unnoticeable, but even where it is noticeable. It's just jump cuts are we found with customers just more an acceptable style these days if that's what you're going for, and then with overdub that works as well. And one of the most common uses for overdub is not necessarily to have an audio book or a full voiceover read in your voice. Although it works great for that if you want to do like a video voiceover or something. But what we see a lot is using it for corrections. So if you realize later that you want to make a editorial correction and you find a better word for something, you can just delete it and replace it with a different word in Descript. The overdub is contextual to the audio that came before and after. So it blends everything in in a way that's consistent with the way that you're speaking in that particular sentence. And that works with video as well. And what we'll do is we'll speed up or slow down the video by the slightest amount to make it all fit in. And again, when you're doing screen recordings or something like that, it tends to be a pretty seamless workflow.

BP So for this episode, what I was thinking I was I would do is like take some of the words you say and just replay like a mad libs just like replace them with nonsense words. See to what degree Descript can make it sort of like sound, you know, completely seamless. So some of what you're saying now is gonna make sense the audience, some of it is gonna make no sense at. But it'll all sound like we recorded it live. That's the idea. That's the goal, right?

AM Well, you won't be able to do that because you can only use overdub on your own voice. So we have a verification process.

CW Dangit.

AM And unfortunately, I could, I could share my voice with you and let you do that. So we'll see how it, we'll see how it goes.

CW Kind of like that, though, that there's that verification because then you can make sure that someone isn't making something that you don't want out of your own voice.

BP Right. And I have a few banks or like different services where I do like a voiceprint identification. So I guess that is a pretty serious biometric. To have some security right?

AM Yep. Yep.

CW What are some of the coolest use cases you've seen that people have taken it, Descript and just ran with it?

AM Well, with overdub in particular, there's podcasters that use it for kind of like a scratch track where especially if you're doing a long form, scripted podcast, where you're pulling in a lot of interview tape, and the host reading their lines is often something that happens at the very end of the process. And the cool thing about about Descript is you can just write your voiceover track around your your show, and start hearing it right away, start doing your sound design right away. And then only at the very end after you've gone through all the different stages of editorial review, you bring the host in to record and that means less retakes down the road, less time you need to spend in the studio, going back and doing the same thing again and again. We've also seen a bunch of people using it for voiceover for tutorial videos and things like that where honestly, like if you on some of these videos, if somebody didn't tell you that it was a synthetic voice, you wouldn't know. Like, it's the kind of thing that when someone points it out, you can tell it for the long form, for the short form, like individual word correction, most of the time, it's seamless. Like you, you really would have no idea. It's that good. For long form, if somebody is like, hey, this is synthetic, can you tell? You'll be like, Oh, yeah, it's getting, they're saying the same kind of tonality and every sentence after you've heard enough of them, but if somebody doesn't point it out, you can kind of notice and it's, it's really useful for that kind of thing.

BP And all you have to do, you know, a couple generations down, you introduce a little, like, randomness little noise in there. So that people can't realize it's it's a machine or—

CW Have someone cough in the background. Yeah.

BP Yeah, give it that, you know, little butterfly effect that you need to make sure. That's very cool. Yeah, I mean, I think a lot about right, like how we do filming for big movies now, a lot of times you'll be acting a scene, nobody's on the other side, or during the pandemic, like lots of people, you know, did stuff from home where they just like, sent in their face, you know, like, they'd be like, we're gonna do this commercial, this big athlete, and they just record their face and do the whole thing with their body. So it's part of that whole wave of kind of amazing technologies that, yeah, let you ghost in kind of a version of somebody, and then they can do it later. I like that. Scary but fun.

CW For content creation, just being able to scale yourself. That's so great, because I can't tell you how many times I've wanted to make some kind of online course I'm just like, okay, but I have to set aside like an hour of just for recording, probably more, hour is optimistic. And then I've got to edit all the videos, I've got to make sure words line up with everything. And so being able to have a tool like this, where I can remove the filler words and overdub certain parts that I've loved and probably write a script and add in my own human voice here and there, that would speed up my processes so much.

AM Yeah, it really does change the way that you go about making stuff like I've been making videos for, I don't know, over a decade for the various startups that I've done. And editing is such a burden. That you do everything that you can to avoid it. Like we kind of think of it like the difference between the typewriter and the word processor. You can edit with the typewriter but you'd rather not, right, it's like, you try to get it right the first time. But with overdub when I'm recording, or sorry, with with D script, when I'm recording a voiceover for release video or something, I'll write my script, I'll read it, I'll mess up a bunch of times along the way. But it's all good, I just keep going. Because it's so easy to just go back and delete the mistakes, you're just looking at the text and you don't even need to listen, to be able to eyeball it and fix it up. Or you'll be just recording from a very rough outline, and you kind of riff on what you want to say in real time. And again, it's so easy to edit that you find yourself being much more loose with what you're recording. And then even when you have something more specific, the ability to see the text keeps you in your editorial brain, you're not flipping to a timeline and staring like the matrix that this waveform that's putting you in into a technical mindset, you can just stay purely in a creation mindset that really makes a difference.

BP Is there a minimum amount of time that it like needs to listen to my voice to be able to learn it? Like does it need a certain amount of data in order to do a good job with like an overdub or it just—yeah, how does that work?

AM Yeah, like, the way to think of it is that your chance of having a good sounding voice increases the more audio that you provide up to an hour or so. So if you've given us five minutes of audio, it might sound good, but chances aren't great. If you've given us 10 minutes of audio, there's a pretty good chance it'll sound good. If you've given us 30 minutes, then it's going to sound good, in all likelihood. But it's I don't know, it's the mystery of AI, it's just not as deterministic as we wish it was.

BP And so yeah, I guess another thing that I thought was really interesting was, yeah, like its ability to identify different people and then sort of be able to tease them apart. And that's something that, you know, seems like it's really powerful within the context of podcasts, especially if you're doing it over and over again, there's a group of two or three people, right, the ability to sort of like for the machine to gain clarity around who's speaking to separate them out to know different voices and be able to imitate different voices. It's pretty heady stuff. So very cool.

AM Yeah, we have speaker identification, which is increasingly standard in most transcription, automatic transcription tools that you'll find. But the other thing that we do that's super cool and hard and uniquely us, is for a podcast like this, we're recording on three separate audio tracks, and you can pull those all into Descript separately, transcribe them separately, and then we combine them into something that for the most part operates as if it's one document, one object. So the speaker labels are put in there when the different people are talking. And then if you have an instance of over talk or something like, you can always drill into the individual tracks and edit them. But that process of kind of dynamically synthesizing multiple transcripts into a single editable document is a really cool thing about Descript.

BP Yeah, it's perfect for this remote world. Well, that's exactly what we do, although I send it to our producer and she correlates it. But knowing that you can do that, if you don't have a producer on your side is very, is a very neat feature.

CW In an ideal world. Where do you want to see Descript go? Like do you want, do you want to see it go into any other types of editing? Do you want to see it go change certain features of how y'all edit so far?

AM We're an editor, we think that editing is the missing link for really making audio and video a ubiquitous communication medium, as opposed to something that's really just the domain of specialists, it's getting easier and easier to capture this content. But editing is holding it back. Like screen recording is a good example where it's really taking off as a communications medium. But at some point, you get tired of listening to people's rambley extemporaneous takes on whatever the issue is, and you want them to go back to email where they have a backspace key. And so we're trying to take the best of text and bring it to audio and video, and just make a better and better editor, something that anyone can use. But that doesn't have a trade off for speed and power and flexibility. Much like a word processor.

BP Yeah, Cass and I were talking about this the other day, just like the Gmail auto suggest, and to what degree we're comfortable with, like, like letting that write our emails for us. So I like the idea, you know, you're doing a screencast and notice, as you're starting to get boring, you're off on a rant or on a tangent, it's just like—

CW I'm gonna cut you off. [Cassidy laughs]

BP Yeah, Descript knows when you're interesting and when you're not, and it helps you to stay interesting. Most of the time.

AM Yeah, I mean, even like, the bar is lower for that for where we need to take video to and audio, which is like, we mostly just want to get the technical stuff out of the way for users so that they have some agency in their craft. Right, if you're a writer you haven't made because you learn how to type. And that's it, the rest of your career is dedicated to your craft. But if you work in audio or video, it's all about the tools and you know, mastering the tools and keeping up to date with the latest generation of tools and how they're evolving and don't step away from them for a year because you're gonna forget how to use them. And so that's what we want for audio and video, that same thing that people that work in in text have. It's mostly just about giving them agency in their craft, not necessarily even taking on editorial decisions, or creative decisions, like that would be great too, I guess. But just letting people say what they want to say more easily is a pretty good start.

BP Or we could all just go back to blogging.

CW Yeah!

[music]

BP Alright, today, we are giving out a lifeboat to the avinash: 'Invalid syntax error on running AWS command.' If this has happened to you, we might have some information in our show notes that can help you out. So thanks to Avinash, get a lifeboat badge took a question with a score of negative three or less got it up to a score of three or more and an answer score of 20 or more. I am Ben Popper, Director of content here at Stack Overflow. You can always find me on Twitter @BenPopper, you can always email us, podcast@StackOverflow. And if you liked the show, leave a rating and review, it really helps. Cassady, who are you and where can people find you on the internet?

CW I'm Cassidy Williams, Director of Developer experience at Netlify. You can find me @cassidoo on most things. You can also look up Cassidy Williams, and there's me and a Scooby Doo character and I'm not the Scooby Doo character.

BP Andrew, tell the people who you are, where you can be found on the internet if you want to be found.

AM I'm Andrew Mason. I'm the CEO of Descript. And we have a Discord community that I hang out in, you could find me there.

BP Was it really you in there?

AM Yeah!

BP Or was it somebody typing in your voice? [Cassidy & Andrew laugh]

AM It's me!

[outro music]