The Stack Overflow Podcast

The AI that writes music from text

Episode Summary

The home team discusses why it seems like everybody needs subtitles now, the AI that generates music from text, and a list of open-source data engineering projects for you to contribute to.

Episode Notes

It’s not just you: We all need subtitles now.

Google introduces MusicLM, a model that generates music from text. The examples are pretty-mind blowing and raise big questions about licensing and copyrights for non-AI creators.

Taking the uncanny valley to a new low? Nvidia’s streaming software now includes a feature that deepfakes eye contact.

Beware the potentially dangerous intersection of AI and stan Twitter.

Thanks to Siavash Kayal, a fan of the show and data engineer at Cleo, who sent along a great list of open-source data engineering projects folks can work on.

Today we’re shouting out Stellar Question badge winner Paragon for asking how to Open two instances of a file in a single Visual Studio session.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast. It is your home team crew: myself, Ben Popper, along with my wonderful collaborators, Cassidy Williams and Ceora Ford. How’re y'all doing? 

Ceora Ford Good, how are you?

Cassidy Williams Hello!

BP Good. Nice to see y'all. Right before we came on I was doing a bit of mumbling and I think it's in the air. We all read this great story this week– why everybody needs subtitles turned on all the time no matter what they're watching. Somebody explain to me what's going on here. 

CW Yeah, so this is a link that I shared in the actual Stack Overflow newsletter, and I think a lot of people watched it separate from that too. I don't know about y'all, but no matter what I watch I have subtitles on now. And for a while I was just like, “Man, I didn't ever really need this as a kid.” I use them now and I never miss anything. It's really nice. I like subtitles even if it's in my native language. And turns out it's a thing. It's not just me, everyone. In the past, people who were acting in movies were often very Broadway theater trained and stuff, and they enunciated every single thing and it was a declaration. And so on that end, on the actor's end, they just performed differently. And then on the actual sound engineering end, there used to be just one track, or there was a much more simplistic way of recording audio, but now we have much more advanced ways of being able to get every single mumble a person could possibly say and get the microphones closer than ever, get multiple tracks mastered and put together. And so as a result, because we have this technology of being able to listen to anything, unfortunately that means that actors don't have to declare every single loud statement they want to say. They can mumble just as they would in daily conversation. And so captions are really helpful if you have no idea what they're saying. That is what this article is all about. 

BP There's technology here at work. You would watch these old movies, and they showed some, and there's a boom mic right above them and everybody would have to speak. And then you notice now, once they show the boom mic, everybody's blocking on stage so their faces are towards the mic. And they even parody that in this movie where the woman keeps turning her head. I'm doing it and she's [audio cuts out] and then she's [audio cuts out] off talking, and it's like that. But now everyone's wearing a lav mic, plus you've got boom mics, plus you've got a million other things. And I was just watching All Quiet on the Western Front, and I noticed in multiple scenes after watching the Vox video, people saying a line with their head turned to the side completely away from the direction of the camera and it was muffled. And I was like, “Oh, there it is.” But now I see it. I was starting to notice it in the movies once I watched that video.

CF Yeah, I know the video also mentioned a couple actors who are notoriously super quiet anyway. I can't remember what actor they named specifically. 

CW I think Alec Baldwin was who they talked about. 

CF Yeah, who had a rep of being a whisperer. So all that on top of each other kind of makes it so that basically you're not crazy, people are mumbling and the audio quality or the sound engineering is just different now, so you really can't hear or differentiate the words people are saying. It's not you losing your hearing or something like that. 

CW Right, or your focus in this era of technology.

CF I was going to say, I do think it helps if you have ADHD specifically to watch with subtitles, because giving your brain something else to do in addition to watching the movie or the TV show, also reading the subtitles at the same time, I feel kind of helps keep your brain active so you have an easier time.

CW Yeah, you’re doing something while watching. 

CF Right, exactly. So it's easier to pay attention. I find that's why even if I'm watching an animated show or movie, I still turn on subtitles because it helps me to stay focused, although it has none of the issues with the microphone being wherever and people turning away and everything. I still like to have subtitles because it helps.

BP Part of this is a technology thing about the improvements in the microphones. Part of this is that they're like, “Oh, well we can fix it in post now because we have the technology.” And I've recently become obsessed with Serato, the DJ equipment. I don't know if I mentioned this already, it has automatic stems, so you can go in and just grab the acapella, just grab the bass, just grab the drums off the track, even if they weren't separated. The machine can do that and so that lets actors cheat a little. But to Ceora's point, the other thing is that everybody now is watching these Oscar-nominated two and a half hour films on their phone in 10 minute chunks while they're doing something else, and so subtitles are part of that multitasking of consuming media. 

CW I watched a bunch of movies with my family just over the holidays, like old Christmas movies and stuff like that. And it was interesting because they are all older movies where they are the theater trained people, like a Christmas Story for example, or White Christmas. Those kinds of movies where everybody's kind of shouting, everyone's kind of declaring what they're saying, everyone's kind of facing the camera just enough. And you do notice a difference where we didn't necessarily need captions. We didn't turn them on, mostly because my parents were just like, “Why would you throw captions on?” They don't understand why it's a thing. But then when we would turn on newer movies, for example, we watched the new Glass Onion movie, that Knives Out movie on Netflix. I was just like, “Could we please put captions on, because some of these people are a bit quieter than others,” and now I know it's by design.

CF And another thing which again, has nothing to do with the technical side of this, is watching stuff with people who have different English accents. So if I'm watching a British TV show, especially depending on where in England it's based, I absolutely one thousand percent have to have subtitles because I won't understand what's going on otherwise. So that's another huge part of it now. I don't know what it was like 50 years ago, but I feel like especially now, we have so much more access to media outside of the United States. I've watched so many British mysteries and TV shows and things where it's based out of Manchester or Birmingham, stuff like that where they have these super British accents that for me are hard to understand and I'm like, “Maybe that wasn't so much of a thing 60 years ago,” or whatever.

CW That's very true– the internationalization of it. Very similarly, I watched Derry Girls, which is an Irish show. If I didn't have captions I would have no idea what they're saying. And it's a very funny show, and it reminds me of a quote that I think the director of Parasite said. He was just like, “It's amazing how much of the world opens if you can get over the fact that there's a one-inch caption on the screen,” or something to that effect. If you're willing to watch captions then it opens up your opportunity for watching certain things from all around the world. And so I do think that's a very true point.

[music plays]

BP A new hub for the platform engineering community. The Upbound Marketplace houses everything you need to upscale your infrastructure with Crossplane without having to replace everything you already have. Start integrating with your stack today at marketplace.upbound.io. 

[music plays]

BP So I wanted to segue us a little bit into something I shared that I thought was interesting. Google has been releasing a lot of AI research. They release a huge sort of overview of everything that they've done, and then they've been doing a bunch more. They want to make sort of their presence in the field felt strongly. So it's called MusicLM: Generating Music from Text. And you can't use it, you can't just put in your own text and have it generate music yet, although hopefully they will. But the samples that they included are amazing. So you say, “The main soundtrack of an arcade game. It's fast-paced and upbeat with a catchy electric guitar riff.” And then you click it and it sounds like somebody made a Super Nintendo theme except an AI just generated it. So I've been thinking about how fun this is for creators. You can now get your own theme songs in there. You can get something that's your mood or whatever, or even just a start of a song and then you can work on it. But I also was thinking about how this is the end of stock music. I don't know if either of you have ever had to use that, but I've used a lot of stock music and stock photos in my life as a journalist, creator, video maker. You need a little something, you go and you get a track. That business is just gone now. If you need stock music or photos, or pretty soon video, it's just going to get generated by software.

CW I'm very curious about what the future of this will be because I also use services for music for the background of my livestream and for different videos and stuff, and so this is something up my interest alley. But also it will be really interesting how copyright goes because the main thing that protects a lot of creators in this space –I'm really interested in copyright for musicians and stuff– is having this legal entity that protects you and allows you to get your music licensed. Because if the AI generates this music for a video, let's just say you're making a short film of some kind and you end up using it as the soundtrack, does the copyright go to the bot? Who owns that copyright? Is it the person who prompt engineered the bot to make this soundtrack? I feel like that ownership level is something that is a mystery to me and I'll be curious to see how that changes, because I feel like the industry of stock music isn't dead until this part is solved, I think. 

BP Right. But I'm also so hype for the Cassidy livestream where you write the description of the music you want and then over time it learns and gets better and becomes more personalized to you or to what your fans respond to. 

CW It’s true. Man, I've worked so hard on curating this playlist and what's the point? The AI will just do it for me. 

CF I find this whole AI stepping into the world of creatives really interesting, because I feel like a lot of people have very polarizing opinions. I haven't seen any takes on AI generated music yet, but I'm interested to see what people think about it, because I've seen some very wildly different opinions on the whole AI generated art. Some people are like, “It's not really art because the whole point of art is for an actual human being to create it.” So I wonder what people are going to think about music as well. I think AI is opening up a lot of interesting discussions. I personally don't really know how I feel about it yet, but I think just hearing this at face value, it sounds really cool to be able to put in a description. I'm interested in knowing what was everything that went into creating this. How long did it take? What did you do to make this website that can take a description and turn it into a song? That to me is pretty interesting, technologically speaking. 

BP To the earlier argument of dubs versus subs, I saw another similar thing working with these AI techniques. I'll try to remember where it came from and put it in the show notes, but basically you show it a quick snippet of video and it renders the face and the way the lips are moving. And then you say, “Same scene, but in Spanish. Same scene, but without the curse word so we can get the PG-13 version.” And it just produces these awesome seamless edits of the same scene without you having to go in and re-record any audio. So in some ways, like you said, what would open up the world more for people to watch foreign films? In some ways, that kind of technology could be so incredible because a lot of people can't be bothered to watch subtitle films. But at the same time, like Ceora is saying, scary, a little bit unknowable still at this point.

CF Yeah, because I hear what you're saying and I'm like, “What if somebody is like, ‘Oh, say this really crazy criminal statement.’” 

CW Yeah, that's what’s scary. The insidious stuff.

CF Yeah, exactly. I think I said this in a previous episode a couple weeks ago; this is one of the times where we really need to have some sort of formal regulations for these things, because there's just too much potential for just absolute craziness. Because I have seen some deepfakes that are incredible how accurate they look. But at the same time, that's super scary. I don't think there's enough footage –well, I hope not– I don't think there's enough footage of me online to create a realistic deepfake, but if I was a celebrity and I saw a deepfake of me saying the ‘I Have a Dream’ speech or something like that that I've never said before, I would be like, “This is cool, but terrifying at the same time,” because there's just too much potential.

BP My head stays the same. You're always changing your hairstyle, so it's going to be way harder for you than it is for me. I'm just the same bald egg with a beard every time. But you're always coming up with new and creative looks.

CW Right now we're at a good point where, first of all, there's a lot of uncanny valley stuff where multiple tech people I've seen who've been experimenting with it have said that there's this one AI thing that makes it so your eyes are always looking at the camera no matter where your head is. And everyone in the audience was like, “This is really freaky. Could you stop it? Could you turn it off?” And so we're still in that uncanny valley part. But then what's also interesting is, I've been reading some papers recently and some announcements about how people are trying to figure out how do we ‘watermark’ audio and watermark art and text in a way where people can tell that it's been AI generated in some way for these purposes to avoid the insidious natures that could come out of it. But it's a blessing and a curse when anyone can do this. 

CF Yeah. I don't know how internet regulations work as far as the law is concerned, but if there was some rule decided by some entity that you have to denote that it's generated by AI before you publish it to the world, that would to me be okay. That's more reassuring than just, here's this deepfake, but I'm not telling you it's a deepfake, and everybody's going crazy because they're like, “Oh my gosh, look at Beyonce saying this thing. This is so crazy,” and it's not real.

CW Yeah. I don't know how this will work either because laws are made all the time but the internet was not designed with these kinds of regulations in mind. It was designed to be fully open for better or for worse. And so laws like GDPR, for example, have been made and it works really well for people who follow the rules, but not everybody does, and people will figure out their loopholes around the rules and stuff. And so there's going to be I think a lot of interesting changes and ideas coming out of it over time.

CF Yeah, agreed. So we'll see how this evolves and five years from now we'll be like, “Oh my gosh. They actually created a law because they listened to the Stack Overflow Podcast. Wow.” 

CW It was us. We did it. 

CF When I say the term ‘Stan Twitter’, do you guys know what that means, first of all? 

CW Yes. I know what it means, but you should probably explain it to the audience. 

CF Okay. So Stan Twitter is basically the side of Twitter where people who stan or are big fans of different musicians or actors hang out and they talk about whoever their favorite celebrity is and all that kind of stuff. So that's what Stan Twitter is in a nutshell. But Stan Twitter also can be a place where a lot of heated arguments and debates happen between who's a better singer, would've thought, things like that. So one of the things I have seen happen is on Twitter in the browser, you can change the text in a tweet if you open up dev tools. And I've seen people fabricate tweets from different celebrities or even fans of celebrities and be like, “Oh my gosh, look at this crazy thing they said.” And it'll start a whole nonsense. And so when I hear about AI generated art or AI generated this, that, and the third, changing the audio of this clip or whatever, that's automatically what I think of, because I'm pretty sure whoever created dev tools in Chrome never intended for somebody to do that. Or at the very least, I would edit a tweet from Zayn or something like that saying, “I love Ceora,” but I'm not going to. Do you know what I mean? 

CW Oh, yeah. I mean, I've done that too, where it was just like, “Wow, Beyonce said I was really cool. That's amazing. Look at that.” That's silly, but unfortunately people take that to the not silly extremes where they spread straight-up misinformation out in the world. 

CF That's kind of a smaller, slightly harmless, not harmless, but it's obvious enough for most people to catch it when it happens, but it still is done in bad faith. So that to me is like, “Man, if that can happen with tweets, I'm sure somebody's going to cook up something really crazy with this deepfake stuff.”

[music plays]

BP So I want to take us to the outro and I wanted to give a shout out to a listener of the show who has been writing in. He specializes in data science, is a data engineer over at a company called Cleo, and wanted to send us some links that we could share every week with the podcast. So this week, an awesome list of open source data engineering projects you can contribute to. That was thrown up on January 24th so it's pretty fresh, so I'll put it in the show notes and if you're a listener and you're interested in contributing or learning about data engineering, we will have a link for you. And I guess if you're a listener and you want to contribute to the podcast, this is what we're looking for– stuff we could talk about or things we can share with the audience that’ll make it easier for folks to learn and grow in their careers. Now that I've shouted out somebody from the audience who helped us with an idea, I'll shout out somebody who won a badge. There haven't been any lifeboat badges recently, but my favorite new type of badge that I've been shouting out is the Stellar Question: somebody who asked just a great question and so it was saved by at least a hundred other users. So, “Open two instances of a file in a single Visual Studio session.” Thanks to Paragon for asking such a great question. You've helped 175,000 people with the same question, and you've gotten the Stellar Question Badge, so we appreciate it. As always, I'm Ben Popper. You can find me on Twitter @BenPopper. You can email us with questions, suggestions, or your weekly contributions, podcast@stackoverflow.com. And if you like what you hear, leave us a rating and a review. It really helps. 

CF And my name is Ceora Ford. I'm a Developer Advocate at Auth0 by Okta. You can find me on Twitter. My username there is @Ceeoreo_.

CW And I'm Cassidy Williams. You can find me @Cassidoo on most things, and I'm CTO at Contenda. 

BP All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]