The home team chats with Gašper Beguš, director of the Berkeley Speech and Computation Lab, about his research into how LLMs—and humans—learn to speak. Plus: how AI is restoring a stroke survivor’s ability to talk, concern over models that pass the Turing test, and what’s going on with whale brains.
Gašper’s work combines machine learning, statistical modeling, neuroimaging, and behavioral experiments “to better understand how neural networks learn internal representations in speech and how humans learn to speak.”
One thing that surprised him about generative adversarial networks (GANs)? How innovative they are, capable of generating English words they’ve never heard before based on words they have.
Read about how AI is restoring a stroke survivor’s ability to speak.
Universal grammar proposes a hypothetical structure in the brain responsible for humans’ innate language abilities. The concept is credited to the famous linguist Noam Chomsky; read his take on GenAI.
AI expert Yoshua Bengio recently signed an open letter asking AI labs to pause the training of AI systems powerful enough to pass the Turing test. Read about his reasoning.
Find the Berkeley Speech and Communication Network here.
Find Gašper on his website, Twitter, and LinkedIn. Or dive into his research.
Congratulations to Lifeboat badge winner and self-proclaimed data nerd John Rotenstein, who saved How can I delete files older than seven days in Amazon S3? from the ignominy of ignorance.
[intro music plays]
Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, Director of Content here at Stack Overflow, joined by my colleagues and teammates, Ryan Donovan, and Eira May. How's it going, y'all?
Ryan Donovan Pretty good. How're you doing?
BP Pretty good. Eira, you've been writing a bunch about AI for the blog and we've been writing a lot about LLMs, and the root of all of this stuff was neural networks and the idea that maybe to train artificial intelligence we should model it after the brain and these artificial neurons instead of going through the hard work of giving it a million hardcoded rules about how the universe works. And so today I'm pretty excited. We have a guest: Gašper Beguš, who is an Assistant Professor at UC Berkeley working on generative AI and language, and right at the intersection of AI, LLMs, and the actual brain– the neuroscience of all of it. So Gašper, welcome to the program.
Gašper Beguš Thanks so much for inviting me to the podcast.
BP So before we started, I know you were telling us that you were doing generative AI before it was hip. Give folks just a little bit of background. How did you find yourself at the lab you're at and what you're focused on today? And then maybe tell folks a little bit of what is that sweet spot that you're trying to work on the intersection of these both technological and biological processes.
GB Yeah, so I kind of got into machine learning in an unusual way. I was primarily interested in how language works and how we can understand it. And we know language is this unique thing that we humans have, although animals have pretty sophisticated communication systems as well. But it's kind of complex in terms of brain activity. We don't know exactly how it happens, but we understand it on some levels pretty well. We know how children acquire it. We have a pretty good understanding of how it develops in history. So I was studying language and I was really interested in how we humans do language and what makes it unique, so what is it about our language that is so unique and special compared to other animal communication systems, which are also pretty interesting. And so I think for the first time in history, with deep learning we have ways to model language in a very similar way as humans learn to speak– human babies. Human babies learn language even before they're born. So we listen to sounds of language in the womb while we're in the womb, and we know that because babies cry differently based on the language of their parents, and stories that are read during pregnancy are remembered by babies. And so what fascinates me is that for the first time, we can basically combine the understanding of how human babies learn language and build models that are very close to that. So we're building unsupervised generative AI models that learn language from just being immersed into spoken audio sounds without any text. No baby learns from text. And so what the focus of my lab is is to use deep learning to better understand language, but also use language to better understand deep learning, because language is this nice controllable system that is quite interpretable and we've been studying it for centuries. And there's a lot we don't know about machine learning and deep learning yet, so we can use language to understand where we are different and similar from artificial neural networks and just inform each other about how our brain works and how artificial neural networks work.
RD So obviously a lot of the generative AI stuff is based on neural networks. I remember I took an AI course in college and they showed me neural networks, and it's this big sum function. How do you relate that to how the brain works?
GB I mean, the nice thing is that on some levels, the artificial neural networks are inspired by the brain, and the specific architecture that we are using for spoken language are convolutional neural nets, which are most inspired by the brain, actually by the vision primarily. And so for cognitive modeling where we're trying to build computational models that learn like humans, we're actually not using transformers, which are now the big boom, the GPT-4 transformers and so on, we’re using convolutional neural networks. And what we've shown is that using a very similar introspection technique on these artificial neural networks and the brain, shows you that the computations can be quite similar in these two entities. So what we do is we record people's brain activity when they listen to language, and there are several ways to do that, but all of them are basically getting electric activity, a sum of electric activity on your skull– that's the primary idea behind brain imaging. And then what we do is we take artificial neural networks and take the sum of their artificial activity, not electric activity, but artificial activity when they're listening or when you pass the exact same sound through the artificial neural network. And then we do a simple summation which is the same thing as as you do in neuroimaging, and we show that responses to the brain signal are similar. So for example, we can catch sound. When you hear sound, first it hits your ears, then it travels through your brainstem onto the cortex. So the brainstem is the earliest you can catch your sound basically when you're doing neuroimaging and we're doing that. We're recording brainstem responses to a sound of language, and then we're passing that through a neural network, and for the first time we've shown that you don't need any other steps, just raw signals are similar. Now that doesn't mean that there are no aspects of artificial neural networks that are biologically implausible. So the way we train them is called back propagation, that is implausible biologically. But they're similar at some levels enough to biological neural processing that you can do these really interesting comparisons. We know that the way we hear sounds depends on what your first language exposure was, and we show that something similar happens in artificial neural networks as well. They get wired slightly differently based on whether you trained them on Spanish or English. So I think there's a lot of potential to learn from each other, so in other words, to better understand the brain or language with neural networks and vice versa, to better understand neural networks by looking at our actual biological neural computations.
BP Ryan is always very disappointed to know that we're just the sum of some random electrical firings, which really are just the sum of some random hormonal impulses, but he'll have to sit with that.
Eira May It's really interesting that you talk about learning about the way that neural networks work in humans and also the way that they work in AI. I feel like I'm a little bit in the lab right now because I have 11-month-old twins who are just learning to talk and you can sort of see the wheels turning as they connect the gestures and the signs. And they're identical twins, too, so we can do A/B testing which is really interesting. I guess I'd be curious to know what you've learned that's been surprising, whether that's something that you've learned about the AI side or our side.
GB That's a really cool question. I mean, there's so many cool things that are happening as humans are learning language. At the beginning, babies basically hear every sound of any possible language, but then at about 11 months, they just start focusing on only those that they hear in their surroundings. I already mentioned language acquisition. I mean, we start hearing language way before we start seeing complex stuff. So if you can imagine all the stuff we see in the womb, it's pretty complex stuff we see. And so intonation, a voice is going up or down, that's something we definitely pay attention to. In terms of modeling language with artificial neural networks, one thing that was kind of surprising to me was that the stages in which they acquire language is similar to what kids do. For example, we had this study where in English, your P, T, and K in English have this puff of air. ‘Pit’ has this H-like puff of air and English kids have to learn that. In my language for example, we don’t have that. But if you add an ‘S’ before that ‘P’, that puff of air is gone. So ‘pit’ versus ‘spit.’ ‘Spit’ has no puff of air. And so that's a very simple algebraic rule that kids need to learn as they're faced with English. And they do mistakes, so English kids produce this puff of air even when there's an ‘S.’ So they'll say ‘spit,’ something like that. And we've observed that neural nets, as they were trained on this data, they started doing the same. So they had a nice pronounced developmental stage. It was very, very interesting to kids. One thing that is most surprising maybe to me is that they're super innovative. So we are training generative adversarial networks, or GANs, and we trained them on a few words of English and they start producing novel words of English that they didn't hear before. So because they're learning by imagination and imitation, give them eight words and they'll start producing new words of English, and that is really fascinating. For example, we have a network saying ‘start,’ although it only heard ‘suit’ and ‘dark’ and ‘water’ and a few other words and it never heard ‘start’ before, and yet ‘start’ is a perfectly good English word. And so they're extremely innovative and so are we, we are very innovative as well. One nice thing about spoken language is that we basically have the generative aspect innate. So we talk a lot about generative AI, and if you think of vision, we do not have an innate generator for vision. I can ask you to imagine a red apple and you'll do that, but it'll be very difficult for me to access your imagination. But for speech, we can speak novel sentences, novel words, we can make up words, we can make up sounds. So our articulators are the generative AI principle. And so basically we're also modeling how we are moving mouths. So we're trying to build models that learn more and more like humans. So we're trying to get there, we're never going to get there but we’re trying to, so we're adding representations of mouths. Our models started moving mouths instead of just generating sound. And they were also very innovative there as well so they said words that don't exist, sounds that don't exist. So there's a couple of really impressive results that we got, but those are maybe the main ones.
BP I like how you mentioned this idea that if I'm thinking of something, it's pretty hard for you to access. I know you recently tweeted about some research that did come out of UC Berkeley as well, a really incredible story that was in The New York Times about a woman who had a severe stroke or lost the ability to communicate and was paralyzed, and they were able to study her brain activity. I guess she had kind of a full implant, and then from there, learn what signals meant what phonemes or sounds or words, and give her back the ability to speak, which truly you feel like we're at this moment of having a mind computer interface that seems so sci-fi. Can you tell us, what did you think of that work? Did that connect in any way to what you're doing, and where does that leave us in the future? Am I going to be able to imagine something and send it your way over a computer?
GB Yeah, so I'm just going to go to teach a class and two or three of the authors on that paper were in this class in previous years, and I think it just signals how important speech is, because we focus a lot on text and that's great because most of human knowledge is written in text. So the reason why GPT-4 is so great is because there's text. It's trained on text and there's a lot of human knowledge that is encoded there. But when you say ‘hello,’ there's so much information. Even if you call somebody and the person says ‘hello,’ I can identify many of their social properties like where they're from, how they feel that day. And if you transcribe that, it's just a single ‘hello.’ So speech has so much richness of information that text is losing, and I think in some ways speech is the new text because there's so much to be done there. This study, as I mentioned, we're teaching a class here on speech processing and audio processing, and that study is absolutely fascinating. It really shows how these new neural technologies allow us to basically generate spoken language of patients who lost the ability to speak. And I think this study was amazing and I think we're going to be seeing a lot of progress in the next years. And if you think about it, to lose the ability to speak is a really hard thing. So your thought is basically completely intact, it's undamaged, but you just cannot execute what you want to say. And giving people that ability back is going to be really important, and adding large language technology will probably even increase performance and I think this needs to scale up. So absolutely fabulous study, yes. And my lab is less focused on applied stuff and more on the basic stuff, but we're using some of the same data. So it's data when patients need to undergo surgery, and then there are amazing neurosurgeons like Eddie Chang at UCSF, where they record brain responses, as I mentioned, to spoken language, and that's where we get really rich information about how the brain processes speech and then why we're doing modeling. You might think modeling is just an exercise. It's not, because with modeling you can play. So if you build an artificial neural network, you can do experiments that you would never be able to do on humans because they would be ethically horrible. But if you have an artificial neural network, you can–
BP Torture it all you want.
GB Right. I mean, yes, in a sense. You can turn down some connections, mute some connections, and get understanding.
RD I'm pretty sure there's a science fiction story about this.
GB Yeah, we're getting into the AI ethics territory.
RD So you talk about modeling language and Noam Chomsky has this idea of universal grammar. Do you think that the language modeling that you're doing is able to pick that up, or do you think it's irrelevant to the modeling?
GB That's a really interesting question, actually. So yes, you're right. A lot of what people know about linguistics comes from Chomsky and the idea that somehow we had this pre-engineered brain that is capable of language, and this is a really, really deep and complex and interesting question. So you asked me at the beginning, how did I get into generative AI, and this is precisely what I was trying to answer. How many human-specific things do we need for our language to be possible? Does there need to be something specifically human in our brain, or is it just just a scaling question? Are we just so smart that language emerged? And that's a really deep question and I think we have to approach it very carefully because it's easy to be part of camps and say, “Oh, Chomsky's completely wrong,” or, “Chomsky's completely right,” but I think the truth is somewhere in the middle. Now we're building models that learn like humans and we're showing that increasingly stuff emerged there. We don't need a lot of human-specific aspects. One nice thing about transformer, I was a little bit critical before that, but from the perspective of how humans learn, they're not very realistic. So no humans learn from massive amounts of text, but transformers are really great in showing us what is possible in a neural computation and what can emerge. And I think there's a lot that emerges automatically and we don't need any specific human aspect for language. For example, we have this study where we show that the latest LLMs were not only able to do language well, but they were able to think, to reason, to analyze language itself, so this metacognitive reasoning about the reasoning type of ability emerges. And so I think we're getting a lot of evidence that there's very little, if anything, that is pre-made pre-wired universal grammar type of thing, but I wouldn't completely rule it out. I mean, that's the big question. And if you think of it, we're asking the same question in machine learning as well, which is, are the models that we're seeing now like GBT-4 or this neural computation that we know now, is that enough to get the super intelligence or AGI, or will we need some extra tweaks in architecture that will get us there? And exactly the same question is being asked in cognitive science and linguistics– are the animals that show many traits of language, but none of them have exact same language as we do, not even our closest relatives, chimpanzees in bonobos. Are their brains just smaller and less powerful, or do we have some human-specific things that enable language? And of course Chomsky will think that that is the case and we have this special operation that allows syntax instruction and so on. I think a lot of neural networks are showing that there's a lot of structure, there's a lot of stuff you can get for free in other words, so just in artificial neural competition. But I think we should look very carefully at this question because it's a highly consequential one, and I wouldn't rule out that there is something human-specific, but I'm just not seeing a lot of evidence. And definitely artificial neural networks are helpful in that respect.
BP Ryan is holding out hope that we're somehow unique and different.
RD Oh, no, no. We're brains in a bone suit, covered in meat armor powered by electricity.
BP Yeah, that's all it is my friend. It's just the number of flops you can run on the language. That's pretty much it. So there have been a few papers recently exploring the idea of whether or not these new AIs have consciousness or sentience or their intelligence is the same or different from humans. One was Microsoft, like you said, proposed that if it has a theory of mind, if it has metacognitive abilities and it's starting to showcase things that once upon a time we would've said are unique to consciousness. And even some of the biggest players in the field like– I don’t know how you say his name.
GB Yoshua Bengio?
BP Yeah, Yoshua Bengio was recently on a paper that was just sort of saying that any test we can come up with for consciousness, you could design AI to pass today. That doesn't mean that we know what it is, but I could build you an AI to pass these tests if that's what you want. And I guess the one person on the other side is the guy from Facebook, Yann LeCun, who's sort of saying, “Look, don't fool yourself. This is just math.” And like you said, if we're going to get somewhere really big and important, we're going to have to come up with a different design. So where do you fall in that camp? Do you think we're on the right path, or do you think we're going to need new approaches, new designs to get the rest of the way?
GB The jury is out there on this one, I think, and nobody really knows. I was most impressed by the magnitude of improvement in performance between GPT-3.5, and GPT-4. The only real model that impresses me is GPT-4, based on the tests that I've been running and then other people as well. So the theory of mind and other concepts, I think we are looking at this question in a very human-centric way. Understanding how these models work is so important to me. So what we're seeing is that some models that we are building are learning like humans or they're getting closer and closer to human performance, and that's great for me because I'm a cognitive modeler. But I think we're also seeing that some architectures and models are diverging from how humans would do things, and I think that's really important to understand. How are they finding new paths, new ways, new insights? I mean, socially, of course they're biased. These models are biased because they're trained on human data, but in essence, they're not humans, and that's their advantage. They are not biased, they're not limited by biology in the way we are. And so I think rather than trying to build human-like models or trying to figure out are they our exact same consciousness or not, I think we need to understand how they're even different from us or where they're different and how they're becoming different. As I mentioned, transformers are not learning like human babies are, and they're seeing stuff that our humans wouldn't necessarily observe. And so I think the more interesting question is, can we understand them and can we leverage that understanding so that we get insights that we wouldn't as humans?
BP Can we control them, I think, is what some people would say is the most important thing.
GB I mean, when you get into once they become so smart it is going to be difficult to evaluate them. How am I able to evaluate them if at some point they become super smart? But also in terms of consciousness, that's a really loaded question. The brain of a fly is different from our brain for sure, but how different really architecturally? So again, is this a scaling problem? I mean, we think that we have consciousness and flies don't, but who really knows. But other animals like whales or chimpanzees–
BP Yeah, whales have huge brains, what's going on there? I don't understand, it's not a scaling problem.
GB Exactly. Well, there's other things like brain to size ratio and we don't really know a lot about whales so maybe we'll uncover something. But the thing is that if consciousness emerges on the way, if we're able to scale these models to much, much higher dimensions, why not, and maybe the consciousness will be different or not. It's really hard to say. I think to be certain, to say I'm in one camp or the other is a little dangerous because it's really difficult to say. But I say let's try to understand them– how they're similar, how they're different.
BP I'm just giving you the chance. If you want to get a lot of news coverage, you have to have a hard opinion that could be completely wrong and you just stick with it. And if you're wrong, nobody will remember, and if you're right, everyone will give you the accolades.
GB No, I think they have a lot of potential, I'll say that. And I think it's not impossible. And if you need to put me in a camp, I'd be more in the sci-fi existential risk camp, but with one foot on the cautious side.
RD I mean, there are folks who believe that all matter has consciousness, the Panpsychism folks, so you never know.
BP So Gašper, from your perspective in the lab and as you're thinking about this stuff, when you go out to look for training data, like you mentioned, all the texts on the internet is one thing, language you said is going to become the new text, sound. I've read about maybe folks harvesting that from a YouTube video, but if you really wanted to learn, if you wanted to create universal language or learn different languages or gather all this stuff, how would you go about getting all the data that you need? Do you have people come into the lab and give you voice recordings? Do you use audio books? To teach an AI like GPT-4 unfortunately is a little bit easier maybe than Stack Overflow would like– just go out and crawl the whole internet and learn from that. But what do you do when you're looking for language and sound?
GB Yeah, so for language we have some really dated databases that work well for limited purposes. I mean, companies have various different ways to get data and there's a lot of spoken language data, but spoken language is very expensive. It contains so much more information, but that means that it's going to be computationally much more expensive. So text is a really good low-dimensional representation of our language, but it loses a lot of the fun part, or loses a lot of emotions, a lot of these kinds of things. I think one really next frontier is to pair language with brain data, and that's more difficult to get. So there are non-invasive techniques that are pretty easy to get. For example, the paper we had in a scientific report, we just placed electrodes on people's head and they can come to your lab and just listen to a lot of sounds. There's more invasive techniques, but those are nice because in the study we saw, they allow you to generate spoken language of patients who lost the ability to speak. But I think that creating those databases where we pair spoken language, maybe even vision, with brain responses to those signals is going to be the next big thing, and it'll really allow us to do what I initially said, to better understand the brain with neural nets and to build maybe more realistic models of the brain in the artificial neural network world. And again, I think the most important thing is to really understand how they work. There's this hype about black boxiness of neural networks. Yes, they're difficult. Yes, they're challenging to interpret, but it's not impossible. And part of it is that we were primarily focusing on vision. The visual world is super complex. There’s millions of shapes and objects that you need to distinguish from, whereas language is a much more controllable space, and I think there's really promising results and I think we'll be able to understand them more and more if we're interested. Industry currently was primarily focused on performance, but now I think a lot of the hype will turn into understanding how these models work and understanding how they're similar to humans and where they're different and leveraging that difference. So I as a human can see things and get insights, but I'm limited. And so can I introspect AI and say, “What did you learn here? Tell me what you learned about whales,” or “Tell me what you learned about molecules or the universe,” that will not give me a definite answer, but it'll give me important points and clues to look for. And I think that's going to be the really good next thing with a lot of potential for discovery.
BP Sweet. Well, if you need someone to put something deep inside their brain, I guess I'll volunteer.
GB Okay. That happens only when it's clinically warranted, so don't worry about that.
BP All right, everybody. It is that time of the show. We're going to shout out a Lifeboat Badge winner: someone who came on Stack Overflow and helped some folks by spreading some knowledge. Thanks to John Rotenstein, awarded 55 minutes ago, “How to delete files older than seven days in Amazon S3.” John, we appreciate the answer. You're part of the AWS collective and you've helped over 40,000 people delete some old files, so good on you, John. As always, I am Ben Popper. I'm the Director of Content here at Stack Overflow. Find me on X @BenPopper. Email us with questions or suggestions for the show: firstname.lastname@example.org. And if you like this episode, leave us a rating and a review. It really helps.
RD I'm Ryan Donovan. I edit the blog here at Stack Overflow, you can find it at stackoverflow.blog. I am occasionally still on X, so you can DM me there @RThorDonovan.
EM And I am Eira May. I'm on the editorial team, write for the blog, and work on the show notes. And if you want to find me on text-based social media, I am @EiraMaybe.
GB I'm Gašper Beguš, Assistant Professor at UC Berkeley. I'm on X @BegusGasper and pretty much every social media probably out there. I have a YouTube channel where you can listen to how language sounds in the brain and in artificial neural networks or how the networks created new words that I'd never heard before. But those are roughly my presence.
BP Very cool. All right, everybody. Thanks for listening and we'll talk to you soon.
[outro music plays]