Ryan sits down with CTO Aruna Srivastava and CPO Ruslan Mukhamedvaleev from Koel Labs to talk about how they’re innovating speech technology with the help of AI and classic movies. They also tell Ryan about their time in the Mozilla Builders Accelerator and their experiences as student co-founders in an ever-changing economic and technological landscape.
Koel Labs uses classic movies to help learners master pronunciation. You can join the waitlist for their closed beta launch now. Check out their open-source community project for Koel Labs on GitHub.
Check out their project on the Mozilla Builders site.
Connect with Aruna on LinkedIn.
Connect with Ruslan on LinkedIn.
Shoutout to Populous badge winner Tomáš Záluský, who won the badge for answering the question How to chain multiple assertThat statement in AssertJ.
[intro music plays]
Ryan Donovan: Hello everyone and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am your host, Ryan Donovan. And today we are gonna be talking about some early career folks getting into building a project, working with the Mozilla Builders Accelerator Cohort. I'd like to welcome my guests Ruslan Mukhamedvaleev and Aruna Srivastava -
Aruna Srivastava: Thank you -
Ruslan Mukhamedvaleev: Thank you so much.
RD: Top of the show, we like to get to know how our guests got into software and technology. What got you interested in starting a career in this?
AS: Yeah, we're still in our early stages of our career, so our story isn't particularly long, but for me, I got into programming in high school and decided to major in computer science at the University of Washington. I've since become more interested in exploring the intersection of speech technology and multilingualism, so I focus on a lot of research surrounding audio language models and how to improve the capabilities to be more accessible and inclusive for multilingual and accents.
RM: Yeah, I've been tinkering with code and stuff since middle school, like early middle school and then I kind of got really into web and all things design. I do a lot of research specifically with the Ukrainian language, so I really love the intersection of technology and language. That's my place where I really love to work in. And I also do research that's more humanity based, which is like an awesome thing to do with technology.
RD: Yeah. You have a little company that is doing a pretty interesting language learning application. It's Koel Labs. Can you talk a little bit about what you're doing with it?
AS: Yeah, definitely. So way back in the summer, Alex and I were both involved in machine learning based research -
RD: Alex, being your third co-founder -
AS: Alex is our co-founder and CEO. And Alex at the time was working at a startup as a full-time engineer there, working on audio models that you talk to over the phone, and noticed that the models weren't able to pick up on his accent because he is a Danish native speaker. And so his Danish accent was preventing him from actually interacting with this model correctly. And as we were speaking, we were talking about how this problem of pronunciation is really difficult to master. All of us have parents who are immigrants and are very familiar with this idea of just struggling to master pronunciation and also communicate effectively, given a lot of issues in pronunciation. And so as we were discussing, we were like, well, Alex has been at the startup for a while, and I already had quite a bit of research experience and Ruslan and I had previously met at a hackathon, and I knew Ruslan was a brilliant designer. And so I talked to Ruslan, and at the time we saw this Mozilla grant that was being funded for open source-focused startups that were planning to build out AI for locally running AI. And Alex proposed the idea of working on the startup via the accelerator program. And so when we applied, we got in and we decided to go all-in on it and are continuing to work on it to this day.
RM: It's actually really funny though, me and Aruna met through a hackathon working on a language tool for kids with speech impediments, so it was really interesting like to go back to that kind of idea.
RD: Yeah, that's great. I love that. My mother is also an immigrant, she's from Iceland. I do have a fascination with language learning. But it's interesting, I just read an article about the difficulty in processing accents in English, especially for beginners and experts at non-native speakers. Can you talk a little bit about the research and thinking you're doing, that you're coming across in terms of understanding accents across all speakers?
AS: Yeah, absolutely. Koel Labs is kind of separated into two components. First, being on the product end, we're really hoping to be able to model accents really well using this phonemic alphabet called IPA. This is the sound alphabet of speech. That's the first component, which is important for actually giving users accent feedback. The second component of Koel Labs is actually researching the greater implications of accent variation and dialect variation. So we're in collaboration with Carnegie Mellon and UBC, and we're working on demonstrating how accent variation generally causes a lot of issues in the performance of speech technology. Say given big models like Whisper and other like ASR speech technology, oftentimes these error rates that you see, it's like a 95% accuracy and it's 5% error rate. It's often times reflected on when it's evaluated on standard American English speech and it's highly biased, and we wanted to really frame the issue of what a lot of current state-of-the-art speech technology is doing. And so we're working with universities to not only demonstrate this, but also try to come up with a better alternative to training these models. And that's why we wanna be able to build a platform that is for non-standard speech, because then we're able to get non-standard speech data and be able to train more inclusive models.
RD: For those, sometimes the accents, even native US speakers have accents that make it hard to differentiate words. I know there's a video going around about a Baltimore speaker saying like, ‘Aaron’ ‘earned’ ‘ironed’ and ‘earn’. And it's just iron, iron, iron, iron. How do you go about disentangling the possible overlap in a speech accent?
RM: What our platform is trying to do is use like a method of layering IPA transcriptions. To go back a little bit for context, we use like movie content and stuff as a baseline, as like a ground truth so that you are able to choose whatever content that you want to kind of sound like. And so we use that as a base truth and then like compare what the user is saying to that, in order to like best see the differences in between that and give actually helpful feedback.
AS: And so coming back to like the Baltimore accent, which I think is so funny, I think I was watching it like last week. Oftentimes those small things like I think he was saying earn instead of iron, right? And all those things actually can be well represented using IPA. So unlike our standard English alphabet, which is not very reflective of the actual pronunciation of those -
RD: No, it's pretty terrible -
AS: Yeah. Especially English is like really tricky. IPA is like a one-to-one in terms of the sound you're supposed to produce. So it's actually a huge alphabet of sounds like several hundred. We can kind of narrow and condense it down to around like 50 to 60 sounds that are appropriate to represent L2 non-native speech. And so what we do is we basically use a phonetic transcription model that we've trained to be able to take speech and then represent each sound in an IPA token. And it's a little bit different than like an ASR. You can imagine ASR because it's like a transcription task. It's highly context based, right? Your model might not know the exact word that somebody says, but based off the context, it's predicting everything. But for phonemic transcription model, if it's doing that, it's gonna start assimilating it's assumed phonemes to some standard speech, which is not what you want. You wanna be able to model exactly what the speaker is saying. It's a little bit different in terms of how it can rely on context because it truly can't really rely on context.
RD: Right. The model itself, is it a more traditional sort of machine learning or are you getting into the sort of transformer-based generative stuff?
AS: We follow the Wav2Vec2 architecture, and we use a pre-trained checkpoint from Facebook's model. It's called Facebook XLSR 60. And this checkpoint is essentially pre-trained on several hundred hours of multilingual speech data, and we do training on top of that to be able to use a lot of human annotated phonemic transcription data to be able to get the model to accurately predict phonemes. And this Wav2Vec2 architecture does use the transformer style like encoder model. It is relatively new. A lot of people do assume that following like the Whisper model and like running fine, tuning on Whisper is much better. But the key focus of our product is one: to run locally so that the models have to be relatively small and lightly. And the second component is that the Wav2Vec2 models available on like hugging face, a lot of them already have been fine tuned on phonemic transcription data. There's no proper checkpoint on Whisper for it being well calibrated for phonemes. It's well calibrated for ASR tasks, so for it to completely relearn phonemic transcription is a little bit more challenging.
RD: And you all are taking this as a company, what's the sort of difficulty going from doing all the research to making this, that somebody will use and possibly pay for at some point?
RM: It's a really large task to juggle all at the same time. But like our main mission is to keep the entire company really research focused because that's where all of our kind of passions lie. And we wanna make sure that we're actually delivering a product that's helpful to people and hopefully the company aspect of that doesn't get too in the way of our actual research, which is our main goal.
AS: Yeah. I will say deploying to real users is incredibly challenging compared to what works well in like an incubated research setting. I think what we realized quickly is that there's a lot of research surrounding phonetics, linguistics and also non-native speech, but oftentimes this research is done entirely in a vacuum setting where it's not all encapsulating of the variation of speech. So, while some people may create a tool that's well calibrated for a certain speaker profile, say maybe like Japanese L2 speech, it might not deploy or generalize well for all other accents. Likewise, sometimes people have accents, but also have [inaudible] on top of that. So being really considerate of the demographic of users and making sure that you're not making false assumptions about who your users are gonna be is really important compared to the research setting.
RD: Does that research suffer from the college student-only problem. I know in psychology, they have an issue where most of the research is done on college students because that's who will do it.
AS: Right, right. Yeah. We've definitely seen that in a lot of data sets, especially in terms of when we're running evaluations. A lot of the data sets are for like Standard American Speech of educated college students, and so you can already see that there's gonna be so many biases there. Luckily, we also have just a lot of data of non-native speech. We're working on also curating a lot of our own data sets to be able to make sure that they're more inclusive and diverse, so we're hoping to get that done in the next year.
RM: Yeah, that was like a really, really big problem that we ran to initially was finding data sets that represented non-native English speech really well, like you were talking about.
RD: Yeah. I want to ask about the Mozilla Builders Accelerator Cohort, what is that and what was that like being a part of that?
RM: It was an awesome experience. That lasted I think 12 weeks in September to December. We got a lot of support and mentoring from the Mozilla folks there. It was really helpful as we didn't have a lot of experience with the business side of things, so they were able to give mentoring specifically for that. As well as marketing and things like that. We also got the show off prototype of our application at their Mozilla demo day, which is a really amazing experience to actually show off what we built over those 12 weeks, which I think you can talk about the model.
AS: Yeah. And working with Mozilla, they were just really gracious in supporting us as students. We were the only students who were in the program, and I think it was really special to be such a young part of the cohort. There were lots of experienced founders who were giving us a lot of really good feedback and advice, and just the general amount of support that they gave, we could really feel as young student founders, even in terms of developing the machine learning side of things. They gave us a lot of support in finding credits and making sure that we had enough money to be able to fund that.
RD: That must have been both intimidating and exciting, just being the youngest ones there.
AS: Yeah [laughter], yeah, definitely. And like in terms of the introductions that we were given, it was always like, here come the young students. We like 23 [inaudible].
RD: So what was the most surprising thing that you learned as a cohort?
AS: I was surprised that so many of the well-established ex industry founders who were presenting at Mozilla, how many of them said to just go full send on startup and whatever you were interested in, and not worry so much about having the perfect background in industry. Because Alex and I are both looking to graduate soon, we were making a lot of career decisions and for me I was worried that having a strong like industry foundation was important before doing anything startup related. And a lot of these founders were like ex Google, they'd been at Google for many, many years. They just explained the idea of like golden handcuffs. They put so much faith and belief early on into us as founders that it made me realize how much potential we have even without having years and years and years of industry experience.
RD: I've been at small and large companies and it is interesting to see the process that goes on in a large company, but you know, at a small company, especially if you found it, you get to learn everything. You get to do everything. That also means you eat dirt when things go wrong [laughter]. But in trying to build this company, have you had any lessons in the field that you're gonna take with you or anything that stung a little bit?
RM: I was talking about, like we didn't know a lot about business before, but now we do. And I think those skills will be applicable to anything in the future, but just having to juggle every single role, everyone has to do kind of everything together, was like an awesome experience that teaches you so many different things. Like in the code, in the marketing and design, like every aspect, which is just awesome.
AS: I think one thing we found interesting was that how you frame the business based off who you're speaking to is really important. And like Ruslan said, we aren't business students. So when we were pitching to other companies and investors and like talking to them, generally the reception of this idea of open source is not the best if a lot of these investors are not familiar with the software side of how open source generally works. So I think we learned quickly that you need to be careful about how you word those things. People oftentimes feel worried about if you're giving away proprietary information, they don't understand the nature of how you're actually going to be profitable. So being really clear about what our mission is and also demonstrating that we have a very viable business plan regardless of if it's open source, I think was something that we didn't realize initially that we had to be really careful about.
RD: Yeah, it's interesting. I think I've had that realization over doing this podcast that we've talked to a lot of open source companies where what they sell isn't the software necessarily, it's the support, it's the hosting. What is the thing that the business will deliver if it isn't the software?
AS: I think our main competitive edge is that we're able to run our model inference for very low cost because we focus on being able to develop something that can run locally. What that allows us to do is be able to scale quickly to a lot of users at a very low cost. So while some people may have to pay and charge users hundreds of dollars a month or a year to be able to give them full access to those models, we don't. And what that means is that you can be able to deploy to a lot of users who otherwise might not be able to afford a pronunciation language coach. What that means is you're collecting more and more speech data of highly diverse populations, and so what our hope is that once we get a huge user base, we're able to not only use that speech not just for our platform, but be able to help other speech technology companies improve their understanding of linguistic diversity and being able to truly serve non-standard speakers.
RM: And one of our goals kind of going into it was to improve the already existing data sets. So by collecting this data, we will be able to give back to the dataset community and for underrepresented dialects.
RD: We'll put a link in the show notes to that. I wanted to ask the models open source too, right? Is it more than just open weights? It's also open data. The training data?
AS: Most of the data is open sourced. Some of it is done with a specific agreement in mind that we won't release it because it's health specific data. For instance, for some speaker data that contains post-stroke aphasia speech that is like medically sensitive data. So we can't release that for the speech that is fully anonymized and unspecific to a condition, we're able to share that data openly. And then everything from the training code and hyper parameters are also shared.
RD: Yeah. Do you think there's an advantage for open source models?
AS: Absolutely. I think that, especially in the linguistics space, you really want the communities to be able to iterate on it. Especially in Hugging Face, people are all about like the fine tuning type of culture, and so especially when your main bottleneck is having well annotated data, when people have that data, you want them to have the opportunity to fine tune on and see if performance improves. So that's a big reason why we wanna be open source and I think throughout the project we've realized that the implications of our work are far more than just language learning and pronunciation. What we realized is that there are so many ways that these phonemic transcription models are used for medical diagnosis when people have speech impediments and stroke disorders. It's really important to have that kind of granular level transcription. And so there's been a lot of work in that field on the medical side of things for people who have attempted the same problem space. And so we wanna see, you know, all these different sectors being able to collectively use our work and benefit from it collectively.
RD: That's interesting that the other use cases, are your models able to identify whether somebody has a particular stroke condition or even, you know, just identify that they have this particular accent?
AS: Oh. Like is the model able to differentiate between an accent versus like a speech impediment?
RD: And to be like, oh, this is somebody from Chicago, or something like, can they sort of pinpoint them?
AS: For now, this model is just transcribing phonemes, but you can definitely have a second layer where you're able to categorize the general patterns of the phonemes. So when we gave users natural language feedback. Not the actual phonemes. In addition to that, we are able to identify general trends of the phonemes that bucket the patterns that we see. So we can certainly like have that kind of feedback.
RD: Yeah. Could be like a 23 and me, but for your accent -
AS: Yeah, yeah, yeah -
RD: You were all about to graduate and we've heard a lot about this sort of tightness of this job market. I wonder how you feel about the job market, even if you're going all in on the startup world. What's it look like for you and your peers?
AS: I don't think there's ever a time though that everyone says, wow, the job market's amazing and everybody's getting jobs. No matter, I don't think there's any decade that anyone has ever said that, because that's just not how people think. So perhaps for us within our time, because the software engineering world is changing so quickly, it is more difficult to get a software engineering position. But I think now more than ever, it's easier to pick up new skills and so people can pivot also just as fast as the industry is changing and so as long as you're continuously learning, I think there's always a place for you in every industry.
RD: Yeah, I think there definitely had been some software engineering gravy trains the last 20 years. Do you feel like AI makes it harder as a starting developer?
AS: Makes it harder? Ah, I wouldn't -
RD: I mean, I keep reading these pieces about goodbye junior engineers and that [laughter]
AS: Ah, okay. Well, at least for us when we're working on a startup that needs to deploy code quickly, having auto completion from things like Copilot is certainly helpful. In terms of it causing junior developers to become obsolete, I would say that I don't think that's the case for everybody. I don't think you can uniformly say that, especially for any company as well. I think the idea of having to code very slowly is no longer there. I think people can code much faster and also test as fast. But the necessity of being able to think like a probably senior level engineer, where you're thinking more about architectural elements, code complexity and how to write code that's clean and readable and makes sense for everybody, I think that level of senior development is necessary from everybody now.
RD: Yeah, so can people check out your language learning tool today?
AS: Yeah, they can sign up for the wait list right now. Because we're using movies and TV shows, we're still working on some licensing agreements, so that's our main pain point at the moment. We're hoping to get a really good deal with a big production or movie company soon, and so once we get full rights to the videos, then we're able to deploy everything to users. So if they sign up for the wait list in maybe a few months or so, we'll see that some licensing agreement comes through and then we'll be able to deploy to everybody. But yeah, fingers crossed. And for now the models will be continuously updated on Hugging Face for anyone else to try, if they wanna just see if it works for them. And we'll release a couple other things where people can just test like a small video. But yeah, lots more to come so they can go to koellabs.com. Hit the wait list, enter their email, and then they should receive a link in a few months or so for a beta testing program.
RD: Alright, everyone, it's that time of the show again where we shout out somebody who came on to Stack Overflow, dropped a little knowledge, shared some curiosity, and earned a badge. Today we're shouting out the winner of a populous badge. Congrats to Tamasha Luzuki for answering how to chain multiple assert that statements in assert J. Their answer was so good it outscored the accepted answer. I am Ryan Donovan. I edit the blog, host the podcast here at Stack Overflow. If you have questions, concerns, topics to cover etc, you can reach out to us at podcast@stackoverflow.com and if you wanna reach out to me directly, you can find me on LinkedIn.
AS: Awesome. And I am Aruna Srivastava, the CTO of Koel Labs.
RM: I'm Ruslan Mukhamedvaleev. And I'm the CPO of Koel Labs.
AS: And Alex is our CEO. If you wanna learn anything more about Koel Labs, head to https://koellabs.com. Also, give us a star on GitHub, same thing, GitHub koellabs.com, and you can learn more about our work there. We hope to be in touch and check out our website as well.
RD: All right. Thank you very much for listening everyone, and we'll talk to you next time.