Today's episode is sponsored by Rev. We explore the history of automatic speech recognition and computer systems that can understand human commands. From there, we explain the machine learning revolution that has powered recent advancements in speech to text systems like the one employed by Rev. Finally, we look to the future, and imagine the features and services that the next generation of this AI could produce.
We chatted with three guests:
Miguel Jetté: Head of AI R&D
Josh Dong: AI Engineering Manager
Jenny Drexler: Senior Speech Scientist
When Jette was studying mathematics in the early 2000s, his focus was on computational biology, and more specifically, phylogenetic trees, and DNA sequences. He wanted to understand the evolution of certain traits and the forces that explain why our bones are a certain length or our brains a certain size. As it turned out, the algorithms and techniques he learned in this field mapped very well to the emerging discipline of automatic speech recognition, or ASR.
During this period, Montreal was emerging as a hotbed for artificial intelligence, and Jette found himself working for Nuance, the company behind the original implementation of Siri. That experience led him to several positions in the world of speech recognition, and he eventually landed at Rev, where he founded the company’s AI department.
Jette describes Rev as an “Uber for Transcription.” Anyone can sign up for the platform and earn money by listening to audio submitted by clients and transcribing the speech into text. This means the company has a tremendous dataset of raw audio that has been annotated by human beings and, in many cases, assessed a second time by the client. For someone looking to build an AI system that mastered the domain of speech to text, this was a goldmine.
Jette built the earliest version of Rev’s AI, but it was up to our second guest, Josh Dong, to productize and scale that system. He helped the department transition from older technologies like Perl to more popular languages like Python. He also focused on practical concerns like modularity and reusable components. To combine machine learning and DevOps, Dong added Docker containers and a testing pipeline. If you’re interested in the nuts and bolts of keeping a system like Rev’s running at tremendous scale, you’ll want to check out this part of the show.
We also explore some of the fascinating future and promise this technology holds in our time with Jenny Drexler. She explains how Rev is moving from a hybrid model—one that combines Jette’s older statistical techniques with Dong’s newer machine learning approach—to a new system that will be ML from end-to-end. This will open up the door for powerful applications, like a single system that can convert speech text across multiple languages in a single piece of audio.
“One of the things that's really cool about these end to end models is that basically, whatever data you have, it can learn to handle it. So a very similar architecture can do sequence to sequence learning with different kinds of sequences. The model architecture that you might use for speech recognition can actually look very similar to what you might use for translation. And you can use that same architecture, to say, feed in audio in lots of different languages and be able to do transcription for any of them within one model. It's much harder with the hybrid models to sort of put all the right pieces together to make that happen,” explains Drexler.
If you’re interested in learning more about the past, present, and future of artificial intelligence that can understand our spoken language and learn how to respond, check out the full episode. If you want to learn more about Rev or check out some of the positions they have open, you can find their careers page here.
Intro
Ben Popper I mean, this stuff is totally fascinating. Can you just try to break down for me quickly how these systems do this natural language processing?
Joshua Dong So the way speech recognition works is that you have the deep neural network, which takes the audio and converts that into some sort of phonetic sequence. And actually, it's not a sequence, it's a entire tree of possibilities. So when I say 'cat', it's like 'k-at' but sometimes you'll have uncertainty. So that's where the tree comes in. And then the hybrid part is where you have non-neural network components, such as FSTs, or statistical n-gram models, which then figure out what, which path makes sense, right? Oh, 'cat' is a word. So maybe I'm just gonna follow that and mark that as a word. With the end to end kinds of models. They can use similar approaches, actually. But the idea is like, well, can we just use sequence to sequence modeling directly to go from audio to text?
[intro music]
Main Episode
BP Hey, everybody, welcome to the Stack Overflow Podcast. Today, we have a very special episode. It's a sponsored podcast, brought to you by the fine folks at Rev. You may not know this, but I actually use Rev all the time as a journalist, I used to use it to have my interviews transcribed. And now I use a little product called Descript for the podcast, which can turn speech into text, it's super great, saves me tons of time, makes the editing process a lot easier. We have three great guests on today from Rev and they're going to talk to us about AI and machine learning and building an NLP which is natural language processing system. Our guests today are Miguel Jetté, the Head of AI R&D at Rev. We have Joshua Dong, who is an AI Engineering Manager at Rev. And we have Jenny Drexler, who is a Senior Speech Scientist at Rev, welcome the three of you.
Jenny Drexler Thanks!
Miguel Jetté Thank you.
BP Miguel, tell me a little bit about your background. I know you told me once you were studying computational biology, I'm not going to ask what that is. Because that's going to be a whole nother ball of wax. But how did you go from there into the field of AI machine learning and NLP?
MJ Yeah. So I guess you have to rewind for me back to 2003, 2006. I was doing a Masters in Mathematics at McGill University in Montreal. And it was applied to I call it computational biology. But more specifically, it was the problem of studying phylogenetic trees, and DNA sequences. And back then I wanted to work in evolutionary theory, study the evolution of like specific traits within evolutionary trees, like size of brain or length of the femur, stuff like that. I did finish my masters in New Zealand. But when I came back to Canada, a friend of mine, Jean-Philippe Robichaud was working at Nuance and Nuance had a big lab in Montreal, and he was working in something I'd never heard of before called automatic speech recognition. But both domains phylogenetics and ASR has a pretty big overlap, actually, in terms of algorithms. And so in some way, a natural evolution of my career.
BP So my impression, Miguel, is that Montreal became kind of a hotbed for neural networks, deep learning, machine learning, things like that. Was that part of what set you out on this path?
MJ Yeah, I think you can track it mostly back to a company called Nortel and Nortel had a big r&d team. And they they failed spectacularly. But everybody at Nortel went their own ways and started little companies here and there. And so Montreal just became this, this really cool place for AI.
BP And so, you know, for people who don't know, in relative layman's terms, what are the similarities between, right, trying to do phylogenetic sequencing and trying to process speech that, you know, you noticed? Is it it's some sort of pattern recognition?
MJ Yeah, it's you're both in both cases, you're analyzing a sequence of states or a sequence of token characters that are dependent on each other to some extent, and also actually, a lot of the algorithms were very similar. In speech recognition back then we were using GMM, HMM. It's called Gaussian mixture model, Hidden Markov model. And we use that in both domains and algorithms like Baum-Welsh and Viterbi algorithm are both used. And so it was, yeah, kind of surprising, actually. But it makes sense if you think about it.
BP Nuance obviously, a big name in the space recently acquired by Microsoft, if I'm not mistaken, so I guess Cortana will become more fluent, very soon, but how did you make your way from Nuance to Rev? What was the path there?
MJ So at Nuance I work in speech recognition applied to IVR so I've also dabbled in mobile phone applications. And so it's those phone systems that you know most people despise. But—[Miguel laughs]
BP If you're calling for the pharmacy, please say yes.
MJ Exactly, exactly. But it was a very popular use of ASR back then. And it was really interesting, a lot of great ASR development came from IVR application. You know, mobile phones started to exist, you know, like I worked at Nuance before the iPhone was, was invented. So when it started, we, we built a thing called Nina, a mobile phone assistant using voice. Yeah. And then I moved on to a company called Voicebox. Voicebox was in Seattle, and they focused on in-car speech recognition. And it's really at Voicebox where I started to work on on more open domain, large vocabulary systems. And I found that fascinating, because before it was always, you're always building applications based on some pretty well defined use case. But when you're you start thinking about open domain speech, then, you know, it's, I find it's a much more interesting problem.
BP So yeah, just for people who don't know the history, so you were mentioning SRI, the Wikipedia has Siri spun off from an SRI project. And the the original speech recognition was provided by nuance and Siri is probably the most well known sort of computational agent, right? That is having natural language dialogue with human beings on a day to day basis.
MJ Yeah, totally. And I feel lucky to have worked on the first, you know, like incarnation of Siri.
BP So tell me a little bit about, yeah, when you got to Rev, what you were working on? I know, you were one of the first folks they're working on some of the AI driven NLP stuff. What were you building? What like tools and technologies were you're building? And I know, that'll be a good segue, then to go over to Josh, who came in shortly after you and help to transition it to a little bit more of a scaled model?
MJ Yeah, totally. And maybe real quick, I can say create a very quick story about Rev. Rev was founded in 2010. And it's on the unset, a, what's called the two sided marketplace, and they focus on language tasks. And so there are Revers that work from home, about 50,000 Revers now that work from home, and we get customers that send us audios that they want transcribed, or captioned or stuff like that, and the Revers are free to pick which audio to work on. And so in some way, sometimes I call it Uber of transcription. But after about six years of, of doing this with humans, they kind of realize that ASR this, this new kind of emerging technology was going to be a disruptive technology for their business, they started to invest a lot in speech recognition around 2016. And that's when I joined. Yeah, so when I joined, there was no speech recognition. So my, my main goal was to kind of prove to the founders and to the leadership team that the data they had accumulated was was worth a lot of money and was going to be useful for speech recognition.
BP Right, the fact that people were giving you tons and tons of audio, they were sort of annotating it by set might, they might be saying we're speaking this language, or there's two speakers or, you know, naming the two speakers. And then people were going through and doing transcription. And then often the person who paid for it would give feedback about what was right and wrong. That's kind of the dream, you know, when you're trying to build a dataset for a big ml model, is to have tons and tons of data, and then to have that sort of cleaned and curated and cataloged by humans, right?
MJ Oh, yeah, definitely. And that's definitely what attracted me to the opportunity. And I think I have to give credit to the founders, they really wanted the domain where they can assess quality. And when you're building a data set, you know, that's that you have to worry about quality. And so yeah, it was kind of a perfect match for sure.
BP So before we jump to Josh, tell me, you know, you'd mentioned this when we chatted a little bit before this, that I think Perl and Bash were some of the things you were using a little shoestring putting together a treehouse is, I think how you described it, but what were you building? What were you hacking together when it was just you early on to try, as you said, to prove out, hey, you know, you've got this greater data set and if you let us build the right tooling around it, you know, we can we can provide a actually really valuable product?
MJ Yeah, I mean, I have to say that I actually use Stack Overflow quite a lot. So thank you for you know, your great product. [Miguel laughs]
BP Yeah, of course, welcome to copy and pasting. [Miguel laughs]
MJ I never did that. So, yeah, you're right. So at first, you know, I came from, I guess, maybe more old school. So I was using a lot of Perl and Shell scripting. At first, you know, the first nine months or so actually on the job was a lot of exploration and digging into all the data because, you know, they said we have this data. First of all, you have to find it you have to figure out how it's organized, which data is useful which, which is not, you know, so I wrote a lot of, you know, one of Shell scripts, you know, to dig, dig through the data and figure out what to use and how to use it. But yeah, it was a big, kind of like mishmash of, of different scripts. And it's worked to an extent, you know, it created great proof of concept models and, and things were, you know, working okay for a team of one eventually, for a team of two, three. But yeah, as you said, I often describe it as a rickety tree house, you know, like, I built what looked like, you know, this tree house with little posts, you know, barely hanging on and, and right, and then at some point, Josh, you know, joined the team. And I would say that, you know, after Josh was there for a few months, it started to look more like, like a cottage, let's say, rather than a tree house.
BP So Josh, yeah. Tell us a little bit about sort of your background, maybe a little bit about your education, and then how you found yourself at Rev.
JD Yeah, definitely. I actually got a major in statistics at University, but kind of like, I think a lot of us here, just interest in language, strong computer science skills. That's what kind of landed me here at Rev. I was actually like, employee number two on the AI team. And what I kind of brought on was, how do we set ourselves up for engineering success? And part of its like, well, Perl, let me be clear, it's my first scripting language. So it has a special part place in my heart, right? Nothing against it.
BP Of course, it's a Swiss Army knife. So you know, you could use it for anything you need.
JD Exactly. But the thing is, you also have to be realistic, right? Like not to compare the two. But if your stack was in like COBOL, like you just can't hire for that, you know, and thinking about Python, like, every ml developer uses Python. And was made quite a while ago. And so part of it was setting us up for success so that other people could work on it. But also practical concerns like modularity, reusable components, being able to test the actual pieces of data preparation.
BP Makes a lot of sense. We just had an episode aired, like, maybe two, three weeks ago, where the founder was explaining the fatal mistake of allowing the original engineering team to pick Ruby, which they loved. And now they're trying to scale and just can't find lots of qualified Ruby developers. And especially Yeah, there's not like a great pipeline of young people who are interested in learning that. So.
JD Yeah, like, objectively, there's nothing wrong with it. It's just you have to be pragmatic. And hiring is so hard nowadays, as a startup, we have to use like kind of attractive and hot technology sometimes if you're moving that way.
BP So you have this rickety treehouse, you wanted to modernize some of the language choices. So to the architecture, what else did you sort of focus on building? I know, you had talked to me earlier about the idea of DevOps? But for ml ops? How is that different than what you might think, for traditional DevOps?
JD Yeah, it's kind of interesting, this ml Ops thing, just to distinguish for our viewers is distinct from AI Ops, ai ops is bringing AI into DevOps, ml ops is bringing DevOps into ml. So at its core, ml Ops, is still just DevOps, especially at the beginning levels, they're like tears. When we joined the team, there was nothing we didn't even have a product sell, you know. So immediately I brought on like, let's actually have reproducibility, let's have containers, we just put things in Docker things. So we can send it off to the platform team. Let's have a testing pipeline. Experimental code differs from production code in that you often have a lot of dead code paths. There's this really great white paper, I think, by Google called like the Hidden Technical Debt of Machine Learning, really great read. But it definitely applies. Like, as you're experimenting, you don't actually want to test that code or even write tests for that code, because you might abandon it in a day. But you have to make a clear distinction between that code and your production code. Because the production code, that's what customers see, you know. And so setting up the pipelines we use Jenkins, right? Was kind of level one for DevOps and getting to ml Ops, like, that's kind of higher level stuff in how you think about, well, how do we deploy models? How do we manage metrics, because a lot of these things are soft, you know, it's like, just because the metric says it's better, it doesn't always mean it's really better. Sometimes it'll be better than the metric won't even say it's better. They're really small, little edges, especially as you deploy faster and faster, your improvements will become harder and harder to discern, and sometimes even a little bit conflicting. So ml Ops, really, building off of DevOps becomes focused on how do we organize the data into a state that can be continually live, right. So we don't have model drift. How do we deploy models quickly, so that we can do multipronged AD testing. And I would say that we're kind of on the tier of automated training. And today, we're looking into automated deployment, how do we just take away the all of the deployment concerns from other developers, so we don't really have to think about evaluation as a thing to do and it's just automatic.
BP And so you know, when we had chatted before, you're mentioning that right now. It's kind of maybe a hybrid model. So it has some of the older elements that Miguel had mentioned, some of the older algorithms and techniques that you know, were applied across other fields, and then has some of the newer, you know, sort of ml and deep learning stuff in there. Can you talk to me a little bit about like, yeah, in what ways it's a hybrid model? And then we'll maybe, Jennifer, we could chat a little bit about sort of, like where you want to go and moving, you know, to that kind of end to end solution?
JD Yeah, I'll definitely pass some of this on to Jenny. But um, in terms of like a hybrid model, what it means is that so the way speech recognition works is that you have the deep neural network, which takes the audio and converts that into some sort of phonetic sequence. And actually, it's not a sequence, it's a entire tree of possibilities. So when I say 'cat', it's like 'k-at' right, but sometimes you'll have uncertainty. So that's where the tree comes in. And then the hybrid part is where you have non neural network components, such as FSTs or statistical n-gram models, which then figure out what, which path makes sense, right? Oh, cat is a word. So maybe I'm just gonna follow that and mark that as a word with the end to end kinds of models. They can use similar approaches, actually. But the idea is like, Well, can we just use sequence to sequence modeling directly to go from audio to text?
BP Very cool. Well, Josh, thank you for that. It's really interesting to hear about what you worked on how you built up some of the infrastructure there. Yeah, this idea of transitioning from the older models and the hybrid models to the newer end to end models, with some of that really interesting AI on it. Jenny, let's transition over to you. Tell us a little bit about sort of where you came from on the academic side and how you got into the field. Your title is senior speech scientist at Rep.
JDR Yep. So for me, I also sort of similar to Josh was really interested in the STEM field for a long time studied computer science in college, I was also interested in language, I took a lot of psychology, linguistics, in college, I did a minor in neuroscience as well. But when I was in college, which was a while ago, now, you know, I was in a relatively small program, there was no sort of NLP specific coursework. So I did some machine learning, but didn't really have a chance to get into sort of the specifics of, you know, NLP, or speech recognition or anything like that. So after college, I thought about just sort of going directly into software engineering, but that wasn't quite where my interest was, I was sort of on the fence about whether I wanted to go to graduate school or not. So I ended up working for a few years after college and sort of lucked into a job working on translation. And it was sort of the translation equivalent of the older technologies that Miguel was talking about. So these statistical models, you know, there was no deep learning anywhere in there. But I decided to go to graduate school and actually decided on a project that was more speech focused. And that was how I got into speech recognition. The transition in the field to deep neural networks really happened right around when I started graduate school and 2013. And so that was sort of my first experience with that. So it was definitely a big transition going from, you know, my work with statistical models to understanding these deep neural network models and how they work.
BP Yeah, 2013, that was the same year, they had that kind of like regulatory image net competition, where, you know, before they've been using older models, and then they brought in some of these ml, and dl techniques. And suddenly, there was kind of like a step by step change, right, in the in the accuracy of some of it?
JDR Yeah, absolutely. That was also right around. When Siri came out, I remember when I was first applying to grad school are talking about applying to graduate school, you know, people outside the field didn't really understand what I was talking about, or what I wanted to study. But by the time I got in and actually started grad school, I could just say to people, oh, it's like Siri, and people would immediately know what I was talking about.
BP Right. Right. Can you help me a little because like, when you say, you know, the older techniques were based more on a statistical model. You know, that makes sense. But but at the same time, I've often heard people say, like, Well, you know, machine learning deep learning. Deep down is just, you know, statistics anyway, like, what, what is the difference, like when we're talking about sort of that black box model with different weights, you know, you're set, you're giving it these inputs, and then like checking the outputs and trying to adjust the weights to get towards what you want and that reinforcement learning? How does that differ from other statistical models that, you know, you had previous experience with on the academics?
JDR Yeah, so there definitely are a lot of similarities in some of the underlying math sort, of course, statistics principles. With statistical models, typically, I would say they're a lot less of a black box than a neural network model. So for machine translation, for example, what a statistical model is trying to do is actually sort of count up every possible pair of words in the source language and the target language and build a probability table of how likely a particular word in the source language is to be translated into a word in the target language. Similarly, for we talked about, I think Josh mentioned statistical n-gram models for language modeling, that, again, is just a big probability table, that tells you the likelihood of different n-grams. So statistical models in general, are basically counting up statistics over your data. And they're a little bit more transparent in terms of being able to dig in and see what the model has learned. With these end to end, deep learning models, all of that is sort of buried in the weights of your network. And it's a lot harder to go in and understand exactly what the model is really doing.
BP So let me see if I can walk this back and tell me if I get this right. So yeah, if you imagined were the examples before, where Josh was saying, you know, it's saying 'cat' but maybe, you know, you said, 'cap' you know, if it had other words in the sentence, or some, you know, like, statistical model of where cat and cap might fit, you know, then it might be able to figure out, okay, I think there's a better probability that this is what they said than that. So we can work backwards from there, or like you said, if it was in translation well, you know, if we're following, you know, a few words that we feel confident about in this sentence, probably, you know, we can infer what this last word, you know, would need to mean. Whereas with the deep learning model, right, you, again, you're adjusting the weights and like stuff, but there's just sort of that magic happening in the black box. And you just keep tuning it so that you know, the outcome gets better and better. You're not really sure why.
JDR Exactly. And one of the other features of this statistical models is that typically, you'll have sort of a series of smaller models that solve smaller tasks that get put together to do ultimately, speech recognition or translation. So like Josh was talking about with speech, with these hybrid models that we're talking about, you have one model that sort of helps you go from audio to this phonetic representation, then we have a separate lexicon piece that actually defines the pronunciations of every word we want to be able to recognize. And they have this separate language model piece that deals with the order of words and sort of the meanings with the end to end model. That's all learned together. One objective, you know, one big model that just does everything. [Ben laughs]
BP Right. One magical inscrutable AI system.
JDR Exactly, yep.
BP Okay, so tell me a little bit about Yeah, what you work on day to day, and then I would love to chat with you about a little bit of what, you know, you see coming down the pipe, you know, what's Rev going to be focused on over the next couple of years? And what are you excited about, you know, with this sort of broader field of technology, or the next decade?
JDR Yeah, definitely. So my current work is really focused on our transition from hybrid models to end to end models. So right now, in production, if you actually use either the auto transcription on rev.com, or a rev.ai API, that's using a hybrid model, that we're doing a lot of research on end to end models, trying to find, you know, the best configurations, architectures, toolkits to be using with our particular data, and trying to understand, you know, the trade offs between accuracy and performance, and ultimately, how we can get these end to end models into production. So that's really what my day to day focus is.
BP And I know you're telling me there's certain things that you're pretty excited about one of them was bilingual translation, can you tell me a little bit about sort of what the challenges there and you know, what you think these sort of systems are gonna be able to achieve in the near future?
JDR Absolutely. So one of the things that's really cool about these end to end models is that basically, whatever data you have, it can learn to handle. So a very similar architecture can do sequence to sequence learning with different kinds of sequences. So the model architecture that you might use for speech recognition can actually look very similar to what you might use for translation. And you can also use that same architecture, to say feed in audio in lots of different languages and be able to do transcription for any of them within one model. It's much harder with the hybrid models to sort of put all the right pieces together to make that happen. But for end to end, it's pretty straightforward. So one of the things that we're really excited about his expansion into other languages. And in particular, when we do that, taking advantage of the English data that we already have to actually produce models that can do say English and Spanish at the same time. We know that a lot of languages around the world borrow words and phrases from English, a lot of people might switch back and forth in their conversations between multiple languages. So we think moving forward with these end to end architectures that we'll be able to have some pretty cool results in terms of multilingual transcription.
BP Yeah, that's awesome. So I guess yeah, you know, one of the things you mentioned, which I wanted to touch on was the idea of sort of higher level, you know, feature sets or higher level offerings that can complement what goes on with the humans who are doing this. So obviously, whenever we talk about Uber and driverless cars, or let's say Rev and automatic transcription, you know, one concern is okay, well, what happens to all the people that have been doing this work? Can you tell me a little about how that works and your perspective on how you know, AI and humans can work productively together?
JDR Yeah, definitely. So as Miguel mentioned, Rev has this mission to create great work from home jobs. I think Rev still is the biggest customer of our automatic transcription. So what we do is use our transcription as a first draft that our Revver freelancers can edit instead of having to do all the work from scratch. So basically, the better that we make our technology, the more productive they can be. So that's important to me. And that's an aspect of Rev that I really like.
BP I am a customer, because we do my transcription in descript. And then I send it to a person who cleans it up. I will do it with this podcast. Super metal.
JDR Yep. And so I think as good as transcription gets to me, you're always gonna want that human touch, AI and humans actually make pretty different kinds of mistakes. So I think often the mistakes that AI makes are a little bit strange and almost off putting. So even if we can get to the point where AI is making sort of the same number of mistakes as a human, I think you're always going to want that second pass of someone reading through editing it, especially if the transcript is going to be used for, you know, professional purpose. In terms of moving forward towards other services, things like summarization, I think one of the things we've seen at Rev is, how amazing it is to have this marketplace and this sort of data collection pipeline. So what I'm excited and hopeful about is, you know, that Rev can move into offering other services, again, using this marketplace of freelancers. And that we can build up this you know, sort of flywheel that we have right now where we can have humans perform a task, we can use that to produce AI, and that can make humans more productive, and sort of continue that cycle.
BP Right, you might use some of the technology, you were just mentioning and building to do translation service, and the AI does the first pass. But then, you know, as somebody who speaks both languages goes in and really, you know, make sure that it's on point, isn't missing the the cultural idioms and things of that nature. And Josh, what about you any thoughts you want to share with people who might be working in a similar area or hoping to get into AI, ML and natural language processing?
JD Maybe some advice for like r&d departments out there?
BP Okay.
JD There's just so much great stuff out there with the open source community. And I just want to say, yes, evaluating solutions is hard. It's just like a multi armed bandit problem. evaluating all these solutions out there, especially in speech recognition. Now we have what ESPnet, WeNet, SpeechBrain, all these kinds of stuff and the incumbent Kaldi. I think what carries on with your team is how you set up those common interfaces, how you think about the actual data structures instead of the code, and how you grow the team around those, like, for example, as we have horizontally scaled our service from our deliverable of the machine learning model, to our platform team, having that shared interface has been everything, you know, it's like, audio in, text out, it's easy to say that, but once you have to, you have to define that what does that actually mean? Text is not text,. Does the text have any punctuation in it? How does it delimited? Yeah, just a thought for all those startups out there. You know, it's, it is hard, but you have to think about in terms of data structures.
MJ Yeah, is definitely getting hard to keep, you know, keep track of state of the art and like, I don't know, how many papers in speech recognition are published like every week, but right now, like, it's, it's definitely heating up. That's a good comment, for sure.
BP Yeah. I mean, one of the things that's interesting about this particular sector is, right, there's a lot of open source work. There's big companies that are you know, obviously investing tons of resources into proprietary and product driven ways. But then also acquiring like, you know, an open AI and trying to merge that. And it does seem like so much of this, maybe because it came out of academia has embraced, you know, the idea that even sort of cutting edge discoveries more often should be shared, and pool that at least accessible to competitors in a certain way.
JDR I think I've been pleasantly surprised by how willing a lot of the big industry players are to publish their cutting edge techniques and even make some of that open source. So even though it's very hard to evaluate all of the things that are out there, it is really nice that all of that is out there to be evaluated. You know, if you want to build a system like this, you really don't have to start from scratch.
BP And I think that might be sort of like a bottom up thing. You know, I've heard that from an engineers who work here and have gone on to do other stuff in the data science world, you know, yeah, you know, data scientists and ml, folks, they have no shortage of job opportunities. So they're gonna choose to work at a place where they, you know, sort of respect, you know, the ethics or the approach that the company has to things like, you know, academic publishing or an open source.
MJ Yeah, and I agree that, you know, software has been pretty freely distributed, which is amazing. Kaldi, for example, as probably opened up, way more speech recognition applications, and we can count. So it's, it's pretty amazing. But the dark side of that is that you can't do anything with it without data. And so, you know, we're very lucky to have the data here at Rev. Not everybody has that privilege. One thing that as a company I want to do even more is like, share some of the—well share the data we can share, share it so that people can test their solutions on and reproduce, you know, academic results more easily. Because as Jenny said earlier, it can be difficult sometimes to reproduce results. But yeah, it's a really exciting field to be in right now.
BP Before we sign off, if you were a young person, either getting into this field, going through school, or just getting out and looking for a job, what resources would you recommend whether that's a podcast, video series, particular books, people are interested in learning more about, you know, NLP, specifically ml? Do you have any recommendations of resources, they should check out?
JD The field changes so fast. It's, you have to get in it.
JDR Yeah. In terms of books, it's really hard to stay up to date.
JD Yeah, in terms of books, I think the only thing that's steadfast is mathematics. So be solid in your statistics, a lot of the stuff the fundamentals of like gradient descent, that doesn't really change. But as far as like domain knowledge of the field, you just have to get an internship as hard as that is. [Joshua & Ben laugh]
MJ If I was to recommend maybe one book, it's called the Automatic Speech Recognition: Deep Learning Approach by Li Deng and Dong Yu, it covers the basics of, of speech rec, and like all the way to, to end to end. I think it's a great book. But as Josh said, thing changed a lot. So I wouldn't get stuck on the details, rather just more like the techniques on there and the stats.
JDR One other thing I would say is that, I think with these deep learning models, the field is moving more and more towards very similar models used across a lot of different types of tasks. So I would focus really on like the core underlying machine learning and understanding that as opposed to trying to specialize too quickly into a specific, you know, NLP or ASR or something like that.
[music]
BP Alright, I am Ben Popper, Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper, and you can always email us podcast@stackoverflow.com. So all the Rev folks, who are you, just a quick reminder for people who've been listening, and where can you be found on the internet if you want to be found?
MJ Alright, I'm Miguel Jetté, I'm the head of AI at Rev. And you can find me on LinkedIn or on Twitter. My handle is @bonwellphotog.
JD I'm Joshua Ian Dong, you can call me Josh and my GitHub handle is JDongIan. And I am always on LinkedIn. So ping me there.
JDR And I'm Jenny Drexler. Yeah, not too many places. You can find me online but I guess LinkedIn is the best one for me too
BP And for folks who've been listening are interested and they want to learn more about Rev, they're interested in seeing what kind of careers you have open, where should they go?
MJ Rev.com/careers.
BP Easy enough. All right, well, thank you to all three of you for doing this. And when I post the transcription, I'll be sure to put in the shownotes exactly how it was made. So people know. We're dogfooding the technology as we speak.