Ben talks with Dylan Fox, founder and CEO of rapid-growth startup AssemblyAI, about how he became interested in AI and machine learning, why he left a steady job at a tech giant to create something new, and what AI can offer creators like writers and visual artists. Plus: Why Ben threw out $20,000 worth of Magic cards in middle school.
AssemblyAI is an AI-as-a-service provider focused on speech-to-text and text analysis. Their mission is to make it easy for developers and product teams to incorporate state-of-the-art AI technology into the solutions they’re building. Their customers include Spotify, the Dow Jones, The Wall Street Journal, and the BBC. Need AI to run semantic analysis on your forum comments or automatically produce summaries of blog post submissions? Rent an ML model on-demand from the cloud instead of building a solution from scratch.
Just three months after its $28M Series A, AssemblyAI raised another $30M in a Series B round led by Insight Partners, Y Combinator, and Accel. In this economy?
When it comes to new and cutting-edge AI developments, what’s Dylan excited about right now? This open-source implementation of AlphaFold from GitHub user lucidrains.
Connect with Dylan on LinkedIn.
Today we’re shouting out the winner of an Inquisitive Badge: User Edson Horacio Junior asked a well-received question on 30 separate days and maintained a positive question record.
[intro music plays]
Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host Ben Popper, Director of Content here at Stack Overflow. I am flying solo today without any of my co-hosts, but I have a wonderful guest, Dylan Fox, who is the founder and CEO at a company called AssemblyAI. I'm hoping to chat with him about some of the things that I'm most interested in in this area at the moment like text to image generation, some of the trends we might be seeing in people going back on-prem with their models, what the future of machine learning is going to be as a service, and then what are the areas he's really excited about as well as which ones he thinks might be a challenge or that keep him up at night. So without further ado, Dylan, welcome to the show.
Dylan Fox Yeah, thank you so much for having me on.
BP So let's start at the beginning. What brought you into the world of software and, at a 10,000 foot level, quick synopsis, how did that journey bring you to where you are today?
DF I was always very into video games as a kid. I played a ton of Counter-Strike, Runescape, World of Warcraft– a ton of MMO RPGs. And as part of that, I would hang out on IRC all the time and my brother would build computers in our basement, and I was just always kind of around a computer. And I got into setting up our own, for a couple friends of mine, private Counter-Strike servers on a remote Windows desktop that we would rent. And through that I built a website for us and learned some basic HTML and CSS. And then I kind of put all of that on hold while I went through middle school and high school and college. And then after college I got back into software development and I actually went to college as an econ major, but then just got into programming and worked on a bunch of side projects.
BP Why did you put it on hold? You were like, “I can't be a nerd my whole life. I’ve got to put this to the side.” That's how I threw out all my Magic cards in middle school so I felt like I could get a girlfriend that way. Now I regret throwing out $20,000 worth of Magic cards.
DF Yeah, yeah, yeah. I had an older brother that went into finance and I was like, “Oh, I should do what my older brother's doing, so let me get a business degree.”
BP Following in his footsteps.
DF Right, exactly. And then pivoted when I got old enough to realize, “Hey, I can actually decide to do what I want to do.”
BP Turns out tech pays better than finance anyway, so joke’s on him.
DF Yeah, exactly.
BP So you got back into computer science and technology after school. What were some of your first gigs and how did you transition from those, and more specifically, into this world of AI and machine learning?
DF Yeah, sure. So I was working on a startup towards the tail end of college with a couple friends of mine. And through that experience I got really into Python and Django and was building out all the backend for the startup that we were working on. I started attending a lot of Python in Django meetups in Washington D.C. where I was living at the time, and I got pretty connected to the community there where I just kind of fell in love with programming. As I started to do more and more web development I realized that I was more interested in backend development, and then as I started to do that I realized I was actually more interested in algorithm problems and machine learning problems. And at the time it was still very classical machine learning focused where I was doing a lot with support vector machines and scikit-learn and NumPy, and deep learning wasn't super popular yet– this is maybe in 2013, 2014. But through that experience I decided that I wanted to get way more into the area of AI and machine learning and ended up taking a job at Cisco out in the Bay Area where I was on a machine learning team as a research engineer more or less, working on NLP and NLU problems for different Cisco collaboration products. And then fast forward a couple more years, I ended up starting AssemblyAI after I had spent a couple years working with neural networks and getting deeper into that tech. So that kind of landed me where I am here.
BP Yeah, I forget. What year was it that there was that breakthrough with the ImageNet competition? There was kind of this watershed moment where these old models that had been proposed in the ‘70’s and ‘80’s and fallen out of fashion and suddenly we realized, “Oh, we do have the compute now,” and at this level they do work better than some of the more classical models you had been trying.
DF Yeah, exactly. I don't remember the exact year offhand, but it's still happening, and I remember going to the first TensorFlow meetup down in Mountain View. I mean, it's still so new. A lot of these frameworks are still just a couple years old. It's still so new.
BP And so you mentioned you kind of fell in love with this as you were getting back into programming and going to meetups and then as the industry was evolving. So what was the impetus at a certain point to leave Cisco and decide to build your own thing? Talk a little bit about that genesis of making kind of a bold choice to leave a steady job and create something new.
DF I had kind of always known that I wanted to get back into startups and to start another company, but I didn't know what I wanted to do. I knew I wanted it to be a hard technical problem and I wanted it to be something that was very technically challenging, because I thought that was more interesting to work on, but I didn't really know what I wanted to do specifically. But I was in San Francisco for a reason, which was that I wanted to be around startups and work on another startup. I had moved there from Washington to join this team at Cisco as kind of a first step into figuring out what was next. And when we were at Cisco I remember we were exploring a lot of speech recognition solutions as well as a bunch of other NLP solutions, and this is like 2015, 2016. There were a couple papers that came out in the field of speech recognition, one was called ‘Deep Speech’. I don’t know if you're familiar with this, but it was a paper from Baidu and it kind of popularized the idea and potential of using an end to end neural network for automatic speech recognition, where classical approaches were built on a daisy chain of old school models that had kind of hit a ceiling in terms of accuracy they could support. And so we had the idea of, what if we could use the latest AI research to build super accurate automatic speech recognition models and expose those through a Twilio or Stripe style developer experience to start. But the goal was always, what if we can do that not just for speech recognition but for a number of other different AI tasks as well? So today we have AI models for summarization and content moderation and topic detection and a number of different AI tasks that we research, train, and deploy in house. But that was really the goal. It was, what if we can build an API for state of the art AI models that gives developers and product teams easy access to those models?
BP I just looked it up; 2012 was the year that a sort of convolutional neural net made a big breakthrough at the ImageNet Challenge. And like you said, 2013 was kind of the year you felt like things had turned. So it was right around that time that some of these systems came online and started to just outperform everything else by such a large percentage that people who had been dismissing this approach for many years had to sort of wake up and pay attention to what had turned out to be a pretty radical transformation in the space. So let's get to a little bit of how your business works. I like the comparison to Stripe or Twilio. So I run a large internet forum and I want to start to classify the forum posts by sentiment and by topic. You have an API that I can utilize, it ingests data for me, it runs through a system on your side, and then you share back a way for me to tag those forum posts with topic and sentiment. Something like that?
DF Correct, yeah. So you could use our Rest API to send audio files too. So when we got started it was all audio based. If we just focus on audio for a second, you'd send an audio file in and with your API request you would configure the parameters to say, “Hey, I want to run this audio file through an automatic speech recognition model. Then I want to also run a content moderation model over the audio to detect if there's any hate speech being spoken or violence being spoken. And I also then want a summary of the transcript.” So you would configure your posts request to do that and then in the JSON response you get back a couple minutes later, all that information is in the JSON response. Now you can also send text documents like forum posts or chat messages directly to our API and have them run through the same AI models. So you can say, “Hey, give me the sentiment for these forum posts,” or, “give me a summary of these forum posts,” or the topics that are being detected or extract the keywords. And you can then get that data back and build a new feature or product on top of that data.
BP And so you mentioned you started out with speech to text. What universe of functionality do you now encompass?
DF Yeah. So we have about a dozen models right now and that number is growing, but we have models for embeddings, for summarization, content moderation, topic detection, PI redaction, really all around the task of NLP. So a number of AI models for NLP, and those can operate either on audio files, video files, or text documents and we have APIs that can ingest the different types of data. We have a lot of things in the works to launch our own large language models, potentially even text to image models which I know you're interested in too, that are kind of further off because those are so new. But for right now we have a number of AI models that are state of the art for those different NLP related tasks.
BP Did you rely on open source work or existing other models to create yours, or were these created in-house? Where do these come from? And I guess you just mentioned, you're thinking about perhaps releasing your own. Would you then allow other people to sort of use that and make their own permutations of it?
DF So all the models that we deploy are all researched and trained in house. We leverage open source deep learning libraries like PyTorch and TensorFlow to train those models of course, but we're not just taking an open source model and then deploying it. We have a team of about 20 AI researchers in house from places like Deep Mind, Google Brain, open source communities, that are everyday researching and training new models that we are then deploying. Yeah, everything is trained in house.
[music plays]
BP Want the best remote engineering talent? Join over 300 companies who trust Turing.com to source, vet, match, and manage developers. With 2 million developers and over 100 skills, hiring high quality engineering talent has never been easier. Enjoy a no-cost two week trial at Turing.com today.
[music plays]
BP So that brings me to a question. I was thinking maybe I would write a trend piece on this but I wasn't sure if it was a real trend. I saw one story on it and then it was mentioned by a guest on one of our podcasts, that people are beginning to bring some of these computationally intensive machine learning tasks in house. They have their own hardware, their own instances to run this stuff, especially as in your case, they've created the models themselves. And in the end, they have more control, they end up saving money over constantly going to a big cloud provider to run them. And if they're doing something that requires less latency like knowing when to restock shelves at a supermarket, they're able to perform that task with more accuracy and less latency in house. What's your perspective on this?
DF I think there's a lot of layers to that question. So definitely big cloud GPU compute is very expensive, whether it's for training or for inference. So we do most of our training on-prem instances. For example, we recently just purchased a couple hundred more A100 Nvidia cards that we have on on-prem instances that we use to train.
BP The crypto winter has been kind to you. Those are on sale now, or they’re not as ridiculously expensive as they used to be.
DF Yeah, yeah. There's crazy supply chain issues for those too. I think it took a couple months to get some parts in before we could fully utilize everything. Anyway, we do a lot of training on-prem. We do overflow a lot of our training to the cloud, but I think that it really depends. If you can do inference of your model on a CPU or a consumer grade GPU that isn't super expensive, then for a lot of cases, I've seen in some airports now they have these stores that have all these cameras for cashierless checkout. So they're tracking you with computer vision models in the store and then detecting what you pick up and you can just walk out and they'll charge you.
BP Totally not creepy. Sounds like a lovely experience.
DF Well hopefully it actually is a local AI model that's running and doing that and not someone in a room in Silicon Valley watching you and then clicking some interface like, “They just picked up Doritos, they just picked up a Pepsi,” which could very well be the case with the amount of funding that startups have been getting the last couple of years. So I think it really depends on the model that you're training and running. If you're training state of the art large AI models today, a lot of times you kind of have to use a big cloud because you need access to the hundreds of GPUs if not thousands of GPUs to train those models. And then if you're talking about deploying those models for inference, for example, we're processing millions of audio files and text documents every day with our API and all of those requests get routed to models that are running on GPUs. We need high availability of GPUs, so a lot of times you have to still go to the cloud today to get that. But if we weren't running as compute intense models, maybe we could have some local servers that we could run if our models were CPU based. So I would say it really depends on the problem that you're going after. Not every problem needs to be solved with a billion parameter neural network, but training for sure is most commonly done on-prem.
BP I guess that brings me to my next question. Let's say you have these different flavors, you work at a company or you manage an organization that is not necessarily technology first, I'm sure it utilizes hardware and software and the cloud in some way, but you want to have machine learning for some function. You think it will be more efficient or productive or cost saving versus what you're doing now. And you have these choices between maybe making a model in house and then running it through these big clouds or making a model in house and running it on your own stuff. Or then more like what you're offering, which is kind of AI as a service. Is that how you refer to it?
DF AI as a service. That's the trendy term now I think.
BP Yeah, I decide not to stock up on my own hardware or train my own models but just turn to you. What are the pros and cons there, and where's the sweet spot that you're finding customers?
DF The way we think about it is, our customers are product teams that are trying to implement these AI models into their products to power these features, sometimes even new products or even new companies, and we give them the easiest way to get access to these state of the art production production-ready models that we maintain, constantly update, and give a really good customer experience around. I think that no one has really, to be honest, solved the go to market for an AI as a service company. There's a lot of different flavors. There's a lot of open source first companies. There's a couple companies like us that are API first. And then you have the pure research labs like Deep Mind that are focusing on protein folding and are more services focused. So I think the opportunity for an AI as a service company is huge, because just like most companies need a database today, I think most companies are going to need some form of AI in their products in the future. There's so many different tasks that can benefit from state of the art AI models. Our opinion is that it's going to become fairly easy to get a toy model running through some type of open source library, but to have a production ready AI model that can scale, serve large scale requests, be state of the art, maintain state of the art, that's going to be very difficult for a lot of companies to do internally which is why we're trying to be the place that those companies can come to to get that technology and get that AI tech into their products so they don't need to take on the competency of trying to hire state of the art AI researchers, maintain compute infrastructure to train models, constantly research, deploy hosts. There's a lot of work behind that to run state of the art models. If you don't need state of the art tech it's definitely easier to do internally, but I think that at a macro level, most products are going to have to use AI in some capacity and are going to use AI in some capacity, whether it's for audio tasks or NLP tasks or vision tasks. And our opinion is that we're trying to give product teams and companies the easiest way to quickly get that state of the art AI tech into their products so they can focus on the vertical applications they're building or whatever it is that makes their product unique and accelerate their development.
BP Yeah, let me ask another question. So you made good comparisons earlier to like a Twilio or a Stripe, and you keep referring to a company that has an engineering or product team but doesn't have a deep competency or want to invest the resources in staying state of the art all along the way. What about sort of more like the Shopify model, which is more to the consumer? Can you envision a world in the future where you might be offering useful AI tooling that is easy enough for somebody with a WordPress blog or an Instagram account that is their small business, that is their creator economy, to hook up to and utilize? Or do they need to have a couple programmers in house to figure all this out?
DF That's a great question. I think a company like that will exist. Where I sit today I don't think that will be us. I think we're really focused on developer tools and developers and product teams. And I think there might be some product team or developer that builds that on top of our AI models, but while I think that could exist and will probably, we are really focused on just empowering developers today and product teams, not the consumers yet.
BP So if a developer comes to you and they want to start to use your stuff, what do they need to know? Do they need to know certain languages and frameworks? They just need to know how to make a Rest API call? What does a developer need to know to work with you?
DF They just need to know how to make a Rest call. So it's actually a really simple API, you just make a post request with your data and basically the models that you want to run against your data, and then you make a get request to get the results of the model a couple minutes later. So it's a really simple API and if you know how to make an HTTP post in get, you can use the API. So it's a very simple learning curve for most developers.
BP Do folks get to choose then? Can they say, “I need this back in one minute,” or, “Put me in relaxed mode, it'll cost less. Get it back to me in an hour.” Or, “I need this much noise,” or, “I need this much clarity.” Do they have choices like that?
DF We do have a lot of parameters that you can set in your request to customize how the model performs to your specific use case or what you're trying to do. For example, our summarization model can return many different length summaries. You can get a few word summary, a single sentence summary, multi-sentence summary, you can control that. With our transcription model you can control the vocabulary, the language, different dialects, you can do all that with your API request. And then depending on the API you use, the speed is different. So we do have different APIs for speed, like we have real time APIs over web socket protocols. We have synchronous APIs so you actually get the response back immediately in the request response loop. And then we have asynchronous APIs where you’re basically submitting a job and then getting a web hook a couple minutes later when it's done. So the pricing is different for each and it really depends on what you're building, but you have a lot of options basically, depending on what you're trying to do.
BP So what's some of the stuff you're most excited about? I know you mentioned AlphaFold. That one is super exciting in the abstract. It's beyond me how it's really being applied. Other stuff I love that you mentioned is DALL-E and Midjourney. I signed up recently for the paid version of Midjourney just because it's super fun. I use it to make imagery for playing Dungeons & Dragons with my sons. It's like, “Oh, I need a new character. Oh, I need a new setting.” Just dream it up, if it doesn't look perfect it doesn't matter. You get the idea. And I guess it occurs to me that for the enterprise version which is a couple hundred bucks a year, if I can get two or three blog post illustrations or YouTube thumbnails out of that, now the tool has paid for itself. So what are you most excited about in some of that sort of cutting edge AI stuff and what are you most worried about– I'll ask that next. But as I use Midjourney, I worry about the future career of illustrators that we have employed as this stuff gets better at machine learning speed.
DF Right. Actually, I'll send you a link later, but I did see that there's an open source implementation of AlphaFold that's starting to be built out. I'll send you a link, it's pretty cool. There's this guy out on GitHub, I don’t know if you've come across his GitHub repo or his GitHub username, it's Lucid Rains. But he does all these open source implementations of these new neural networks that come out. He did one for DALL-E 2 and for Imagine. You should check it out. At today's point in time the text to image models are so cool. I'm also a customer. I signed up for a paid plan of Midjourney and I think their models are so fun and creative. And one of our investors is Nat Freeman and I remember seeing him post some of these images on Twitter a while ago and I was blown away. And then I did some research because Midjourney was kind of on the DL for a while, and got into the beta and it's so cool. So the text to image models are definitely very trendy right now. There's open source versions coming out almost every week. I don't know if you've seen Stable Diffusion if you've heard of that, but there's an open source diffusion model, a text to image model, that you should check out called Stable Diffusion, which can run on consumer grade hardware locally in like a second, so you should check that out. But that's the stuff that I'm most excited about, and I think that the rate at which we're seeing progress in the AI field is just crazy. And the consolidation of different domains into single bodies of research is also kind of crazy. Even a decade ago, different model architectures were used for different tasks, whether it was vision or speech or NLP. But now transformers are used for most tasks and you don't need to have that much domain specific competency to work in different domains if you just know how to build these AI models. So that is really exciting and I think is accelerating the rate of research that's happening because you don't need to have all these disparate threads of research. It's just kind of all in one direction and so it's seeing a lot of acceleration. And I think that that is really exciting, but also from a fear perspective, we are going to see some pretty crazy changes, you know? I do think within a year, you can imagine being in some video editor tool and just typing in, “I need this clip art image or whatever to drop into the image.” Or you can probably change the entire background with the diffusion model or text to video models. I think we're going to continue to be really blown away by what comes out of the AI field over the next 2, 5, 10 years.
BP Yeah, you make a great point which I've kind of noticed. AI sits at the intersection of academia, open source, and the largest most well funded corporations, as well as lots of aggressive and well funded startups, and the freedom of that data and latest models and latest developments to be shared through research and then open sourced really seems to be driving an innovation cycle at a pretty incredible speed. To your other point, I myself am not a, “AGI is coming and Skynet is coming” kind of guy. It's more of what will the impact be on the economy or on a certain subset of workers, or what will the impact be on our ability to distinguish real information from misinformation. The real time video AI generator that cooks up something right after a big incident and publishes it on Twitter is a scary thought.
DF Yeah. I mean, someone was telling me this, you could start a whole line of children's books with all the illustrations being from a diffusion model and the story being from a large language model, and you could just farm those out and automatically publish those to Amazon. And I agree, I'm not sold. I think the current technology that we have in the field of AI, I don't think AGI is just going to be birthed out of that or emerge out of that primordial soup of AI tech that we have today. I think there's going to have to be some new technology invented. But I do think there are going to be these new tools that are exposed to everyday people that are powered by AI. And you already see this with devices like Alexa and Google Home and Siri. I know people give those products a hard time, but they are very new and they're pretty amazing. I remember when I first got the Alexa and I could talk to it from across the room and start a song, that was mind blowing. And I think that similarly there's going to be tools for creators, for writers, that are going to be powered by AI that are going to really change so many industries.
[music plays]
BP So normally I shout out the winner of a lifeboat badge, but we have done so many podcasts this week that we're out. So when that happens I like to shout out the winner of an inquisitive badge; somebody who came on Stack Overflow and asked a well received question on 30 separate days maintaining a positive question record. So thank you to Edson Horatio Jr, you've been very curious and you've helped to spread some knowledge around the Stack Overflow community. We appreciate you. As always, I am Ben Popper, Director of Content here at Stack Overflow. You can find me on Twitter @BenPopper. You can email us with questions or suggestions, podcast@stackoverflow.com. And if you like the show, why don't you leave us a rating and a review. It really helps.
DF So I’m Dylan Fox, founder of a company called AssemblyAI. The best place to check us out is our website AssemblyAI.com or our Twitter handle @AssemblyAI. We post a lot of fun developer content, so that's the best place to check us out.
BP Wonderful. All right, everybody. Thanks for listening and we will talk to you soon.
[outro music plays]