The Stack Overflow Podcast

Balancing a PhD program with a startup career

Episode Summary

Cameron Wolfe, Director of AI at Rebuy and deep learning researcher, joins Ben for a conversation about generative AI, autonomous agents, and balancing a PhD program with a tech career. 

Episode Notes

Rebuy is an AI-powered personalization platform. Check out their developer hub, explore case studies, or keep up with their blog.

Cameron is a PhD student in computer science and member of the OptimaLab at Rice University. 

Autonomous agents are AI-powered programs that can create tasks for themselves in response to a given objective. They “can create tasks for themselves, complete tasks, create new tasks, reprioritize their task list, complete the new top task, and loop until their objective is reached,” according to one beginner’s guide to autonomous agents.

Follow Cameron’s work on Twitter or Substack, or his website. Read his publications here.

This week’s Lifeboat badge honoree is Mark Setchell for sharing their knowledge with the world: I need to convert a fixed-width file to 'comma-delimited' in Unix.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am your host Ben Popper, the Director of Content here at Stack Overflow. Today I am joined by somebody who I started following on Twitter, found very interesting, and wanted to bring on to discuss the topics that are kind of front of mind for Stack Overflow the company and for a lot of software developers. People cannot say ‘Generative AI’ enough. Cameron Wolfe is a PhD student who is working on things in this area: deep learning, neural nets, AI, over at Rice University in Houston, Texas. He's also the Director of AI at Rebuy Engine, which is creating intelligent shopping experiences. And he has a terrific Twitter account and a newsletter that has helped me to stay on top of this stuff as best as one can when research breakthroughs and new products are rolling out daily. So Cameron, welcome to Stack Overflow Podcast. 

Cameron Wolfe Thanks for having me. Really excited to be here. 

BP So give us a little bit of background. How did you get into the world of software and technology, and when did you start to hone in on the world of AI?

CW Yeah, so actually the first time I ever wrote a program was after my senior year of high school. I basically graduated high school and decided not to work over the summer, so I wanted to take a break before I got into undergrad. So it took me about two weeks to get bored and I had never known anything about programming, I was terrible with computers and stuff like that, and I decided to go onto Code Academy and try and learn how to program a little bit. And I remember immediately having a thought where I was like, “Huh, wouldn't this be funny? This is pretty fun. This would be funny if I completely changed my major to this,” or something like that. But at the time it was too early so I was like, “No, I'm not going to do that.” But that was the first time I had ever written a program or anything. I went to undergrad and majored in mechanical engineering initially. I didn't like it very much and I ended up joining, at UT Austin which is where I got my undergrad, something called a freshman research initiative stream. It's a program that takes undergrads that are very early in their upper level education experience and introduces them to academic research. And UT Austin has one of the top programs for this in the world, which was awesome. And because I had previously done a little bit of programming over the summer, I thought it would be cool to get into a computer science based research stream. The one that I ended up joining out of pure luck, because the guy who ran it also was a mechanical engineer by training, just happened to be focused on AI. It was run by the neural networks research group at UT Austin who I did research for for three years after that. They looked into a small subsect of AI called genetic algorithms, which looks into instead of training neural networks with gradient based methods which is the standard, they train them with these evolution techniques that are inspired by evolutionary processes in biology. So I did that for three years. Somewhere during that process I decided that I wanted to change my major to computer science. So it took me up until the end of my junior year to do that and then I had to take the entire computer science coursework during my senior year of college, which was pretty interesting but I made it work. And then from there just from doing research, I knew that I liked AI, I liked computer science, so I applied to PhD programs. I got rejected by 17 of them and then accepted by one program, which was Rice University in Houston, which is a great school. So then I just went there and continued doing research. Now my research is more related to gradient based techniques, just more standard mainstream deep learning research. So I got a little bit away from the research that I did previously, but it's all related to just making deep learning more practical and easy to use. And I'm actually defending my dissertation in two weeks, two and a half weeks. 

BP All right, well I won't take up too much of your time. I know you need to study and prepare for that; I didn't realize you were on the clock. That's fascinating. I think most people, including myself, are sort of passingly familiar with this idea of stochastic gradient descent and backpropagation and that there are these mathematical techniques you can use basically to minimize the loss function and therefore get closer and closer to getting it right, which in the case of LLMs is like, “Guess the next word in the sentence.” From that tiny idea, we have now built this sprawling brain in a box that can guess not just the next word, but give you whole paragraphs. I'm curious before we get back into the mainstream, on the genetic side, how does it work? You're also trying to minimize the loss function or increase the reward, but through some sort of survival of the fittest genetic kind of adaptation instead?

CW Exactly. So when you have a gradient based technique, you define a loss function. So that's basically you make some predictions over data and the loss function tells you how good those predictions were. And then using calculus and some simple rules, which is actually only one rule called the chain rule, which is pretty crazy, that kind of underlies pretty much everything in modern AI, pretty cool. But using that one rule, you can basically say, “I need to tweak the parameters of my neural network in this direction to decrease the loss function,” and the lowest possible loss function kind of corresponds to getting all of your training data classified correctly– or not necessarily classified, maybe it's like object detection or something. But genetic algorithms use the same type of loss function. The only difference is that we're not using any calculus. And to understand at a high level how it works we can look at something called hill climbing, which is a very simple version of genetic algorithms. And basically all you do is you have the current parameters for your neural network and you generate a random change to those. So you can just generate some random numbers and add or subtract them from your current parameters, and you check, “This neural network, does it achieve a loss that's lower than the previous one? If so, replace the weights with the current weights that we just updated. If not, then just retain the other weights.” Genetic algorithms get a bit more complicated. 

BP Does hill climbing imply some kind of fitness of the algorithm and you want to, as you said, get it closer and closer? 

CW Yep, exactly. For genetic algorithms, they call it the fitness function, because it's more of an analogy to biology. But if you just take that random update rule, see if it's better, and then parallelize it across tons of CPUs and apply some more complex rules that people propose in research papers, that's kind of what that area looks like.

BP You are now doing a PhD and you are simultaneously working in the field. Maybe let's start with academia since we were there. What's your PhD about and how does it relate to maybe some of what people have been hearing in the news, if at all? 

CW Yeah, so as I said, my PhD is kind of related to just making deep learning more practical and usable. Practically, when you look at the specific topics that I've looked into, there are a couple of different things. One of them is pruning neural networks, so taking really large neural networks, potentially a large language model, and removing weights or making them smaller while maintaining their performance, and you could think that that has massive practical implications. A lot of people hypothesized that when the Chat-GPT API was released, it had a 10x cost reduction. People thought it was going to be way more expensive than it actually was. This is referring to the GPT-3.5 Turbo model in particular. It was super cheap and one of the hypotheses for why it might be the case is that they could have potentially pruned the model to get a smaller model that's easier to host at comparable performance, or done model distillation where you train the small model using the larger model.

BP That's been some of the most interesting stuff that I've seen percolating up recently– folks working with some of the open source stuff or some of the LLaMa and Alpaca stuff that came out of Meta and then was sort of given to the open source community or they ran with it, is that people have been able, like you said, to very quickly say, “All right, well this used to run on these specialized processors in parallel with the big compute cluster. Now I've got it working on my laptop. Well now I've got it working on my Pixel phone.” And obviously at IO they talked about having built four different models of different sizes for different use cases, depending on what kind of device you're using it on or what kind of application you want. And then to your second point, not only have we seen this ability to sort of compress the size, complexity and cost, but also as you said –and maybe this gets us back to almost the idea of evolution again– if you make something great by training it on a really big data set and using feedback from human reinforcement learning, then that model itself is a set of instructions, like a parent, like a blueprint. You can train a smaller model on less data with less time and less parameters but also get reasonably high accuracy, what a year ago would've been considered bleeding edge and best in class. And so then you're able to do things like have a hundred million people hit your API every day and not blow through your 10 billion too quickly.

CW Yeah. And I think one thing that we've seen with all of these open source models, if you look at how they're trained, a lot of the data that they're trained on is specifically instructions that are generated with Chat-GPT. So people create datasets by scrubbing a bunch of data from Chat-GPT API or whatever, but they're taking all of these generations or dialogues from larger models and using them for training. And I think the big takeaway from there is that we see that these models that are trained over that data, even if they're pretty small, perform incredibly well, and it speaks to the power of knowledge distillation, which is the idea of taking this larger neural network and using its output as a training signal for a smaller neural network. Typically, if we train the small neural network from scratch, not even necessarily for LLMs but just for neural networks in general, it oftentimes won't match the performance of the larger network trained from scratch. But if we train the larger network and then use it to provide a training signal or training data to the smaller model, the smaller model can often close a lot of the gap between its performance and the larger model. So we see for LLMs in particular, knowledge distillation is seemingly super effective. 

BP Yeah, it's really interesting. I know for a while they were sort of like, “Oh, we've discovered this scaling law and we just have to increase the parameters and make sure we're getting more of this and we'll see this increase in performance,” but now there's been a lot of work going in the opposite direction and showcasing that something with –I forget what the biggest one ended up at. What was it, like 504 billion parameters or something like that? But there are models that are far, far smaller than that with sort of the pre-training almost, with the benefit of the reinforcement learning and the pre-training and almost the domain expertise of, “You are a chatbot. You will get prompts and you will give responses,” as opposed to, “Learn everything through the corpus of the internet text.” Then that gives them a big advantage and they can come out with a lot more efficacy and utility right out of the gate. 

CW Yep, definitely. 

BP So tell us a little bit about the corporate side of things, the work that you're doing. How are you managing to balance your PhD and your work outside, and what are you focused on at the company?

CW Yeah, so one of the unique aspects of Rice in particular is that they really emphasize minimal coursework on top of doing research. And because of that, I've basically throughout my entire PhD been able to take just one class a semester which allows me to focus time on either research or whatever job that I have at the time. So throughout the entire PhD I've had a job. I think I was unemployed for a total of three days throughout my whole PhD. So I originally worked for Salesforce Commerce Cloud. Technically I was an intern, but I worked as an intern year round for two straight years, so kind of a weird classification there. But I worked for them basically writing recommendation systems for a while. Eventually it kind of made sense to not be an intern anymore, so I went and worked for a startup that did data labeling called Alegion. There's a lot of startups in that space, but Alegion specializes in video data, so that was pretty interesting. And then now I recently made a switch to a company called Rebuy, which is basically a SaaS platform for e-commerce recommendations in search. So we provide a bunch of AI tooling that you would typically see on websites like Nike or whatever other big website that has a ton of engineers working for it building AI powered products, and we basically package all of that in a SaaS platform where you can just check a box and, any website whether it be like Magic Spoon, which sells cereal, Liquid Death, which sells sparkling water, we have 7,000 different brands that use our product and they can just check the box and then they get all of the AI-powered e-commerce tech that Nike would have.

BP And so how is that fed to them? Is that through API endpoints? Is that something that you go in-house and help them add to their code base? How do they leverage your AI expertise? 

CW It kind of depends. So when you look at a lot of different e-commerce websites, typically they're hosted through some major platform like Salesforce Commerce or Shopify. There's also BigCommerce, Commercetools. There are a bunch of different people who just provide platforms for building D2C e-commerce websites, so commerce websites where you're selling directly to customers in other words. And basically the approach that we take is going to depend on what platform it's kind of operating on. So our largest platform is Shopify, and the cool thing about Shopify is that they provide this awesome platform for building e-commerce websites, but they also provide a lot of flexibility for plugins or whatever, developer support, where people can come in and write apps that will run on people's Shopify websites.

BP I gotcha. So you're like a plugin to the Shopify CMS and they can click a button and sort of get that AI magic on top of their carousel of items or whatever it may be. 

CW Yep, exactly. So Shopify exposes tons of APIs and webhooks where you can kind of integrate into someone's shop really easily and pull data that you would need to make recommendations and so forth. Other providers are not quite as nice, so it's way harder to integrate with them. Shopify is pretty incredible in the amount of extensibility that they provide. So for people that are off of Shopify, typically it's more of an API-based framework, so we expose all of our products via APIs, people will kind of send us relevant data that we would need, and then we would respond with recommendations or conversion likelihoods, how likely is this user to buy this product or whatever. 

BP Right. So let's move over a little bit to the research field because that's how I came across you. I just felt like it was one of the nicest signal to noise ratios on Twitter and you don't have the thread boy approach of, “10 amazing things happened in AI today and they're all going to blow your mind, so buckle up, here we go,” and then it's just the news that happened that day. But how do you pick and choose what to feature? How do you try to keep up? Obviously you can't keep up with all the academic papers and the corporate stuff coming out, but how do you pick and choose what to read and focus on? Let's start there and then from that I'd love to hear about what you're most excited about given the speed with which everything is moving and changing. 

CW So just going back a little bit, I was originally a person that hated social media, so I actually completely deleted all my LinkedIn and Instagram or whatever three years ago or something like that. But I completely purged myself off of social media for a while. But basically what happened is, when I was in my PhD, I would keep these Google Docs and papers that I was reading with summaries to keep track of the different things that I would learn, because typically if I read a paper, I'll forget about it immediately. But if I take 15 minutes to write a summary of the most important parts of the paper or whatever blog post that I'm reading, it'll stick in my head way better. So at some point during my PhD, I was making all of these overviews of papers and I had different Google Docs for pruning or quantized training, just different types of stuff that I was looking into, and I realized that it would be pretty easy to just clean these up and turn them into kind of survey blog posts where you can see an overview of all the research in certain fields. So back in undergrad I became a writer for Towards Data Science because I had written some articles before but hadn't written anything in a long time. So I basically started out by converting some of these Google Docs of paper summaries into giant survey articles about different topics, and I posted them on Medium for a little while and then decided that I was going to launch my own Substack. So that's basically how that came about. 

BP We'll be sure to link both your Twitter and your newsletter in the show notes, and you've got over 4,000 folks tuning in to the newsletter every week, so that's pretty awesome. All right, I will tell you the thing that tickles my brain the most that's happening, and then you can tell me, and then maybe we'll switch it off and we'll talk about the things we hate the most in the AI hype cycle before we head for the hill. So the thing that I love the most, and you saw this in the Sparks of AGI paper from Microsoft, and I got the chance to talk to Paige Bailey the other day who's the PM for Generative Models at Google DeepBrain, talking about how LLMs just trained on text have had some pretty amazing breakthroughs with emergent abilities recently, where we never trained it to do math and at GPT-3 it could barely do it, and then all of a sudden it's doing basic arithmetic, and then at GPT-4 it's doing competitive mathematics exams and doing really well. More picking up new languages, figuring out how to play chess and illustrate that and ASCII, and things that you have a hard time understanding exactly how it would just be parroting other things that it had seen out there. And the more modalities you add to that, like when you have a multimodal model that has both image and text, it seems to gain a certain depth of abilities for context and reasoning and world theory of mind. That to me is the most interesting because it feels like, “Okay, well we've built this brain in a box with just text and it's pretty useful now.” It's like this giant thought calculator you can use, but once you start to add senses, it gets all these new abilities and that feels very human in a way. The more senses you layer on, the more these neurons can sort of fire in interesting directions. So that's what I'm most excited about. You could reflect on that or just skip it if it doesn't interest you and tell me what you're most excited about. 

CW Yeah, definitely. I mean, I'm excited about that as well. I haven't looked a ton into multimodal LLMs. I hinted on it a bit on my newsletter the last couple of weeks, which has been focused on prompting. But kind of extending on your point, what I've been looking into in the last month is kind of the world of prompt engineering and how you can tweak the language model's input to get a lot of different emergent capabilities. And it kind of relates to what you were saying, because one of the interesting things about these models is that they have emergent capabilities, and we can see these with simple prompting approaches where these much larger models will be able to do all kinds of crazy stuff. But they still fall short on a variety of different tasks, whether it be complex reasoning tasks or certain arithmetic problems or something like that. But the interesting part of that is that the language models actually seem to have the ability to solve those problems as long as you construct the prompt properly. So you see with more advanced prompting techniques, originally it was zero-shot learning, which is just like, “Describe the task to the language model. Give it the input and let it generate the output.” From there, there is a simple extension of few-shot learning, which is zero-shot learning except it also shows a few examples like input output pairs. So it could be if we're classifying positive or negative sentences, your prompt also just includes two sentences or five sentences with associated labels. Even with simple prompting techniques like that we see emergent abilities. We also see the language models falling short in a lot of areas. But then if we use more complex techniques, we see that the language models are way more capable than we even realized, we just need to know how to interact with them properly, which is pretty cool. So you see things like chain of thought prompting where maybe the language model can't solve this multi-step reasoning problem, but if you encourage it to just generate a rationale, so a step by step description of how it arrived at its final answer, it actually gets way more accurate, which is weird. And then from the multi-modal perspective, there's also a paper that extended and showed that chain of thought prompting with image modality added is more effective than just with text. So it’s pretty interesting how we're seeing some of these methods for interacting with language models evolve, but also it's questionable whether prompt engineering is going to die once models get really good. We'll see. 

BP I saw that. I don't know if it was you who tweeted it and I grabbed it or just where it came from, but that Andrej Karpathy tweet from the other day. It was showing task accuracy versus how much effort you're putting in. Zero-shot prompting he was comparing to just throwing out a random question to a random person. Well, we'll see what happens. Give examples of solving the task with few-shot prompting. This might be somebody who's practiced a little. Give them, like you said, a more complex prompt than on a machine that's been fine tuned with RLHF and suddenly it's like you're talking to an expert. And to your point before, if you use that last, as you scale up the level of effort and complexity that goes into fine tuning the model and the prompt, a small base model can now start to perform maybe comparably to a big model with a zero-shot prompt, right? 

CW Yeah, and this is something I made a note of in a previous newsletter, but it's something that I guess people who are just getting into AI might get wrong sometimes. But prompting approaches are completely different from fine tuning. And the reason for that is that when you're prompting a language model, you don't ever update any of its parameters. You're just adding extra context to the prompt and generating output. Fine tuning in particular refers to training the parameters of the model with gradient descent, for example, over some dataset. And for sure since GPT-2 or GPT-2, we always see that fine tuning, if we're allowed to perform fine tuning, the performance is always really good. The problem is that the model is no longer generic in a lot of cases. So if you fine tune it to perform sentence classification, it does sentence classification really good, but it can't solve thousands of other tasks that the generic model could with just prompting.

BP Yeah, I guess I mixed a few things together there. I mean, on this chart that he shared there was zero-shot, few-shot, retrieval augmented few-shot, but then fine tuning and RLHF was the last one. So those are not the same thing, but they are two ways of thinking about how we can get models that, like you pointed out, sometimes make sort of what seem like simple mistakes or struggle with basic problems in certain areas of logic. You can get around those blocks with either better prompting or more fine tuning. 

CW Yeah. And a lot of times it doesn't make sense to have a super generic model. There are tons of different applications where it makes sense to fine tune your model towards a particular domain, whether it be with trying to create a medical chatbot that's very specialized in that type of data. We deal with this at Rebuy, specializing LLMs towards consumer products or something like that, or even towards an individual merchant so that it can talk about product catalogs sold by a certain merchant. But a lot of times fine tuning is both super effective and important because you need your LLM to be specialized in some particular problem. 

BP Yes, it does seem like there's two things. There's folks who are specializing in e-commerce, medical, law, finance, and then there are the players who have the money to burn and they're saying, “We're going to go big, big, big, wide, wide, wide with the next generation of just very GPT-oriented that's going to be able to do it all,” and so those two things are kind of coexisting. All right, last thing before I let you go. What is the thing that as you look out and try to share what you want to do on Twitter and put stuff in the newsletter and listen to all the stuff, what is the stuff that, if anything, gets you a little bit frustrated? 

CW Yeah, I mean, I don't get frustrated about a ton of stuff. I think there's a lot of people who are frustrated with the number of people that are in AI right now, people who claim to be experts but have just gotten into the field. They don't know anything about how these models work or whatever. For me personally, although it can be frustrating at times, my opinion is that the more people, the better, because if you're working in AI, you would be ridiculous to say that you're not excited about this many people caring about what you're doing. So as AI becomes more important, we have more job opportunities, it's more emphasized to the companies so the AI product becomes people's core products. It's getting tons of funding or whatever, so I think that's a good thing. The only bad side of that is that sometimes one thing that I have noticed is that it's comparable to a long time ago when ML and deep learning just became popular. There was a common meme where statisticians hated ML people. So ML was the new hot topic, and all of these mathematicians or statisticians were mad at ML people because, “There's no theory behind it. Bayesian models are better. Why does everybody care about this? This is so stupid. You guys don't know anything about statistics.” Now the entire ML research community has been solidified for quite a while, so all the ML researchers are looking at the people who care about LLMs and giving the same type of hate. So it'll be these people where they're upset about something related to LLMs or calling these people stupid, and this is an ML technique that's existed for years or whatever. So occasionally, people will just assume that you're someone who doesn't know anything and give you some flack on Twitter or whatever for no reason where they're like, “This has existed forever.” They're just upset because they think you don't know anything. So definitely lots of hate comments with people who I guess don't read my bio and don't see that I've also been in AI for 10 years, I guess. But I don't know. You can't really get too upset about random people on Twitter. 

BP No press is bad press. No press is bad press. 

CW Exactly, yeah.

[music plays]

BP All right, everybody. It is that time of the show. I want to shout out somebody who came on Stack Overflow and helped to spread a little knowledge and save a question from the dustbin of history. Awarded April 27th to Mark Setchell, “I need to convert a fixed-width file to a ‘comma de-limited’ one in Unix. How do I do it?” Well, Mark Setchell has the answer and earned himself a Lifeboat Badge for saving the question and has helped over 12,000 people. So if you've ever been curious or had this problem, we got the answer for you in the show notes. I am Ben Popper, Director of Content over here at Stack Overflow. Find me on Twitter @BenPopper. Email us with questions or suggestions, And if you like the show, leave us a rating and a review. It really helps. 

CW So I'm Cameron Wolfe. If you want to find me on Twitter, it's @CameronRWolfe because there's actually another super famous Cameron Wolfe. His middle name is also actually an R, but he doesn't use it for academic publications. He’s an awesome medical researcher that works at Duke. But I'm the Director of AI at Rebuy Engine, a commerce AI company that provides search and recommendations via SaaS product. And you can find me either at my deep learning focus newsletter–, or my Twitter, which is just @CWolfeResearch. And Wolfe has an ‘E’ at the end of it. It's not like the animal, so, yep. 

BP All right, Cameron. Thanks for coming on, great to chat. And everybody else, thanks for listening. We'll talk to you soon.

[outro music plays]