The Stack Overflow Podcast

Compression is understanding

Episode Summary

The home team chats about machine learning and its applications beyond the hot topic of GenAI, what it means for models to unlearn data, the future of open source, and new frontiers in game development.

Episode Notes

Find out what’s new with ML in production.

Machine learning models must learn to unlearn.

Open-source game engine Godot now has a free Nintendo Switch port for game developers.

We’ve previously hosted Godot cofounder and lead developer Juan Linietsky on the podcast.

Stack Overflow user areller earned a Lifeboat badge with their answer to How to call a destructor.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, overwhelmed dads edition. The kids had a two-hour delay today to start school and had a snow day yesterday, so instead of learning, they're playing Fortnite. What am I going to do?

Ryan Donovan What are you going to do?

BP We’ve got podcasts to make. Ryan, I actually brought three links today that I thought were super interesting. They weren't just little small tidbits, they were big things. I'll start with my personal favorite, then maybe I'll let you choose the next one. Vicki Boykis wrote for us back in the day. She's a great writer, she's got her own blog and newsletter, and she wrote a piece: “What's new with machine learning and production?” So this is a topic that's near and dear to our hearts. We were just working on a big thing about how you would get Gen AI into your organization and basically put some kind of machine learning into production. She makes a good point, which is that all we're talking about these days is Gen AI. That is probably 10% of what's going on out there in terms of machine learning. Machine learning is doing a million things. It's recommending the next thing I see on YouTube, it's configuring how my application appears when I log in. Machine learning is doing a million things. We're focused on one because a lot of the other stuff has kind of faded into the background for us. We don't see it. Somebody mentioned that the thing that blurs your screen behind you on a video or takes out other people's voices and echoes is all machine learning. It's all the AI smarts that are good at that stuff. But the thing I wanted to get to in her piece that to me is one of the most interesting things to think about almost at a philosophical level about machine learning, is that machine learning, she says, is compression. You take some kind of data, you compress it, and in compressing it and getting a representation– well, first of all, you could just make it smaller. A zip file is compression, but that's not machine learning. Is it? What she's saying is, in compression, you get this amazing ability to sort of encode knowledge. And one of my favorite little talks ever was from Ilya Sutskever, the former Chief Scientist at OpenAI, with the head of NVIDIA. He said the same thing. He was like, “It turns out, when you do all this work to compress data, what you're doing is teaching the software system a world state and rules it needs to figure out how to predict how to respond to a query.” I'll just stop there because I'm kind of rambling, but this idea is super interesting and kind of at the heart of what's going on in machine learning.

RD Well, I think the comparison is interesting because gzip basically works by finding patterns in a piece of text. If you say the word ‘hello’ a dozen times, it'll encode that as a symbol– as 31 or something. So it figures out what's the statistically most beneficial way to compress this to make it smaller, and machine learning in a lot of ways does that. It figures out how to encode all this text as parameters, weights, biases. I think it's not entirely the same because it's not a deterministic retrieval. The decoding process isn't that you can just go through a decoding function and get all of the world's knowledge from a machine learning model.

BP Yeah, you can't unzip a giant AI model, although people are working on that. That's a field– AI interpretability and to be able to say, “What was this trained on and can we pull out copyrighted data?” And one of those sort of holy grails would be, can we get it to unlearn something? Let's say in the future we want to fine-tune it and we decide that it has data that's now private or data that's now inaccurate. How would you remove that from the model’s training without having to train it all over again? That's not something we understand how to do now, but a very interesting idea of how you could undo it. And then just to your point, one of the really cool things that Vicki brought up was this paper which I guess came out, made a stir, and then got lost because things are just moving so fast. And the point of the paper was that gzip, which is not a cutting-edge recently developed algorithm, but something kind of old and standard, can, just using nearest neighbor kind of stuff and compressing data, be a gold standard neural network like BERT when it comes to certain classification tasks.

RD So to be fair, that's what the paper claimed, but the community found that the offline evaluation metrics that the authors were using were not exactly standard.

BP A bit favorable to their results.

RD Yeah.

BP So let's say maybe it doesn't beat gold standard, but it hangs in there with a gold standard transformer architecture and we're talking about something that was for zip files.

RD And the other thing that she brings up as part of this is that you're getting these big balls of compressed data, but you don't know what's in there. Most people have lost control of the data that they're getting results from where they're using models that were trained on whatever packages. I don't think a lot of folks release their training data sets, and you're calling it as an API.

BP It’s dangerous, they say, to do that. At this point, the big model creators will claim that it would be dangerous to release that because then bad actors could take them and this is cutting-edge stuff.

RD I mean, maybe. We’ll see about that.

BP I'm just saying what they claim. I'm not advocating that argument, I'm just saying that's the [claim.]

RD In the piece we did with IBM, they tell you exactly what's in there as a way to let you know that they're not getting super copyrighted stuff.

BP That's a great approach to governance, and Facebook's LLaMa has driven a ton of the open source work. For some of the other big players, that's their edge and they don't want to release that. Maybe safety is a valid reason not to do it, but also obviously it's a competitive business edge, and they’re businesses so they don't have to do it.

RD Yeah. And there's been talk that you can't create a world-class LLM without using copyrighted material.

BP There's no way.

RD There's no way.

BP There's no way.

RD We'll see if they can do that without riding the backs of other people.

BP We’ll see, or maybe they'll come up with data licensing, data sharing agreements and we'll move on in an amicable way as a partnership. We'll see, we don't know what's going to happen. All right, so moving forward, there was a cool piece I sent along: “What is the next thing after open source?” Did you get a chance to look at that one?

RD I did, yeah.

BP Give me your thoughts.

RD So open source right now is primarily a way for businesses and organizations to use tools and solutions that other people have open sourced, and it's great for people who are developers and know what they're doing. He wants a post open that has those companies compensate, I think, the open source softwares, as well as making the open source software a little more consumer-friendly. If you think about it, do you personally use any open source software? I use GIMP and that's about it.

BP That's an interesting question. I'd have to think about that. This is not a new thing that we haven't talked about– how does the common person who's creating an OS get compensated and how do you ensure that stewards of big open source technology don't, when it then suits them, decide to make it closed source and sort of betray the trust of the community. Those are two things that I think we've discussed many times. This article is interviewing Bruce Perens, quoted as a founder of the open source movement. So these are not new issues that you and I have not heard of, and I guess I was kind of hoping for some more substantive ideas of what he was working on and could be the next step. This seemed mostly like pointing out the problems that are well-known. I think he said something like, “I'm putting something together. Obviously I need help from a lawyer and some grant money.”

RD And I think one of the things he says is that most of the open source things are licenses and not contracts. They are giving you rights. They are not enforceable as much. And of course this conflict comes down to Linux and Unix flavors. I'm not going to name names, but there are some Unix/Linux providers who have stopped giving open source, which is required under the GPL.

BP It's interesting to me. I wonder how many people who are younger than us are more conversant in their ownership usage or sort of modification and customization of software and care to move towards open source technology because they feel that will provide them with the ability to do what they want with it and to take it their own way and fork it if they need to. I think the reality is that, as these things become larger and larger, they need a board and a sort of corporate backer because they're becoming huge and you have to support the community and figure this stuff out. And so then for an individual, you could always fork it and then run it on your own, but for an individual to do that, you'd kind of be stuck on whatever the last version was. Maybe you could take it forward yourself a little bit, but an individual or even a group of individuals can't really maintain an extremely popular OS project going forward. It's not that easy to just say, “See you later. We're forking it and we're going our own way.” Maybe within your own company you could, but again, then you're working on your own sort of private house version of Linux, and that could cause issues in the future when it becomes incompatible with other things.

RD I think that's why you see most of these big open source projects eventually fall under the wings of the Apache or Cloud Native Foundation or the Linux Foundation. But talking about people's willingness and interest in customizing, the iPhone is the most popular phone I think, and it is as closed a system as you can get. I don't think people on a consumer level want to configure as much as a developer does.

BP Yeah. We live in the world of developers and the reality is that most people don't want to run their own Mastodon server. They don't want to configure their own modular Android phone that can do their liking. Most people want to get the latest flavor of the iPhone every two years and just have it work the way it works, and such is life.

RD Just have it work. Don't make me do work about it.

BP All right, so speaking of open source, we have discussed on the podcast a few times Godot, which is an open source gaming engine, and they made some news this week. They have a free port of the Godot engine which will allow you to build Nintendo Switch games. This is so cool. So before it was for creating essentially PC and mobile games. Now it's something that you can use to build games for one of the most popular dedicated video game consoles. So I thought this was super cool.

RD Yeah, I think that's really cool. I'm not sure if they have ports for the other consoles, but the fact that it's on the Switch which has been another one that's been very protective of its platform. The fact that you can now port your games to it, I'm sure you still have to license some amount, but maybe you don't. Maybe the Nintendo Switch is jumping on the indie game train. Indie games are hugely popular.

BP Yeah, I think that's probably true. Let's say you're Nintendo, or let's just say you're a company and you have huge competitors and they have their own studios where they produce AAA games and you do that sometimes too. What's good for your ecosystem? Well, every year having a couple of indie games that become a hit and make press and are available on your platform and are a reason to pick up your hardware.

RD So Godot does have other console support. Most of it is through third parties.

BP What does that mean?

RD That means they don't officially support it because you have to be a licensed company for most of that.

BP I see. I gotcha.

RD I think they do for the Xbox, I assume because there's a Windows synergy there and so SDKs are protected by non-disclosure agreements.

BP Interesting. After this, I looked up just a little thing just to see what people like about it, and the first thing is what we just talked about. Godot is good for programmers like open source is good for programmers. Obviously that's one reason to like it. It doesn't mean much to the average person or to somebody who's like, “Well, what I want to do is make money off of my game and so it has to be good for the consumer.” Godot has its own language, which I didn't know. It's called GDScript and it takes off a bit on Python and Lua. If you know those, you'll probably be in good shape. And then it supports multiple languages– C++ and other things like that, VisualScript, which is an alternative to Unity, and something called language binding. So you could go in there and use your chosen language through an API. So those are some things I didn't know about Godot that I learned this time.

RD But it seems like the Nintendo Switch one doesn't support the C# or Native extensions, which if you know GDScript, you're set. Otherwise, get the docs, kids.

BP And it has its own IDE. You're going into your own little world, but I think that's true of gaming. If your other choice is Unity, you're going into your own little world of Unity which has its own sort of universe. So it's not quite as interoperable as picking up a language like Python and some of that.

RD And I think when we talked to the creator, he created it because he wanted his own way to make games. It turned out it was useful for other people. And I think those are the best open source projects where you're solving your own problems.

BP Scratching your own itch and then a community gathers around you, and that's a good thing.

RD Right. The downside of that is that they're not always the most user-friendly.

BP Yeah, they're sending you angry emails at midnight saying, “Why haven't you fixed this bug? I pointed it out last week.”

RD Because it's for me, it's not for you.

BP It's like, “Well, I have three kids, so go fix it yourself.”

[music plays]

BP All right, everybody. Let's go to that time of the show. Let's shout out a member of our community who's contributing to an open source project. Ryan, you and I obviously work on an open source project called Stack Overflow.

RD Is it?

BP People can take the knowledge. It's open source in the very light sense that we give away the data.

RD It’s Creative Commons.

BP Yeah, Creative Commons. We give away the data. You could use that to build a research app or something like that.

RD I don't want anybody mad at us for claiming it's open source.

BP Okay. It's not open source software. We like to distribute the knowledge, that's true. All right, this sounds like a Magic card or something. Awarded January 19th to Areller, “How to call a destructor.” If you want to know how to call a destructor, Areller has the answer for you. This is a .NET question. Manually destroy C objects. Just destroy them.

RD Must destroy.

BP Must destroy. All right, everybody. Thanks for listening. As always, I am Ben Popper. Find me on X @BenPopper. If you have questions or suggestions for the show, if you want to come on, hit us up, podcast@stackoverflow.com. And if you like the show, leave us a rating and a review. It really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. And if you want to reach out to me, you can find me on Twitter and/or X @RThorDonovan.

BP Thanks for listening, and we will talk to you soon.

[outro music plays]