The Stack Overflow Podcast

For those who just don't Git it

Episode Summary

Pierre-Étienne Meunier, creator and lead developer of open-source version control system Pijul, joins the home team to talk about version control, functional programming, and why OCaml is a source of French national pride.

Episode Notes

Pierre-Étienne’s interest in computing began with the functional programming language OCaml, created by Xavier Leroy. Before OCaml, Pierre-Étienne explains, “everyone thought functional programming was doomed to be extremely slow.”

Pijul is a free, open-source distributed version control system. You can get started here. Want a GitHub-like interface? Find it here.

Read the article that led to this conversation: Beyond Git: The other version control systems developers use.

Pierre-Étienne is currently working on a new project with the creators of the open-source game engine Godot. We hosted Godot cofounder and lead developer Juan Linietsky on the podcast a few months back; listen here.

Nix is a package management and system configuration tool. Learn how it works or explore the NixOS community.

Connect with Pierre-Étienne on LinkedIn.

Congrats to Lifeboat badge winner Rachit for answering Passing objects between fragments.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, world's worst coder, joined as I often am by my colleague and collaborator, Ryan Thor Donovan, Editor of our blog and our newsletter. Ryan, you helped to set up today's episode so refresh me here– who's our guest and what are we going to be chit-chatting about?

Ryan Donovan So it's Pierre-Étienne Meunier. He reached out because of this version control article I wrote. And he is the creator, I believe, of Pijul, which was briefly mentioned in the article. I didn't hear much about it, but somebody was interested in that it used patch algebra.

BP Cool. This was in the article about, “You're not using Git, but why?” Is this where we're getting back to?

RD Yeah, it's the version controls that people use other than Git.

BP Okay, gotcha. Well then without further ado, Pierre, welcome to the program.

Pierre-Étienne Meunier Well, thanks for having me. Yeah, I did read the article and I was super interested in all the options and alternatives. The fact that Pijul was just briefly mentioned was like, “Okay, they probably haven't heard much of it,” but this is based on a previous version control system called Darcs. Some of the ideas come from Darcs, but at some point me and a colleague, Florent Becker, were using Darcs to write a paper about something completely unrelated about tilings and geometry in computer science. After work, we went out for a beer and started discussing version control, and Florent told me that, as one of the last remaining maintainers of Darcs, he could tell me that the algorithm was not proper. It wasn't an actual algorithm because there was a bunch of holes in it, and this explained why it was so slow in dealing with conflicts. And so we started chatting about, “Oh look, we're computer scientists, and so our job is to design algorithms and study their complexity, so this is a job for us.” We were also working on a model of distributed computing, so this was like, “Okay, this is exactly the kind of stuff we should be interested in. This is one of our chances to have an impact on something.” And so there we started working on some bibliography first. We found some arguments about using category theory to solve the problem, and then we started working on it and writing code and coded more code and debugging, and it turned out to be a much bigger project than we first imagined. And so when I saw Ryan mention Darcs, I’m like, “Well there's this new thing coming out maybe someday, but it isn’t finished yet.” I reached out and was like, “Oh yeah, well, you are interested in version control and probably there's something we could chat about together.”

BP Yeah, yeah. You sit down for a beer, you start talking version control and things always go a bit farther than you expected, I think that sounds about right. Pierre, can you give people just a super short synopsis? We started out talking about you already as an engineer and stuff like that, but how did you get into this field? What got you started down this path to the point where you're sitting around a bar, coming up with your own version control system? How'd you get educated in this world and enter into software development?

PM Well, I’m not educated at all, like really not. I don't know anything about software engineering. I'm not an engineer myself. So I started coding when I was a bit young, I guess, on an old computer that my uncle left when he went out for some travel. I think I was 12 back then. And then I've been coding on and off, then started studying mathematics and physics. I got into logic theory and proving that kind of stuff. And where I'm from, there's one thing that is studied in France but I don't think anywhere else in the world, and that's the OCaml language. So when you grow up as a French man and go to university to study general science, like the first two years of basic mathematics, computer science, physics and all that, you get taught that, well, this OCaml thing is really something French and it's really something we should be proud of because there's this legend of computing, Xavier Leroy, he did everything there and he was the first to show the world that you can design a functional programming language that's at the same time really fast. So then I was interested in that and wanted to study computer science. I did a PhD in computer science, theoretical computer science, but then I ended up working on theoretical computer science, what people call here ‘fundamental computer science’. It doesn't mean it's particularly important or useful, it just means it's the basic things. Probably foundational is a better name for that. So yeah, that's how I got started.

RD So in researching the article, I started off way back in the day on Visual SourceSafe and it seemed like there were natural developments, but Pijul interested me because it uses patch algebra. Can you talk about what that is?

PM Yeah. So patch algebra is a completely different way of talking about version control, of thinking about version control. Why? Well, because instead of controlling versions, you are actually controlling changes, and that's completely different. Actually the jewel of versions is changes. So instead of just saying, “Well, this version came after that version,” which is something that's, well, CBS, RCS, SBN, Git, Mercurial, Fossil, and whatnot, all this family of systems, they keep controlling versions, they keep insisting on the version and the snapshots that come one after the other. In contrast to that, most of the research about parallel computing, like distributed data structures, focuses on changes. So a change is, for example, “Well, I introduced a line here and deleted a line there.” “I renamed the file from X to Z.” “I deleted that file,” for example, or, “I introduced a new file.” “I solved the conflict.” So that's also super important. And in contrast to talking only about snapshots and versions, this gives you much higher flexibility because all systems that deal with versions actually show versions or commits or snapshots as changes. If you look at a commit on GitHub for example, you will never see the actual commit, as in, you will never see the actual full entire version. What GitHub will show you when you ask about a commit is what it changed compared to its parents. So actually there's a fundamental mismatch in the way people think about version control when they use Git. Everything they see is changes or differences and then everything they need to reason about when they actually use the tool is versions. So how can you reason about that? Well, we found ways around it. We have all these workflows and Git gurus that will tell you what you should and should not do and all that. You have good practices and all these things, but fundamentally, what these good practices aim at is getting around this fundamental mismatch between having to think about something you've never seen. So what patches and change algebra gives you is that now you can reason about things. So you can say, “Well, these two patches are independent, so I can reorder them in history.” This sounds like a completely useless and unimportant operation but it's not. What that means is that you can actually, for example, take a bug fix from a remote and cherry pick it into your branch without any consequences. You will just cherry pick the bug fix and that's it and it will just work. You won't have to worry about having to merge that branch in the future. You won't have to worry about any of that. And if that bug fix turns out to be bad and turns out to be inefficient, for example, and you've continued working, well you can still go back and remove just that bug fix without touching any of your further work. So this gives you this flexibility that people actually want to reason about. So when you are using Git, you are constantly rebasing and merging and cherry picking and there's also all these commands to deal with conflicts which Git doesn't really model. There's no conflict in commits. Conflicts are just failures to merge, and they're never stored in commits, they're just stored in the working copy. And so when you fix a conflict, Git doesn't know about it. It just knows that, “Oh, here's the fixed version.” So this means that if you have to fix the same conflict again in the future, well, Git doesn't know about it. It just knows that there was this conflict. There was these two versions that the user tried to merge, and then there was this version with the conflicts fixed, but it doesn't know how you fixed the conflicts. So conflicts might reappear and you might have to solve them again, or you might even have conflicts that just appear out of the blue and then you don't know what these conflicts are about and you still have to solve them. And in contrast to them having this ability to reorder your changes gives you a possibility to just remove one side of the conflict without touching the other, or model precisely what happens when you change things. It also forces you to look at all the cases. When you look at all the cases of a merge, you're like, “Okay, this is a conflict. What are all the cases of a conflict?” Well, for example, if two people introduce a file with the same name in parallel, this is a conflict. If I change a function's name, and if Alice changes a function's name, and Bob at the same time in parallel calls that function, what should happen? Is that a conflict? Well Pijul actually doesn't model that, but it does model a large number of cases of conflicts. And so this is much easier, it will probably save a lot of expensive engineers time actually.

[music plays]

BP Listen to Season 2 of Crossing the Enterprise Chasm, hosted by WorkOS founder, Michael Grinich. Learn how top startups move upmarket and start selling to enterprises, with features like Single Sign-On, Directory Sync, Audit Logs, and more. Visit WorkOS.com/podcast, make your app enterprise-ready today.

[music plays]

RD Yeah. In my experience, the merge conflicts are very manual so it takes a lot of time to actually resolve them. Does Pijul and the patch algebra help reduce the manual load?

PM Yeah, absolutely. So first of all, you have much less conflicts. Why? Well, because all these artificial conflicts that Git just invents out of nothing just because you didn't follow the good practices for example, or you have long-lived branches for some reason because your job requirements need that. So you won't have all these conflicts so there's a lot less manual work to do because there's less problems to fix. And then when you are in the process of solving the conflicts, what happens in Pijul is that, in the data structure used to merge the patches, we keep track of who introduced which bytes. It's down to the byte level. It's still super efficient, but we know exactly who introduced which byte in which patch. We can tell, “Okay, this byte comes from that patch.” And so this is a really useful tool if you want to solve conflicts, because while you're solving a conflict, you can know exactly what the sides of the conflict are and this helps you solve them. But in my experience at least, it helps you solve the conflicts much easier. So I think this is going to save a lot of time.

BP I was going back in our history of podcasts we recorded, and I remember now that we sat down with Arthur Breitman, who was also educated in France, to talk about Tezos and the blockchain and why he loves OCaml. So you're right, for every child who was educated in that, something interesting came out of it. A source of national pride and some interesting ideas about functional programming.

PM Well, the initial version of Rust was also written in OCaml.

BP See? Today I learned.

RD So one of the interesting splits I found in the version control article was between folks who deal with mostly code and their source control and places like video game companies that have large binaries. Does patch algebra apply to the binary files as well?

PM Absolutely, because when you're describing changes, when you're describing what happens in the change, you might say things like, “Oh, Alice today introduced that file. She added the file to the repository, and the file is two gigabytes in size.” And so there's the actual two gigabytes which Git might store, for example. Well, you better use LFS if you do that. But in a classic version control, you might just add the file to SVN, for example. You might just upload the file and that’s it. And when you're describing changes, you can try to do that in Darcs, but I don't recommend it for performance reasons. But in Pijul, you'll be like, “Okay, here's a change. Alice introduced two gigabytes.” What I just said is very short and it's just one file and the information is really tiny. It's just logarithmic in the actual two gigabytes. And then there's the two gigabytes themselves. And the thing is, using patches, you can separate the contents of the patches from what the patch did. So by modeling the actual operation, you can be like, “Okay, I can’t apply this patch without knowing what's in the file.” I can’t just say that I added two gigabytes without telling you what the two gigabytes are. So this sounds okay, but how can this be useful? Well, if Alice goes on and writes multiple versions of the two gigabyte file, she might just go on and do that, upload a few versions. And then when you want to know what the contents of the file are, you don't have to download everything. You just have to download, well, Alice added two gigabytes here, then she modified the file, added another gigabyte. Then she compressed stuff and did something and then there's another three gigabyte patch. But then you don't have to download any of that. You just have to download the information that Alice did some stuff, and then after the fact, after you've applied all of Alice's changes, you can just say, “Okay, the remaining parts of the file that are still alive after all these patches are these bytes and now I just have to download these bytes.” So maybe I'll just end up downloading one full version of the file, or two gigabytes, but I won't download the entire history going through all the versions one by one. So I believe I've never tested that at scale on an actual video game project, but I believe that this has the potential to save a lot of bandwidth and make things a lot easier for video games to use. And actually, I have a project going on with the others of Godot, the open source video game studio, like a video game editor. So we'll see what goes out of that, but we're totally aligned on what we want to do. We're both fully open source. So that's something exciting and new going on in the video gaming industry. I think Godot is really bringing in a lot of fresh air.

RD Yeah, I mean the fully open source projects are very popular. If you have a bug you want to fix, you can just go fix it.

BP And it's been fascinating to see sort of the race going on these days between the closed corporate world that's developing cutting edge AI and all of the places like Hugging Face and StableDiffusion and others that are trying to keep pace with them in an open source way and with kind of a community of contributors, so very cool.

RD Yeah. So when you were talking about how it doesn't create versions, it seems to me that the version system is sort of a legacy from when we actually burned disks or released binaries to download and install. With your background in distributed computing, do you think that this can be a better way to update and maintain all the distributed systems we have now?

PM Yeah, I hope so, at least. So one really cool example I can give is NixOS. So NixOS is not really a Linux distribution. It's actually a language with a massive standard library containing a lot of packages and you can use this language to build your own system. So that's the promise of NixOS. And so while doing so, for example, if you're maintaining machines in the cloud, you probably want to build an image and use one custom version of what's called mixed packages, the general library in NixOS. And so you want to customize this in one way or another, and then release some of your patches to the official central repository for Nix packages, but then keep some of the others for yourself. So you want these multiple links, you would do that by having lots of different branches, or feature branches, which you can push to the central repository. Then you would work on another branch, which would be the merge of all those patches, plus the changes that occur in the Nix package’s central repository. Then this quickly becomes a nightmare to maintain because you have to keep rebasing your changes on top of each other and on top of what happens in Nix packages. Then when your changes get merged, your Nix packages, you get conflicts and so you have to roll back to some old commits which might not even exist. So I believe that maintaining multiple versions or multiple fixes at the same time to one tool can be much, much, much easier using tools like Pijul. So there's one announcement I can make, this is something I've been working on for a while. So Pijul has its own sort of GitHub thing for Pijul, which is called the Nest. And so far it's been not super successful, neither commercially nor I should say industrially, because it doesn't scale very well. It has been through a data center fire. If you remember two years ago in Strasbourg the OVHcloud fire. And so it's using now a replicated architecture, but it's not very satisfactory. It's written in Rust, it operates in three different data centers, but it's not easy to maintain. So I've been working on a new serverless infrastructure for that function as a service. So function as a service providers don't give you an actual disk on which you could run Pijul but I've been able to fake Pijul repositories using Cloudflare's KV, for example. And so this gives infinite scalability and excellent reliability. So I am working on that. My product is very close to being ready, so I hope I'll be able to release that in a few days or in the worst case, a few weeks.

[music plays]

BP All right, everybody. It is that time of the show. We want to shout out someone who came on and helped save a little knowledge from the dustbin of history and answered a question. Today, the Lifeboat was awarded to Rachit. “Passing objects between fragments.” That’s the question, they’re using the built in navigator drawer. They’ve got fragment menus and they want to communicate over those fragments, passing data from one to another. This is about Android fragments, so if you ever wanted to pass objects between Android fragments and get that data moving around, we have an answer for you. And thanks to Rachit and congrats on your Lifeboat Badge. We appreciate you sharing some knowledge on Stack Overflow. All right, everybody. Thanks for listening. I am Ben Popper, Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. You can always reach us with questions or suggestions, podcast@stackoverflow.com. If you like the show, leave us a rating and a review. It really helps. And thanks for listening.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow. It's located at stackoverflow.blog. And if you want to reach out to me, you can find me on Twitter @RThorDonovan.

PM Well, I'm Pierre-Étienne Meunier, and you can browse Pijul.com or Pijul.org if you want to know more about this project. And send a message to pe@pijul.org.

BP All right, everybody. Thanks for listening, and we will talk to you soon.

[outro music plays]