Ryan is joined by Tobiko Data co-founders Toby Mao and Iaroslav Zeigerman to talk about the crucial role of rigorous data practices and tooling, the innovations of Tobiko Data’s SQLMesh and SQLGlot, and their insights into the future of data engineering with the rise of AI.
Tobiko Data is creating a new standard in data transformation with their Cloud and SQL integrations. You can keep up with their work by joining their Slack community.
Connect with Toby on LinkedIn.
Connect with Iaroslav on LinkedIn.
Congrats to Stellar Answer badge winner Christian C. Salvadó, whose answer to What's a quick way to comment/uncomment lines in Vim? was saved by over 100 users.
[intro music plays]
Ryan Donovan: Hello everyone and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am your host, Ryan Donovan, and today we are gonna be talking about the perils of the poor data engineer and the lack of tooling and why they need better tooling. Today I have two great guests, two of the three co-founders of Tobiko Data: Toby Mao and Laroslav Zeigerman. Welcome to the show.
Toby Mao: Thanks for having us.
RD: So, just quickly, we'd like to get a sense of who our guests are, how you got into software and technology. Can you give us a little rundown of your origin story as software engineers and founders?
Toby Mao: Sure. So I actually started off with a math degree 'cause I found computer science a little bit too hard and I got my start doing pharmaceutical consulting at a petite consulting firm. There I started working on SaaS, Excel and kind of found my foray into data programming. Decided I wanted to become a real software engineer, so, took some time off, learned iOS programming and started doing that. After that, I somehow managed to get a job at Script doing recommendation systems. That's kind of where I learned about SPARK and AB testing. After that, I went to Netflix where I led the development of their experimentation platform, and then finally Airbnb, where I led analytics infrastructure, including maneuver, the metrics platform, and experimentation as well. Then finally, I ended up here creating Tobiko, where we built SQL mesh, open source data transformation framework, as well as sql [inaudible] a SQL compiler. That’s over to Iaroslav -
Iaroslav Zeigerman: Yeah, thanks, Toby. I started programming as a child when I was 10 years old. I'm from Ukraine originally, and I began my full-time career at about, like 18 years of age. Since back then in Ukraine there were very few product companies, most of them were contractors for companies in the United States. It gave me flexibility to kind of do university and a full-time job at the same time. Which also led me to get an offer from a United States company called DataRobot, straight out of university. And this is when I moved to the states. And this also coincided with me starting a career in data specifically. So I've been doing a lot of data-related work for various machine learning automation companies. After that I joined Netflix where I met Toby and worked on experimentation and data engineering, and then proceeded to work at Apple after that, before co-founding Tobiko.
RD: So when I was talking with Toby before this episode. He said that this was, your company is basically catching up data tools for the data engineers, the whole ecosystem of how they work on data. Why do data engineers need additional tooling?
IZ: Yeah, so basically when both of us worked at Netflix, Netflix is known for hiring like top talent and that was really apparent when essentially everyone is extremely capable and extremely talented. And, the, the data and experimentation were approached, like with rigor and all the seriousness data was treated as a first class asset and not as an afterthought. Netflix is essential. Data is Netflix's blood. And we picked up a lot of best practices while there, and it was treated as very much as a software product. The code was version controlled, the actual deployments were happening, but at the same time, the tooling was still behind, trailing behind what software engineers had available at the time. So, no actual deployments, we would essentially alter production tables after merging our code manually. So there were a lot of practices that are like, okay, when you have like a bunch of very talented, high paying people, but they are not particularly scalable, especially outside Netflix. And after also, like spending time at Apple and Toby at Airbnb, we decided like, what would the tool look like that we'll be excited to work with? Like what's the processes should look like in order to establish this like rigorous and scalable process for developing data pipelines.
RD: Yeah, that's interesting. You know, everybody today is very concerned about their data, in large part because of Gen AI. And I talked to a lot of companies trying to figure out how to make the data stack less complicated. Do you find that naturally the data stack gets a little less complicated by having good processes, good protections, good pipelines in place?
IZ: The right level of abstraction is important. Back at Netflix, like what allowed Netflix to create this like rigorous pipelines and processes is like a lot of those were built in like Python and Spark, which allowed them to be unit tested and built and so on and so forth. Now in order to do this, obviously companies are required to hire a lot more like higher level professionals with higher expertise and programming languages and data structures and so on. But, at the same time, there are simpler interfaces out there to interact with data, which is SQL. It's been around right, for decades, and everyone involved with in data knows SQL to some degree. So basically what we decided to do is essentially not try to kill SQL, but embrace it and meet the industry in the middle. And lower the barrier of entry into the stack, into the tooling by using SQL, but at the same time, bring all the best practices, bring all the rigor that we picked up at Netflix into this technology.
RD: Is it hard to write unit tests for a SQL code?
TM: Absolutely. This is a very difficult thing and a big reason for that is with software, you just write a function, right? And you can mock the inputs and have the expected output. With SQL, you've got a massive data set and you're probably working with a proprietary engine like Snowflake or BigQuery, which you don't really wanna access in your CICD. You don't want to scan a billion rows of data every time you make a commit. And so the challenge becomes how can I test a SQL query that's written for this proprietary warehouse and run it very quickly, either locally or in CICD, where I might not have access? So I would say that before our tooling, SQL mesh unit testing has not been very common practice in data for SQL because of how difficult it is. And so kind of the innovation that we've done is we take the SQL that's written, BigQuery or Snowflake, and then we can transpile that into another dialect like DuckDB, which can be run as an embedded process on your computer. So even though you have code written for Snowflake or BigQuery, we can then execute that against a CSV or adjacent locally and have your business logic unit tested.
RD: So in a way you are kind of mocking the database itself, right?
TM: Yeah -
RD: Yeah, that's interesting. Do you have recommendations/thoughts on what that mock CSV database should be? Should it be like production sampled or is it something else?
TM: So that's another thing that we've learned is that people don't want to mock the data 'cause if you've got a large data set, it's a lot of work. And so we actually built tooling around that too. You can actually automatically generate a unit test by just selecting data from production or whatever, and then that will create your test for you. And then now you can have A-C-I-C-D process that will let you know whenever your expectations change. So sure, it's not as robust as going through every single edge case and making sure everything is handled, but at least you'll know, Hey, I made this change. My unit test changed. Is that expected? So it's better than nothing.
RD: Right. And some of those pieces of code are gonna change the database and which I assume with every build, you don't want to go around like inserting and then deleting data from your big database.
TM: Right. And that's slow and expensive and you want your test to be really fast.
RD: Yeah. You know, you talked about the transpiler. When I've seen like, you know, professional SQL queries, there's a ton of joins on it, inner, and outer joins. Do you end up simplifying those in the transplantation or is that those joins, is that sort of like how it has to work?
TM: For the most part, things like joins are commonplace around every engine, and they have very specific definitions, and so for the most part, they're all the same. No transplantation required there. Transplantation is mostly around things like functions, specialized functions, data types, casing, normalization, all those kinds of stuff. There are some specific join types that are only supporting one or the other. For example, cross join on Nest versus lateral view explode, right? Those are the difference between Presto and Spark. But I would say that for a lot of the basics, they're pretty much universal.
RD: You know, with so many databases having their own sort of add-ons. It makes me wonder is SQL a sort of calcified language? Is it growing at all? Is there like a committee out there like with Python or Java that's adding new, like even C++ has new things added.
TM: It's so interesting. Yeah. So there is an ANSI SQL standard, but I don't know a single database that actually adheres to that. [Laughter]. And the thing is, a lot of these businesses product, their main kind of product is their warehouse and they really have no incentive to stick to the standard because it makes things, you know, more sticky -
IZ: Admittedly, these products, they innovate faster than the standard is able to keep up. And if anything, the standard often hinders the pace. Like for example, in a Trino case, they are doing their best to adhere to the NNC standard. But at the same time, they say no to many features and quality of life improvements that they would've otherwise introduced.
RD: Right.
TM: And the interesting thing is like now. These vendors like BigQuery or Databricks, they're adding things like pipe syntax, right? They're reinventing the SQL language in a certain direction, but not necessarily always in tandem with each other.
RD: Right? And, the pipes are sort of like, Unix pipes, like that sort of thing.
TM: Yeah, just because like with SQL, it's kind of weird because you do something like select and then from, right. It's kind of unintuitive where usually you think, okay, I have my data and then I wanna select from that. Right? And so, you know, changing the order of things, being able to pass one result to another, right. These are things that are intuitive to most people, but are not how a declarative language was, was born, right -
RD: Right, right. So, you know, with being able to run these unit tests, having, you know, every custom, not probably not every custom SQL extension, transpile to each other, but are you able to sort of explain exactly what's happening in every call, every function called?
TM: Yes, that's a good question. Transplantation is like a monumental task, and I would be lying to say that we do it perfectly. Of course we don't, but what we do have is an amazing community, right? SQL Mesh. SQL Glot. They're built on open source communities where we have the benefit of not only receiving issues and feedback from thousands and thousands of users, but also contributions, right? And so I believe that because of the open source nature of our product and how we're relying on the community to come together to kind of build something that is incredibly difficult, that's why we're able to have so much coverage and success.
RD: Yeah, and having a community out there, I'm sure makes it easier to catch those edge cases. Right?
TM: Exactly.
RD: Like no matter how good your unit tests are. You're not gonna catch everything.
TM: Yeah. And if you were to close source something like this and have a closed source parser or transpiler, well then you're really limited to just your specific customers. And then you're not gonna get contributions from others. And then the adoption of your particular tool will be not as big because it's no longer open.
RD: I wonder how, you know, you say there's an ANSI SQL group. But nobody follows it. Are the individual SQL implementations documented very well, or did you have to do, you know, a little bit of hacking to figure out what the AST was for everything?
IZ: It actually varies. Some engines are documented really well, others don't. Some engines have actually variability in the rules that they apply, depending on, in which context they're used. So for example, BigQuery is, correct me if I'm wrong Toby, like is a big offender of that. Like I think the rules around quoting or like temporary table naming, something along those lines -
TM: Yeah, yeah. Some things are uppercase and some things are lowercase -
IZ: Exactly. They're inconsistent about their own kind of standards.
RD: Oh that's brutal. So are there still implementations that you're trying to like sort out like surprises like that within the language itself? Or is it just like implementations at this point?
TM: I mean, it's mostly implementations. Of course there will always be edge cases here or there, but one of the things we pride ourselves for is kind of our response time. And as soon as anyone has a bug, we'll close it within like 30 minutes.
RD: Nice. Tell me a little bit. You have the two products, the SQL Mesh and the SQL Glot. What do each of those do and what are the differences? What's the separation of concerns there?
TM: So SQL Glot is a Python based parser and transpiler, its job is to understand SQL. Give us an EST and then let us either explore the metadata of it or render SQL in a different dialect.
IZ: I think it's worth explaining what ASD is probably.
TM: Oh, then ASD is an abstract syntax tree. So basically SQL is a text, right? It's code. And that's very hard to work with. And so an abstract syntax tree is what kind of parsers, right? Language tools used to represent that text into something that's more easily workable with code.
RD: Yeah, it's sort of the middle ground for, for compilers between the text code and the machine code. Right?
TM: Right.
RD: And SQL mesh. What’s that one do?
IZ: So SQL Mesh is a data transformation and orchestration framework that is based on SQL GOT, and it utilizes SQL GOT for well SQL Parson in order to get the semantic understanding of users' business logic. As well as for transpilation. Basically to assist in, well first of all, it powers our unit testing framework, but also to assist in warehouse migrations, for example.
TM: So at a higher level, right? What is a data transformation tool, right? You have a lot of data and people wanna know how that data powers your most important metrics. How can you make decision based on the data? You can't just ask a data base, all these numbers. You have to take all the raw data, join it with the right things, aggregate it to the right levels, all those steps to process raw data into usable data. That's called transformation. And so in the olden days, right, the way this used to work is you'd have a bunch of these custom scripts. You gotta figure it out, orchestrate it, big mess, right? SQL mesh allows you to just write SQL. And then from that SQL, because we can understand it, because we parsed it, we know, oh, these are the dependencies, right? This select statement references these other select statements, so there's a natural order to things. And then we put more abstractions on it. Okay, this one we want it to run every day. This one, we want it to run every hour, right? And so we put all these common patterns that you need to efficiently and correctly create these pipelines. Build that into one tool, and that's SQL mesh.
RD: Yeah, I guess that's a good point. You know, I've talked about these ELT pipelines, but I guess an ELT pipeline is just a bunch of SQL code and maybe some Python on top of it. Right. Is that generally what it is?
TM: Exactly -
RD: Yeah. In your experience, these pipelines, how much babysitting do they require?
IZ: Generally a lot. And mostly because, so the complexity arises in not, not in creating those pipelines. Creating pipelines is easy. Changing and evolving those pipelines is really, really hard. Basically, like again, referring back to my experience at Netflix, I was working on a playback data set, which is essentially the core data set that contains all the information about users playback session. It's very foundational and core to the rest of the organization. And iterating on that table and the dataset was a nightmare. For example, like you want to modify a column. You have no idea who is even using that column in the organization. You modify it, it breaks, and then you have angry stakeholders knocking on your door in Slack and telling you that you messed up. But that's not the only problem. It's only half of the problem. Like, the other problem is imagine someone upstream, basically the data producers that feed your playback data, they come to you and say, ‘well you see, we messed up’ and turns out the data we published like two weeks ago was not actually valid. So what you need to do now, you need to reprocess the data. But a data set like playback, it produces like terabytes of data daily. How would you even reprocess something like that? So you need to design carefully. You need to babysit the data set every step of the way. And it's just one table -
RD: right?
IZ: And it caused so much, so much trouble.
RD: Right. Yeah. I could see something, you know, my traditional understanding of the ELT pipeline is like you're taking production data, transforming it into a sort of analytics view, right? So the playback data, it's one person watches, you know, 37 minutes of a movie, and then what's the average number of minutes watched of any given movie, right?
IZ: Pretty much. Yep.
RD: And that's, that's a significant amount of work to actually get that useful state, right?
IZ: Especially at scale.
TM: Yeah. Another big problem is like, Netflix allows you to watch movies offline. Right. You can download a movie. And if you're on an airplane, you're not sending those events to the servers every, every minute. It could be days. And so there's this whole problem of how do you handle late arriving data?
RD: Yeah. I've talked to other folks who say things like, you know, the LT pipelines are no longer necessary, right? You could just store everything, one area, no extract. Just everything lives in the same place. Do you agree with that? Is that possible?
TM: Well, isn't that more ETL isn't necessary, right? So like to me, they're saying you can store all the raw data in Jason, but then if you wanna show it somewhere, you're gonna have to process it somewhere. But you can have the raw data, you know, in your source table. Whereas the ETL is like, okay, as the events are streaming in, we're extracting the data and then creating these flat tables. So if that's what you mean, maybe. But if you're saying you don't need any transformation, it can work off of the raw data, I'm not sure, I agree with that.
RD: Yeah, I think they're saying that analytics and production can be the same database.
IZ: I mean, I'm also a little skeptical, but I wouldn't dismiss it as unusable. I think there is no silver bullet. And different circumstances require different approaches and solutions. And some people might just have a replica of their production process and query that for analytical needs, and it would just be fine for them. But at certain point it stops to scale -
TM: Yeah. If your Postgres database is like a hundred gigabytes, that's totally fine. You're not gonna have any problems once you start getting into the terabyte range or ranges where you have to think about sharding Postgres, that's probably where you wanna do real warehousing -
RD: right. Yeah. I was wondering if this is a question of scale, because at some point you're just putting extra load on your production database and depending on how important latency is, like you could be screwing up, you know, customer transactions, right -
IZ: And many organizations start like that, but sooner or later, graduate this process.
RD: Right.
TM: Another thing too, not just from a technology perspective, is from a user perspective, right? The business department is a little bit different from the engineering department. And an analyst who only knows SQL, has very different skills from someone who writes software. Like the software engineer knows about indexes, they know about transactions, they know about performance. An analyst really is focusing on the business value. And so they might write a SQL query that's gonna do an explosion and lock the database, but to them that's not what's important. Right? And so I think having that separation of like an analytical warehouse, and separating that from your production database, also has organizational value as well.
RD: Yeah. And I could imagine there's, you know, a PII concern at some point with the analytics, right?
TM: Right. That's another can of worms. [Laughter]
RD: So, I mentioned it briefly at the beginning, with Gen AI and the data that causes. Are there new things that you have to consider in terms of an ETL pipeline or managing the data?
TM: Yeah, I think that's a really interesting question. As AI gets more prevalent, right? I think that over time as it becomes more mature technology, the consumers of data are less and less going to be generated by humans or pipelines, but more and more by LLMs. And the challenge here that still needs to be solved is: how can you constrain the field such that the LLM is trustworthy and can create reproducible results? So, I think I might be outdated here 'cause things move so fast, right? But LLMs are not great at math. Right, so like if you give it a bunch of numbers or like a database and you say, gimme the exact number that this should produce, I don't think it's very good at that. It's a created summarization right? And so like the trick here is creating an interface such that then LLM can succeed. So can you have the LLM generate the right queries to produce results that always give the right answer?
IZ: But I would say fundamentally, the approach doesn't change. Just like with BI tools before. Garbage in, garbage out, equally applies here. If anything, it'll be more amplified by LLM, like the unpredictability of LLM itself. So if it's garbage in, it's gonna be 10 x garbage out versus before.
RD: Right. We talked a lot about unit testing the SQL mesh. You know, what else does this help out with the data pipeline?
IZ: So there are a couple more things that SQL mesh brings to the table compared to other data transformation and orchestration tools out there. For one, SQL mesh automatically manages your development environments, so it creates development environments for you, initializes them, and essentially has the ability to apply changes that you made, only to a specific environment in a very isolated fashion. Introducing this way a true data deployment process. And the way it does that is by actually intelligently versioning different versions of your transformation logic and associating like data tables, like data artifacts, with that specific version. So it enables traveling back in time, but it also enables SQL mesh to intelligently reuse data between environments when it's appropriate to do so. Appropriate and safe. Which means that initializing the new environment becomes just like a very simple and cheap operation. It's almost like a shallow clone in engines like Snowflake and BigQuery without actual shallow cloning because the versions already exist behind the scenes. It's just a matter of pointing at the right table. And it works across the engines. So first of all, this enables quick iteration and quick development process because you don't spend time spinning up the environment populating it with data. How do you even populate it with data? Like do you copy or do you sample? Do you generate it? You don't need to worry about that anymore. But another step in that process is thanks to SQL Got. SQL Mesh can analyze your changes. So it can compare the ASTs that we talked about before, of a new version of the transformation logic with a previous version of the transformation logic, and tell you what the changes are breaking or non breaking in relation to its downstream dependencies. And depending on that, well, it can either inform you that well, this is breaking and you will change your logic in order to make it less breakin or non-breaking or otherwise, it will only compute or recompute what is needed, what is breaking, in order to test and preview the changes without touching anything else. Again, reducing drastically the compute, but also further increasing the speed or or velocity of development.
RD: Interesting. Yeah. Sounds like a diff almost at a very low level. Very cool. Alright, ladies and gentlemen, we are coming to the end of the show where we shout out a question or answer on Stack Overflow. Somebody who came on, dropped some knowledge, shared some curiosity and earned a badge. Today we're shouting out a stellar answer badge winner, Christian C Salvado, who gave an answer that was saved by a hundred users to the question: what's a quick way to comment on comment lines in Vim? Vim is always a favorite to ask about here, so if you're curious, we'll put it in the show notes. I am Ryan Donovan. I edit the blog. Host the podcast here at Stack Overflow. If you liked what you heard or want to suggest topics, email us at podcast@stackoverflow.com. And if you wanna reach out to me directly, you could find me on LinkedIn.
TM: And I'm Toby. I'm the CT of Tobkoi. And if you want to join our growing community, you can join us at tobikodata.com/slack
IZ: And I'm Iaroslav, Chief Architect at Tobiko. Ryan, thank you so much for having us today.
RD: Thank you for listening and we'll talk to you next time.