The Stack Overflow Podcast

Column by your name: The analytics database that skips the rows

Episode Summary

On this sponsored episode of the podcast, we chat with Rohit (Ro) Amarnath, the CTO at Vertica, to find out how your analytics engine can speed up your workflow. After a humble beginning with a ZX Spectrum 128, he’s now in charge of Vertica Accelerator, a SaaS version of the Vertica database.

Episode Notes

These days, every company looking at analyzing their data for insights has a data pipeline setup. Many companies have a fast production database, often a NoSQL or key-value store, that goes through a data pipeline.The pipeline process performs some sort of extract-transform-load process on it, then routes it to a larger data store that the analytics tools can access. But what if you could skip some steps and speed up the process with a database purpose-built for analytics?

Vertica was founded by database researcher Dr. Michael Stonebreaker and Andrew Palmer. Dr. Stonebreaker helped develop several databases, including Postgres, Streambase, and VoltDB. Vertica was born out of research into purpose-built databases. Stonebreaker’s research found that columnar database storage was faster for data warehouses because there were fewer read/writes per request.

Here’s a quick example that shows how columnar databases work. Suppose that you want all the records from a specific US state or territory. There are 52 possible values here (depending on how you count territories). To find all instances of a single state in a row-based DB, the search must check every row for the value of the state column. However, searching by column is faster by an order of magnitude: it just runs down the column to find matching values, then retrieves row data for the matches.

The Vertica database was designed specifically for analytics as opposed to transactional databases. Ro spent some time at a Wall Street firm building reports—P&L, performance, profitability, etc. Transactions were important to day-to-day operations, but the real value of data came from analyses that showed where to cut costs or increase investments in a particular business. Analytics help with overall strategy, which tends to be more far-reaching and effective.

For most of its life, Vertica has been an on-premises database managing a data warehouse. But with the ease of cloud storage, Vertica Accelerator is looking to give you a data lake as a service. If you’re unfamiliar, data lakes take the data warehouse concept—central storage for all your data—and remove limits. You can have “rivers” of data flowing into your stores; if you go from a terabyte to a petabyte overnight, your cloud provider will handle it for you.

Vertica has worked with plenty of industries that push massive amounts of data: healthcare, aviation, online games. They’ve built a lot of functionality into the database itself to speed up all manner of applications. One of their prospective customers had a machine learning model with thousands of lines of code that was reduced to about ten lines because so much was being done in the database itself.

In the future, Vertica plans to offer more powerful management of data warehouses and lakes, including handling the metadata that comes with them. To learn more about Vertica’s analytics databases, check out our conversation or visit their website.

Episode Transcription

Ro Amaranth So, what most databases at the time were built were row based databases. So on disk you store the piece of data row by row, right, so every column is together and if you want to find a particular transaction you can go right to that transaction. And that's what most row based (Oracle, Sybase, SQL Server) were all based on that base technology. What Dr. Stonebreaker decided was, well, for reporting and analytics, it's better to look at it from a column basis, right? So if you look at carnality, if you're looking at a column that's got states, right, there's only going to be 52 states if you're just looking in the US. So, he decided that storing the data as columns would give you better performance when you're looking at reports that give me everything you know about a particular state. Because now you don't have to go for every row that has and find every instance of the state. You go to column, find that one instance of that, say New York for example, and then get all the related data from the other columns, because it's stored by column. So it improved the performance of queries that were very analytic based by a magnitude or multiple magnitudes. And that was really the start of the revolution in terms of columnar databases.

[intro music plays]

Ben Popper Alright. Hello, everybody, and welcome to the Stack Overflow podcast. I am Ben Popper, director of content here at Stack Overflow. Joined, as I often am, by my colleague and collaborator, Ryan Donovan. How goes Ryan?

Ryan Donovan Oh, goes, it goes alright?

BP It goes, it goes, You're among the walking wounded but thank you for coming on today. I appreciate it. So today, we are going to be talking with some fine folks at Vertica. And we're gonna be talking about databases, machine learning, analytics, and yeah, how to navigate and make use of the enormous rivers and lakes of data that many companies are finding themselves with these days. Ryan, is this something that we've seen on the blog? I feel like it is over the last year and a half?

RD I mean, a little bit, we've seen bits and pieces of it. I don't know if we've seen as much of the intersection between databases and machine learning.

BP Yeah, yeah I guess like in my former life, when I was a technology journalist, this came up a lot. You know, it was sort of like, the big thing that interested people in the world I was covering, like drones and computer vision, you know, was how to make use of the enormous amount of data that people were pulling in different ways, consolidate them and train systems on it. I think I, right when I left the verge that was right when DeepMind became the best Go player in the world and we were all sort of waiting for the other shoe to drop on the singularity. But still waiting.

RD Yeah, I mean, obviously, machine learning processes a huge amount of data. So that data has to be stored somewhere, right?

BP And so today, we're gonna be talking with Ro Amaranth, who is the CTO at Vertica and the head of Vertica Accelerator. Ro, welcome to the show.

RA Thank you, folks. Thank you very much for having me.

BP So Ro, tell us a little about yourself. How did you get into the world of software and what brought you to your current role?

RA Long story there. I think I knew I was going into software as soon as I picked up my first ZX128 back in the day and couldn’t put it down and went with it for a good while. Came to the States when I was 17, went to school at RIT and did systems analysis, which is, actually very few schools had it at the time as a program. Worked on Wall Street for as a contractor, I started my own businesses as a consultant right out the gate and worked with a bunch of Wall Street firms. And then in 2007, I saw AWS, Amazon came out with the EC two, which was compute, you know, renting computers on demand. And I was toying with the idea of, you know, doing databases and data warehousing as a service, because I'd supported a bunch of them in my on my wall street jobs and essentially said, hey, I could do all of that, without having to spend the capital to build that infrastructure out. Right, and got involved with AWS very, very early because of that.

BP So you got off of Wall Street in 2007. We know you have good timing.

RA Yeah. Well, I was still I still had to work with them for a while because nobody wanted to buy the cloud at the time. But, but it was, it was still an early start. So you know, work work with Chase and Deutsche Bank and a whole bunch of others. And in fact, just before kicking that part of my career off, I actually met with Michael Stonebraker and Andy Palmer, who were the founders of Vertica. And I was really, really excited about what they were doing with with Columbia data warehouses at the time, and this is 2008 well, I think it was 2006 when I first met them, and then 2008 when they actually formally launched and with the product, so I've been with both AWS and Vertica from the very, very beginning,

BP Right, and Ryan and I have been obsessed with the name ever since we heard it. Dr. Stormbreaker tell us a little bit sort of like what his pedigree is, and I know, you know, he's created some pretty interesting sort of foundational technologies.

RA Yes, well, I'll actually put one one caveat that I didn't actually join Bordiga till this year. My company was partners with Vertica partners, AWS, we built out, managed services around around those two. And this year Microfocus, which is the owner of the Vertica product acquired my company, and made me the ctfo Vertica. But it started out with Michael Stonebraker’s awesome work with ingress and Postgres, initially, and then he decided that, you know, Postgres didn't really fit the bill for different types of workloads. And he went down the path of really building out a purpose-built data warehousing technology that ended up being vertical to C store and MIT once they launched as a product became Vertica. But then he went on to do other purpose built databases. He did stream DB VoltDB of one time transactions and I think he's now working in machine learning in the data space. But his latest farm with Andy, at Tamar.

RD Yeah, the purpose dB. Sounds really interesting to me, can you can talk about what the specific purpose that Vertica was built for?

RA Yep. So what most databases at the time were built were row-based databases. So on disk, you store the piece of data row by row, right, so every column is together. And if you want to find a particular transaction, you can go right to that transaction. And that's what most row-base Oracle, Sybase SQL Server, were all based on that that base technology. What Dr. Stonebreaker decided was, well, for reporting and analytics, it's better to, to look at it from a column basis, right. So if you look at carnality, if you're looking at a column that's got states, right, there's only going to be 52 states. And if you're just looking in the US. So he decided that storing the data as columns, would give you better performance when you're looking at reports that give me everything you know, about a particular state, because now you don't have to go for every row that has and find every instance of the state, you go to column, find that one instance of that, say, New York, for example, and then get all the related data from the other columns, because it's stored by column. So it improved the performance of queries that were very analytic based by a magnitude or multiple magnitudes. And that was really the start of the revolution in terms of columnar databases back, you know, in the mid 2000s.

BP and so you said you had come from from the world of finance. And now, you know, this this was kind of purpose built, you said for sort of more analytical approaches, has there been overall a shift in the way people sort of sense value in data, like transactional versus analytical?

RA I think so. I mean, I think, you know, my, my career even on Wall Street was around analytics. So it was building reports, or balance sheets, and p&l and profitability, etc. And so we, while we cared about the transactions, the real value was figuring out how they could cut costs, or invest in a particular business or look at data from an aggregate perspective, right? And those reports help with strategy and future where versus transactional, it's just very operational, right? What do I need to do today to solve a problem, right? Whereas analytic reports tend to be far more reaching in terms of what the impact is, is on a company, right? So I think then and now it's exploded, right? So people realize that there's so much value in the data that's been collected, and people are collecting so much because of the cloud that to really get insight into your business, you need to be able to load this into a lake, into a warehouse to be able to get that value or find those nuggets of information that lets you say buy a business or invest in a particular product.

RD my experience with the sort of analytic databases has been there's a big no SQL database that handles transactions goes through an ETL pipeline and ends up in a big SQL analytics data lake. How does Vertica’s database fit in there?

RA I'm a SQL bigot. So I love SQL, always love SQL. I saw the sole trend towards no SQL and I poo pooed it that time, to some extent, my poopoo-ness was justified, people are coming back to SQL now to look at efforts even by Google etc. that went down to no SQL route. They're bringing back the SQL based databases, because at the end of day, it's one of those core technologies and ways to query data that is never going to go away. Right. And it's so simple. It's, it's, you know, a so simple to pick up. So intuitive in terms of, at least to me, in terms of looking at at how to find data that even the no SQL guys are building SQL engines on top of their no SQL databases. So Vertica, Michael Stonebreaker, was along that same thought process and everything he's done is very SQL oriented. In fact, probably the biggest SQL bigot out there in terms of what, what, where SQL fits. So Vertica is very squarely in the face in terms of analytics and how you query it is use is using SQL primarily based on NC SQL, but has extensions for a number of different things that they've added to it like machine learning and time series.

BP And so can you give us some examples? Like if a customer was coming to you today? What kind of problems might they be looking to solve? Or, you know, what kind of value might you be able to deliver to them, as they're thinking about database as a service, or, as you said, a warehouse in a lake that they can sort of peer into to extract those really useful nuggets?

RA There's so much data being collected for every industry, right? So there isn't a particular industry that we haven't worked with, or we don't find value in working with, right. So I work with gaming companies, they're the ones who collect massive amounts of data in terms of behavioral analytics, how a particular player is doing something they collect significant amounts of data for, for doing that. So one of the one of the early big projects we did with Vertica, and AWS was really helping Zynga bring Vertica into the, you know, sort of massive petabytes of data. And there's a case study on it on the AWS website, and that's one big area, healthcare, aviation, all of these industries are collecting more and more data on a day-to-day basis that they need to put it somewhere that usually ends up in s3 base or object stores that are s3 compatible as the big daddy of them all. And once they’re there, they need to be loaded, transformed, queried in it by some form or the other. So there are ways to do it directly. But most of our customers end up bringing in the data into Vertica. So they can apply the sequel, make it easier to load and transform within the database itself.

RD I know another of the things because it's kind of crap base, you've been using a database as a service, right? Do customers even have to set up the database at all?

RA So, Vertica’s been around for a while. So they've obviously they're big installed base is on prem, but they can run and exist on any any of the big clouds. So they you can install Vertica on premise, you can install it, AWS and GCP, Azure and so on. But one of the things that that we ended up building out and going GA with this year, which is what product I'm managing, which is called Vertica accelerator, that is the as-is service for Vertica. So essentially lets you run Vertica without having to manage, right, so click a few buttons. And the expectation today is you only have to click a few buttons and a database magically appears. And that's what that's what we've built out. One of the things with with that as a service is it does provide what, Vertica’s been very good about leveraging what the cloud has to offer. And it's separated out sort of this compute and storage concept. So data is stored in object storage with s3. So you don't really have a limit of how much you can store there. And then it's it's actually brought to the end customer by SQLl, but they bring out what's needed to the cluster that's actually doing the compute side. So if you're doing an aggregation, or bring the data down to do the group by and sum and show, you know how many widgets there are in a particular state, for example, now, in fact, create there was an interesting recent example last year Domo doing a Coronavirus dashboard with Vertica on the back end to show people where where the cases were, and day to day how things were changing.

BP And how does that work from like a security perspective? Is that something customers like to be able to separate compute and storage? And as you said, run it after bringing it down?

RAYeah, I mean, actually, one of the things that we do differently from from some of our competition is we actually run the the environment and the database in the customer’s AWS account. So rather than it being in our environment, and they connect to it, moving data into into our environment, taking it out, we're actually doing it in their account itself, which gives them a fair bit of flexibility, because now you don't have to move the data in and out of their environment, it basically sits and as close to some of these large s3 sources of data where where they've been collecting it for a while, now they have a way to to use it without actually having to move the data much. To me, you know, my experience has been a lot of ETL and ELT is just moving data around. And if you can reduce the number of times you have to move it or places you have to move it, it saves money, is more secure and so on.

BP Make sense

RD You imagine both ELT and ETL? Do you have a opinion on how to perform those which which is the better

RA It goes, goes right with my SQL bot bias, right? I like doing things in SQL, like doing things in the database. So my preference actually has always been to load the data into the database, and then work with the data in the database, because SQL does really well with sets, right? So if you're if you're changing data across a number of columns, the number of rows that's sets, and those sets do really well in in an analytic data warehouse, when you're trying to change. So I've always liked the ELT concept, which is extract, bring it from wherever it is, load it into the warehouse and then transform it in the warehouse. And what I've seen in the past few years is people actually going back to that with a number of ELT-specific tools for managing the data. And the reason is primarily because there's so much available compute today that the databases have gotten powerful enough to be able to do operations on on these massive sets of data. Whereas they so they didn't don't have to do it at the row by row basis or transaction by transaction basis, which was the traditional ETL model, not. And like I just said, you don't need to move the data multiple times in ETL, you're actually taking it out, going into an ETL tool, bringing it back to the Data database, or data warehouse. Whereas an ETL stays there, you're doing all the processing in the same place, you're reducing the number of moving parts that you that you need to deal with when you're moving in transforming the data.

BP And so how do you sort of engage clients when you're managing their infrastructure? Do you talk about this in terms of data ops? And what they can save? Do you talk about time and costs? Like when you're on the phone, you know, with a big client who's considering this, what are some of the value propositions you put to them?

RA Well, the primary thing is really looking at it as a data pipeline, right? So looking at from, okay, when from the point that you collect the data to the point that you deliver the data to the customer, or their end customer, whether it be reports or downstream systems or API's, what do you need to do to make that journey as quick as possible, as efficient as possible as cheapest possible, and it depends on what their end goal is, right? So some of them might be looking to just keep it cheap, make sure they can get all the data in and have it available for regulatory purposes, or they may be looking at, okay, how quickly can I get an answer to a question that's being asked and reduce that time, you know, to near real time or real time, from the point that we collect the data? So depending on what they're trying to do a lot of our discussions around how Vertica fits is around those those concepts. What do you need to do? How do you need to get there? What sort of transformations do you want to achieve? And how quickly do you want them to be done? Right? And that really helps with that data operations pipelines that help customers make the decision, I think, and we fit right in the middle, we make it easy to load the data, transform it and deliver it.

BP That's interesting. That way you put that you know, is it something you want later, like you said, for governance or something you want in real time for a finance transaction? Or gaming, you know, tournament? Is that something where they're making a choice between yeah, sort of speed, quality, and cost? Or you're managing those things for them? Or is it possible to move everything in the right direction?

RA Right? And that's, that's where they they've got to make a decision. Do they spend more on Compute, you know, to get things done quicker? Or do they spend more on storage and storage in s3? From our perspective, the technology can support it either way, right? The way the price is based on Compute, so it does we actually, at least on the accelerator, as a service side, we don't care how much they want how much data they want deal with, it's how much compute they want to throw at the problem and have a pretty good idea of what compute they want to throw at it. Because it's running in there there.

RD Right, gotcha. Talk about analytics, you know, everybody wants insights from their data, what sort of analytics, what sort of transforms and processing work really well with Vertica?

RA So work has done a great job with building things that are needed to work with the data into the database, right. So machine learning, time series, all of these things that you would normally have to use extra tools for and outside the database are now are now available as SQL functions within the database itself. So if you wanted to run a machine learning model within the database, again, as a whole, let's not move the data in and out and spend and add complexity to to the infrastructure because you have to build out, you know, machine learning pipeline, etc, you could do that all inside the database. I think being able to do that, again, is to help reduce that friction, bring, keep it in one place, make it easy to work with, because all the data tends to be there. And that translates immediately into efficiency, because the data is right there. Because these these functions have been created with a great deal of thought and expertise, they perform extremely, extremely well with, I have a anecdote about one of our prospects who had like 1000s of lines of Python code that they had written to manage a machine learning Python model that went down to like a 10 line Python model, because most of the work was being done by the database itself. So that's the sort of thing that really gets me excited and happy to be where I'm helping customers with, with those sort of fishing efficiencies.

BP Right. And so you had mentioned before, two things that I hear a lot these days, which is data warehouses and data lakes, I looked up data lakes once and it started talking about blobs. And at that point, I just like I thought, this metaphor, how deep can this metaphor go? No pun intended, like, so when you say warehouses and lakes, can you for a lay person like what are those terms mean? And why are we using those metaphors? Like, what are we, what are you trying to convey with those?

RA Sure. So so the data warehouse term has been around for a while. It's I've I think I've heard it in the 90s and 2000s, you know, and it was really, here's the database, and we're going to just take data that that we've been collecting and put it there, and it's and we'll look at it when we need it. Right. And I think essentially, that concept of the data warehouse came up purely around the analytic space because it was an operational data that they want to look at our day to day basis, they want to keep it look at it when they needed it and look at it historically. Right. So that's where the warehouse concept came about. When the cloud caught fire, you know, there was this whole concept of really, really cheap storage that s3 and other object stores provided where you could throw significant amounts of data without having to pay through the nose for it, and practically no limit, right? So you don't have to plan out, okay, I'm going to have to buy this, this rack of space and put it in the data center. When this next set of data comes in. I could just say, Okay, if it goes from one terabyte to one petabyte, tomorrow, Amazon will handle it for me, I don't have to worry about it. Right. So that's where the whole data lake concept came about. Because the people were like, Okay, I could just have these rivers of data flowing in into these object stores. And that's where the lake sort of concept came about. Like, I don't have to worry about how deep it is. I don't have to worry about how it gets there. I can just make sure it's pointed there. And it'll fill up.

BP Gotcha.

RA Right. And if I had a bathtub and fill up the bathtub, right, five massive amounts of data

BP Lakes flow into the oceans got it.

RA Yeah. Now there's that whole intersection is that data lake house would kind of make sense.

BP Sounds more like a vacation thing? Second home?

RA Exactly. So So it's these these warehouses that are sitting on the edge of the lake, I guess. Take that metaphor a little further.

BP But yeah, within that world, I think I understand. So like all of this sort of raw data is flowing in there, you're not even keeping track of it. But then later on, you might want to do a lot with it, as you said, a lot easier to do that if you have these high level functions and analytics built right into the database itself.

RA Correct. And I think that's where some of the future technologies are headed towards is tracking where all that data is metadata around what's happening with the data in that lake, when you're getting these massive amounts of data being able to find the needle in the haystack, and then pulling out everything that's attached that into a warehouse. So you can actually do something with it is where a lot of the technologies are headed towards right. And I think from a future perspective, you'll see a lot more on data warehouse data lake management and metadata management that I think is the next next phase for that space.

BP Right, metadata Metaverse, I'm ready.

RA I don't know if I am. I don't know about the metaphor thing.

BP It's okay. You're old school. Yeah, we will we'll come back to real life eventually. What's old is new again. Great. Well, we have just a few minutes left. So for the developers who are listening, or the folks who are listening who manage teams or work at organizations, you know, that have a lot of data, how would you recommend that they sort of, you know, get acquainted with your tools? Or, you know, what are some good options for people who want to check out what Vertica has to offer Vertica accelerator and see if it makes sense for what they're doing?

RA Well, it's very simple. It's vertica.com VERTICA.com. There's training materials on there, we have a whole Academy that's academy.com. That gives you an introduction into the technology. If you want to get started quickly product group that I'm in charge of, which is the as-is service gets you going, you know, with a few clicks for that, that makes it really easy. If you're ready to go. There's a 14 day trial available as well. If somebody wants to kick the tires, and there's some really, really smart people on on my side of the fence who can enable to help them if he wants wants to explore further.

BP Okay. All right, cool.

[exit music plays]

BP All right, everybody, it is that time of the show. I'm going to shout out the winner of a lifeboat badge came on StackOverflow and helped save a question from the dustbin of history. Thank you to Bar Mar award to December 10, answering the question “how to query rows where time is between two times” and that's my sequel, not SQL or no SQL, but my sequel, I am Ben Popper, the director of content here at Stack Overflow. You can always find me on Twitter at Ben Popper, email us podcast at Stack Overflow. And if you liked the show, leave us a rating and a review. It really helps.

RD I'm Ryan Donovan, I edit the blog here at Stack Overflow. I'm on Twitter at Arthur Donovan. And if you have a great idea for a blog post, email me at pitches@stackoverflow.com.

BP And Ro tell us who you are what you do. And if you want to be found online where people should go

RA sure Roit Amaranth, everybody calls me Ro. I'm the CTO at Vertica. You can find me on Twitter at Ramarnat on LinkedIn Of course. Awesome. Thank you everyone.

BP Alright everybody. Thanks for listening. We'll talk to you soon.