The Stack Overflow Podcast

A database built for a firehose

Episode Summary

In this episode, we chat with Stephen Goldberg and Kyle Bernhardy from HarperDB. Their startup was born from their experience dealing with the firehose of Twitter data during sporting events, and now they have a database designed to scale well for real-time data.

Episode Notes

HarperDB is a startup that focuses on highly scalable databases that handle real-time data.

Harper is built on Node.js and Express with a little help from Fastify.

They know where they excel and where they don't. High data throughput like gaming and vision, great! High data resolution and transactional software like financial applications, not so great. It's speed over accuracy.

Instead of a Lifeboat badge today, we shared a relevant question: Q: How to create HarperDB table with lambda.

Episode Transcription

Stephen Goldberg We were managing anywhere from 250,000 tweets a second that we were doing natural language processing entity extraction on. And it was really complicated. It was really expensive, and we were frustrated. But we figured there had to be a better way to do this. And that we are, you know, we do not think of ourselves as particularly smart, and that there must be smarter people out there who had figured out a better way to do this. And the short story is, after a long sort of search, we did not find a better way to do it. And we launched HarperDB with idea that we wanted to build a database that was highly scalable, but that was developer friendly and simple to use, and then had a great developer experience. And so that was sort of why we built HarperDB.

[intro music]

Ben Popper Couchbase is a modern multi-cloud to edge SQL friendly JSON document database for building applications with agility, performance, and scale. For tutorials, videos and documentation, as well as best practice tips, quickstart guides and community resources, visit the Couchbase Developer Portal at couchbase.com/StackOverflow.

BP Hello, everybody! Welcome to the Stack Overflow Podcast. I am Ben Popper, the director of content here at Stack Overflow and I am joined today as I often am by my colleague, Ryan Donovan. Hey, Ryan.

Ryan Donovan Hey, Ben. What's up?

BP So for folks who don't know Ryan edits our blog and helps to run our newsletter, and he and I have done a fair number of interviews and pieces I would say, over the two years have been working together about databases. We were together interviewing the CTO of MongoDB, not too long ago. And we have two great guests today, Steven and Kyle from HarperDB, who are going to tell us a little bit about their organization, what they work on their approach to this stuff. So Steven, Kyle, welcome to the show.

Kyle Bernhardy Thanks for having us.

BP So Kyle, why don't we start with you maybe tell people a little bit about who you are, how you got into computer science, and yeah, what it is you work on over Harper?

KB Sure. Yeah. So I'm Kyle Bernhardy, one of the cofounders of HarperDB and CTO. So my role at HarperDB you know, like any startups wearing a lot of hats, lead architect, lead engineer, assist with, you know, sales engineering.

BP Alright, let me stop. You have a total of less than 50 employees?

KB Yes. So we are under—how many employees do we have Stephen?

SG As of this week, we have 15.

KB 15 employees. Yes.

BP Okay. So you're still very early stage. When was the company created?

KB We were founded in March of 2017.

BP Stephen, quickly tell folks, yeah, who you are and what it is you work on.

SG I'm Stephen Goldberg. I'm one of the founders of HarperDB with Kyle. I like to joke that Kyle and Fred decided to make me CEO because I'm the most socially awkward, and they thought that would be funny.

BP That's a good troll.

RD What a prank.

BP So how you mostly from a business background, or you're also from a technical background?

SG No, my background is technical. So I was actually the CTO of last company where Kyle and I worked together. I've been programming since I was about 13. Huge fan of Stack Overflow, but I'm nowhere near the programmer Kyle is. So for this venture, we thought it made more sense for Kyle to be CTO.

BP Gotcha. And let's sort of use that, where did you two meet? Where were you working on before this venture?

SG So this is the third—fourth company where Kyle and I have worked together actually. We met when I was a consultant, a company called Cetera. And Kyle was my customer at a company called Coresight. And we got stuck together in a room programming on salesforce.com for about a month in a very tiny conference room.

BP And I mean, are you licensed? Or we going off the grid there?

SG No, no, at the time, I was, no longer—that was over a decade ago.

BP Alright. So you've known each other in a few places. Is this the first company sort of cofounded or started out as cofounders or no?

SG Yeah, I founded a small consulting company in Kyle was my first hire two companies to go and then this is the first company where we that we cofounded together.

BP Alright, well, this is the Stack Overflow Podcast. So I would be remiss if I didn't ask Stephen, why is Kyle so much of a better coder than you? Is it intellectual, emotional, psychological? [Stephen laughs] What are the tips and tricks here?

SG All of the above. I'm very good at fast PLCs. But I highly doubt that any of my code is still in production on HarperDB. They've probably ripped it out at this point. You know, my background is consulting. I was sort of a quick and dirty get it done. Kyle is disciplined. He does the in depth research. He is focused on performance. It's, you know, creative, greater problem solving. And I think I would be best described as pretty sloppy and fast. Yeah, if that makes sense.

RD Kyle, I assume you've had to sit with a code base longer than Stephen.

KB Yeah, I mean, I've been sitting with it, I stare at it every day for four years. It's very much my child at this point. And it's growing up so fast. It's four and a half years old.

BP So you'd worked together before, it sounds like in a couple different capacities, and you want to start a business together. And so it was focused on databases, what were your sort of foundational architectural choices? What kind of languages and frameworks you're using to build this? Like, how is it architected? And then maybe we can get a little bit into sort of like the thesis, which I read on the website, you know, what it's about. And also, you know, one of the studies you paid for an independent study to compare it to some of the other major folks out there. So tell us a little bit about the choices you made the beginning about how to build it. And then actually, maybe we should start with the problem you were trying to solve. And then we can talk about the tools you chose.

SG I can talk to the problem. And Kyle obviously can better speak to the architectural choices we've made. So our last company, we were working in a big data sports entertainment analytics company, we were managing big data from the Twitter firehose, Facebook, etc, for live sporting events. So getting real time streaming data, you know, for events like the World Cup, the World Series, Super Bowl, Beyonce concerts. Kyle and I built this sort of Frankenstein architecture across AWS, and you know, many other places using lots of different technologies. It was extremely complex to maintain. And the real problem was sort of trying to do real time streaming analytics and operational capacity at scale and a small company, we were managing anywhere from 250,000 tweets a second that we were doing natural language processing entity extraction on. And it was really complicated, it was really expensive. And we were frustrated. And we figured there had to be a better way to do this. And that we are, you know, we do not think of ourselves as particularly smart, and that there must be smarter people out there who had figured out a better way to do this. And the short story is, after a long sort of search, we did not find a better way to do it. And we launched harperdb, with idea that we wanted to build a database that was highly scalable, but that was developer friendly and simple to use. And that had a great developer experience. And so that was sort of why we built HarperDB.

BP So Kyle, let's put it over to you. And then Ryan, let me step back for a second let you ask some questions. But we understand now what the pain points were based on previous experience, sort of where the opportunity was that you saw on the market, you know, you had, I guess the chance to sort of step back and say, alright, we're going to build this from scratch, you know, what kind of choices did you make? And how many of those choices do you regret deeply today? [Ben laughs]

KB Yeah, I can list a few that I regret. So the framework that we use is node.js. So from day one, we chose node.js. The reason for node was, first of all, you start a new project, start a new company, it's a huge lift. And there's so many decisions that you have to do. Learn a new language, learn a new framework, that's extra stress, on top of everything else that you're stressed about. So we had a lot of experience at our previous company, scaling out node.js applications. So what our team knew we've loved and still love node.js, also with NPM, a massive community behind it. And so you know, getting something off the ground, being able to create an MVP, quickly, leveraging libraries, like we can do best, you know, evaluate best of breed libraries to solve problems that are important, but not core to what we ourselves have to solve, but complement our problems, leveraging NPM really helped get our product off the ground really quickly. And you know, we're still iterating around that we started out with Express.js, because HarperDB is an API first database. So that's how you access and interact with our database. So we were using Express.js. Express is what everyone uses and knows, but Fastify is what we use now. We were doing some high end scale benchmarks around this time last year. And we're hitting some real choke points on scale with the HTTP part. And then, through doing evaluation, we found Fastify, it's amazing to us. And so just, you know, the iterations that you make, just inside all that with the community is awesome, and things keep evolving. But anyway, I could go on and on about this stuff. for a really long time.

RD When I saw the spiel on the website and talked about the scalability, and I assume replicability of the data. My last job, we had worked with Cassandra as the production database, and just kind of replicating it across two data centers. And I remember that being, you know, at least one, two, three people's jobs to maintain that. How do you guys do that kind of scale scalability and data integrity on such a large scale?

KB Yeah, so our data replication model—so first of all, the at a low level, we're using WebSocket communication protocol to maintain steady state communication between nodes. And replication is configured, you know, by whoever's managing HarperDB , your DBA, or engineer. And so it's pub sub at the table level. So you get to choose how these replications occur. And HarperDB is also has been, our intention with HarperDB is outages happen. And so we kind of come from offline first idea. So we are compliant at the node level and eventually consistent across the cluster. So but what I was saying with the understanding that offline happens, you can have network cuts, you can have a server go down, things like that. So we also have built in catch up routines. So you know, the notion of, hey, we haven't talked to each other in, you know, and milliseconds, where'd I miss from each other in this period of time. So just from making sure that data is maintained, and caught up, things like that. So we have these understandings building to HarperDB. Also, from a conflict resolution perspective, how we approach things is from like a last writer wins, because HarperDB, you can think about is more like a peer to peer database. And so you know, you're configuring your cluster, however, you need the data to flow. So think about more peer to peer. And so you could be working on that same record on multiple nodes. And so currently, what we're doing, like I said, this last writer wins, we're looking into conflict free resolution data types, which are a new data type that is built for peer to peer. And so that's something we're doing active r&d around right now, just to have better resolution around truing up data when you've got multiple sources of truth.

SG From a management perspective, I've just said, you're absolutely right. Typically managing sort of clusters at this scale is complicated. Philosophically, when we built the product, we really wanted it to be such that it was easy to maintain by a developer, you didn't need a DBA, you didn't need a DevOps person. And so all of it can be managed via micro services. We have a studio that visualizes how to set the cluster pretty easily. And Kyle, myself and our background, we're really developers. First, we're not DBAs, we're not data scientists, we're not database engineers. And so we built a product that we would want to use. And it shows and it makes it a lot easier to manage a cluster, to set up an environment, it takes about, you know, five minutes to install HarperDB, configuring two nodes to talk to each other as a single API call. And a lot of the intelligence of the cluster management, while extremely complicated, is really obfuscated from the end user. So they're not worrying about that. And it's kind of on rails, which is a trade off in the sensor things we can't do and will never be able to do because we made it on rails, but we put it on those rails so that you're not having 15 people managing a 20 node cluster and that one person can easily do it. That's sort of the trade off we made there.

RD Do you guys use any sort of data locking, any field locking or anything like that?

KB So you know, there's there are products in the market that like as you're transacting a lock out the row, that's what you're talking across the cluster? So we do not lock out the row across the cluster. So like in the what Stephen was saying, like, we're not good for some use cases, like FinTech, like we would be a terrible product to use in FinTech, where you have to have like extremely high scale data of resolution. Where we are great fits is areas like gaming, and, you know, like use cases around like sensor data collection, like other areas and like Telecom, you know, there's use cases in entertainment, things like that. But you know, FinTech we're always really upfront when we're talking to prospects that where our gaps are just so that, you know, if that's your use case, there's other products that you should be looking at. We don't want to send people down the path that you know, they lose trust in us just because we weren't upfront and transparent to start.

BP That's really interesting. So for somebody like myself, who's not as conversant. Why is your as a little bit squishier in the sense that maybe financial institutions wouldn't be interested in it, but appealing for something like gaming, which is large, loud?

SG So I can sort of address that. So if you think of like a financial transaction, CockroachDB is a really great product for that. And sort of the way that works is, if you make an update to a row, like Ryan was sort of suggesting in Tokyo, it'll globally lock out that row all over the world. And that ensures that when that update happens, that it's consistent everywhere in the world, and you get the same response from a node in London and a node in New York etc, which is really important financially because if you don't do that, you may end up in a state where two people bought the same item or two people traded the same share. And so that's important, but what you're trading for that is speed, the ability to make that update takes significantly longer than HarperDB. Whereas in harperdb, let's say I'm logging on to a gaming console in Tokyo, and I want that to be reflected all over the world. It's okay that that status might be wrong for 100 milliseconds. But I want that performance because I want that to happen as fast as humanly possible. And so when I'm looking at the fastest speed possible, but it's okay to have 100 milliseconds of mistakes. HarperDB is an excellent choice. If I'm looking at use cases where I cannot make a mistake, HarperDB is not as good of a choice, if that makes sense.

RD Right. It's for that real time data, like if somebody is moving around in a game, it's okay to kind of lose that little 100 millisecond position. It's almost like it's automatically error correcting right?

BP How can we both win at the Call of Duty World Championship? [Ben & Ryan laugh] If HarperDB says we both had the headshot at the same time.

KB We're not giving you access.

SG For a second, it'll say that you both have that shot. But ultimately, it'll it'll give you the correct answer. And so that's the important thing is because it adds a complaint at a node level. And because of the way our replication is built, ultimately, you always end up with the right answer. But for however long it takes to globally replicate, let's say it's 200 milliseconds, there could be slightly wrong answers.

BP And it'll be a dramatic moment.

RD I had a question about node.js, does using node.js make it faster? Or was it just that was the thing that you're most conversant in?

KB You know, our team, we we have a lot of experience in a lot of languages. But across the entire team, that was the consistent language and JavaScript framework. I will say, you know, going back to underlying libraries, while it's JavaScript underneath the hood, a lot of times it's c++, so you get a live performance, there's, you know, from raw C or c++, you will lose a little bit of performance, you know, in that transport, you know, as it goes through the VA engine, but even still, like our underlying data store, is lmdb. It's written in C, we're using an amazing binding for lmdb. So our underlying data stores this like high performance key value store, and again, leveraging the NPM community, we were able to implement that into our product. Now, again, like it's not all just lmdb. It's also based off of, you know, how we're indexing and how we're utilizing that binding and that key value store that we're also getting performance out of, but you know, for example, that's Britain and C++, and, but we're implementing it at the JavaScript layer.

RD What's lmdb for folks that don't know?

KB Yeah, so it's stands for lightning memory map database. And so it's written by Howard Chu. He's the CTO at CMS Corp, which they maintain openLDAP, he's given a number of great talks. And if you ever want to do a real deep dive into it, you know, you can look him up. But he wrote that for openLDAP, but made it open source, it's using memory maps. So what that means is the records or invite addressable space in memory, and so once you've read it once or written it, it's essentially like it's an in memory access. And so the I/O gets a lot faster. And a lot of the you can have a bigger map on disk, then you have memory, and then the OS is just managing the cycling of memory space and things like that. So he's made a lot of really intelligent decisions. One of the reasons why we use it, it's been around for a decade, super performant. And there was needs that we had the way HarperDB works, it fell into all the needs that we had, and because we did a bake off on a bunch of products around that, but you know, we can leverage, like the power of node being C and C++ under the hood and the community that gives us access to some great power and speed.

BP It's interesting, you know, you mentioned twice sort of the the idea of community and NPM and being able to leverage that, is there a sense that that's something you're trying to do? Like to build community in some way or to make this open to contributors in some way so that you know, people who are in the ecosystem, but maybe not directly working for your clients can contribute?

KB So HarperDB currently is closed source and Stephen can speak to this, like Stephen worked at Red Hat and—and well Stephen I'll kick it to you for like the open source decision. You always speak to that well.

SG Yeah, so I did work at Red Hat. I'm a huge fan of open source software. We are doing our best open source as much of HarperDB possible all of the tooling around it studios drivers. connecters currently, as Kyle said, we use a freemium premium model. For the core of HarperDB. We do have a free forever tier that we intend to keep free forever. And our goal ultimately is to find a way to hopefully open source even more of HarperDB, we have built a pretty awesome community around HarperDB. slack has been an amazing tool for that we've hosted a number of events where we're starting to see a lot of people get involved in social media has been great for that we do try and foster community as much as possible. Part of our like, philosophically HarperDB being our intent is to build the easiest database in the world to use. And so what we found is that a lot of folks who are new to their careers in development, especially folks who don't come from a traditional background, you know, maybe they don't have a CS degree, they're from a code school, or they're self taught developers really find Harper to be really interesting. And we've gotten a lot of enjoyment out of sort of fostering that community, helping those folks with their journeys. And Kyle and I both personally find that to be an incredibly rewarding part of HarperDB. One of my favorite things is when I like wake up to a tweet about someone finding it, building something new, and I spend a lot of my time working on sort of building that community. And we've really seen that take off in the last several months, it's been a lot of fun.

KB Yeah, and there's one thing I'd like to tag on that too, is we just released a new feature in HarperDB, called custom functions. And you can think of it like Lamda, where you're writing JavaScript code that you can run in line in HarperDB. And that you can create, you know, our user base, they can create plugins, modules, open source that we're working on recipes that we are just going to share with the community, like extra layer, new ways of doing authentication inside functions, and doing ml and AI, leveraging, you know, HarperDB with custom functions. So while the core of our product is not currently open source, we're working really diligently on making aspects of HarperDB open to the community itself. And you know, the other key thing with that functions too, is you can pull in extra libraries, it's not just locked down to just HarperDB and then that's it. So it's making it as open as we possibly can and extending that simplicity.

[music]

BP For our outro here, I used to, at the end of every episode, read out a lifeboat badge to thank a member of the community. But it's been a little bit more fun recently to see if I can keep it topical. So I searched HarperDB Stack Overflow, and as you were just mentioning about recipes, this is from three months ago: "How to create HarperDB table with Lambda." So in AWS lambda. So if you're interested in that, there's a question here tag with JavaScript, I believe. Yeah, you can go check it out. I'll put it in the show notes, JavaScript, node.js, AWS, Lambda and no-SQL. Okay. It's a question about a lot of things really. Alright, everybody. Thank you so much for listening. I am Ben Popper, the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. You can always email us with suggestions and questions podcast@stackoverflow.com. And if you'd like the show, please do leave a rating and review. You'd be surprised that it actually helps.

RD I'm Ryan Donovan, content marketer here at Stack Overflow. I edit the blog. I am occasionally on twitter at @RThorDonovan. And if you have a great idea for a blog post, please email me at pitches@stackoverflow.com.

SG I'm Steven Goldberg, founder of HarperDB and CEO. You can find me on Twitter @SGoldberg. And you can find us harperdb.io and find our Slack on our resources there.

KB I'm Kyle Bernhardy, cofounder and CTO at HarperDB. You can follow me occasionally on Twitter @KyleBernhardy and check us out at harperdb.io.

BP Alright, well, thanks to both of you for coming on and yeah, appreciate your reaching out the show, glad we could have you as guests.

KB Thank you so much, everybody.

[outro music]