The Stack Overflow Podcast

Shorten the distance between production data and insight

Episode Summary

On this sponsored episode of the podcast, we talk with Stanimira Vlaeva, Developer Advocate at MongoDB, and Fredric Favelin, Technical Director, Partner Presales at MongoDB, about how a serverless database can minimize the distance between producing data and understanding it.

Episode Notes

Modern networked applications generate a lot of data, and every business wants to make the most of that data. Most of the time, that means moving production data through some transformation process to get it ready for the analytics process. But what if you could have in-app analytics? What if you could generate insights directly from production data?

Episode notes:

Stanimira talked a lot about using BigQuery with MongoDB Atlas on Google Cloud Run. If you need to skill up on these three tools, check out this tutorial.

Once you’ve got the hang of it, get your data connected with Confluent Connetors.

With Atlas, you can transform your data in JavaScript.

Connect with Stanimira on LinkedIn and Twitter.

Connect with Fredric on LinkedIn.

Congrats to Stellar Question winner SubniC for Get name of current script in Python.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk about all things software and technology. I'm your host, Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague and collaborator, editor of our blog and newsletter, Ryan Donovan. Hi, Ryan.

Ryan Donovan Hey, Ben. How are you doing today?

BP I'm good. We have a great sponsored episode today brought to us by the fine folks at MongoDB. I feel like we've talked with them quite a bit. We've had their CTO on, we've done other things with them. What are we going to be chatting about today, Ryan?

RD Today I think we're going to be talking about analytics and using all the data you create as a company to kind of gain some insight into your customers.

BP Nice. The lakes and oceans of data we're all generating, how to use them, how to funnel them productively. All right, wonderful. Well, I'd like to welcome to the show our guests, Frederic and Stanimira. Hello, welcome to the Stack Overflow Podcast.

Frederic Favelin Hi, Ryan. Hi, Benjamin. Hi, Stanimira. Nice to see all of you. So I'm globally leading the cloud partner solution team at MongoDB. It is three years that I’m at MongoDB.

BP Very cool. Nice to meet you, Frederic. And Stanimira, what is it you do?

Stanimira Vlaeva Hey, everyone. My name is Stanimira. I am part of the developer relations team here at MongoDB. I'm based in Sofia, Bulgaria. I was actually listening to one of the recent episodes from the Stack Overflow Podcast and I was surprised that there was a fellow Bulgarian as a guest, so I guess there's now two of us here.

BP The Bulgarian software industry is rising for sure. So every company wants insights from their data, and the traditional way has been to sort of process production data with some sort of ETL pipeline, then use the resulting data lake or warehouse to generate insights. But that I guess introduces a delay or a latency that maybe some folks have accepted. In your mind– unacceptable. So tell me a little bit about what the state-of-the-art was and what you think MongoDB is doing differently that can improve the experience for developers and for organizations.

FF Yes, Benjamin. In fact, when we look at the current state in many organizations, there is a split of domains. It's built by different teams, serving different audiences with data stored in different systems. That is how things are typically worked on. To be clear, it's not going away any time soon. But today it's not enough. The digital economy is really demanding that our applications become smarter, driving better customer experience, so fast inside and taking intelligent syntax within the application on live operational data in real time. So the objective is to out-innovate our competitors. So along smarter applications, users want insight faster so that they know what is happening in the moment right now. So the objective here is really improving the business visibility efficiency, and with those demands for working with fresh data, we can no longer rely only on moving data out of our operational systems and storage into analytics. Data has, as you may imagine, lots of latencies, but also consistency of data, having trouble between the separations of the application, the insight that has been created. So this doesn't mean that this workflow goes away and we can decommission our data warehouse and data lakes, but it's not really well suited for the application data driven analytics.

RD I think you talked about how the data is in a bunch of different places. It needs to be kind of transformed to get analytics out of it. But how can companies get insights off of their production data without this whole transformation movement?

SV Yeah, so Fred touched upon a very interesting distinction between application-driven analytics, or in-app analytics, and on the other hand we've got this long running analytics. And MonogoDB, with its flexible data model and also the extensive aggregation framework, is really well-suited for application-driven analytics. This means that you can analyze live data right away instead of sending your data to a third party system where you analyze it and then get some insights out of it. But as you also mentioned, app-driven or in-app analytics aren't really meant to be a replacement for long running analytics. With the long running analytics, we typically perform them over really large amounts of data that are usually stored in some data lake or a data warehouse, some sort of a cheaper storage solution for call data. And the goal of long running analytics is to get a more complex, more in-depth analysis. However, there are some smart ways to combine the two so it's never just one or the other. Businesses can definitely use both to get more benefits for their applications. And one smart way to do that is, for example, you can build a machine learning model using something like BigQuery ML. And let's say that we are building a model for predictive maintenance. Predictive maintenance is a technique that manufacturers use to track the state of their equipment and see if some machine is likely to fail or needs maintenance sooner than the other ones. So to build such a machine learning model, we need data that is coming from the sensors of this equipment. And to make it accurate, we need a very large amount of data that has been collected over a prolonged period of time. So we might use BigQuery ML and data that is stored in a data warehouse, and then once we train the model, which can take hours or even days, we can use it. So by all means, this process of creating the model is long running analytics as a task. But once the model is ready, we can use it in real time. We can feed it with data from our sensors and get a real time almost immediate response about the performance of the machine. And we can do those insights of our applications. So you can see that here we've got this sort of a symbiosis between long running analytics and also in-app analytics. Additionally, we can then feed data from our operational database, from our application data, back to the model and retrain it to make it even more accurate. So it's not one or the other. I think technologies that perform the different types of analytics can be complementary and by all means we can use both of them.

BP Yeah. So you mentioned BigQuery. I know that one of the things we had discussed before hopping on was a product on your side– Atlas, that frequently integrates well with that. So I would love to learn a little bit about what Atlas is and then maybe discuss some of the implementations and use cases you've seen with clients that you think developers who are listening might find useful within their organizations.

SV Sure. Atlas, for all of our listeners who haven't heard about it before, is MongoDB’s developer data platform. It's built on top of the core MongoDB database, but it allows you to deploy your database and manage it in the cloud. You can choose from one of the three major cloud providers, and you can even choose regions in different clouds and create a multi-cloud deployment. For example, you might have your data distributed in one region in AWS in Europe, one region in Google Cloud that is in the US, and a different region in Azure that is in let's say Australia or India. So you can have a multi-cloud deployment. But Atlas is a lot more than the core database. It provides various data services that you can use while building your applications. For example, you have Atlas Search, which provides you with a full text search capability that is built on top, and its benefit is that you don't need to have a separate search system that you need to maintain along with your database. You can just send queries to your MongoDB Atlas database and you can take advantage of full text search on it. Another service is, for example, Atlas Charts, from an analytics standpoint of view. So with Atlas Charts, as you can guess from the name, you can create visualizations on top of your Atlas data or the data that is in your database, and then share these visualizations with your teams, embed them in your web apps if you want to, and the best part is that they will automatically update as data changes in the underlying database.

RD Sounds interesting. Can you give us some examples of how this works in real life use cases?

FF Yeah, sure. In fact, we have a couple of use cases there, and let me pick two or three different use cases there. The first one I would like really to take care is about retail with what we can call contextual recommendations. In fact, these recommendations are usually based on other products that have been purchased by similar customers. For example, let's take a famous e-commerce platform, Amazon, which is generating 35% of their revenue through recommendations by providing suggestions not only based on similar customers, but also they are adding the customer’s own purchase history, the current seasonal trends, product combinations that are not really that intuitive so that they can really be formed with data mining and AI/ML to build these product recommendations. So when we tie together, for example, MongoDB Atlas and BigQuery within the real-time data platform by correlating all of this information, the orders, a summary of historical datas, buying histories, personal datas, as well as seasonal trends, we can get, in fact, a model which is going to be parsed within a machine learning model to really provide real-time options that are going to really drive a much higher conversion rate. Another example could be, for example, in manufacturing and logistics, which is something which is really well known from us now since COVID with Track and Trace. So having the real time locations, the effects that can be really utilized first to reduce the expense and risk for the business. For example, GPS tracking devices are attached to vehicles in the fleet which can emit the current longitude and latitude so that you really know when you're going to wait for a parcel, where it really fits in so that you know where it is. And that really generates lots of amounts of high value data so that when you want at some point also to optimize all of this delivery mechanism, then combining MongoDB and BigQuery at some point will really help to get all of this actionable data that you gathered during your delivery process in order to really translate it into a significant tradition in business risk, and then also to improve the operational efficiencies. The latest example I would like to provide you is within the financial services. Within financial services, we want really to get less fraud payments for example. So this fraud detection mechanism is really a use case within the financial industry, which needs also to get real-time anomaly detections. When you are paying for something, you need to know if you are going to be allowed to pay or not pay, because do we have something which is going to be fraud or not? So a bank or ledger can really easily convert that domain of knowledge into the fraudulent behavior of real-time rules and apply machine learning to detect unknown anomalies. Then they're going to get back the scoring function results within your transactional database which will be MongoDB Atlas, will help to reduce the number of false alarms which have been generated and raised. So the combination of the two, one side parsing and building the scoring algorithm, getting back also the result within MongoDB, then helping to get a solution which is going to be really full-time efficient, real-time efficient, in order to be able to get thousands of simultaneous connections within real time.

BP Ryan and I were actually just chatting with some folks at Intuit about building out their ML stuff and trying to make it easier for developers to employ ML as features, and one of the key things they talked about was fraud detection, so definitely something that makes sense. I was also listening to a podcast recently, just talking as we have on this whole episode about the rise of data and how within a company, if you can start to build these models and craft a sort of reinforcement learning loop where you're learning from your own customers and your own data or from these interactions, that becomes a kind of powerful moat for your company. It's not just that you've built great software, that you've hired great talent, but that you are able to have this evolution of your sort of services based on learning and advancing and optimizing. So I think what you're talking about is a pretty powerful trend throughout the industry and it's interesting to think about it from the database-first perspective. So I know earlier we mentioned MongoDB and BigQuery, Stanimira. We just talked through some examples of Atlas, but can you tell us a little bit more about that integration and maybe some use cases that developers who are listening would find interesting?

SV We talked about what we can use this integration for, but not exactly how to implement it so let's talk a little bit more about the tech behind it. There are several ways to integrate BigQuery with Atlas. Essentially you want to build a data pipeline that streams data from your database to the data warehouse or the other way around. And first, you need to find out if you want to be doing that as batches, like as a batch job, or if you want to be streaming data. So once you define that, you can have different solutions. Let's start with the most simple ones. So you can use Google Cloud Dataflow, which is a service provided by Google Cloud. And you can use one of the MongoDB templates that we have developed together with Google Cloud and are provided right in the Google Cloud UI. So when you select the MongoDB to BigQuery template, which is the most simple one, you need to provide the source, so the connection string to your database, some transformation, which is essentially a function that you might want to execute to transform the adjacent document that is coming from MongoDB to a table or to more structured data. And then finally, you'll need to provide the BigQuery table as a destination. And that's it. You just run the data pipeline and this is going to stream your data from the MongoDB collection to the BigQuery table. You can also set up a change stream listener that you host somewhere on your own and you can feed that to Dataflow, and that way you can have a streaming process. So you can have a stream of events that are coming to your Dataflow pipeline and are then feeded to the BigQuery table. Finally, there's one other integration that is actually quite recent that I want to mention. You can use Confluent Cloud. So if you're not familiar with Confluent Cloud, it's basically Kafka as a service. So if you start your Confluent Cloud account, you can create a pipeline that uses a connector. So you can have a MongoDB Atlas connector that listens for changes in your database automatically so you don't have to actually host your own change data listener anywhere. The platform will do that for you. And then again, you can transform the data and then you can also provide a sync connector to BigQuery where you will output the data that is coming from your MongoDB Atlas database. So yeah, it's actually pretty simple to set up a data pipeline and you have multiple different options to do it.

RD Yeah, it sounds super cool. Are there places that folks can check these out and play with them if they want to see what they're about?

SV Yeah, certainly. So we've got a tech block that is called the MongoDB Developer Center that has tons of engineering articles published by our solution architects, several people from engineering. And we have several blog posts about the integration between BigQuery and MongoDB Atlas as well in there. We also have recently published interactive hands-on tutorials on Google Cloud Skills Boost. If you're not familiar with this platform, it is super useful. If you're just getting started with Google Cloud, you can register and when you start the tutorial that is actually called a lab there, you will get a temporary Google Cloud account that you can use to basically sign in to Google Cloud and execute all the exercises step by step. Actually, you're doing that in a real world environment. So yeah, we have published a few MongoDB-specific labs there, including one BigQuery lab that is pretty connected to what we are talking about today.

BP That's so cool. We've talked a bunch of times on this show about the way in which cloud is one of the most powerful on-ramps for people who are just getting into the field, and I think the idea of being able to go in and try out some of the work that you would do within an organization under the context of a lab but from a free account is so different from the way things would've been 10 or 15 years ago, and is certainly very cool for developers who are listening and considering whether or not this would work within their organization.

[music plays]

BP All right, everybody. Thank you so much for listening to this episode. I hope you enjoyed it. As we do at the end of every show, I'm going to shout out a user who came on Stack Overflow and helped spread some knowledge with their curiosity or their answers. Awarded seven hours ago: a stellar question badge to SubniC. This question has been saved by 100 users, meaning it's provided a ton of value to different folks. “Get the name of a current script in Python.” 500,000 people have come to check this question out and learn from the answers, so we appreciate it, SubniC. I am Ben Popper. I'm the Director of Content here at Stack Overflow. You can always find me on Twitter @BenPopper. If you have questions or suggestions about the show, email us, podcast@stackoverflow.com. And if you like what you hear, why don't you leave us a rating and a review. It really helps.

RD I'm Ryan Donovan. I edit the blog here at Stack Overflow, it's at stackoverflow.blog. And if you want to reach out to me, you can find me on Twitter @RThorDonovan.

SV Thank you everyone for listening. My name is Stanimira Vlaeva. You can find me on Twitter with my full name, which isn't that easy to spell, but you can do it.

BP We'll put it in the show notes.

FF Thanks for listening to us. My name is Frederic Favelin and you can reach out to me on LinkedIn on my profile, Frederic Favelin.

BP All right, everybody. Thanks for listening. We'll be sure to put those social handles for Twitter and LinkedIn in the show notes, as well as some other links where you can check out some of the stuff we discussed in the episode. Thanks for listening, and we'll talk to you soon.

[outro music plays]