The Stack Overflow Podcast

Why is it so hard for companies to protect your privacy?

Episode Summary

Minh Nguyen, VP of Engineering at Transcend, joins Ryan for a conversation about the complexities of privacy and consent in tech, from the challenges organizations face in managing data privacy to the importance of consent management tools to the evolving landscape of privacy regulations.

Episode Notes

Transcend is a data privacy and governance platform. See what they’re up to on their blog or dive into their docs.

Find Minh on LinkedIn.

Stack Overflow user ivanavitdev earned a Populist badge with their exceptionally thoughtful answer to How to use toSorted() method in TypeScript.

Episode Transcription

[intro music plays]

Ryan Donovan Hey there, listeners of the Stack Overflow Podcast. I'm going to be at the HumanX Conference in Las Vegas from March 10th through March 13th. We'll be recording episodes with some of the speakers, as well as asking questions of folks on the floor for special compilation episodes. If you're attending and want to meet up, email me at podcast@stackoverflow.com. Hope to hear from you.

RD Hello everyone, and welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I am Ryan Donovan, the host of the podcast, and today we're going to be talking about privacy and consent and why it's so hard in large organizations. My guest today is Minh Nguyen VP of Engineering at Transcend. Welcome to the program, Minh.

Minh Nguyen Thank you so much for having me.

RD Of course, my pleasure. So at the beginning of the show, we like to find out about our guests’ journey into software and technology. How did you get started and how did you get to where you are today?

MN Sure. I got started from a nontraditional background. I studied philosophy in college. As part of a requirement for graduation, I had to take a bunch of math and logic classes and so I learned about computers from a pretty abstract standpoint. I wrote proofs about a halting problem, Turing completeness, and what it meant to compute what could be computed before I ever wrote code. But through that process, I learned I really enjoy that sort of thinking and so I signed up for my first CS course senior year of college. Turns out I love it, and then afterwards I joined a bootcamp, and here we are.

RD Yeah, that's interesting. I hear a lot of nontraditional entries. The philosophy logic one is an interesting path. How do you think that that focus on the sort of proofs of computability has changed or affected how you write code?

MN I think it hasn't changed how I write code per se, but I think just my philosophy background has given me a lot of different, maybe useful methods of thinking, methods of inquiry into looking at technical and nontechnical problems. And so I think it's helped me sort of become a more well-rounded engineer.

RD Okay. So today, obviously we're talking about privacy, data privacy. Why is that a problem? I know I've talked about it with other folks on the podcast and it does seem to be a problem that is not easily solved.

MN I think privacy is just one of those things that most people are for. I think all these regulations get passed because there's general support in the public for it. People think they should have a say over whether their data gets collected and how it gets used. And I think companies want to honor that as well, both for the user and for regulation reasons, but it's just a tough problem because most companies are in the business of building good software to serve their end users, they're not necessarily thinking about the privacy considerations. And even if they are, it can become a hard problem just as the scale of your data grows. So to maybe give more specific examples, if we think about the lifecycle of data, at the collection point, you have to make sure that you have the user consent to collect their data, which is complicated by the fact that different users have different regulations apply to them and so you maybe need to give them a different consent experience. After collection, you need to make sure that at the point of processing that you still have their consent to process that data. Maybe they have opted out, and if they have, then you need to make sure that you're doing your best effort to honor that. And then if they submit a deletion request, then propagating that request across all of your different data systems. And the problem only gets worse in scale as a company gains more and more data systems, works with more and more vendors, and then observability becomes a big concern as well. If someone decides to copy over a database for further analysis, are you going to be aware of that copy? Because you need to also remove the user data from that new copy. So a lot of different challenges that just sort of get added as the company grows.

RD Yeah, that is interesting that it is a lot of different moving pieces here. And I think it does seem like privacy is one of those things that, like you said, everybody wants, but it's not the first thing that they think about. It's something that's added to the software or the software development lifecycle a little later. Do you think it's one of those things that they add in once it starts scaling, once you get to the point where it could bite you and hurt you?

MN I think so. I think when you first start out your company, maybe you're a small company, manual work streams could work for a while. You could stand up a privacy email inbox or something like that, have your one lawyer or one operation person be responsible for it, and you get maybe a handful of requests ever and so you can just manually fetch the data or delete the data to honor that request. And that could get you pretty far, you are doing your best. But as your company grows and as your need for data becomes more urgent, maybe you need to understand user behavior to improve your product or to sell to the right user. As that becomes more of a need, then the manual workstream is not only an expensive way to solve the problem, but becomes not feasible. If you have hundreds of vendors or hundreds of thousands of databases, then you can't throw a human in the loop and solve that problem. So definitely it is a problem that comes with scale, and scale of the data that you're processing rather than necessarily the size or revenue of the company.

RD Talking to data folks, there's so many different databases and data gets propagated through them in different ways. How do you automate that sort of consent deletion request through what could be hundreds of databases?

MN Yeah. So at Transcend we've built lots of direct integrations into all of the different major database technologies and even not so popular SaaS vendors. Basically anyone with an API that we can programmatically talk to, we’ll talk to you. And then we have an orchestration engine essentially to help propagate the signal across all of your stacks and it's pretty flexible. You can sort of encode logic to take into account the type of user, the type of request, what regulation applies, and even incorporate manual work streams as well if the vendor or the team that owns certain data silos don't have an automated way to talk to it.

RD And you mentioned the observability piece, is there almost a forensic part of this to sort of track where data goes through a system before you can even honor that deletion request?

MN Yeah. So you definitely need to start with a solid data map, a data catalog. And I mean, you could have a data map or data catalog for lots of different reasons, not just privacy reasons, but we care about it for privacy reasons. And so yes, you would want to build that out, and again, there are many ways to build it out. One way is fairly manual– you can just send a bunch of surveys or just grab all the different department heads, sit down and ask them, “What is this data silo? What do you use it for?” Do it once a year and then you have a data map that's potentially one year out of date, and maybe that's good enough. But we don't like that at Transcend. We have, again, a way to connect to all of these cloud providers or data silos and scan the contents to help you understand what data is stored and where.

RD My experience just finding that map, who owns what, I had to do it in a microservices context in a previous job, and just finding where all the code lives, who owned what, was something that people didn't do, even at a company with hundreds of engineers. Is there a first step to sort of understanding the data and the privacy consent applications of that data that any company should do?

MN I mean, I think it depends on the sort of data that you're processing and where you operate your business. if you are handling sensitive data, then I think it's already a concern and you should start with the manual process maybe to start just to assess the risk that presents your business and then decide whether you need to automate that discovery process so that you can have a more robust up to date data map. But I think by and large, if you are serving especially a B2C product, if you have a marketing team, you probably have privacy concerns that you need to think about.

RD Yeah, that's right. The marketing team loves to use all the information they have.

MN Sorry to throw the marketing team under the bus. I mean, the product team too. The first time the product manager comes in and says, “We need to understand where our user is spending time on our app,” you probably need to think about privacy..

RD Right. Are there pieces of data that are affected by privacy and consent that people don't think about?

MN I mean, I think maybe one common misunderstanding is that you need to only consider personally identifiable information like PII, but a lot of the laws go beyond that. It's any personal data, even if it's not identifiable, it doesn't uniquely identify an individual. So I would say that that's like a common misconception that we hear, and we can help. With our discovery and classification tools, we try to classify all personal data, not just identifiers.

RD Yeah, because I have heard of people being able to take three sort of general things, might be browser type, zip code, and something else and be like, “That's you.”

MN Yes, yes, yes. This sort of ties into our approach to consent management. I mean, the more popular name or the older name is cookie management or cookie banner. And so we really popularized this terminology in the public consciousness and it makes it seem like cookies are the only way to track a user, to identify a user, but when in fact there are so many ways, like you said. Just basically any three pieces of information– your browser, your user agent, could triangulate and figure out who you are. So we need to go beyond cookies to think about a robust consent management solution.

RD Yeah. So on that topic, you all built consent management tools. How did you go about putting that together?

MN So we actually were a little late to the party. I think by the time we arrived on the scene, there were already many different cookie banner solutions out there, but they all pretty much work by basically either blocking scripts or blocking cookies, and there are a couple of problems with that. If you're blocking a whole script, a script could be loaded for multiple reasons. So let's use the example of a chat widget. You arrive on a website, you need help, you need to talk to someone, but maybe you use a privacy-conscious browser and it opts you out of tracking, but the chat widget loads the chat functionality and also collects analytics data, and so what do you do– you just block the whole thing? And I think this happens a lot. When I use a Brave browser or something, a lot of site functionality just doesn't work. So it's not very flexible and you're kind of penalizing the privacy-conscious user in those scenarios. The second problem is sort of what I mentioned earlier– cookies are not the only way to track the user, it's just a storage mechanism. There are many, many storage mechanisms on the web– local storage, IndexedDB, even just the path or the URL query parameter could contain personal information. So we took a step back and realized it's not the storage mechanism that's the problem, it's when it leaves your browser that's the problem. So our consent tool works very differently. We’re basically a firewall that runs on your browser and inspects every network request and decides to block it or allow it based on your consent.

RD That's interesting. So when you say ‘when it leaves your browser,’ how would that information leave your browser?

MN So there's all these cookies or whatever, you're storing it. The script that's loaded –let's say the chat widget– there's probably also a little analytics collection part of the code that's just pinging, sending back data on some regular cadence or when the user closes the chat or something like that. So if you visit a modern website today and you open your network tab, even if you're not taking any action whatsoever, you'll see new network requests go out. So there's all these code, like SDKs, whatever, that are sending out data whenever, basically. And so we basically hash all of the DOM APIs that could potentially make a network request, and that way we're able to regulate them.

RD To sort of block anybody who's sending more information than they need to.

MN Yeah, exactly.

RD That's an interesting idea. I mean, it seems like everything on the web is sending analytics at this point, multiple analytics on multiple pieces. And every time I go into my task manager, there's subtasks that are running ads and analytics, and I was like, “Why is this one here?” Is there too much analytics tracking in your opinion, or is this sort of par for the course?

MN Is there too much? I think there's probably too much, but Transcend isn't in the business of taking everyone's data offline. I don't think that's the vision, I think we just want people to have a say in it. And as a user, I'd be much happier to give my data to a company if I felt like it was clearly understood what they were going to use it for and what the value is for me as the consumer, and I would also be much more likely to give my data if there's a clear and easy way to export, delete it, bring it to someone else if I don't agree with how they're using my data. And so yes, there's probably too much data being collected, but also some of that's for good reason, but you're just not giving your user visibility and understanding into why you're doing that.

RD And I know we talked about the back end part of this. How does the consent management stuff translate into the sort of back end storage, like you said, removal requests and all that?

MN So in terms of consent, you may need to persist consent signals in the back end just so that if you're uploading the user data to some vendor, at that point of uploading, maybe you want to check again to see if you have permission to do so. And so the technical requirements for this consent store needs to be highly available. I guess it depends again on what you're using your data for, but for some companies, it's quite sensitive and has a lot of business impact if the processing is slow. Maybe you're a financial firm and you need to. So on the back end, the solution needs to be highly available and accessible from lots of different systems, so we built a highly-available, low latency API so that people can query for this data. And the other interesting challenge and perk of our consent solution is setting up this thing could be quite complicated. You're essentially asking sometimes not-so-technical folks to set up a firewall for their website, which is kind of bonkers, so we are also collecting some metadata to help make suggestions to the end user on how to configure that firewall. So we have to handle a lot of telemetry data, I guess, because there's lots and lots of network requests and we need to help them classify what is this vendor that's making this request, what's the potential purpose behind it? That's a very interesting challenge as well that if you were to build that in-house, that would be really hard to do.

RD I mean, you are collecting data also on this, but I think that that shows why everybody's doing it. You want to see if this works, you want to see if it's effective. There's so much information given in any sort of web networking transaction that if you're not looking at it, you're missing out on understanding what's going on.

MN For sure. But we try to do this in a pretty privacy-conscious way, so we don't actually send personal information back to our telemetry server, it's just purely, “How many times did you make a request to this domain?” So it's just the domain name. You can't really identify the user based on just the domain and account. So we always take privacy considerations quite seriously and bake that into our product, and I think generally there is a way to collect data and use it in aggregate or something that can't then be misused for privacy reasons.

RD Yeah. Top of the show you mentioned some of the regulations and the various different ones. Obviously there's GDPR, but California has one that I don't know the full landscape of.

MN I don't even know the full landscape at this point because I'm not a lawyer. Just because there's a patchwork of state-level laws being passed because we don't have a federal law. And then every country also has their own.

RD So are those regulations useful, and that patchwork, is that making it worse, making it harder to sort of guarantee privacy?

MN I don't think it's making it worse. I think if you implement it poorly, then obviously it will result in a worse user experience. But I think it shows that there's appetite, I think, for a federal-level regulation. I think the problem isn't that there are state-level laws, it's that we don't have a federal law. And there's some commonalities as well, and this makes it so that you should probably buy and not build this in-house, again, because a good tool will be fairly flexible and allow you to configure lots of different experiences and not be too opinionated about ‘this is what this means’ and only build for one regulation, because all of these things are changing and your tool needs to be flexible enough to account for any changes in the future.

RD And if you build to the most strict regulations, you probably catch some of the other ones in there too.

MN Yeah, that's also quite common. Companies will choose to just bill for the strictest case.

RD So this is for programmers largely, can you tell us a little bit about the tech stack you use to build consent management tools?

MN Yeah. Our entire tech stack is TypeScript with a little bit of Python where you just have to do it for ML. If you're doing ML, then you have to do Python, but everything is TypeScript. We're also an AWS shop. And the consent though, in particular the SDK that runs on the browser, it all compiles in the native JavaScript code, rather than we can't use any third party libraries for privacy reasons again, and also it's quite performance-sensitive because it sits in the critical path of all DOM mutations so we have to be very careful to not slow things down.

RD So no frameworks, just raw JavaScript.

MN Yeah, and lots of reading not just MDN docs, but also the original specification for different protocols.

RD Wow. So your team must have one of the better understandings of some of those protocols.

MN We have some pretty incredible, I guess, browser engineers, which are kind of rare because generally you're building web applications or something like that but you don't need to dig so deep, but for our product, you kind of need that sort of deep expertise.

RD On a completely different topic, we had a public figure, Mark Zuckerberg talked about software engineering teams needing more masculine energy to be more aggressive. What do you think of a statement like that?

MN It's a funny statement, but I think ultimately I don't really know what he means by that. Does he want more men in tech? We already have a lot of men in tech. And does he want people to be more aggressive, more assertive, more competitive? What does he mean? We have these more specific words, so why don't you just use the more specific word to say what you mean? So personally I prefer more specific concrete words when it comes to my team culture and sort of the engineering culture that I want to build, so I think a lot more about collaboration and how to have open discussions and healthy disagreements and how to build a team that can be high-performing.

RD Yeah. And in my experience, the engineering teams aren't really where you get the aggression as much, even when it's all men. Most engineers are pretty conflict-averse, generally willing to work with whoever is there, unless they're one of those take charge 10x engineers.

MN Yeah, I think that tracks with my experience. I have had just a career where most people I work with are nothing but nice, kind, talented people. And there are challenges, it's not that they never come up, but that those are the exception, not the rule.

RD Right. So back to the consent and privacy, what are you most excited about the future of privacy and content management?

MN I think, excited by the challenge that AI brings to privacy. I think it's become more on people's minds now that AI is everywhere, and I think there's a lot of work for us to do but I think we've taken the right approach to it, which is this is a problem that was introduced through automation, through tech, so you need to build the right tech to catch up and solve this problem. So it's a big challenge, but I'm kind of excited by it.

RD Okay, that's a good attitude to have. You see a big problem and are like, “Oh, that's exciting.”

MN Yeah. I mean, at least I feel like I'm doing something directly about it, which is all that anyone can do really.

[music plays]

RD All right, everyone. It's that time of the show where we shout out somebody who came onto Stack Overflow and dropped a little knowledge, helped out some people. Today we're shouting out a Populist Badge winner– someone who came on and dropped an answer that was so good it outscored the accepted answer. So congrats today to ivanavitdev for answering: “How to use toSorted() method in TypeScript.” If you're curious and you have a TypeScript shop, go check it out. I'm Ryan Donovan, I host the podcast here at Stack Overflow. If you want to reach out to us with comments, questions, concerns, you can email us at podcast@stackoverflow.com. And if you want to find me, I'm on LinkedIn.

MN My name is Minh Nguyen. You can find me on LinkedIn. You can find more information about Transcend at transcend.io.

RD All right, everyone. Thank you very much, and we'll talk to you next time.

[outro music plays]