The Stack Overflow Podcast

Mobile Observability: monitoring performance through cracked screens, old batteries, and crappy Wi-Fi

Episode Summary

Today we chat with Austin Emmons, an iOS developer at Embrace, where he spent time rebuilding their SDK to work with OpenTelemetry. He discusses the challenge of tracking performance and watching for edge cases when your app is deployed across dozens of devices with enormous variability in their hardware, software, and network capabilities.

Episode Notes

You can learn more about Austin on LinkedIn and check out a blog he wrote on building the SDK for Open Telemetry here.

You can find Austin at the CNCF Slack community, in the OTel SIG channel, or the client-side SIG channels. The calendar is public on opentelemetry.io. Embrace has its own Slack community to talk all things Embrace or all things mobile observability. You can join that by going to embrace.io as well.

Congrats to Stack Overflow user Cottentail for earning an Illuminator badge, awarded when a user edits and answers 500 questions, both actions within 12 hours.

Episode Transcription

[intro music plays]

Ben Popper Managing agreements can be challenging with scattered data and manual work. Discover the DocuSign Developer Platform to streamline agreements with flexible APIs and tools, customize workflows, extend capabilities, and extract actionable AI-driven insights. Learn more and sign up for free at developers.docusign.com.

Ryan Donovan And welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ryan Donovan, I edit the blog here at Stack Overflow, and today I'm joined by my colleague, Eira May. How are you, Eira?

Eira May I am doing well, Ryan. How are you?

RD Living the dream, dreaming the life. Today we have a great guest to talk about observability, but mobile observability. We talk about observability mostly in general terms, but today we're going to talk to Austin Emmons, iOS developer at Embrace.io, to talk all things how observability works on the tiny devices. So Austin, welcome to the show.

Austin Emmons Hi. Thanks for having me.

RD So top of the show, we like to get to know our guests. How did you get into software and technology? What's your origin story?

AE Just like any normal or standard origin story. I grew up and was pretty good at math, really loved video games, always wanted to be a video game developer. In high school I took my first programming course and we had a local college that sent one of their students to teach us alongside our teacher. And the course was in C++ I believe and then the college student came for the last month and taught us Java and introduced us into the object-oriented programming paradigm. And that kind of solidified a path to the University of Vermont up in Burlington for an undergraduate degree, a BS in Computer Science, and learned kind of the theory behind computing there. And that was– oof, I think I graduated in 2012 now, so I've been a professional developer since then. I got into iOS development probably at the tail end of my college career. One summer right after iOS 2 had come out with the App Store and they had started to open it up to developers, there were some textbooks that came out and there was this intro to iOS development textbook that was the first ever textbook I had ever read cover to cover and did all of the examples in the book, built a calculator app and a to-do list, all of the basic stuff. But over summer break, spending my time on a computer, doing that thing was new to me. I really felt like, “Oh, okay, this is interesting. It's something that I really want to do.” And so I've been interested in iOS development ever since. My career has kind of taken me here and there from Android development to back end server development in PHP and Ruby on Rails, and I finally was able to come back and become a full-time iOS developer. It's great just to be back where you want to be and doing the work you want to do.

RD On the podcast and in the blog we've talked to a few folks about observability in general. So how does mobile observability differ from back end observability?

AE It’s similar in that you're still worried that the application is performing correctly when it's out in the wild. So on the server, you deploy to your production environment and you still want to monitor your production environment to make sure that the logic that you've written is correct and valid and the edge cases are minimized as much as possible. And so that's the same in the mobile environment, the difference is that on these devices, you have less control. So you have no control over the disk space, the network you're on, Wi-Fi, cell, how spotty that might be. These devices are most often running on battery power which is completely different. I don't think there's any server in the world that would run off a battery that is tiny enough to fit into your pocket. And then the chaos continues from there. The main input, the display, just could be shattered into a million pieces and parts of it are not active because that is just gone. Not that as an application developer you should concern yourself about that, but this is the reality that your users are dealing with, and that's very different in this mobile environment. And the devices are very different. On the two major operating systems, Android and iOS, even just for a phone, you have many different models. And iOS is a little better, so the best case scenario, you might have a phone that is five or six years old that's running an operating system that's been out for a year or two, but you might be on a device that came out two days ago that's running pre-release software that will be released sometime next year. And so the spectrum of that environment is very wide and very different. At the end of the day, your users will expect the best from your application and so you need to make sure that when these edge cases come up, you have to really narrow them down and understand why is this happening, where is this happening, and what can I do to minimize it as the application developer?

RD So how do you implement observability when you can't really know what sort of data is going to be sent to you because of all these devices, because the interface area may be limited. How do you account for all this variability?

AE You definitely try to find commonality where you can and start small and build from there, just like any problem you're trying to solve. And so in the mobile space, what does that mean? It's where is this application making an important interaction, and many of those interactions are the network requests that it's making to the server. A lot of times the server is kind of the driver and the mobile device or that mobile application just is a passenger along for the ride, and so a lot of the state is handled still on the server and the database is up in the cloud somewhere. But that network request, that interaction, is very important to make sure that it goes out successfully, that when the app launches, you want a quick request to come back so that that user is not waiting– the first time they open the app, they're not waiting for a long handshake. And so we can observe those network requests and say, “Okay, one of these is going out and not 30,” which is great. We're trying to minimize the antenna and how much, because that'll just create a bottleneck. And then also maybe this is the second time the app or the third time the app has started. Hopefully this request is cached and hopefully our caching logic on device is defined well enough so that we understand what that behavior is. If it's a hello request like that, it might be a very short cache lifetime, but at least there's a 15-minute window where the state is considered okay to be the same. But those interactions you try to limit, and any of those bottlenecks, you try to really start there to say, “Okay, here's where the problems could arise because state is being exchanged,” and then from there we can start observing other things like maybe what the user is doing or what the system is telling us. Is the device on low power mode or are we getting any memory warnings from the system? And that's also important signals that the system sends that are pretty low hanging fruit to capture.

RD With back end, like you said, you control everything. What are the variabilities in the telemetry that you get from mobile devices and what is usually the hardest to get?

AE It depends what access the system gives you. Network requests is a pretty common instrumentation and so that's probably the biggest one. There's also just the concern that as an observability vendor, you don't want to slow down the runtime of the application. You definitely don't want to cause any crashes or cause any slowness and so you try to stay out of the way. So that can be just a complication, but that's not something that you retrieve. Crash reports themselves are a big piece of telemetry that we can grab from the device once the app is relaunched, and so that's a really good one to have because it's an explicit failure. But even still, sometimes the failures there are not as explicit as you want. These operating systems do take control and say, “Okay, you've had enough and we're just going to kill the app.” Apple has a very famous exception where it's in hex “8BADF00D” and it's when the application is going to be put to rest. And so that can happen anywhere and so it's very hard to identify what might have caused that. It could just be various different things that led to the system saying, “Okay, sorry, you're using too much memory,” or, “We just have to put you to sleep because somewhere else in the system we need more, and your application is no longer being used so we'll put you away.” So anything can be difficult, but because you're in the mobile space, it's hard to pick out one piece and say it's difficult, but the hardest part is really getting the logic in place so that if something fails randomly, you can recover and capture the data you need to reproduce whatever that failure was.

RD I know I've seen things about logging potentially leaking PII. Are there other privacy concerns with mobile observability?

AE Absolutely. This is a user's device in their pocket and there's so many things that a user may not expect as PII but is. Apple has done a really good job and Google has done a very good job at starting to define and really limit what an application or observability vendor can access. Apple just last year at WWDC, I think ‘23, started this concept of this privacy manifest where even us as an observability vendor, as this SDK, we can declare, “Here's the information we're gathering. It may or may not be identifiable to the user, but just to be safe, let's declare it and make sure everybody understands what this is.” And so an example of that that might catch you off guard is how much disk space is available on the device. And you're like, “Well, that's not identifiable at all,” but it's a very precise integer. It's how many bytes are left. And that piece of information coupled with maybe an IP address and one more other piece of loose information can be used to fingerprint that user and track them, not just in that app, but across apps, and that's the red line that these operating systems are really drawing. We don't want you to have the ability to say, “Austin Emmons is using Uber, and then while they're in their ride, they moved over and now they're whatever it might be.” If we're accessing that disk space, the rule as it’s stated in Apple's documentation is that you cannot make any derivatives of this value and the value cannot leave the device. And so starting with logic and language like that is really important to protect users’ PII. And then the other things that are very common, just location info and stuff like that, we try to tell people just don't even capture it. Don't even observe this in our platform. We don't want to store this in our servers. It's really not that useful. If an issue occurs, then give us a derivative but don't give us that fine-grain location. So if you're tracking a vehicle driving and you're calculating speed, well, calculate speed on device. And a lot of these interfaces allow you to just ask, “I have these two coordinates. Derive the speed from that,” instead of sending the location up to the cloud and doing it in the back end. That's the type of stuff that, as a developer, you should be thinking about when it comes to privacy and where this logic should be held. And as a mobile mobile observability vendor, we're definitely very cognizant to make sure that the people that use our SDK are not trying to do anything that is less than desired for me as a mobile user myself.

EM I wonder if I could, just sort of piggybacking on that, ask you about the kind of maturity of mobile observability. In that space, is there a lot of awareness that this is a thing that folks need and that folks need to be working on? Have you seen that really change over the last couple of years?

AE So mobile observability in general has been around for quite some time and there are some very famous vendors –Sentry and Datadog– that are out there that have solutions– New Relic. These are big names in the observability space in general, and they do offer some solutions in the mobile space now and have for the last couple of years. I would say it's kind of getting into its teenage phase if we're talking maturity, where it's been around out there and now we're starting to really enter the era where open standards are taking shape. And in the last year, Embrace has switched to take OpenTelemetry as our base data model and really dive into that and say that this is great. It fits our use case perfectly and it's so useful that it's an open standard, that the data that we emit can go to any back end. So if you have an OTLP collector already, we can just send it to you. That just makes things easier. And so I am part of some of the OpenTelemetry SIG groups. In the client-side SIG especially, I've seen really good growth over the last year where new people are joining, new proposals are being added, questions are being asked like “What's the best way to model a user session?” or, “If a crash occurs on the device, what are the semantic conventions around sending a crash report back to some OTLP collector?” And so that's been really exciting to be a part of and so I'm hoping that as we go through this teenage phase, the growing pains there will happen, but soon we'll have a really good suite of semantic conventions around the mobile-specific interactions.

RD We've talked to Splunk about OpenTelemetry and they sort of seem to feel that there was, especially for big players, some reluctance to get on these open standards because once you instrument on it, you can move to anything. Was there ever some reluctance to move to the open standards or were you all just like, “Let's get there.”

AE We see that as a benefit. It definitely came up that this is a great way to remove vendor lock-in, but from talking to our customers, that's what they want and so why stop that? It's great for the community, it's great for mobile observability on a whole, and we want to be leaders in this space so we should lead that charge and say that vendor lock-in is not great for anybody. That's not a good experience so let's jump onto the open standard and offer a good experience and play in the space cooperatively. I don't think there's really any surprise. There's definitely still innovation to do, but if we do that and somebody else tags along and wants to innovate with us or alongside us, then that's a good thing. So I think existing players may have thought that vendor lock-in is a good way to protect their business. Hopefully that mindset is shifting. I'm not the business side of the equation though, but I do hope that mindset is shifting.

EM It seems like it's shifting kind of in response to what customers are looking for. They don't want that kind of closed-ended commitment to a vendor. Is that what you've seen too?

AE Absolutely. And the big thing for us in just how we implement that, on a technical level, some of the clients we have organized themselves in a way where they're writing– I call them first-party frameworks– where there's the application that is what they distribute to the app store, but then beneath that they have a series of frameworks that they build themselves and they've organized their team underneath to have a tooling team or the UI component team, and that's building reusable components for them to use in their app. Or if they release two or three apps, they can use the same components across those two or three apps. And for us, if those SDKs are linking to the Embrace SDK, that is really heavy vendor lock-in. It is such an aggressive, every single thing is depending on Embrace. Whereas using an open standard, the goal is that each one of those frameworks can just depend on the OpenTelemetry API, use these shared concepts, and then the app can inject in the vendor at that point in time, whatever that app is using. And then it elevates the entire community because that company and those SDK teams, those first-party SDK teams, are much more willing to take on this project to get observability into that part of their ecosystem. And so it really helps break down that barrier just by introducing an open standard.

RD Were there any difficulty in modeling all the telemetry from the mobile device to the standards? I know OTel has the logs, traces, metrics, and now events. Are you able to model everything to those observability pieces?

AE Yeah. So we started out with traces and logs and we'll start getting into metrics soon. Events is still very experimental. I'm part of the Swift SIG there, and I've tasked myself with actually implementing and getting the events implementation in the OTel Swift project up to the latest spec. Because as that proposal goes on, certain things change and the developers need to to keep up with that change. And so for us, modeling the telemetry that we are already collecting in a proprietary data model and shifting that into using these primitives that are provided by OpenTelemetry was a challenge, but the great thing about OpenTelemetry is that if I was observing a baseball game or a soccer game, I would be using these same concepts. A point in time something happened, or a duration of time and I wanted to monitor that duration of time with traces and logs. And so because it's so abstract and generic and simple, it's easy to map the concepts that we're using. A lot of them just fit and then the complication just became making sure that everybody in the organization understood that this was just new vocabulary for maybe the same concept. And that's once we start getting into the nitty gritty of, “Okay, here's a network request. How should we model that? What features are we providing as Embrace in our proprietary data model? Are all of those features congruent with this new shape and this new data model or are there any features that may not fit?” And there were minimal features that we were just like, “That doesn't fit,” or “We'll think about that in this new paradigm and shift the entire implementation drastically.” But because there are semantic conventions for a lot of these things from the server's perspective, and in some cases from the client side's perspective, you just kind of look, “Hey. What is out there? What exists? Can we use that directly? Does that fit our need? If it doesn't, then let's take this design, get inspired by it and maybe, okay, a low power mode. We'll use these keys and that'll be an Embrace-specific semantic convention and will sit as a layer above those OTel semantic conventions.” But the goal is that we participate and we eventually push those down and we propose those to these SIGs and get their feedback, see if it works for them, and then hopefully eventually into the OTel semantic conventions. It's a process, but you just need to make sure that our back end that is consuming this data understands that it may change in the future and we still allow for that.

RD So what's the future of mobile observability? Will there be OTel instrumentation built into the OS's at some point?

EM Yeah, what are the college years going to look like, I guess?

AE One thing that we're struggling right now with, and one thing I'm very jealous of in the server space is that most packages are observable. So if, especially Ruby on Rails, or even if you use Django as your web server back end, you can find instrumentation that takes that incoming request all the way down through the database, and you can see the trace of how many queries were made during that request, if there are N+1 queries that are causing issues and delays, and all the way back out to really understand what your response time is at the server level. And that happens regardless if you're using Postgres or SQLite or any other type of database– Clickhouse or whatever it might be. And on the mobile side of things, the instrumentation for those vendor third party SDKs that many application developers use hasn't grown and it just doesn't exist yet. And so the hope for the future of mobile observability is that these vendors start taking on this task of instrumenting their SDK, using that common OpenTelemetry API, so that then when an application integrates with that vendor, maybe it's for app store purchases, that vendor then allows them to inject in a tracer and a logger, and that application has an understanding of what's happening in that app store purchasing SDK or whatever it may be. And if that's Apple exposing those and allowing that to happen, that would be amazing. If that is just developers that have their own Swift packages that are into mobile observability and they want to tag along and add in this ability, that would be amazing. And so that's what I'm looking forward to and that's what the hope is. The other benefit of OpenTelemetry that I would say is that this data can live side by side and mingle with the data that you're collecting on your back end and that is another massive benefit that we see. And if you're using Grafana or some of these other OTLP collectors or vendors that support collection using the OpenTelemetry protocol, then this data can live alongside that. And that's immensely powerful because then you can really start to see not just from when that request enters and hits your server, but also when that user taps the button on the device, how long and the path that the logic took, all the way through to whatever microservice architecture you're using in the back end, or however long that request is going to go unhandled, and then all the way back, hopefully to that user. And so that's something that is also really powerful that we're looking into and we kind of get for free just by using the OpenTelemetry standard. It's why we're really excited about this new data model.

[music plays]

RD Well, it's that time of the show again where we shout out somebody who came on Stack Overflow and earned a badge for shedding the light of knowledge on fellow users. Today we're going to shout out a winner of an Illuminator Badge– someone who came on and edited and answered 500 questions, both actions within 12 hours.

EM Damn.

RD So congrats to Cottontail for showing us the way, giving us the knowledge. I've been Ryan Donovan. I edit the blog here at Stack Overflow. If you want to reach out to me, you can find me on LinkedIn. If you liked what you heard today, please drop a rating and review. It really helps.

EM I'm Eira May. I'm also on the Editorial Team at Stack Overflow. You can find me on LinkedIn and text-based social media @EiraMaybe.

AE I'm Austin Emmons. I'm an iOS developer at Embrace. You can find Embrace at Embrace.io. I'm on LinkedIn– Austin Emmons should pop up. I'm based out of Pittsburgh if that helps. You can find me also in the CNCF Slack community, in the OTel SIG channel or the client-side SIG channels. Any of those SIG calls, feel free to join. They're public, the calendar is public on opentelemetry.io. And then Embrace has its own Slack community to talk all things Embrace or all things mobile observability. You can join that by going to embrace.io as well.

RD All right. Thank you very much, everybody, and we'll talk to you next time.

[outro music plays]