The Stack Overflow Podcast

What launching rockets taught this CTO about hardware observability

Episode Summary

Austin Spiegel, CTO and co-founder of Sift, tells Ben and Ryan about his journey from studying film to working at SpaceX to founding Sift. Austin shares his perspective on software development in high-stakes environments, the challenges of hardware observability, and why paranoia is valuable in safety-critical engineering. Bonus story: Austin invited Elon Musk to speak at his student club…and he came!

Episode Notes

Sift is an end-to-end observability stack for safety-critical hardware development. See what they’re up to on their blog.

We talked to SpaceX about their testing processes way back in 2021.

Connect with Austin on LinkedIn.

Stack Overflow user TheScholar earned a Great Question badge by wondering How to create a new deep copy (clone) of a List?, a question that’s helped more than 200,000 people.

Episode Transcription

[intro music plays]

Ben Popper Hello, everybody. Welcome back to the Stack Overflow Podcast, a place to talk all things software and technology. I'm your host, Ben Popper, Director of Content here at Stack Overflow, joined as I often am by my colleague and collaborator, Ryan Donovan, Editor of our blog. Hey, Ryan.

Ryan Donovan Hello.

BP You and I, once upon a time, did a nice four-part blog series with the folks from SpaceX, and a lot of that was about what it takes to make software that is robust and resilient enough that you can trust it in space. You’ve got this very expensive piece of hardware, you've got some people's lives on the line, how does it work? And so there was a couple of different pieces of that– there was telemetry and there was the hardware itself, testing, and we also got to talk to some of the folks at Starlink. Today, our guest is Austin Spiegel, who has some experience over at SpaceX and Starlink. We're going to be chatting a little bit about what he learned there, and then we're going to be chatting about Sift, which is a new company that he is helping to run, some of what they're doing taking the learnings from before and what they're seeing in the market with the product that they're putting out there today. So without further ado, Austin, welcome to the Stack Overflow Podcast.

Austin Spiegel Awesome. Thanks for having me, Ben and Ryan. I’m really excited to be here. And I do remember that four-part series. There was actually a couple of articles highlighting some tools that I worked on so it was really exciting to see that at Stack Overflow.

BP Oh, nice. Cool. Tell us a little bit about how you got into software and technology and how it is that you ended up at SpaceX.

AS Totally. My background is I attended USC initially to study film, but was drawn into computer science and specifically computer science for games. So USC has a program that's focused on games, and during my time there, I was fortunate enough to– it's a bit of a crazy story– but I was in a small student organization and we emailed Elon Musk in 2014 and invited him to come to dinner with our 20-person student organization. And he showed up and spoke to us a bit about the Falcon 1 launch and how he was successful through multiple attempts in that, and that was very inspiring to me so I applied for an internship at SpaceX, abandoned my dreams to work in the game industry, and then joined SpaceX as an intern in early 2015. So that was my journey from film to software engineering. And then when I joined SpaceX, I was working mostly on internal software for manufacturing and test. I actually did eventually get to return to the game industry. I worked at Riot Games for a year after my five and a half years at SpaceX.

BP Cool. Well, that's a legendary story. I'm sure many people have tried that invite since and I don't know how many have been successful.

RD Now that you're at a safety-critical hardware observability company, the hardware and software in a rocketship is super mission critical. If that fails, people die, hardware explodes, you lose millions of dollars. What did you learn from creating software in that sort of high-pressure environment?

AS Definitely. Thanks, Ryan. I could talk to a couple of things here. I think one is just the general software development process. So something we did at SpaceX was negotiate a software development process with NASA. NASA tried to impose a very stringent software development process for how they built software, and of course, SpaceX took a first principles approach and wanted to do things in its own way, so they negotiated a different process with NASA for developing software, and there were multiple tiers to that. So the tier that I used was what we called Class C software, which was specifically for ground support software, so anything that's not necessarily running on the vehicle but is in support of the mission. So that's a process that, while we are not necessarily required to use here at Sift, we've adopted ourselves in order to ensure that the product is of high quality and resilient. The second thing I'll talk to you a bit is about how I think the hardware architecture of Dragon and Falcon kind of trickled down into the entire culture of the company at SpaceX. So specifically Dragon and Falcon used a triplicated flight computer, which means there were basically three strings and each of those strings would vote on basically the next decision to make, to put it in layman's terms, and the majority vote would win. And the reason that you had this was, of course, in space you can have radiation that impacts your flight computer. You can have random flight computer restarts. So it was important to basically have a multi-fault tolerant system. So of course, then that trickles down into what we do here at Sift in that we collect all of our customers' data, whether that's from data review, so anything during the hardware R&D lifecycle. So you can imagine, even before you launch a rocket, I would actually say most of the data generated in the lifetime of a vehicle, whether that's a rocket or a spacecraft, is actually created before the vehicle was launched. Because you have thousands of simulations, or what we call ‘hardware out of the loop tests’ at SpaceX, and then you have ‘hardware in the loop tests’, which are basically all of the avionics or kind of the guts of the rocket splayed out on a table where test scenarios are run, and then you have ‘vehicle in the loop tests’ where it might be the fully assembled vehicle and you're kind of running a test on that. So there's thousands of these tests that occur in the lead up to a launch and that is when most of the data is generated so it's really critical to capture all of that data upstream of your launch or your deployment of your software, and then use that, review that data in order to proceed with the next step, whether that's launch or release or also use in the manufacturing process.

BP I was talking to Ryan earlier today. Somebody had brought up what goes on during these massive training runs that companies are now doing for their frontier AI models, and there's obviously a huge investment in hardware. You build these giant GPU clusters and people have thrown around crazy numbers for what it costs to do one training run. So it's not quite as critical as actually putting human lives into a piece of hardware, but it may be just as expensive, who knows. And it was the only other time I'd ever heard somebody talk about cosmic rays. They were like, “We've got these three different data centers. We're trying to get this next frontier model with a quadrillion parameters out the door, and man, if a cosmic ray flips one bit on these GPUs, we've got to start the whole thing over again.” And I just thought, “Wow.” How do you get around that and why does this have to work that way?

AS You need radiation hardening. And I think that's why at SpaceX too, a lot of the chips were custom-built for space.

BP So tell us a little bit about the decision to leave. You said you were living your dream, you were learning a lot there and then you went on to games which is another place that you felt really passionate about. But you must've seen what you felt like was an opportunity. Often people with that engineering mindset think, “Okay, here's a pain point that has a big addressable market. Nobody's solving it the right way. This is my chance to be an entrepreneur and figure this out.” Was it you, was it you and a co-founder? What was the germ, sort of the genesis of getting to Sift?

AS Of course. For me, I think I always in the back of my head knew that I wanted to start something, but I did not want to start something without really having a unique perspective on the world, or on a problem, I should say. And during my time at SpaceX, I was exposed to things that I would not have learned if I had gone to a big tech company, for example, specifically manufacturing. I don't think I realized how fascinating manufacturing was. I was always a big fan of watching How It's Made, but once you get onto a factory floor and you see rockets being assembled, at least for me, it was very moving. So I think in my five and a half years there, I got some unique perspectives on problems, one of which was collecting telemetry for rockets and satellites and then reviewing that data, and specifically how the existing observability tools on the market were very much geared towards software engineering use cases, whether that's application development, application deployment, or monitoring. So about two years ago I was actually looking for a new role and started to interview at some early stage hardware companies, and they were all building or looking for somebody to build this tool. And one of the observations was they were pasting together a lot of open source tools, so taking an open source time series database, taking an open source visualization tool, whether that's a more business intelligence focused tool like Metabase or a monitoring-focused tool like Grafana, pasting all these things together and then hiring a team of software engineers to go and maintain this as they scale the company. And I think one of the challenges for these hardware companies too is that ideally you're hiring software engineers that want to work on the vehicle, and people join these companies because they want to go and build a satellite or build an autonomous train. So it's a little difficult to find talent that is interested in building the support software, essentially, at these kind of early stage hardware companies. So when I had that realization, I reconnected with my co-founder, Karthik, who had joined my team at SpaceX and then went off to work on DragonFly software and lead a few Dragon missions to the space station, and he was a heavy user of this tool. So my team at SpaceX worked on this tool, he was a heavy user and it felt like why don't we take an understanding of how to build it and take an understanding of how it's used and go and build this end-to-end stack that will enable customers to stream us hardware sensor data, store it, analyze it, and then give them a rich user interface that's really geared towards hardware engineering use cases. So what we did is just a lot of user research, honestly, before founding the company. We were connected through the SpaceX network with roughly 30+ companies of all stages. So I spoke to people at companies that were just being founded all the way up to companies that launch rockets into space actively and compiled all of their feedback on what they would want in a tool and whether or not they would purchase such a tool and at what price point would they purchase that tool. And then from there we were kind of off to the races. We basically gained enough conviction that we should devote our time to going to solve this problem.

RD It's interesting. We've talked to a bunch of observability folks primarily doing software, some doing mobile, but very few doing hardware, and I wonder what the challenges are that are different, because like you said, hardware is different. You're dealing with physics. You can't set a breakpoint on a chip and say, “Dump at this point.” And the fact that you're also doing it on hardware sensors which are just gathering tons of data, and then you have to gather data on the thing that's gathering data. So what are all the challenges that you had to face there that are different than regular observability?

AS I think the number one challenge, and you kind of just alluded to it, is integration with the sensors and the vehicle themselves. A lot of customers in this space are very conservative about what software they run on their vehicle or on their device, and that's because these vehicles have power constraints. The battery is only so large, the amount of voltage they can get is only so high, and they also have bandwidth constraints. So a satellite only downlinks data for a certain period of time, and it will only downlink a certain amount of data, usually lower fidelity data. Maybe an autonomous vehicle out in the desert has a 3G cellular connection and it can't send all the data. So generally speaking, these customers don't want to just let anybody run software on their vehicle, and they do very clever things with the way that they construct their messages between their vehicle and their command and control software, specifically around how they pack data into a message to save space. It's actually interesting as a side note, there's actually a lot of overlap between the way that our embedded software engineers pack messages of telemetry and the way that networked video games or multiplayer games pack messages. There's a lot of clever ways to save space on message size. So that's basically the biggest challenge. Everyone comes with their own format, and then the challenge for us is how do we kind of build something that is adaptable to each of those formats. And the way that we've solved it is what I like to describe as a sliding scale. There's very simple ways to get data into the platform, whether that's uploading a file essentially, or using an existing protocol –influx line protocol is very popular– and then there's more complicated ways to get data into the platform. We have a lot of customers that use protocol buffers and we have a protobuf registry so they can register their protobufs and then forward or stream us data in an envelope and then we can deconstruct that. So that's pretty much how we've gone about solving that problem.

BP So one thing that occurred to me as Ryan was talking was that we've met a million companies that do this on the software side. It’s interesting that you brought up gaming. We were just talking on the podcast recently that people will pay a lot to make sure the streaming of the Olympics or the Oscars or the biggest gaming tournament goes down without a hiccup and there's no latency and no player has to complain that they missed a couple of frames. You saw obviously that lots of folks weren't doing that on the hardware side. Is that because maybe in part there's a certain universality to the software where it's easier to build something that applies to all different kinds of companies, whether they be TV streaming or SpaceX or gaming, whereas in hardware maybe these sensors could be very different– one's built for a submarine, one's for a weather satellite, one's for a tractor, I don't know. Do you feel like the product you're creating ideally would work for any company, no matter what kind of product or sensor suite they're using?

AS I’ve got two answers because I think you have a really interesting question about why hasn't someone necessarily solved this problem for the hardware space. So if I could answer that first and then we could answer the second one.

BP Yeah, please.

AS There's a couple things. One, as a software engineer, people solve the problems that they know, and I think generally speaking as a software engineer, you know what problems you're encountering in your day-to-day job. So there's a plethora of observability platforms because we all need observability platforms on the software side and we have the skillset and tools to go and build them, and there's just kind of less awareness of this problem of hardware observability, so I think that's one thing. The other thing is, generally speaking, I think innovation on the hardware development and hardware development processes has not been as fast as innovation in software. So we see companies like SpaceX that have really pushed the boundary of how hardware is developed, and then there's a lot of companies that come either from the SpaceX tree or the Tesla tree, but beyond that, if you look at a lot of our legacy defense programs, for example, they're building hardware the same way they were building hardware decades ago. So I think that's part of why there hasn't been as much innovation because people aren't aware of the problems and hardware development practices haven't advanced to the degree that software development practices have. The way that I like to describe it to people is, perhaps you've read The Phoenix Project before and you're familiar with how software was deployed in the mid-2000’s. Basically you cut a release branch, you have acceptance testing, quality assurance, and then you fix all of your bugs, and then six months later you have this massive release and you cross your fingers and you hope that everything's going to work. And that's basically how hardware is developed today. So while in software we've moved on to Agile and continuous integration and continuous deployment and now we're deploying software multiple times per day, hardware is still deploying software once every two months. And part of it is because this vehicle is in the real world and the cost of failure is so high. If you have a satellite in orbit, it cost tens of millions of dollars to build and deploy that satellite and if you deploy software to it that causes that satellite to no longer work, there's no way to possibly fix it so you inherently want to be very cautious, but also the tools haven't enabled you to move any faster. So what we've seen is manual validation and verification of test data which of course is prone to human error and takes weeks to review, and that's kind of what causes a slowdown there.

BP Makes sense.

RD You have a blog post on your site that I like– “Only the Paranoid Survive.” What's the value of paranoia to hardware developers?

AS Good question. So that's a quote from Elon Musk, and that's something that he kind of drilled into everybody's head at SpaceX. And I think it actually generally applies to not only hardware engineers or hardware developers, but also software engineers and founders. Basically you should be triple-checking everything. You should be very paranoid. Something that we do here at the company is kind of like a pre-mortem– so what are all the possible things that could fail. If this failed, what would the postmortem look like and what are all of the possible things, and then what are the things that we’re going to do to prevent them from happening. And that was a process that was used especially on the launch side at SpaceX. So for the Dragon 2 human missions, Demo-1, Demo-2, they use that process to basically ensure that everything was triple-checked. So I think that's really where the ‘only the paranoid survive’ saying comes into play, which is that the stakes of failure are so high so we should basically be thinking constantly about how we could possibly fail and then how we can remedy that.

BP We could talk a little bit more about Sift. You just got to a Series A. Lots of people who are listening are both software developers and some of them have been entrepreneurs or want to be. How do you think about taking a company forward in today's environment? I know you have thoughts about what it means to be a full stack engineer, and you're still at a point where your company is kind of small. Talk to us a little bit about how you stay lean, how you plan for a runway, and what are some of the tools and tricks that you're applying inside of your own company to try to keep a startup leveling up and moving forward?

AS I think, and everyone's probably aware of this, at an early stage company, you're kind of expected to do a lot more than what would be in your job description at a later stage company. And that's generally how we've gone about constructing the team, and that goes to this concept of what is a full stack engineer? So bringing on people, especially on the engineering side, who not only can build the back end services, maybe dabble in front end, have some experience with infrastructure and DevOps, but also are customer-oriented or deeply care about ensuring that the problem they're solving or the solution they're building addresses the customer's needs. That has allowed us to stay lean on the product side especially, because it just would be very challenging to scale if we had a large product team that was writing product requirements documents for the engineering team and the engineering team was kind of walled off from the customers. The other thing is that we have really built a very senior team initially. My philosophy is to hire people who want to grow and learn quickly and give them challenging problems, and why we stay lean is because if we had a large team, I think for some people there would be less interesting problems to solve, and then give them a lot of autonomy to go and solve those problems and I think that keeps people motivated. And then the one thing I'll add on there is just exposing them directly to the customer. So when you see your user use your tool and see all the challenges that they have with that, it's very motivating for the specific type of engineer. And that's not to say that there's any problem if that's not motivating to you, it's just how we built our company here. That comes from my time at SpaceX. Something that SpaceX was very intent on doing was co-locating all of the engineers with the manufacturing technicians so we could build software and we could walk a couple hundred feet from our desk and watch people use that software to build rockets and quickly realize all of the reasons why it wasn't actually helping them do their jobs better and faster. So that's how we've managed to stay lean thus far.

RD I think that's really interesting to extend the stack all the way to the user at the end. Software developers aren't always the most socially adept people– no offense meant, listeners. How do you ensure that they're able to communicate with customers?

AS So I think there's a couple of things. One is being super obvious that this is a core part of our culture so that when we recruit people, we are recruiting people that want to do that. I would never want to bring somebody in and then kind of change the expectation of what their role is, so we are very out loud that we want engineers who are customer-oriented. It's one of our core values, and then it's also something that we look for in the interview process as well. So a couple of the things we have, one specifically, is a presentation about a project that an engineer has worked on. So it's looking at technical communication ability, some orientation towards solving a business problem. So it starts really with the interview process and what our values are, and then within the company itself, it's connecting engineers with customers. So all of our customers– of course, I'm sure this is common practice– but we have shared Slack channels so we're talking directly to our customers on a daily basis. We have regular syncs. We're still an early stage company, we have roughly 10 customers, so of course we want to make sure that we are keeping our pulse on where everybody's at, so weekly meetings, and then always bringing engineers into those meetings, whether that's to gather feedback on something that they're about to build like a technical design doc or a product requirements document or demoing an upcoming feature and getting feedback there. And then finally, even extending that to an onsite visit. So we bring our engineers to our customer sites to basically shadow our customers as they use the tool, gather feedback, and then take another pass at the software.

BP That's interesting. There's definitely a lot of value to being co-located and it's interesting to sort of be against the tide there. I don't know what the trends are, but a lot of companies went remote and continue to hire remote, especially if they're just on the software side. You can't really do better than having that instantaneous feedback loop. And we work with a couple of vendors or as clients and partners where we have that shared Slack room and that's really an amazing addition to have in your workflow– not this asynchronous email. It's not going to work to 10,000 customers, but at the stage you're at where you can just check on the customer every morning if they have some quick thing and they want to hit you up and it's almost like suddenly you have that rapport of being at the same company. They’re a client, but you're all in Slack together. That's a very cool thing that has been added with that integration.

[music plays]

BP All right, everybody. It is that time of the show. Let's shout out someone who came on Stack Overflow, and with their knowledge or their curiosity, helped everyone to learn a little bit more. “How do I create a new deep copy (clone) of a List?” TheScholar asked this question and was awarded a Great Question Badge. 200,000 people have benefited from your curiosity. So thanks to TheScholar, and congrats on the badge. I am Ben Popper, as always, Director of Content here at Stack Overflow. Find me on X if you want to chat, send me a DM. If you have questions or suggestions for the show, if you want to come on as a guest or listen to us talk about something, email us, podcast@stackoverflow.com. And if you enjoyed today's episode, the nicest thing you could do is subscribe and come back another time.

RD My name is Ryan Donovan. I edit the blog here at Stack Overflow. You can find it at stackoverflow.blog. If you have something you want us to cover on the blog, please reach out. You can find me on LinkedIn.

AS And I'm Austin Spiegel, CTO and co-founder of Sift. We're hiring, so look us up at siftstack.com. And thanks, Ben and Ryan, for having me on the show.

[outro music plays]