The Stack Overflow Podcast

“The power of the humble embedding”

Episode Summary

Ryan speaks with Edo Liberty, Founder and CEO of Pinecone, about building vector databases, the power of embeddings, the evolution of RAG, and fine-tuning AI models.

Episode Notes

Pinecone is a purpose-built vector database. Get started with their docs here.

Connect with Edo on LinkedIn.

Episode Transcription

[intro music plays]

Ryan Donovan Welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I am your humble host, Ryan Donovan, and I'm once again recording from HumanX. And today we have a great guest, Edo Liberty, founder and CEO of Pinecone, and we're going to be talking about the vector database, hopefully going deep into how it's used, its place in the AI stack, and where it's going in the future, and maybe even get some announcements. So welcome to the show, Edo.

Edo Liberty Hi.

RD Top of the show, we like to get to know our guests. How did you get into software and technology?

EL I'll go all the way to the beginning. I tried to be a physicist. I started my undergraduate degree in physics and I knew absolutely nothing about software or computers and I figured I’d be a pretty lousy physicist if I didn't know how to code, so I took a minor in Computer Science. Long story short, I absolutely fell in love with the discipline, the math behind it, the building of imaginary structures. There's sort of an omnipotence that goes along with coding that there's something in your head that–

RD The power, right?

EL Exactly. Through your fingers it sort of becomes a thing. I always found that to be intoxicating and I fell in love with it. I obviously followed it into my PhD and postdoc in Computer Science and Applied Math, and started my first company, became a director at Yahoo and then AWS and now at Pinecone.

RD Don't often get a mad scientist route into computer science, so I love that. Now, Pinecone is known largely as a vector database company with hosting and everything that goes around with it. I see a lot of companies adding vector to their offerings. What's the benefit of having a vector-first database?

EL So I'll start by saying that vector is a data type. Everybody and every database needs to be able to support vectors as a data type, because AI natively generates embeddings for presenting semantic data for recommendation, for search, for agents, and every platform that saves data needs to be able to deal with this data type. Every database out there has, if not already, is shortly adding vectors into their support. That doesn't make them a vector database in the same way that the fact that I can put JSON in a Databricks parquet file doesn't make a document database. A vector database is natively optimized for vector search, for keyword search along with it, for filtering. It's really super optimized in terms of cost, performance, and scale for the workloads for which vector databases are needed. Those are now predominantly agents, so various kinds, and I can dive into what that means exactly. You want to power them with knowledge and you want to make them actionable on information you have in your company and your data. It's semantic search and search in general in large scale, and recommendation engines still are a pretty big use case. And all of these, if you try to do them on something like open search or Mongo or Databricks or anything else that sort of is also a vector database, the performance cost trade-offs you’re going to get are either completely impossible or just not appealing.

RD And so we've talked to folks using vector databases for AI obviously, but the power of the humble embedding, even without the large language models, has been pretty incredible to see. What do you think that power of that vector is? What makes it such a sort of magic thing that enables so many things?

EL So the embedding itself is a semantically processed, say, text. Let's just talk about text for example. You can do the same thing for images and multimodal and many things, but let's just, for the sake of this conversation, talk about text and extend it in your head to everything else. The text itself is sort of a human-legible format, but it's actually a very poor format in the sense that even a typo makes a sentence maybe grammatically incorrect or makes a token-based search fail and so on. So as a computer format, it's actually really bad. More than that, it's very semantically poor. I mean, the existence of a word in a sentence oftentimes is a very poor indication of what the sentence is about. These embedding models are actually very, very sophisticated, machine learning. They’re not quite as powerful as the larger LLMs because they don't have to be, but they're very powerful. So for example, if you do machine learning translation, you would encode a language in a sentence in Arabic and translate it into Korean. If you slice that model in the middle and you say, “Okay, what are the weights, what are the activations that actually go from one level to the next in the middle between the encoding and the decoding,” that's just a vector. The outcome or the output of that vector up to the decoding is a sentence in Korean. Clearly that contains the information and the sentence that was originally in Arabic. It's not even the same letter set, let alone the same words. So clearly the tokens themselves mean nothing and somehow that vector representation in the middle is a lot more valuable because you can then translate it instead of Korean to whatever, Japanese or whatnot, or English. And so clearly that presentation is significantly more deep, powerful, semantic, and actionable for computers. The main problem is it's really useless for humans because you just stare at a thousand floating point numbers, and for me as a human–

RD Those numbers don’t individually mean something.

EL But it doesn't matter. I mean, the people who do search, sort of the entities who now search and need access to knowledge are agents and models, not humans.

RD I think that translation to the vector location on a 760-dimension map was one of the hardest things for me to sort of grasp, because those individual numbers in the array, like you said, they're not meaningful, not to us. Part of how I understand the embedding models work is they sort of adjust based on locations and sentences for some of them, is that correct? They adjust the parameters and the vectors, they move them on the map to make sure they’re closer to ones that are more meaningful, is that correct?

EL So you're asking me about how embedding models are trained?

RD Yeah.

EL Sure. Actually there's God knows how many PhDs on this topic, I'm going to do injustice to pretty much all of them and maybe summarize the whole thing in five sentences, which is, yes, you want to have an objective function for your model that says that sentences that mean the same thing map to a map close together in the output space. Clearly if you have exactly the same input you should have the same output because it's a deterministic machine, but if you switch a word with its own synonym, you shouldn't see a big movement in the output layer. And so there are a myriad of techniques on how to generate good training data. What are similar sentences, how you gauge distances, how much you should utilize, and of course there's a whole industry of architectures of how you create the attention models and the token embeddings and moving weights from one layer to the next and so on. And I'm not even talking about all the mechanics of how to optimize networks in general or the optimizations for that propagation, all its multiple generations, matching, and pooling and all that. So it's a pretty deep stack and luckily there are a lot of people working on these topics already. We actually just shipped along with our database in partnership with Meta and Nvidia, we launched their LLaMa-based embedding model on GPU with accelerated compute from Nvidia next to our DB, so embeddings are becoming really, really good and a lot of people are putting a lot of efforts to make them better and we sort of try to not have a dog in the fight. People should have the best embedding model they can have, and we choose a subset that we host next to the database to make it easier, simpler, faster, more secure, and more sort of easy to manage. But we encourage all of our customers to choose whatever works the best for them.

RD Yeah, bring your own embedding model if you need it.

EL 100%.

RD I think the first sort of baby embedding model I saw was Word2vec and that was the one that sort of understood how this works.

EL It's funny. Word2vec is sort of naive and sort of primordial as it is compared to what we have today. It does a lot of really surprisingly good things. So this is, again, one of those things that is exciting. Again, you can sometimes get a lot for a little.

RD Now we have all these really good embedding models that not only just apply to text but to images, to gestures, whatever. What are the sort of most surprising things you've seen embedding models and vectors used for?

EL We've seen all sorts of weird use cases. I think the thing that I'll really point out is not what they're used for. For me, the most surprising thing is how effective they've become. The fact that we can today do semantic search for RAG with very little fine-tuning and very little sort of playing around with it and it does pretty well is shocking to me for somebody who’s spent God knows how many months of his life optimizing search paths and tweaking this and tweaking that and cleaning the data and bumping weights up and down to just get the machine to actually even look like it's working, the fact you today take a modern embedding model, just put your text chunk into it in some semi-sloppy way, put it in a vector database and it works pretty well, for me, that's shocking because it was never possible before.

RD So is fine-tuning sort of not a thing you need as much anymore? Is it one of those things where you try the basics first and if you need a little extra then fine-tune, but otherwise just go for the basics?

EL So I don't recommend fine-tuning almost to anyone. We've tried experimenting with it, mainly with our customers to see what benefits they get. I'll give you the bottom line. If you use a vector database with a good embedding and use context to enrich your prompt, you get of course a very big increase in quality of your results. I'm assuming you're talking about fine-tuning a model to your own data?

RD Yeah, fine-tuning in the LLM.

EL Yeah. So you can do that or you can fine-tune and literally re or post-train the model on your data, and you can do both. Obviously, doing both well is the best thing to do. Doing each one of them separately helps, but doing only RAG gets you 90% of the way there, and then if you increment it with fine-tuning, you get a little bit of a bump. Fine-tuning alone gets you very little lift. I forget the numbers exactly, but it doesn't get you even close to what you get with search. That’s number one. Number two, fine-tuning is actually incredibly difficult. People don't know how to do it. They don’t know how to organize the data. They find it very hard to retrain models and so on. Even those who are committed to doing this, which very few people have either the capacity, time, or just the desire to even try, but even those that do unfortunately oftentimes make the models worse. It helps you if you do it well, but there is a very large fraction of people who actually make their models worse. They seem to behave better on their data, but what they don't see is that actually they degrade their performance of the LLM on a thousand other tasks that they're not testing for, because they didn't train the model. They don't even know what data it was trained on. And so there are a thousand other tasks that this model was optimized for that you have now completely obliterated with this post-training. And so you trained your model to juggle but you also gave it a frontal lobotomy at the same time. And it's hard to know because you didn't train the model and you can't even experiment to see how it does. So some people, very few that we've met, managed to get both of those to work in conjunction and actually work. Very few choose to do it. Those who do are usually unsuccessful with fine-tuning, and with RAG and search we have found that, A, it's very easy to do, and B, the adverse effect of doing it not well is almost zero. Worst comes to worst, you add some context that wasn't great to the LLM and it needs to ignore more tokens that are in the prompt or in the context and it just ignores it to some extent. And so the adverse effect is almost zero except for the fact that your cost goes up because you're now ingesting more tokens, but that's usually not catastrophic.

RD I think when we first started talking about this, everybody started talking about RAG immediately, and I think that's a paradigm that's kind of naive RAG has become widespread. Everybody wants sources, everybody wants to get around hallucinations. What are the evolutions of RAG that you've seen?

EL Pinecone has become somewhat synonymous with RAG because we were there when RAG sort of became a thing. And frankly, because people used RAG on our system all the time, we took it upon ourselves to educate people on how to do RAG well. And so we would post a lot of material that sometimes was even hard to understand as Pinecone. We would just post on how to embed and how to filter and what models are coming out and how they're behaving and so on. So just ‘RAG’ already does, like I said, a good enough job in most cases. So pick a good model, kind of scrape your data well, chunk it in some reasonable way, and you should already get 80% of the way there. What we've found is that getting from the 80 to the 95 is actually very difficult, for most, but yes, very difficult, and it's sometimes very data-specific. One of the things we spend a lot of time with with our scientists and engineers is a product called Assistant which actually does both the ingestion itself of documents and PDFs and so on and does all the parsing and chunking and embedding and so on and organizing and preprocessing the data, but a much more complex part of it is the query. And so when you query, when we have metrics on it, we should probably publish those soon. Take a note to publish those soon. Assistant was already used by many thousands of organizations, it’s growing very rapidly. One of the things that is hard in getting the right context to their agent in real time is not just a search problem– you have to search multiple times, you have to search multiple places, you have to combine the data in correct ways. So for example, if your agent is trying to answer a question about how did my architecture change from this version to this version, there isn't a document that has that information. The agent or the query executor needs to figure out, “Wait a second. They're not looking for one piece of information. They're looking for multiple. What was this version's architecture? What was this version's architecture? How do you even align those? How do you measure? How do you compare architectures?” And when you bring this data back, now you have to filter on irrelevant information. You have to organize it at the very least. And at the end, you actually have to quite literally organize the object, the text object that you push back to the LLM to be optimized for that LLM. So we see that just OpenAI’s models and Anthropic’s and LLaMa just prefer the context organized slightly differently. They just do better if it's organized slightly differently, the way the model is organized.

RD Is that in terms of chunking tokens or is that in terms of structuring the data as a whole?

EL Yeah. So we opened a context API for the Assistant product, so now agents directly talk to their own data in Pinecone in text and get back a JSON-structured object with the references and the chunks and everything. They don't even have to know about vectors at all. They’re just, “I'm trying to complete this task.” They send the chat history or the prompt and get the right context to complete the task. Okay. What you get back is a JSON object with a bunch of things: the snippets, the extensions, the references, and so on. Just that object, again, different models tend to use it differently, so it really helps when we know that now you are calling Pinecone from Anthropic for the model context protocol, because we can now make sure that the output is optimized for Anthropic.

RD I wonder how you found that. Did you find that in the wild? Was that something you were like, “Oh geez, what's going on over here?” or was that something that you heard from the model providers?

EL No, you cannot hear that. They themselves don't know it, because how would they. We know this because, as a vector database, we are the knowledge hub for God knows how many.

RD It’s almost like you can’t hear your own accent.

EL Can’t what?

RD You can’t hear your own accent.

EL No, I can’t. But yes, the LLM companies are focused on their own model, as they should. We experimented very heavily and measured the accuracy of reducing hallucinations based on these models and we figured out that when you change the context structure, OpenAI does better and Anthropic does worse. When we change it this way, suddenly Anthropic does better and OpenAI does worse. And we're like, “Oh. Why don't we get better on both sides and just give everyone what they prefer?” We've just worked on it for so long that we've just figured out this is what needs to happen.

RD So what's coming next for vector databases? What’s the one year, five years? What’s going to change with things?

EL So the fun part about building a database is that you always do the same things. You care about the same problems every day. You care about more scale, cost, performance. You care about more accuracy and more flexibility and ability to provide knowledge to different software and parts of the stack, and you care about production readiness, control, security, compliance and so on, and we make progress on all three fronts all the time. On scale and performance, by the time this this airs, we will have already announced the fact that we have now completely rearchitected the data layout to be very, very optimized for agents. People have used Pinecone for recommendation engines with thousands per second. People have been using Pinecone for a long time for search on indices with hundreds of millions or billions of vectors. And suddenly we have this very new kind of workload that agents could have information on a folder of a hundred documents, but you might have a million of those folders, so now suddenly you have a million tiny partitions of a database. Some of them might not be accessed for a week, and then somebody comes in and runs 15 queries and then goes away and you have to do this incredibly efficiently and you have to stay up, you have to keep all the data on blob storage, otherwise costs blow up and so on.

RD So there's no possibility of caching with that rare of access.

EL Yeah, but exactly, you have to really create a situation where you cache a very, very small part of the index just so you can really execute the beginning of the query while you’re fetching what you need from blob storage, and you have to sort of hydrate things quickly. You have to have a managed service that you can load a bunch of objects from S3 at the same time so you can execute that query. For those 200 milliseconds, you want 20 machines because otherwise it would take way longer. So there's a ton of optimization that has to go into making those massive multi-agent use cases efficient and fast and so on, and so this is rolling out and it's a big change that we're very excited about. The interesting thing for engineers is how do you now support that at the same time that you support a thousand QPS workload or a million or 10 million or a 100 million? How do you now support, with the same server assistant, a company that just writes a thousand times a second and almost never reads and a company that almost never writes and always reads? How do you make both of those efficient? So we'll put up some blog posts on how we created this adaptive indexing so that sort of the top levels of the LSM are really optimized for writes in those small slabs and those larger and larger slabs all the way down that invest more and more time in indexing and so on. So again, that's sort of intricate play on how you progress data through the system and mature it over time so that you can be both really high throughput for reads and high throughput for writes in a serverless system that sort of seamlessly does all of those things. By the way, on the scale and performance, we just also added keyword search. We already had sparse search, sparse indices, but again, with this architecture and our own implementations of sort of modern, but maybe not.

RD Keyword search will never die, huh?

EL Keyword search will never die, and it's actually really good. So on performance, we just posted some metrics on being more than 10x faster than open search, but the interesting thing is we paired it with our own lexical model. We talked in the beginning about simple things doing amazing, simple models sometimes doing really amazing things. One of the best models that we have– best I'll say in a second, best in what sense– is literally just, given a document, give weights to words based on the context. So process it with a proper LLM or with the model, but this model assigns weights to words.

RD Was that like TF-IDF?

EL Exactly. TF-IDF and BM25 and all these things are heuristics on, given a document, just let's guess how prominent or how important a word is for this document. Those are literally 50 year old ideas. How common is this word in general, and how common is this word here, and let's just guess. It's no shock that models are way better at this. One example we give is, if you would have the word ‘does,’ it's not important, but if it's an article about hunting and does, that's suddenly a word. The token being the same doesn't mean it's not an important word in this document. So even just that, just saying, “Hey, just scrap BM25 and all that nonsense. I just want to put the right weighting for words in my search index,” that does extremely well. And the beauty of it is that now at search time you don't have to apply a model at all, because now you're just searching for the tokens so now it's extremely fast. I do want to say about control and security compliance, AI touches every data in your company. It almost needs to be almost more secure than even your analytical store because this now touches HR documents and God knows what, so we have to hold a very, very high bar. First of all, for the enterprise tier, we already launched backups, admin APIs, audit logs, and a bunch of other stuff on top of all of our security measures, all kinds of encryption and all best access control and so on. But one thing we've started now doing is deploying into customers VPCs. It’s a dedicated tier that deploys BYOC– bring your own cloud or bring your own account. We're only now doing this with a handful of design partners. We maybe have a place for one or two more, but that's going to be publicly more available soon.

RD Okay.

[music plays]

RD All right, ladies and gentlemen. Thank you very much for listening. I've been Ryan Donovan. If you liked what you heard, disliked what you heard, have comments, ratings, reviews, email us at podcast@stackoverflow.com. And if you want to reach out to me directly, you can find me on LinkedIn.

EL Thank you. Edo Liberty, I'm the founder and CEO of Pinecone. You can find us on LinkedIn, on X, write to us. Go to pinecone.io, and glad to see you.

RD All right. Thanks very much, and we'll talk to you next time.

[outro music plays]