The Stack Overflow Podcast

“Translation is the tip of the iceberg”: A deep dive into specialty models

Episode Summary

Olga Beregovaya, VP of AI at Smartling, joins Ryan and Ben to explore the evolution and specialization of language models in AI. They discuss the shift from rule-based systems to transformer models, the importance of fine-tuning for translation tasks, and the role of human translators in ensuring reliable, high-quality output. They also touch on the implications of AI in language education and the challenges faced in implementing LLMs in enterprise workflows.

Episode Notes

Smartling is an enterprise translation platform that includes AI-powered translation solutions.

Connect with Olga on LinkedIn.

Kudos to Stack Overflow user Suleka_28, who earned a Populist badge by explaining how to convert logits to probability in binary classification in tensorflow.

Episode Transcription

[intro music plays]

Ben Popper Can a blockchain do that? Algorand has answers. Developers are using the open source Algorand blockchain to build solutions disrupting finance, supply chain tracking, climate tech, and more. Hear from devs, learn about the tech, and start building on-chain. Blockchain solutions aren’t hypothetical, they’re here. Check out canablockchaindothat.com. Can a blockchain do that? Algorand can.

Ryan Donovan Hello everyone, and welcome to the Stack Overflow Podcast, a place to talk all things software and technology. I'm Ryan Donovan, your humble host, and I'm joined today by the once and former king, Ben Popper.

BP Hello, everybody. I miss the Stack Overflow Podcast so much, but I'm not going away. Can't get rid of me, I'm like a fungus.

RD You're a fun guy, that’s right.

BP I'm a fun guy.

RD So today we have a great guest, Olga Beregovaya, VP of AI at Smartling, and we're going to be talking about their specialist LLM models and how creating a targeted specialist model with a vertical approach can apply to other industries as well. So Olga, welcome to the program.

Olga Beregovaya Thanks for having me.

BP So Olga, for our audience, just give them a little bit of background. How did you get your start in the world of AI and what led you to the position you're at today?

OB I started within the world of structural linguistics, and then, since the world of natural language processing was moving more and more and more towards machine learning, so we started in the rule-based universe. Okay, there we go, rule-based machine translation, rule-based sentiment analysis, then we'll move on to statistical, and then more and more and more modern state of the art machine learning techniques are taking over the natural language processing space, and then there comes AI, and there comes transformer models and there we go. So it was just a natural progression of how the human language is being handled. So pretty much had no choice, just like the rest of the world.

BP You had to ride the wave.

RD So how long ago did you all get into LLMs?

OB Well, I mean, maybe first things first, maybe we define LLMs as smaller LLMs.

RD The SLMs.

OB Well, I mean, SLMs actually are probably the next iteration, but we can say smaller comparative adjectives, smaller language models, and we've been in that space ever since we started, ever since we were able to use the likes of Axelera or BERT. Basically as soon as language models, transformer models, became available, I would say that's very much where Smartling started taking advantage of those models, because it made sense. So I don't know, I guess 2017 is when the famous paper was published– ‘Attention Is All You Need,’ so ever since we've been involved in the different sizes of language model space. Now the actual maybe LLM work –LLM as we know them now– I'd probably say maybe two and a half years, two and a half, three years would probably be fair because that's where we really see LLMs booming.

RD I think most of the time when people talk about LLMs, they usually mean a transformer-based model, so whatever size.

OB Right. And then as you add parameters, then there you go into the world of GPTs. And the more parameters there are, eventually you're like, “Okay, we're really using Gen AI,” or LLMs as we know them now.

BP So there's one sort of example of a focused AI that I always bring up, which is AlphaFold, trained to figure out this tricky puzzle of how a protein is going to be shaped, and in doing so, hopefully open up some really interesting paths for scientific discovery and medicine and things of that nature. But I know one of the things we wanted to focus on today was how you're taking advantage of some of the cutting-edge work being done at these big AI labs that are putting a lot of stuff out, the fact that there's a lot of really powerful open source work being done here, but then also fine-tuning and specializing your model for translation to see performance gains over what the cutting-edge models can do just in this particular domain. So can you tell us a little bit about that?

OB Let me give it a try. Smartling is an AI-powered– language AI, actually we trademarked the term language AI– or AI-powered translation management, a global content transformation platform. So that immediately puts us, at least the translation and global content management piece of that, immediately puts us into specialized fine-tuned space where we do need to fine-tune models and design our prompts specifically for translation tasks. Now there is the other side, obviously. We also talk about using AI in our day to day, but I think we're better off focusing on the translation piece of that. So first things first, you're absolutely right. I mean, as the world got excited and OpenAI APIs became available, we obviously have done a lot of work with general purpose foundational models, and they do serve certain purposes. I mean, you obviously can do, to some degree, you can run quality estimation. To some degree, you can obviously use them for translation purposes. But then again, that ‘to some degree’ comes. And what we learned quite a bit as we deployed foundational models where, okay, it's all good, but since they can do everything, they can calculate, they can summarize, they can, I don't know, whatever, draft a legal brief for you, it became very, very apparent that we need to do something different. And as the model took to hallucinating in foreign languages and not being able to reflect the cultural phenomena of the model languages’ local languages at all, that's when it became even more apparent for us that we've got to do something different here. And that's where we went into multiple areas of fine-tuning long-tail language coverage, because the problem of language coverage has not been solved by general purpose foundation models.

RD I'm definitely interested in the language translation. I've dabbled in a couple of languages. I love reading literature and translation and I wonder how much that fine-tuning, some people may think of it as you just create similar embeddings for similar sentences across languages. How do you reflect the nuances?

OB Since we are a language management, translation management platform, obviously we sit on extremely rich, and to a great extent, labeled data. It can be labeled for quality, it can be labeled for edit distance, it can be labeled for domains. So think you have all this data at your disposal, and that obviously helps you optimize the translation or language assessment or what we would refer to ‘smoothing tasks.’ So the first step for us is you absolutely need bilingual corpus. There is only this much we can extract from monolingual corpus. And then obviously, as you said, for certain tasks, you do need multilingual vector space representation. For simpler fine-tuning tasks, you can actually get by with just tokenizing and then deduping and preparing your possibly noisy data for the actual fine-tuning tasks. And that's where you deduplicate, disambiguate, you can remove longer sentences that tend to confuse the model, and that's where basically you start putting your parallel corpus at play. And then you can start using it for fine-tuning tasks or you can even use your data examples for few-shot translation tasks. Maybe one thing I could say is immediate translation is just the tip of the iceberg. There is also smoothing, quality estimation, edit effort estimation. There are multiple ways of managing terminology and injecting locale phenomena, so translation is just a small part of it.

BP So you mentioned that there's an advantage for your organization to specialize in translation and languages over a general purpose cutting-edge LLM. I remember hearing a discussion from one of the big AI labs that if they tried to extend its capabilities to be multilingual, that it wouldn't just communicate in English, but also French, German, and Japanese, that it would become a little bit less accurate and it would lose some of its expertise in each language. As it tried to stretch to encompass more topics, it would kind of fade a little bit in its clarity on them. So you mentioned, though, that your organization is using obviously multi-lingual, monolingual is not interesting to you. So where does that balance come up, if it does it all for you, of specializing in one language versus having them all within the same model?

OB Well, as of right now, and that's where we started in the beginning, we said that small language models may be the future, and we definitely experiment with monolingual, or not monolingual, but single language models using open source, like the likes of Llama. Llama was probably the most accessible. But we still see that if we do, I think it's all about if you prompt in English and expect an output in multiple target languages, that's one of the ways you avoid the degradation. You're absolutely right though, the general purpose foundational models can be jack of all trades, and at the end of the day, master of none, as they start spreading across a variety of languages. And we've met with a lot of AI labs who intentionally stay within a certain set of languages to avoid this quality erosion, but again, this is where there are so many ways you could do that. Again, you can fine-tune, you can set model parameters, you can do model distillation for a specific cultural phenomenon and for specific languages. So there is quite a bit, actually, you can get out of a foundational model, even without fine-tuning, but again, you need to be very, very, very specific with your prompt engineering or programming. And there are a lot of ways like self-checking loops where you can actually still use a multilingual model, but you can make sure that the model delivers accurate results for local cultural phenomena. And also Smartling will support 250 languages. How many people would I need in my R&D lab to be able to build monolingual or a single language, single translation direction model? So it's still beneficial for us to go multiple directions within a single model.

RD You said that Smartling is a translation management platform and that these other LLMs do effort estimation, quality estimation. Does that mean there's still human translators employed?

OB I might not be making many friends here with my next statement. So if we look and we use a lot of both textual and semantic, both referential and non-referential quality estimation methodologies as a part of our workflow and as a part of what our research lab does, and we see that for certain languages for certain complexity– I mean, languages differ in complexity. So for certain languages like simpler romance languages, here we are almost at a human parity, and our set of metrics clearly states that semantic similarity is extremely high and referential metrics show very low edit distance. So here we are, we're almost at the human parity and we at Smartling or in general in the translation industry are pretty confident that for certain languages and content types, we're only this far away from human parity. Now for the next couple of years, the human in the loop is absolutely a factor, because we need to compensate for model deficiencies. We need our linguists to validate prompts. There is no way under the sun I can validate a prompt in Swahili. I cannot validate an input. So you need linguists to write and validate prompts, you need linguists to still review for factual accuracy and compensate for grammatical deficiencies, and you also need linguists for direct assessment, because models are only as good as their correlation with direct assessment– model predictions. So there are at least three niche areas where we do see human translators, linguists, prompt engineers playing a role. Now once we've reached that human parity, and once the model learned to self-label, self-annotate, once unsupervised learning starts taking over, then obviously the role of human in the loop just inherently is going to be shrinking. Not right now yet, not there yet.

BP These models are so incredible at language, do you really feel that in three to four years I'll be able to have AI on the edge with my phone in my pocket and my AirPods in my ears that is capable of carrying out a conversation between me and a stranger who are linked up on the same service so that we're having near real time translation, the way someone at the UN always has an earpiece in their ear and someone's giving a speech and they're hearing it in their native language. Am I going to be able to travel the world and talk to anybody in English and they're going to speak back to me in their native tongue but we're going to understand each other?

OB I mean, first things first. And again, we operate in the translation area, the automated interpretation in conference or any other settings is already a reality. I'm not going to mention any names, but there is a handful of really solid automated interpretation platforms there that actually used to do ASR to text, text translation, and then text to speech, and now actually with multimodal language models, you can actually do it all within a single translation. Now my question to you would be, actually you could do it just fine with Google Translate right now. You can actually speak into your phone. And that's what I do, for instance, when I'm in a taxicab in Tokyo, because that's my only way of communicating. So the models already exist and the models local to your phone applications already exist. So I think what's going to happen is they're just going to continue evolving, the level of comprehension and the quality of synthesized output is going to go up. But what you are describing as a future is already a reality.

RD My parents got some headphones that do that now for something cheap. I don't know how good they are, but they have that.

BP I had this experience when I traveled to China in 2018-19. I used Google Translate, but it was an awkward experience of I speak in the phone and then it types out what I want and then I show it to them and then they say something and then it shows it back to me in English. It made a lot of mistakes and it didn't feel real time and it certainly didn't feel like we were connected on one service. So I think what you're saying is at this point in time, you're in the cab, it's pretty close to being able to understand both languages and quickly produce an output. What we need is the UI/UX. We need to productize that experience.

OB We need to productize, we definitely need to improve UX. There is one component, though. If you're saying 18-19, you're talking about using neural machine translation as your translation engine. And right now, the tendency is moving more and more towards large language models fine-tuned specifically for translation tasks as your potential translation vehicle. We're doing a lot of benchmarking, and because neural machine translation operates on a single sentence level and is not contextual, with certain tweaks and the right prompting and the right tuning data, you can actually get a better translation now from translation-specific language models. So I think part of your experience was since you were using previous gen of transformer models for translation. Now it's less awkward and it's more contextual, so you probably would have a better experience if you put yourself in a cab in China now.

BP A lot has changed since 2018, absolutely.

OB Definitely. Well, I would say a lot of things changed from last week or from yesterday.

RD So we're talking about the sort of targeted LLM models. Outside of your specific use case, what do you think the benefits are of using a sort of specialist model over a generalist model?

OB Again, if we talk about generalist models, they can perform multiple tasks but there would still be flaws at different tasks. So I think specialized models pretty much will have a role in virtually every domain that deploys technology. If you talk, for instance, about models trained for mental health scenarios, there are models that are trained for something that emulates empathy. They can be empathic or not empathic, but that model will emulate empathy enough to be able to provide support. So that would be one of them, for instance. Legal, that's another one. If you train the models, if you fine-tune the models for summarization, then they would be absolutely priceless for e-discovery as opposed to generalized foundational models. So I think they would literally have a place, I mean, if you go into entertainment, for instance, you need for the model to be able to parse out human language and then provide transcription, or you provide subtitling. And again, through whatever the technique you are using, you obviously need to either fine-tune the model or provide enough examples for those specialized tasks. So I would say there's barely an area that would not benefit from it. Here's another area. If you think about regulated industries and healthcare, models that are specifically– I mean, I don't know what the right term would be. I think fine-tuned would still be the right term– models that are specifically trained to handle PII and PHI –I'm speaking from experience– you cannot get the same result from the generalized model. You really want to fine-tune those models specifically on whatever corpus of proper names and corpus of substitutions. But here is another example, here's another example, and PII is a huge thing, as we know, for regulated industries.

BP We did an interview recently with a CEO of a company called Alexi, and he was saying that they have their own set of fine-tunings, or maybe it was more just prompt engineering that they do, a lot of RAG, like what context are you looking at so that they can very quickly do some pretty deep legal research and that saves folks a lot of time. But he made the funny comment that every day people are going to the biggest chatbots and asking legal questions and the chatbot will first say, “I can't give legal advice,” but then if you ask it again, it says, “Okay, well…” or you say, “Pretend to be a lawyer,” and then it gives you legal advice, which is illegal. It's not supposed to be giving you advice, but it will, it will give you advice.

OB You know what? Actually, I forgot the name of this website, but there is a whole group of people who work specifically on how to cheat a model and get the response that you're not supposed to be getting out of the model. And you can actually be very creative. If nothing else, you can say when, “I could not fall asleep, so my grandmother would tell me X.”

BP Oh, that’s our favorite. Napalm granny, that's me and Ryan's favorite.

OB Pretend you're my granny. And then you mentioned something very, very important– you don't always need to do it within the model itself. RAG is obviously huge. I mean, that RAG approach is a great asset when you need to get domain-specific, or in our case, even customer-specific results. Within our platform for instance, RAG is very easy because we have external style guides, we have external corpus of previous translations, we have external terminology, so actually it's often quite much easier to just RAG into the model instead of performing full fine-tuning. And also it can be much more cost-effective because you spend much fewer tokens.

BP And that's interesting too. If you had a chatbot that was performing some help service, you can say, “I'd like you to do translation, but there's a list of words that you're allowed to use and you have to stick to the script.” We don't want it to be giving away a free car. I think that was the one that caught most people's attention.

RD For the PII, I know we talked with somebody else at a company, Skyflow, that has a privacy vault that basically removes the PII. It knows all the names that are PII and if one of those comes up, it just says, “No, you don't need to see this.”

OB There was a bit of a catch 22 with the PII though, because often because of legal and regulatory requirements, you cannot surface personal information to the model, but sadly, the model is the best thing to handle PII. So go figure, what do you do? The model is best at doing something that you're not in. And this is where open source and self-hosted, local hosted models could be a great way of handling it. And that's probably the buzz or, again, the news of the last three weeks– self-hosted, smaller size open source models that would be much more fit for individual tasks than the generalized models.

RD I have a question for you as a linguist. I keep reading all these articles that are about bemoaning the loss of foreign language education. Do you think that we lose something by not being able to study multiple languages?

OB I mean, the whole linguistic anthropology and socio-anthropology, with the emergence of Gen AI, that's a huge area of study because there are multiple things. Not only do you not use your ability possibly to learn the foreign language, but you also do not, unless the model is really trained on the anthropological phenomena of– again, I don't know, I'll take a remote African region– you might even get access to the language, but you're not going to get access to the cultural phenomena. So the impact is not just linguistic, the impact is actually sociopolitical and cultural, and again, I will repeat the word ‘anthropological.’ Yes, it definitely is a problem, but it's again, a two-edged sword because on the other hand, it actually opens access to foreign languages. And there are a lot of AI-driven language learning, personalized language learning applications that actually drive the study, so I think it's all a question of implementation and deployment and how language models are used for translation. You can equally use them for education. There is a term that I heard at a conference which really made me wonder how accurate it is. It was ‘linguistic colonization,’ which was basically you bring in English language and everybody has access to English language, so what does it actually do to local languages and what does it actually do to long-tail languages when, at the end of the day, the information is accessible in English and you can easily translate it into your local language.

RD I mean, I've heard it said that the majority of the world speaks an imperial language one way or the other. All of the major languages are there because of empires.

OB Absolutely so. I mean, if you look at it from a historical perspective, that's exactly how it is. So I would agree with you that it poses a certain threat, and I think that also justifies the existence of a lot of regulatory committees and governing bodies and international governing bodies, because preservation of local languages is one of the areas that's definitely of concern.

BP All right. Well, I propose that we train one final AI language model that speaks Esperanto. Was that the universal language? You can find a way to communicate with everybody in this same neutral language and we'll all have to learn that.

OB If we remember correctly, Esperanto was not necessarily a shiny success. I mean, the intent was definitely there, the adoption was not there, and then we had the phenomenon of interlingua, where the interlingua was supposed to be a representation language between two different languages. So maybe we'll get there again. I think right now it's extremely difficult to predict the future, but where we do want to find ourselves is equal coverage for dominant languages and for long-tail languages, and I really love a lot of initiatives that are taking place now. A language may not even have a script, a language may not even have an alphabet, and people are going out and recording and then using transcription and using ASR to actually give written representation to long-tail languages. So again, there could be some drawbacks, but there is a lot of good things coming out of implementing AI for language translation and language acquisition and learning.

RD So we talked about the sort of ideas about it. You've implemented these transition workflows into practice into business situations. What's that experience been like? What's the sort of pitfalls that you run into?

OB We actually even developed, so I personally developed a ‘CTO talk track,’ which basically helps our localization partners at different organizations, people from translation and localization departments, explain why just plugging in ChatGPT or now GPT API and just assuming that all your translation needs have been solved is the silver bullet because it actually is not. And this is what we learned. First things first– inference time and latency. Unless you've figured out some kind of multithreaded or hitting various APIs from different, various endpoints, you can actually just stall in your translation process, and that's one huge area that's actually a big concern. And at Smartling, we do a lot of work on making sure that we don't stall, especially when we have to provide instant translation. So that would be one– language coverage. That's another one. You have companies or you have nonprofit organizations that operate in 150 regions and then we need to find the corpus and actually fine-tune an existing model, usually open source model for that language, and then pray that you can actually vectorize that language. Another one would be model hallucinations, and as we said, first of all, the model can just have a mind of its own and just produce completely irrelevant responses, and that amplifies as you go into multiple languages. And then you do things like probability logs or semantic entropy, things that can actually help you mitigate model hallucinations, otherwise we'll be delivering complete gibberish to our customers. Governance– huge issue. You can do things in certain countries, you cannot do things, so you have to adopt a multi-model approach. Here's your AWS shop, here's your Azure shop, here's where you can only use native API and here's where you can only host Llama. So basically there are challenges and wins on every turn.

BP I have to say, as a podcast host, one of the interesting things now is services like Riverside and Descript say, “Hey, you want this versioned in 30 languages? We'll do it for you. Just tell us what language you want this podcast to come out in and it will come out in that language.” And the only thing stopping me is I can't do quality control. I'm not going to turn this in and then go find people who really speak that language and say, “Does this sound good, not good?” But this offer, like you said, is now on the table of like, “Oh hey, this is just turnkey. You want us to just take your stuff and translate it? We'll do it, no problem.”

OB And for certain content types, for general domain content types, it can actually be accurate. But if you operate in a certain space, and actually you can have models self-police themselves and check their own quality, and the whole area of quality estimation of generative AI output using generative AI, that's a huge field of both academic study and in the enterprise world.

BP That's right. You need this chain of agents, one of whom is the language police at the end who's like, “I'm really here to make sure this grammar and syntax and idioms are correct, okay?”

OB Correct. But then this agent will actually send it back and say, “Hey, something is off, so go back and look.” So there are a lot of fun things you can actually do with Gen AI now in the enterprise translation space.

RD The quality control agents you have, are they monolingual by necessity or can you do multilingual quality assurance in a single model?

OB You can actually do multilingual. You can actually do multilingual, because at the end of the day, your prompt can be language-neutral, so you can actually flag what kind of mistakes. You can introduce the error taxonomy, and this error taxonomy is actually going to be language-neutral. And then, as it's language neutral, you can actually add more and more and more languages. And then sometimes the languages would not be supported by a model but then you can just use pure semantic similarity and then just language coverage would be irrelevant.

BP It should flow the other way. Ryan and I should publish a podcast and then the person with their app in the end should say, “Translate this into whatever language I want.” And then they can consume any content they want in their native language, and it's on them if it doesn't sound great.

RD I mean, then we're saying possibly nonsense in other languages. Again, the quality control.

BP I guess so.

OB There are two steps at it. There is quality estimation and quality assurance. Quality estimation is a part of your agentic workflow. Quality assurance is the final seal of approval, and these are two dramatically different things.

[music plays]

RD All right, everyone. It's that time of the show where we shout out somebody who came on to Stack Overflow, shared a little curiosity, dropped a little knowledge. Today, we're shouting out a Populist Badge, somebody who dropped an answer on a question that was so good it outscored the accepted answer. Today's badge goes to Suleka 28 for answering: “How to convert logits to probability in binary classification in TensorFlow?” If you're curious about that as well, go check out the answer, it's in the show notes. I am Ryan Donovan. I host the podcast, edit the blog here at Stack Overflow. If you have questions, concerns, want to suggest a guest or possibly be a guest yourself, email us at podcast@stackoverflow.com. And if you want to reach out to me personally, you can find me on LinkedIn.

BP Hey, everybody. I am Ben Popper, one of the hosts here of the Stack Overflow Podcast. You can find me on LinkedIn, you can find me on X. If you are interested in suggesting guests or topics for the Stack Overflow Podcast, we'd love to hear it. If you work at an organization that has anything to do with turning Figma designs into code, come talk to me, I got some ideas for you. And other than that, I'll pass it along to you, Olga.

OB So I’m Olga Beregovaya, Vice President of AI, formerly Vice President of AI and Machine Translation, but now it's all one thing at Smartling. And the best way to find me would be my LinkedIn profile, or if you're interested in learning more about AI at Smartling and language AI in general, just follow Smartling on LinkedIn.

RD All right. Well, thank you very much, everyone, and we'll talk to you next time.

[outro music plays]