Skip to content

9 June, 2023

ChatGPT understands Danish too!

imagine_vector_space._It_represents_the_words_and_their_relatio_c4ae4cf0-271a-46da-8c39-f5b4a6abf947

Large language models are more multilingual than you think. Search through Danish documents alongside English to streamline your organisation’s data needs.

Article By

Anders Stendevad, Data Scientist / Data Engineer , Flowtale

When you search google you find pages in the language you searched in, but what if what you seek is in another language? A few years ago you would be out of luck. Today this problem is solved by large language models (LLM). Not only are LLM’s multilingual, they also generalise knowledge across languages. This blog post will explain how your business can benefit from connecting knowledge across language barriers. Now is the time to link internal documents, policies and strategic messages to LLM’s to make your business more successful. 

Data management is a task that every organisation has to deal with. Ideally, documents are organised and written in a language that everyone can understand. However, this is almost never the case. Language, structure, and policies are ever changing. Especially in multilingual fields, information is not only shattered by placement. They might also be shattered by language.

LLM’s like ChatGPT not only catalogue this vast sea of data, it can also keep that data alive and available. At Flowtale we enable the use of ChatGPT for many use cases which we cover in our blog post Harnessing the Power of AI: Implementing ChatGPT in Your Azure Environment. We consider it natural that LLM’s will become integral to having knowledge organised and available. 

Long gone are the hours of searching SharePoint, Confluence, and file servers for out-of-date documents. Instead, simply ask your own private ChatGPT on your own private tenancy. Find your documents and get a summary of the contents. When documents are in multiple languages, you would expect this functionality to break down. This is not the case as LLM’s work in vector space and therefore they can freely understand and catalogue data from any language

LLM’s are trained on a lot of multilingual data.

But how is this possible? To get to conclusion first, the answer is that LLM’s are trained on a lot of multilingual data. They have simply seen many languages doing training and have been trained to understand the meaning of words, sentences, and even document types. This works through something called an embedding in a vector space. But before we explain this, you can try the following experiment:

If you go to ChatGPT and ask it to respond in a specific language, it will handle it without issue. You can even ask it to answer in a new language each time. You can use the following prompt: “I want to showcase the multilingual capabilities of ChatGPT. Let us have a conversation where you respond in a new language each time! You can start with Spanish”. When responding with “Again in a new language” you can quickly see that in just about every language. If you stop your conversation with “Can you translate every message in this chat to Danish?”, then you will see how well it works.

Let’s get back to the embeddings in vector space. It represents the words and their relationships in a numerical form. Each word is pushed into a space where similar words are clustered close to each other. This also works for words between different languages. Think of this as if words were stars in a huge Galaxy. By mapping words from various languages onto this shared vector space, LLM’s can learn to generalise linguistic patterns across languages and generate coherent text in multiple languages. 

Vector space is very different from our world. In our world we have 3 dimensions, but in the vector space of LLM’s this space is 1000+ dimensions. So in our vector Galaxy, consider that each star might be connected with many other stars through wormholes. Wormholes are a useful metaphor from our world, but the reality of LLM’s is that words are close because of other factors. For each word in one language you can assume that there exists a wormhole to every other similar word in all other languages. impressive to say the least!

A galaxy of opportunity for your business

Wow, that is a bit technical to say the least. So let’s provide a down to earth example. Say there exists many great Japanese recipes online. If you go and search for them in English it is likely that you will find a few. But Most likely they will originate from sources located outside Japan. This is not a problem for a LLM. Instead of having these recipes separated by language, they instead exist in space where recipes in all languages cluster together. The original language is just one parameter of many used when interacting with the LLM. 

Furthermore, LLM’s incorporate techniques such as transfer learning, where the knowledge acquired from one language can be leveraged to improve performance in another. By transferring the learnt knowledge across languages, these models can efficiently handle low-resource languages or tasks that require multilingual understanding. It is a Galaxy of opportunity, which can be utilised in your work! Just imagine how the collective ideas of your organisation can contribute to tackling the problems of the future.

As a final reminder of the multilingual capabilities of LLM’s look no further than Duolingo. They are integrating GPT-4 in their mobile applications as a conversational tutor for language learning. Not only can they utilise the linguistic analytical skills of GPT-4 to help users improve their grammar. They also provide a new and unique experience for learning new languages. Let this be a reminder that your business has similar applications of LLM, that are too good to pass. 

Finally we hope that this blog post has sparked your interest in multilingual LLM’s and its applications. We here at Flowtale are experts in this area, and we hope to help you on your AI journey. Regardless of which languages your data are in!