Mass Communication-Nordic language BERT models: “Languages with fewer speakers are underrepresented

BERT models in Danish, Swedish and Norwegian have been released by the Danish company BotXO. We spoke to Jens Dahl Møllerhøj, Lead Data Scientist at BotXO, to find out more. See how these open source models differ from Google’s multilanguage BERT model, what can make creating NLP models for Nordic languages difficult, and where these models can be used.

JAXenter: Your company has released open source versions of Danish and Norwegian BERT models. What is the difference between your models and the multilanguage BERT version released by Google that includes Danish and Norwegian?

Jens Dahl Møllerhøj: The multilingual BERT model released by Google is trained on more than a hundred different languages. The multilingual model performs poorly for languages such as the Nordic languages like Danish or Norwegian because of underrepresentation in the training data. For example, only 1% of the total amount of data constitutes the Danish text. To illustrate, the BERT model has a vocabulary of 120,000 words *, which leaves room for about 1,200 Danish words. Here comes BotXO’s model, which has a vocabulary of 32,000 Danish words.

(* Actually, “words” is a bit imprecise. In practice, the model divides rare words, for example, the word “inconsequential” becomes “in-”, “-con-” and “-sequential”. As this kind of word divisions are present in different languages, there is room for more than 1,200 Danish “words” in Google’s multilingual model.)

Combatting AI bias: remembering the human at the heart of the data is key

JAXenter: What makes a language particularly well suited for NLP tasks—and what makes the process especially complicated?

Jens Dahl Møllerhøj: The performance of general-purpose language models such as BERT is dependent on the amount and quality of training data available. Since languages with fewer speakers are underrepresented on the internet, it can be challenging to gather enough data to train big language models.

JAXenter: What were the greatest challenges in creating the Danish and Norwegian BERT models?

Jens Dahl Møllerhøj: Getting the training process to run fast enough required running it on custom Google hardware called TPUs. As these chips are experimental, they are not particularly well documented. Getting the algorithm running on them required quite a bit of experimentation. Moreover, renting Google’s TPUs costs a lot of money. Since TPUs are expensive to use, it is important to make the algorithms run as fast as possible to decrease the cost.

Watch this video:https://www.youtube.com/watch?v=Z5vxRC8dMvs

Mass Communication-Nordic language BERT models: “Languages with fewer speakers are underrepresented

Combatting AI bias: remembering the human at the heart of the data is key

Recent Posts

Subscribe Form