5.5. Multilingual Knowledge Distillation#

This section is still under construction.

Under Construction

Multilingual Knowledge Distillation is a technique developed by Neil Reimers, as stated in Section 3.1.2.B. This sort of technology allows a model to extend its knowledge over a language, i.e. English, to different languages such as Portuguese. This technique is attractive, especially when we intend to create a model that should learn a language. However, only a few datasets are available in that same language.

In this work, we developed a language model that utilizes this technique. However, the goal was not to create a multilingual model, but rather to improve the knowledge a model already has of the Portuguese language.

There are two different models: a student and a teacher model. Assume that we intend for our student model to learn Portuguese and that our teacher model already knows English. Both models will receive pairs of sentences with the same meaning, one sentence written in Portuguese while the other is in English.

The teacher model will encode the English sentence for each pair, while the student model will encode both sentences. Since both sentences have the same meaning, both embeddings should be similar, if not equal. Consequently, the student model embeddings will be compared to the teacher model embedding, which will back-propagate the error using a mean squared error approach. Over time, the student model embeddings will become closer to the teacher model embedding.

The dataset used was: TED 2020 – Parallel Sentences Corpus. TED 2020 contains around 4000 TED (https://www.ted.com/) and TED-X transcripts from July 2020. These transcripts were translated by volunteers into more than 100 languages, adding up to a total of 10 544 174 sentences. All the sentences were aligned to generate a parallel corpus for training tasks such as this.

With the explained end goal, this technique was applied to Legal-BERTimbau-base. It was designated as the student model, supporting the English Language already, and we intended for it to learn Portuguese. The chosen teacher model was sentence-transformers/paraphrase-xlm-r-multilingual-v1(https://huggingface.co/sentence-transformers/paraphrase-xlm-r-multilingual-v1)

It was defined that the number of warm-up steps is 10000. The training was performed with a 1e-5 learning rate using the Adam optimization algorithm during five epochs.

Due to the size of the output layer (1024), for Legal-BERTimbau-large, the chosen teacher model was sentence-transformers/stsb-roberta-large30. Similarly, the number of warm-up steps was 10000.

Also, the training was performed with a 1e-5 learning rate using the Adam optimization algorithm during five epochs.

Furthermore, after this extra training step, we fine-tuned the models for the STS regression task as described in the 5.4. With the integration of this different application of the technique, it was possible to further train a model for the Portuguese language by mimicking a teacher model that knew how toencode English Sentences properly.

The application of the multilingual knowledge distillation process and the classical STS fine-tuning produced two different SBERT variants:

5.5.1. Code Example#

Sentence Transformers Documentation: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/multilingual/README.md