6.1. Language Model Evaluation#

This section is still under construction.

Under Construction

This thesis implicated the development of multiple Legal-BERTimbau versions. From fine-tuning both base and large variants of BERTimbau, to applying the Multilingual Knowledge Distillation technique to models, there were a total of 4 different variants implemented, in addition to BERTimbau’s variants.

Regarding BERTimbau’s variants, Legal-BERTimbau-base and Legal-BERTimbau-large, we can acknowledge if the domain adaptation stage was successful or not based on an average of the loss function used in MLM trainings.

The loss function used was negative log likelihood loss. We utilised the testing dataset from the splits generated in subsection 4.1.2 and compared the average loss produced. We also submitted the models that were subjected to the TSDAE technique to this evaluation.

Model

AVG loss

BERTimbau base

18.15683

BERTimbau large

16.40979

Legal-BERTimbau-base

0.027202

Legal-BERTimbau-large

0.0285061

Legal-BERTimbau-base-TSDAE

10.897830

Legal-BERTimbau-large-TSDAE

11.093482

Regarding our testing split, both Legal-BERTimbau-base and Legal-BERTimbau-large perform better on a MLM task than BERTimbau, which infers that our models are successfully adapted to our legal domain. Models subjected to TSDAE domain adaptation technique also perform slightly better than original BERTimbau models on the MLM task.

Since the main task of such model was determining STS, the correspondent evaluation metric for the language model is through calculating the Pearson correlation [VGO+20] between the expected similarity score and projected between different sentence pairs.

This evaluation was performed on the Portuguese available datasets that were also used in the fine-tuning stage. For comparison, we can see the performance of state-of-the-art multilingual models when performing the same tasks with the same data.

Model

Assin

Assin2

stsb_multi_mt pt

avg

Legal-BERTimbau-sts-base

0.71457

0.73545

0.72383

0.72462

Legal-BERTimbau-sts-base-ma

0.74874

0.79532

0.82254

0.78886

Legal-BERTimbau-sts-base-ma-v2

0.75481

0.80262

0.82178

0.79307

Legal-BERTimbau-base-TSDAE-sts

0.78814

0.81380

0.75777

0.78657

Legal-BERTimbau-sts-large

0.76629

0.82357

0.79120

0.79369

Legal-BERTimbau-sts-large-v2

0.76299

0.81121

0.81726

0.79715

Legal-BERTimbau-sts-large-ma

0.76195

0.81622

0.82608

0.80142

Legal-BERTimbau-sts-large-ma-v2

0.7836

0.8462

0.8261

0.81863

Legal-BERTimbau-sts-large-ma-v3

0.7749

0.8470

0.8364

0.81943

Legal-BERTimbau-large-v2-sts

0.71665

0.80106

0.73724

0.75165

Legal-BERTimbau-large-TSDAE-sts

0.72376

0.79261

0.73635

0.75090

Legal-BERTimbau-large-TSDAE-sts-v2

0.81326

0.83130

0.786314

0.81029

Legal-BERTimbau-large-TSDAE-sts-v3

0.80703

0.82270

0.77638

0.80204

Legal-BERTimbau-large-TSDAE-sts-v4

0.805102

0.81467

0.76010

0.79329

—————————————-

———-

———-

———-

———-

BERTimbau base Fine-tuned for STS

0.78455

0.80626

0.82841

0.80640

BERTimbau large Fine-tuned for STS

0.78193

0.81758

0.83784

0.81245

—————————————-

———-

———-

———-

———-

paraphrase-multilingual-mpnet-base-v2

0.71457

0.79831

0.83999

0.78429

paraphrase-multilingual-mpnet-base-v2 Fine-tuned with assin(s)

0.77641

0.79831

0.84575

0.80682

It is possible to verify that our SBERT variants performed clearly better than state-of-the-art multilingual models on the STS task for both assin and assin2 datasets. Regarding the stsb_multi_mt dataset, this performance is not so clear. stsb_multi_mt is a dataset composed by different multilingual translations from the original STSbenchmark dataset. Consequently, multilingual models did engage with multiple translations from the same sentence during the training process.

Nevertheless, the score is similar and, more importantly, SBERTimbau variants are adapted to our domain, as exposed previously.