Language Model Evaluation
6.1. Language Model Evaluation#
This section is still under construction.
This thesis implicated the development of multiple Legal-BERTimbau versions. From fine-tuning both base and large variants of BERTimbau, to applying the Multilingual Knowledge Distillation technique to models, there were a total of 4 different variants implemented, in addition to BERTimbau’s variants.
Regarding BERTimbau’s variants, Legal-BERTimbau-base and Legal-BERTimbau-large, we can acknowledge if the domain adaptation stage was successful or not based on an average of the loss function used in MLM trainings.
The loss function used was negative log likelihood loss. We utilised the testing dataset from the splits generated in subsection 4.1.2 and compared the average loss produced. We also submitted the models that were subjected to the TSDAE technique to this evaluation.
Model |
AVG loss |
---|---|
BERTimbau base |
18.15683 |
BERTimbau large |
16.40979 |
Legal-BERTimbau-base |
0.027202 |
Legal-BERTimbau-large |
0.0285061 |
Legal-BERTimbau-base-TSDAE |
10.897830 |
Legal-BERTimbau-large-TSDAE |
11.093482 |
Regarding our testing split, both Legal-BERTimbau-base and Legal-BERTimbau-large perform better on a MLM task than BERTimbau, which infers that our models are successfully adapted to our legal domain. Models subjected to TSDAE domain adaptation technique also perform slightly better than original BERTimbau models on the MLM task.
Since the main task of such model was determining STS, the correspondent evaluation metric for the language model is through calculating the Pearson correlation [VGO+20] between the expected similarity score and projected between different sentence pairs.
This evaluation was performed on the Portuguese available datasets that were also used in the fine-tuning stage. For comparison, we can see the performance of state-of-the-art multilingual models when performing the same tasks with the same data.
Model |
Assin |
Assin2 |
stsb_multi_mt pt |
avg |
---|---|---|---|---|
Legal-BERTimbau-sts-base |
0.71457 |
0.73545 |
0.72383 |
0.72462 |
Legal-BERTimbau-sts-base-ma |
0.74874 |
0.79532 |
0.82254 |
0.78886 |
Legal-BERTimbau-sts-base-ma-v2 |
0.75481 |
0.80262 |
0.82178 |
0.79307 |
Legal-BERTimbau-base-TSDAE-sts |
0.78814 |
0.81380 |
0.75777 |
0.78657 |
Legal-BERTimbau-sts-large |
0.76629 |
0.82357 |
0.79120 |
0.79369 |
Legal-BERTimbau-sts-large-v2 |
0.76299 |
0.81121 |
0.81726 |
0.79715 |
Legal-BERTimbau-sts-large-ma |
0.76195 |
0.81622 |
0.82608 |
0.80142 |
Legal-BERTimbau-sts-large-ma-v2 |
0.7836 |
0.8462 |
0.8261 |
0.81863 |
Legal-BERTimbau-sts-large-ma-v3 |
0.7749 |
0.8470 |
0.8364 |
0.81943 |
Legal-BERTimbau-large-v2-sts |
0.71665 |
0.80106 |
0.73724 |
0.75165 |
Legal-BERTimbau-large-TSDAE-sts |
0.72376 |
0.79261 |
0.73635 |
0.75090 |
Legal-BERTimbau-large-TSDAE-sts-v2 |
0.81326 |
0.83130 |
0.786314 |
0.81029 |
Legal-BERTimbau-large-TSDAE-sts-v3 |
0.80703 |
0.82270 |
0.77638 |
0.80204 |
Legal-BERTimbau-large-TSDAE-sts-v4 |
0.805102 |
0.81467 |
0.76010 |
0.79329 |
—————————————- |
———- |
———- |
———- |
———- |
BERTimbau base Fine-tuned for STS |
0.78455 |
0.80626 |
0.82841 |
0.80640 |
BERTimbau large Fine-tuned for STS |
0.78193 |
0.81758 |
0.83784 |
0.81245 |
—————————————- |
———- |
———- |
———- |
———- |
paraphrase-multilingual-mpnet-base-v2 |
0.71457 |
0.79831 |
0.83999 |
0.78429 |
paraphrase-multilingual-mpnet-base-v2 Fine-tuned with assin(s) |
0.77641 |
0.79831 |
0.84575 |
0.80682 |
It is possible to verify that our SBERT variants performed clearly better than state-of-the-art multilingual models on the STS task for both assin and assin2 datasets. Regarding the stsb_multi_mt dataset, this performance is not so clear. stsb_multi_mt is a dataset composed by different multilingual translations from the original STSbenchmark dataset. Consequently, multilingual models did engage with multiple translations from the same sentence during the training process.
Nevertheless, the score is similar and, more importantly, SBERTimbau variants are adapted to our domain, as exposed previously.