5.3. Semantic Textual Similarity#

All of this models are available on https://huggingface.co/rufimelo

This section is still under construction.

Under Construction

The task our language model needs to perform is semantic textual similarity. The semantic textual similarity is a regression task determining how similar two text segments are on a numeric scale, evaluating from 1 to 5.

To adapt the generated variants to this task, we created SBERT versions of themselves and trained them with four distinct datasets. We attach an independent linear layer to each Legal-BERTimbau variant and fine-tune the model using a mean squared error loss. The SBERT version of Legal-BERTimbaularge, utilising the SentenceTransformer library, is defined as follows:

from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader
import pandas as pd
import torch.nn as nn

bert_model_name = 'rufimelo/Legal-BERTimbau-large'
word_embedding_model = models.Transformer(bert_model_name, max_seq_length=256)
pooling_model = models.Pooling(
                word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense( 
                in_features=pooling_model.get_sentence_embedding_dimension(), 
                out_features=256, 
                activation_function=nn.Tanh())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
Some weights of the model checkpoint at rufimelo/Legal-BERTimbau-large were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at rufimelo/Legal-BERTimbau-large and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

This code snippet shows how we can create a SBERT model from scratch, using a BERT model as the foundation. Since we did not add a Pooling layer in our architecture, which would lower the accuracy of the embeddings for the STS task, this SBERT model variant generates 1024 dimension embeddings. To train the models for the STS task, it was used assin [37], and assin2 [38] datasets and stsb multi mt [39] Portuguese sub-dataset. Each dataset contained pairs of sentences and a label value representing both sentences’ similarities.

assin dataset contains 10 000 pairs of sentences, 5 000 of which were used for training. Similarly, assin2 dataset contains 9 448 pairs of sentences, from which 6 500 were also used for training. Finally, stsb multi mt Portuguese sub-dataset contains 8 628 pairs of sentences, from which we utilize 5749 to fine-tune the model. In a nutshell, for the STS task, our models were trained with 20 197 Portuguese sentence pairs, allowing the model to be more familiarized with the Portuguese language.

Following the STS fine-tuning experience performed on BERTimbau, we trained the base version with a learning rate of 4e−5 using the Adam optimization algorithm [40] and batch size of 32 for ten epochs. The large version was trained with a learning rate of 1e−5, also making use of the Adam optimization algorithm, but only performed on a batch size of 8 during five epochs

This type of fine-tuning, generated two different SBERT variants:

5.3.1. Semantic Textual Similarity Custom Dataset#

Our solution implies training on the STS task. The resources available for this effect are slim. For this effect, there are only 3 Portuguese STS datasets, including the stsb multi mt Portuguese sub-dataset, which is a translation from the English sub-dataset version. On top of this fact, the Portuguese legal domain is unique on its own. To improve the STS task, we developed a unique dataset for further training our models. The dataset is publicly available at HuggingFace:

  • stjiris/IRIS sts legal dataset

The dataset creation process was automated. Sentence pairs selected randomly across our document collection were given values from 0 to 1. Values 1 to 4 were attributed to sentence pairs selected from the same summary. The summaries are short, and thus, they might imply some entailment. Finally, we selected sentences from our collection and proceeded to generate their pairs using OpenAI’s GPT3 text-davinci-003 model API, publicly available since November 29th29. The GPT3 model received the following request:

  • ”Escreve por outras palavras: Entrada: sentence Saáda:”

, which translates to:

  • ”Write, in other words: Input: sentence Output:”

Similarly to the STS fine-tuning stage described in BERTimbau’s paper, the models were trained with a learning rate of 1e−5, also making use of the Adam optimization algorithm, but only performed on a batch size of 8 during five epochs