Legal Language Model
5. Legal Language Model#
The embeddings’ creation required a proper language model adapted to the Portuguese language and, more specifically, to the Portuguese legal domain. This chapter will explain the explored approaches and the final implementations of Legal-BERTimbau. Advances in language representation using Neural Networks became a viable way to transfer the “learned internal states of large pretrained language models” [SNL20] [SNL19]. Transfer learning is a technique that allows a model that is trained for a general task, to be later fine-tuned on specific tasks.
BERT contains numerous parameters, reaching over 300 M on the large version. Training a BERT model from scratch, with a relatively small dataset, would cause overfitting. In our case, having nearly 30 000 documents, the correct approach is to make use of a model able to fulfil our needs, that is already trained with a larger corpus than the one available in our domain