Future Work
7.2. Future Work#
This section is still under construction.
7.2.1. Dataset Annotation#
When developing a language model, one of the main concerns is regarding the available data. A Portuguese legal domain dataset could be produced and revised. Even though one dataset containing multiple sentences was generated to train the models, that only was used in the Domain Adaption training step. When adapting the model for STS, it was required to use 3 different Portuguese Datasets. This process, even though, brought good results, the ideal solution would be to, effectively, have a legal Portuguese dataset with sentence pairs and similarity values (1 to 5).
The downside of this approach is that it requires manual work, nearly impossible to automatize. Both assin and assin2 were annotated manually by different groups of researchers/volunteers, oriented only by simple guidelines.
Nevertheless, a properly labelled and cleaned legal dataset from the Portuguese domain would be helpful in future applications. Similarly, one dataset was developed and published, alongside a paper, on 1st July 2022, entitled “”Pile of Law: Learning Responsible Data Filtering from the Law and a 256 GB Open-Source Legal Dataset” by Peter Henderson, et al. [BWG+20] [BSHVD20] [HBCB21] [K+05] [LPC+18] [RLLT21] [HLT+21], that encompasses a large corpus of legal and administrative data from multiple U.S.A. entities. The Pile of Law paper also exposed the ethical challenges it faced, as well, as how it was handled “biased, obscene, copyrighted, and private information”.