3.3. NLP Applied To Portuguese Consumer Law#

In 2022, Nuno Cordeiro, as part of his master’s thesis, created a system, Legal Semantic Search Engine (LeSSE) [Nun], that merges common document retrieval techniques with semantic search properties. The system was specifically created to answer questions on the topic of Portuguese consumer law. The overall goal and context of his thesis are similar to the context of this research. Even though Nuno’s work focuses on Portuguese Consumer Law, several state-of-the-art techniques, such as BERT or Inverse Cloze Task, are relevant to our context. The implemented search system combines the 20 retrievals with the highest scores using BM25 with 50 retrievals with the highest scores using the cosine similarity measure. Consequently, it orders the results through a reordering model to produce the final results. The pre-processing of legal documents differs from the one needed in our context. The implemented system required tokenization to help construct the bag-of-words necessary for the BM25 algorithm, removal of punctuation, and stop-words, which is unnecessary for SBERT. We do not intend to use the BM25 algorithm, even though this reordering technique showed potential in his research. SBERT performs well with sentences with proper context, so the removal of punctuation and stop-word removal is not advised. In his work, the language model had to be trained on a corpus that included legislative jargon to produce the desired results. This was an important step since it would help ensure that the model could create proper relationships with words not seen in the pre-training stage. We intend to implement this approach in this thesis context since there will be legal terms and, overall, jargon that the model has not seen in the pre-training phase.