A Semantic Search System for Supremo Tribunal de Justiça#

Research for “A Semantic Search System for Supremo Tribunal de Justiça” thesis to obtain the Bologna Master Degree in Computer Science and Engineering - Taguspark at Instituto Superior Técnico, Universidade de Lisboa.

IST

Work developed as part of Project IRIS, developed by INESC-ID, in collaboration with the Supremo Tribunal de Justiça (STJ).

INESC-ID

Various assets developed during the project are available in HuggingFace and Github (Thesis Repo IRIS Github).

Abstract#

Lexical approaches have limitations, and information retrieval systems frequently use them. These constraints are exacerbated when tied to a specific domain, such as the legal one. Large language models, such as BERT, have a deep understanding of a language and may overcome the limitations of older methodologies, such as BM25.

This work seeks to investigate the development of a Semantic Search System to assist the Supremo Tribunal de Justiça (Portuguese Supreme Court of Justice) in its decision-making process. It also provides information on the most relevant techniques for training a Large Language Model adapted to Portuguese jurisprudence.

Some of the difficulties faced by this study were due to a need for annotated data. Nonetheless, cutting-edge methodologies have yielded encouraging results when training Large Language Models using both unsupervised and supervised techniques.

We built a Semantic Search System that uses a specially trained BERT model (Legal-BERTimbau) and a hybrid Search System that incorporates both lexical and semantic techniques by combining the capabilities of BM25 and the potential of Legal-BERTimbau.

Acknowledgments#

I want to express my sincerest gratitude and appreciation to my supervisors, Professor Pedro Alexandre Santos and Professor João Dias, who guided me throughout the entirety of this work and relentlessly offered much-needed advice, sound counsel and honest feedback. They were infallible in being reliable and continuously kept in touch with the progress of this thesis. They always provided input on the many updates, advancements, and setbacks that occurred, and I am genuinely grateful.

On the same note, I would like to thank all Project IRIS members I co-operated with in the past months. The resonating experience of partaking in a more significant project with brilliant minds will shape my future ventures.

Lastly, there is no way I could wholly express in words the unwavering support and unconditional love I received from my family. To my mother and father, from the bottom of my heart… thank you.

Bibliography