logo

A Semantic Search System For Supremo Tribunal de Justiça (Portuguese Supreme Court of Justice)

  • A Semantic Search System for Supremo Tribunal de Justiça

Content

  • 1. Introduction
    • 1.1. Motivation
    • 1.2. Objectives
    • 1.3. Contributions
    • 1.4. Thesis Outline
  • 2. Background
    • 2.1. Lexical approaches for Information Retrieval
    • 2.2. Semantic Search
    • 2.3. Techniques for efficient retrieval process
    • 2.4. Summarization
  • 3. State of the Art
    • 3.1. BERT
    • 3.2. Deeper Text Understanding
    • 3.3. NLP Applied To Portuguese Consumer Law
  • 4. Semantic Search System
    • 4.1. Requirements
    • 4.2. Semantic Search Type
    • 4.3. Architecture
  • 5. Legal Language Model
    • 5.1. BERTimbau
    • 5.2. Domain Adaptation
    • 5.3. Semantic Textual Similarity
    • 5.4. Natural Language Inference
    • 5.5. Multilingual Knowledge Distillation
    • 5.6. Metadata Knowledge Distillation
  • 6. System Evaluation
    • 6.1. Language Model Evaluation
    • 6.2. Search System Evaluation
  • 7. Conclusion
    • 7.1. Achievements
    • 7.2. Future Work

Bibliography

  • Bibliography
Powered by Jupyter Book
  • repository
  • open issue
  • .md

Legal Language Model

5. Legal Language Model#

Under Construction

The embeddings’ creation required a proper language model adapted to the Portuguese language and, more specifically, to the Portuguese legal domain. This chapter will explain the explored approaches and the final implementations of Legal-BERTimbau. Advances in language representation using Neural Networks became a viable way to transfer the “learned internal states of large pretrained language models” [SNL20] [SNL19]. Transfer learning is a technique that allows a model that is trained for a general task, to be later fine-tuned on specific tasks.

BERT contains numerous parameters, reaching over 300 M on the large version. Training a BERT model from scratch, with a relatively small dataset, would cause overfitting. In our case, having nearly 30 000 documents, the correct approach is to make use of a model able to fulfil our needs, that is already trained with a larger corpus than the one available in our domain

previous

4.3. Architecture

next

5.1. BERTimbau

By Rui Melo

LinkedIn GitHub