Requirements
Contents
4.1. Requirements#
As mentioned in Chapter 1, this work is a segment of Project IRIS. Consequently, there were some pre-defined aspects, such as which technologies the search system should be implemented in and what data was available to work with.
4.1.1. ElasticSearch#
Elasticsearch, released in 2010, is a distributed, open-source search and analytics engine, built over the Apache Lucene. It works as a NO-SQL JSON document-based datastore. A user can interact with Elasticsearch similarly to interactions with REST APIs, meaning that every request, either POSTs, GETs, or PUTs is sent in a JSON format to different indices. An index is used to store documents in dedicated data structures and allows the user to partition the data in a certain way within a specific namespace.
Elasticsearch uses a complex architecture to ensure the scalability and resilience of the system. It is composed of node clusters, where nodes are single instances of Elasticsearch. It also makes use of shards, which are subsets of indices documents. Shards allow splitting the data from indices to maintain a good performance and replicate information to handle failures.
ElasticSearch was a pre-defined requirement, once the Project IRIS solution is based on the ElasticSearch engine. To utilize ElasticSearch for our use case, it was required to understand how to utilise the provided engine with embeddings.
By default, Elasticsearch uses BM25 to search through documents. However, it is able to support other search functions, such as Cosine Similarity. To utilize Cosine Similarity, ElasticSearch requires an initial mapping of indices, pre-defining its document structure and allowing some fields to be dense vectors. These dense vectors are able to store embeddings. This powerful edge, provides an ideal solution for a semantic search system implementation, allowing to search and analyse huge volumes of data in near real-time.
4.1.2. Data#
Legal documents contain specific language not easily found in conventional websites or books. To create a semantic search system properly adapted to the Portuguese legal domain, it was required to collect a vast number of records. Project Iris members performed the collecting data process, through ecli-indexer (https://github.com/diogoalmiro/ecli-indexer).
ecli-indexer is a repository with multiple tools to extract documents from DGSi, a public database, and index them into ElasticSearch.
In a nutshell, the retrieval process recovered the HTML content from multiple web domain pages containing legal documents, summing up to 31690 documents.
The structure of each indexed document is as follows:
"""
{'_index': 'jurisprudencia.1.0',
'_id': '-B5mRoABpM44h1Fg-6QX',
'_score': 1.0,
'_source': {'ECLI': 'ECLI:PT:STJ:2022:251.18.1T8CSC.L2.S1',
'Tribunal': 'Supremo Tribunal de Justica',
'Processo': '251/18.1T8CSC.L2.S1',
'Relator': Relator 1',
'Data': '17/03/2022',
'Descritores': ['CONTRATO DE TRABALHO',
'CONTRATO DE PRESTACAO DE SERVICO'],
'Sumario': '\n<p>I- Subjacente ao contrato de trabalho existe uma relacao de dependencia necessaria ... \n</p><p>',
'Texto': '...<p><i>d) Deve a Re ser condenada a pagar ao Autor a diferenca entre os vencimentos pagos desde julho de 2011 e o vencimento que venha a ser determinado nos termos dos pedidos formulados em b) ou c) ... ',
'Tipo': 'Acordao',
'Original URL': 'http://www.dgsi.pt/jstj.nsf/12345',
'Votacao': 'UNANIMIDADE',
'Meio Processual': 'REVISTA',
'Seccao': '4a SECCAO',
'Especie': None,
'Decisao': '<b>NEGADA A REVISTA.</b>',
'Aditamento': None,
'Jurisprudencia': 'unknown',
'Origem': 'dgsi-indexer-STJ',
'Data do Acordao': '17/03/2022'}
"""
"\n{'_index': 'jurisprudencia.1.0',\n '_id': '-B5mRoABpM44h1Fg-6QX',\n '_score': 1.0,\n '_source': {'ECLI': 'ECLI:PT:STJ:2022:251.18.1T8CSC.L2.S1',\n 'Tribunal': 'Supremo Tribunal de Justica',\n 'Processo': '251/18.1T8CSC.L2.S1',\n 'Relator': Relator 1',\n 'Data': '17/03/2022',\n 'Descritores': ['CONTRATO DE TRABALHO',\n 'CONTRATO DE PRESTACAO DE SERVICO'],\n 'Sumario': '\n<p>I- Subjacente ao contrato de trabalho existe uma relacao de dependencia necessaria ... \n</p><p>',\n 'Texto': '...<p><i>d) Deve a Re ser condenada a pagar ao Autor a diferenca entre os vencimentos pagos desde julho de 2011 e o vencimento que venha a ser determinado nos termos dos pedidos formulados em b) ou c) ... ',\n 'Tipo': 'Acordao',\n 'Original URL': 'http://www.dgsi.pt/jstj.nsf/12345',\n 'Votacao': 'UNANIMIDADE',\n 'Meio Processual': 'REVISTA',\n 'Seccao': '4a SECCAO',\n 'Especie': None,\n 'Decisao': '<b>NEGADA A REVISTA.</b>',\n 'Aditamento': None,\n 'Jurisprudencia': 'unknown',\n 'Origem': 'dgsi-indexer-STJ',\n 'Data do Acordao': '17/03/2022'}\n"
The partition of utilized data was mainly the “Texto”(Text) and “Sumário”(Summary) fields. It contained the HTML content of a legal document corpus. This data needed to be further processed to create a reliable semantic search system.
Three dataset splits were generated in order to properly train, test, and verify the produced models. The percentages for each split were as follows: 80% for the training dataset, 10% for the testing dataset, and 10% for the validation dataset. This dataset was published to the HuggingFace platform \footnotemark[14] to facilitate model repeatability and future project use. The divisions were as follows:
Training dataset – 26952 documents
Testing dataset – 3169 documents
Validation dataset – 3169 documents
4.1.2.1. Data Processing#
With all the documents properly indexed, it was necessary to clean the available text and further split the content into multiple sentences. Our semantic search system acts on singular sentences, as explored further in Chapter 4 and 5.
Firstly, HTML tags needed to be removed as well as some unexpected characters, such as “&”. It was required to identify and remove Roman numeration from the text and, more importantly, not take into consideration sections or subsection titles.
Texts also contained references to other sections, such as “como referido em a) e b)”. This example shows a possible problem a semantic search system can face, due to the difficulty to demonstrate to the system that there is relevant information outside that specific section. Even if the system identifies that there is referred information from other sections, it is not certain how the system should handle that. A solution to this problem could start by incorporating a summarization technique to join the information in the same place.
The implemented solution does acknowledge this situation, but utilizes a more straightforward approach. In the data pre-processing step, these types of occurrences are removed from the text. Even though it does not contain all the initial information, the focus was to further clean the text, improving the search system.