3.2. Deeper Text Understanding#

Similar work on information retrieval produced by Zhuyun Dai and Jamie Callan in [DC19] (2019) suggest that BERT, more specifically the SBERT modification, in order to retrieve the proper meaning of sentences or portions of documents, should be applied to small portions since it would be “less effective on long text”. This implies that a more effective usage of embeddings for documents, would be by embedding small passages, either paragraphs, portions of paragraphs or only phrases.

In [DC19], Zhuyun Dai and Jamie Callan studied the performance of information retrieval associated with 2 different datasets (Robust04 and ClueWeb09-B) and different techniques, which included testing the performance with scores related to the score of the first passage (BERT-FirstP), the best passage (BERT-MaxP), or the sum of all passage scores (BERT-SumP). The paper showed that simply searching the passage with the best score would provide better results when the dataset has well written text (Robust04) since it could understand the context and proper meaning.

Another interesting approach studied in this research was the addition of the title to the beginning of every passage to provide context. This technique could have produced better results. Nonetheless, it could be an interesting approach to further explore in order to either provide more context or to insert, in a way, metadata into the embeddings.

A Semantic Search System For Supremo Tribunal de Justiça (Portuguese Supreme Court of Justice)

Deeper Text Understanding

3.2. Deeper Text Understanding#