2.1. Lexical approaches for Information Retrieval#

Traditionally, when a system searches for document content related to a query, it uses techniques that search for documents or entries that present those exact query words. This process is called lexical search. The drawbacks with this approach are when some important passages or documents are not retrieved because some words need to be present. In a conversation, we can mention: “I walked through the public garden. I enjoyed it.” We can easily understand that “it” was an underlying meaning of “public garden”. However, it is complex to pass that information to the computer.

2.1.1. Term Frequency Algorithm#

Term Frequency-Inverse Document Frequency (TF-IDF)[tfi] is a ranking function for document search and information retrieval. It evaluates how relevant a term t is relative to a document d while being based on a group of N documents. TF-IDF is defined as follows:

\[ \text{TF-IDF}(t,d, D) = \text{TF}(t,d) \times \text{IDF}(t,D) \]

TF-IDF is composed by two parts:

Term Frequency (TF)
Inverse Document Frequency (IDF)

TF represents how often a term appears in a specific document. It is calculated as follows:

\[ \text{TF}(t,d) = \log(1+\text{freq}(t,d)) \]

where \(\text{freq}(t,d)\) represents the frequency of the term \(t\) within the document \(d\).

IDF represents the rarity of a term in the entire group of documents, where values near 0 show that terms are very common and values near 1 show that terms are rarer. IDF is defined as follows:

\[ \text{IDF}(t,D) = \log(\frac{N}{count(d \in D:t\in d)}) \]

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

df_queries_examples = pd.read_csv('../Data/queries.csv')
df_queries_examples.head()

	Paragraph
0	Dá-se a preterição, quando o cabeça de casal d...
1	Para interpretarmos correctamente a parte deci...
2	A documentação só será deficiente quando não p...
3	Interrompida a prescrição com a citação do exe...
4	Não estando o laudo de junta médica elaborado ...

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('portuguese'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Rui\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

def get_tf_idf(text_files, stop_words, redudant_words, Original_df):
    #text_files contains the dataframe with the text to be used for tf-idf
    #stop_words contains the list of stop words to be removed
    #Original_df contains the original dataframe with the text to be used for tf-idf (the one that should be returned)
    #redudant_words contains the list of redudant words to be removed

    top_n = 5


    tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)
    tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=Original_df.index, columns=tfidf_vectorizer.get_feature_names())
    tfidf_df = tfidf_df.stack().reset_index()
    tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
    top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(top_n)

    """group the top 10 results by document and concatenate the terms into a string"""
    top_tfidf2 = top_tfidf.groupby(['document'])['term'].apply(lambda x: "%s" % ' '.join(x)).reset_index()
    top_tfidf2.drop(columns=['document'], inplace=True)
    """join the top 10 terms with the original document"""
    
    Original_df['top_'+str(top_n)+'_keywords'] = top_tfidf2['term'].reset_index(drop=True)
    Original_df['Index']= Original_df.index

    return Original_df

#df_queries_examples = get_tf_idf(df_all_docs['Paragraph'], stop_words, redudant_words, df_queries_examples)
#df_queries_examples

df_queries_examples = get_tf_idf(df_queries_examples['Paragraph'], stop_words, [], df_queries_examples)
df_queries_examples

C:\Users\Rui\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

	Paragraph	top_5_keywords	Index
0	Dá-se a preterição, quando o cabeça de casal d...	alguém cabeça casal herdeiro qualidade	0
1	Para interpretarmos correctamente a parte deci...	parte analisar antecedentes atenta decisória	1
2	A documentação só será deficiente quando não p...	sentido algumas captação comprometa daí	2
3	Interrompida a prescrição com a citação do exe...	executado prescrição efeitos interrompida irre...	3
4	Não estando o laudo de junta médica elaborado ...	junta médica perícia estando realização	4
...	...	...	...
95	Existe o erro notório na apreciação da prova, ...	apreciação arbitrários baseada conjugada contr...	95
96	Por outro lado, tratando-se no caso concreto, ...	jurisdicional conselho duplo lado levantava	96
97	Nos termos do disposto no art. 4, n 4, alínea ...	etaf acidente resultantes alínea visando	97
98	Para que exista "dupla conformidade relevante ...	decisório exista segmento autónomo dupla	98
99	Não se justifica admitir revista que se centra...	centra exclusivamente inconstitucionalidades j...	99

100 rows × 3 columns

2.1.1.1. Visualize keywords#

retrieved from https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html

#Only visible with a small sample like 100 documents max
import numpy as np
import altair as alt

# Terms in this list will get a red dot in the visualization
not_so_good_words = stop_words

top_n = 5


tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf_vector = tfidf_vectorizer.fit_transform(df_queries_examples['Paragraph'])

tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=df_queries_examples.index, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(top_n)
top_tfidf
#top_tfidf = top_tfidf.groupby(['document'])['term'].apply(lambda x: "%s" % ' '.join(x)).reset_index()
#top_tfidf.drop(columns=['document'], inplace=True)
#top_tfidf

#

C:\Users\Rui\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

	document	term	tfidf
122	0	alguém	0.329389
229	0	cabeça	0.329389
241	0	casal	0.329389
706	0	herdeiro	0.329389
1198	0	qualidade	0.302254
...	...	...	...
149052	99	centra	0.473740
149417	99	exclusivamente	0.473740
149551	99	inconstitucionalidades	0.473740
149655	99	justifica	0.407025
148893	99	admitir	0.293579

500 rows × 3 columns

# Terms in this list will get a red dot in the visualization
term_list = []

#adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001
top_tfidf_plusRand

	document	term	tfidf
122	0	alguém	0.329428
229	0	cabeça	0.329453
241	0	casal	0.329425
706	0	herdeiro	0.329459
1198	0	qualidade	0.302344
...	...	...	...
149052	99	centra	0.473770
149417	99	exclusivamente	0.473794
149551	99	inconstitucionalidades	0.473803
149655	99	justifica	0.407049
148893	99	admitir	0.293615

500 rows × 3 columns

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=300).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 1000)

2.1.2. Best Matching Algorithm#

Okapi BM25 (BM25) [RZ09], , in which BM stands for Best Matching, is a ranking function usually used in search engines like Elasticsearch (explored later in the document) to estimate the relevance of documents given a query. BM25 relies on a bag-of-words logic, by which it ranks a collection of documents based on the query terms that appear in which document, independently of the position in that same file. The equation is as follows:

\[ score(D,Q) = \sum^n_{i=1} IDF(q_i) \cdot \frac{f(q_i, D) \cdot (k_i +1)}{f(q_i, D)+k_1 \cdot (1-b +b \cdot \frac{|D|}{avgdl})} \]

where Q represents the Query given (with \(q_1, ..., q_n\) being the keywords) and D represents the Document. \(f(q_i, D)\) represents the \(q_i\) frequency in the document D that has a \(|D|\) length. Both \(k_1\) and \(b\) terms are free terms that are used to optimize BM25 function.

2.1.3. Distance metrics for lexical approaches#

A traditional way to search for similar documents is through the Jaccard Similarity measure between two sets. It evaluates the amount of shared information or content between both sets. In practice, it represents the extent of the intersection divided by the size of their union. It is defined as follows:

\[ \text{Sim}(C_1, C_2)=\frac{|C_1 \cap C_2 |}{|C_1 \cup C_2 |} \]

where \(C_1\) and \(C_2\) represent two sets.

A Semantic Search System For Supremo Tribunal de Justiça (Portuguese Supreme Court of Justice)

Lexical approaches for Information Retrieval

Contents

2.1. Lexical approaches for Information Retrieval#

2.1.1. Term Frequency Algorithm#

2.1.1.1. Visualize keywords#

2.1.2. Best Matching Algorithm#

2.1.3. Distance metrics for lexical approaches#