Lexical approaches for Information Retrieval
Contents
2.1. Lexical approaches for Information Retrieval#
Traditionally, when a system searches for document content related to a query, it uses techniques that search for documents or entries that present those exact query words. This process is called lexical search. The drawbacks with this approach are when some important passages or documents are not retrieved because some words need to be present. In a conversation, we can mention: “I walked through the public garden. I enjoyed it.” We can easily understand that “it” was an underlying meaning of “public garden”. However, it is complex to pass that information to the computer.
2.1.1. Term Frequency Algorithm#
Term Frequency-Inverse Document Frequency (TF-IDF)[tfi] is a ranking function for document search and information retrieval. It evaluates how relevant a term t
is relative to a document d
while being based on a group of N
documents. TF-IDF is defined as follows:
TF-IDF is composed by two parts:
Term Frequency (TF)
Inverse Document Frequency (IDF)
TF represents how often a term appears in a specific document. It is calculated as follows:
where \(\text{freq}(t,d)\) represents the frequency of the term \(t\) within the document \(d\).
IDF represents the rarity of a term in the entire group of documents, where values near 0 show that terms are very common and values near 1 show that terms are rarer. IDF is defined as follows:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
df_queries_examples = pd.read_csv('../Data/queries.csv')
df_queries_examples.head()
Paragraph | |
---|---|
0 | Dá-se a preterição, quando o cabeça de casal d... |
1 | Para interpretarmos correctamente a parte deci... |
2 | A documentação só será deficiente quando não p... |
3 | Interrompida a prescrição com a citação do exe... |
4 | Não estando o laudo de junta médica elaborado ... |
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('portuguese'))
[nltk_data] Downloading package stopwords to
[nltk_data] C:\Users\Rui\AppData\Roaming\nltk_data...
[nltk_data] Package stopwords is already up-to-date!
def get_tf_idf(text_files, stop_words, redudant_words, Original_df):
#text_files contains the dataframe with the text to be used for tf-idf
#stop_words contains the list of stop words to be removed
#Original_df contains the original dataframe with the text to be used for tf-idf (the one that should be returned)
#redudant_words contains the list of redudant words to be removed
top_n = 5
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf_vector = tfidf_vectorizer.fit_transform(text_files)
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=Original_df.index, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(top_n)
"""group the top 10 results by document and concatenate the terms into a string"""
top_tfidf2 = top_tfidf.groupby(['document'])['term'].apply(lambda x: "%s" % ' '.join(x)).reset_index()
top_tfidf2.drop(columns=['document'], inplace=True)
"""join the top 10 terms with the original document"""
Original_df['top_'+str(top_n)+'_keywords'] = top_tfidf2['term'].reset_index(drop=True)
Original_df['Index']= Original_df.index
return Original_df
#df_queries_examples = get_tf_idf(df_all_docs['Paragraph'], stop_words, redudant_words, df_queries_examples)
#df_queries_examples
df_queries_examples = get_tf_idf(df_queries_examples['Paragraph'], stop_words, [], df_queries_examples)
df_queries_examples
C:\Users\Rui\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
Paragraph | top_5_keywords | Index | |
---|---|---|---|
0 | Dá-se a preterição, quando o cabeça de casal d... | alguém cabeça casal herdeiro qualidade | 0 |
1 | Para interpretarmos correctamente a parte deci... | parte analisar antecedentes atenta decisória | 1 |
2 | A documentação só será deficiente quando não p... | sentido algumas captação comprometa daí | 2 |
3 | Interrompida a prescrição com a citação do exe... | executado prescrição efeitos interrompida irre... | 3 |
4 | Não estando o laudo de junta médica elaborado ... | junta médica perícia estando realização | 4 |
... | ... | ... | ... |
95 | Existe o erro notório na apreciação da prova, ... | apreciação arbitrários baseada conjugada contr... | 95 |
96 | Por outro lado, tratando-se no caso concreto, ... | jurisdicional conselho duplo lado levantava | 96 |
97 | Nos termos do disposto no art. 4, n 4, alínea ... | etaf acidente resultantes alínea visando | 97 |
98 | Para que exista "dupla conformidade relevante ... | decisório exista segmento autónomo dupla | 98 |
99 | Não se justifica admitir revista que se centra... | centra exclusivamente inconstitucionalidades j... | 99 |
100 rows × 3 columns
2.1.1.1. Visualize keywords#
retrieved from https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html
#Only visible with a small sample like 100 documents max
import numpy as np
import altair as alt
# Terms in this list will get a red dot in the visualization
not_so_good_words = stop_words
top_n = 5
tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)
tfidf_vector = tfidf_vectorizer.fit_transform(df_queries_examples['Paragraph'])
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=df_queries_examples.index, columns=tfidf_vectorizer.get_feature_names())
tfidf_df = tfidf_df.stack().reset_index()
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(top_n)
top_tfidf
#top_tfidf = top_tfidf.groupby(['document'])['term'].apply(lambda x: "%s" % ' '.join(x)).reset_index()
#top_tfidf.drop(columns=['document'], inplace=True)
#top_tfidf
#
C:\Users\Rui\anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
document | term | tfidf | |
---|---|---|---|
122 | 0 | alguém | 0.329389 |
229 | 0 | cabeça | 0.329389 |
241 | 0 | casal | 0.329389 |
706 | 0 | herdeiro | 0.329389 |
1198 | 0 | qualidade | 0.302254 |
... | ... | ... | ... |
149052 | 99 | centra | 0.473740 |
149417 | 99 | exclusivamente | 0.473740 |
149551 | 99 | inconstitucionalidades | 0.473740 |
149655 | 99 | justifica | 0.407025 |
148893 | 99 | admitir | 0.293579 |
500 rows × 3 columns
# Terms in this list will get a red dot in the visualization
term_list = []
#adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001
top_tfidf_plusRand
document | term | tfidf | |
---|---|---|---|
122 | 0 | alguém | 0.329428 |
229 | 0 | cabeça | 0.329453 |
241 | 0 | casal | 0.329425 |
706 | 0 | herdeiro | 0.329459 |
1198 | 0 | qualidade | 0.302344 |
... | ... | ... | ... |
149052 | 99 | centra | 0.473770 |
149417 | 99 | exclusivamente | 0.473794 |
149551 | 99 | inconstitucionalidades | 0.473803 |
149655 | 99 | justifica | 0.407049 |
148893 | 99 | admitir | 0.293615 |
500 rows × 3 columns
# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
x = 'rank:O',
y = 'document:N'
).transform_window(
rank = "rank()",
sort = [alt.SortField("tfidf", order="descending")],
groupby = ["document"],
)
# heatmap specification
heatmap = base.mark_rect().encode(
color = 'tfidf:Q'
)
# red circle over terms in above list
circle = base.mark_circle(size=300).encode(
color = alt.condition(
alt.FieldOneOfPredicate(field='term', oneOf=term_list),
alt.value('red'),
alt.value('#FFFFFF00')
)
)
# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
text = 'term:N',
color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)
# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 1000)
2.1.2. Best Matching Algorithm#
Okapi BM25 (BM25) [RZ09], , in which BM stands for Best Matching, is a ranking function usually used in search engines like Elasticsearch (explored later in the document) to estimate the relevance of documents given a query. BM25 relies on a bag-of-words logic, by which it ranks a collection of documents based on the query terms that appear in which document, independently of the position in that same file. The equation is as follows:
where Q represents the Query given (with \(q_1, ..., q_n\) being the keywords) and D represents the Document. \(f(q_i, D)\) represents the \(q_i\) frequency in the document D that has a \(|D|\) length. Both \(k_1\) and \(b\) terms are free terms that are used to optimize BM25 function.
2.1.3. Distance metrics for lexical approaches#
A traditional way to search for similar documents is through the Jaccard Similarity measure between two sets. It evaluates the amount of shared information or content between both sets. In practice, it represents the extent of the intersection divided by the size of their union. It is defined as follows:
where \(C_1\) and \(C_2\) represent two sets.