{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lexical approaches for Information Retrieval\n", "\n", "Traditionally, when a system searches for document content related to a query, it uses techniques that search for documents or entries that present those exact query words. This process is called lexical search. The drawbacks with this approach are when some important passages or documents are not retrieved because some words need to be present. In a conversation, we can mention: “I walked through the public garden. I enjoyed it.” We can easily understand that “it” was an underlying meaning of “public garden”. However, it is complex to pass that information to the computer.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Term Frequency Algorithm\n", "\n", "Term Frequency-Inverse Document Frequency (TF-IDF){cite}`tfidf` is a ranking function for document search and information retrieval. It evaluates how relevant a term ``t`` is relative to a document ``d`` while being based on a group of ``N`` documents. TF-IDF is defined as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", " \\text{TF-IDF}(t,d, D) = \\text{TF}(t,d) \\times \\text{IDF}(t,D)\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TF-IDF is composed by two parts: \n", "- Term Frequency (TF)\n", "- Inverse Document Frequency (IDF)\n", "\n", "\n", "\n", "TF represents how often a term appears in a specific document. It is calculated as follows:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\text{TF}(t,d) = \\log(1+\\text{freq}(t,d))\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where $\\text{freq}(t,d)$ represents the frequency of the term $t$ within the document $d$.\n", "\n", "IDF represents the rarity of a term in the entire group of documents, where values near 0 show that terms are very common and values near 1 show that terms are rarer.\n", "IDF is defined as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\text{IDF}(t,D) = \\log(\\frac{N}{count(d \\in D:t\\in d)})\n", "$$" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Paragraph
0Dá-se a preterição, quando o cabeça de casal d...
1Para interpretarmos correctamente a parte deci...
2A documentação só será deficiente quando não p...
3Interrompida a prescrição com a citação do exe...
4Não estando o laudo de junta médica elaborado ...
\n", "
" ], "text/plain": [ " Paragraph\n", "0 Dá-se a preterição, quando o cabeça de casal d...\n", "1 Para interpretarmos correctamente a parte deci...\n", "2 A documentação só será deficiente quando não p...\n", "3 Interrompida a prescrição com a citação do exe...\n", "4 Não estando o laudo de junta médica elaborado ..." ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_queries_examples = pd.read_csv('../Data/queries.csv')\n", "df_queries_examples.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] C:\\Users\\Rui\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import nltk\n", "nltk.download('stopwords')\n", "from nltk.corpus import stopwords\n", "stop_words = set(stopwords.words('portuguese'))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def get_tf_idf(text_files, stop_words, redudant_words, Original_df):\n", " #text_files contains the dataframe with the text to be used for tf-idf\n", " #stop_words contains the list of stop words to be removed\n", " #Original_df contains the original dataframe with the text to be used for tf-idf (the one that should be returned)\n", " #redudant_words contains the list of redudant words to be removed\n", "\n", " top_n = 5\n", "\n", "\n", " tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)\n", " tfidf_vector = tfidf_vectorizer.fit_transform(text_files)\n", "\n", " tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=Original_df.index, columns=tfidf_vectorizer.get_feature_names())\n", " tfidf_df = tfidf_df.stack().reset_index()\n", " tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})\n", " top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(top_n)\n", "\n", " \"\"\"group the top 10 results by document and concatenate the terms into a string\"\"\"\n", " top_tfidf2 = top_tfidf.groupby(['document'])['term'].apply(lambda x: \"%s\" % ' '.join(x)).reset_index()\n", " top_tfidf2.drop(columns=['document'], inplace=True)\n", " \"\"\"join the top 10 terms with the original document\"\"\"\n", " \n", " Original_df['top_'+str(top_n)+'_keywords'] = top_tfidf2['term'].reset_index(drop=True)\n", " Original_df['Index']= Original_df.index\n", "\n", " return Original_df\n", "\n", "#df_queries_examples = get_tf_idf(df_all_docs['Paragraph'], stop_words, redudant_words, df_queries_examples)\n", "#df_queries_examples" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Rui\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python39\\site-packages\\sklearn\\utils\\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n", " warnings.warn(msg, category=FutureWarning)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Paragraphtop_5_keywordsIndex
0Dá-se a preterição, quando o cabeça de casal d...alguém cabeça casal herdeiro qualidade0
1Para interpretarmos correctamente a parte deci...parte analisar antecedentes atenta decisória1
2A documentação só será deficiente quando não p...sentido algumas captação comprometa daí2
3Interrompida a prescrição com a citação do exe...executado prescrição efeitos interrompida irre...3
4Não estando o laudo de junta médica elaborado ...junta médica perícia estando realização4
............
95Existe o erro notório na apreciação da prova, ...apreciação arbitrários baseada conjugada contr...95
96Por outro lado, tratando-se no caso concreto, ...jurisdicional conselho duplo lado levantava96
97Nos termos do disposto no art. 4, n 4, alínea ...etaf acidente resultantes alínea visando97
98Para que exista \"dupla conformidade relevante ...decisório exista segmento autónomo dupla98
99Não se justifica admitir revista que se centra...centra exclusivamente inconstitucionalidades j...99
\n", "

100 rows × 3 columns

\n", "
" ], "text/plain": [ " Paragraph \\\n", "0 Dá-se a preterição, quando o cabeça de casal d... \n", "1 Para interpretarmos correctamente a parte deci... \n", "2 A documentação só será deficiente quando não p... \n", "3 Interrompida a prescrição com a citação do exe... \n", "4 Não estando o laudo de junta médica elaborado ... \n", ".. ... \n", "95 Existe o erro notório na apreciação da prova, ... \n", "96 Por outro lado, tratando-se no caso concreto, ... \n", "97 Nos termos do disposto no art. 4, n 4, alínea ... \n", "98 Para que exista \"dupla conformidade relevante ... \n", "99 Não se justifica admitir revista que se centra... \n", "\n", " top_5_keywords Index \n", "0 alguém cabeça casal herdeiro qualidade 0 \n", "1 parte analisar antecedentes atenta decisória 1 \n", "2 sentido algumas captação comprometa daí 2 \n", "3 executado prescrição efeitos interrompida irre... 3 \n", "4 junta médica perícia estando realização 4 \n", ".. ... ... \n", "95 apreciação arbitrários baseada conjugada contr... 95 \n", "96 jurisdicional conselho duplo lado levantava 96 \n", "97 etaf acidente resultantes alínea visando 97 \n", "98 decisório exista segmento autónomo dupla 98 \n", "99 centra exclusivamente inconstitucionalidades j... 99 \n", "\n", "[100 rows x 3 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "df_queries_examples = get_tf_idf(df_queries_examples['Paragraph'], stop_words, [], df_queries_examples)\n", "df_queries_examples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Visualize keywords" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "retrieved from https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Rui\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python39\\site-packages\\sklearn\\utils\\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n", " warnings.warn(msg, category=FutureWarning)\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
documenttermtfidf
1220alguém0.329389
2290cabeça0.329389
2410casal0.329389
7060herdeiro0.329389
11980qualidade0.302254
............
14905299centra0.473740
14941799exclusivamente0.473740
14955199inconstitucionalidades0.473740
14965599justifica0.407025
14889399admitir0.293579
\n", "

500 rows × 3 columns

\n", "
" ], "text/plain": [ " document term tfidf\n", "122 0 alguém 0.329389\n", "229 0 cabeça 0.329389\n", "241 0 casal 0.329389\n", "706 0 herdeiro 0.329389\n", "1198 0 qualidade 0.302254\n", "... ... ... ...\n", "149052 99 centra 0.473740\n", "149417 99 exclusivamente 0.473740\n", "149551 99 inconstitucionalidades 0.473740\n", "149655 99 justifica 0.407025\n", "148893 99 admitir 0.293579\n", "\n", "[500 rows x 3 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Only visible with a small sample like 100 documents max\n", "import numpy as np\n", "import altair as alt\n", "\n", "# Terms in this list will get a red dot in the visualization\n", "not_so_good_words = stop_words\n", "\n", "top_n = 5\n", "\n", "\n", "tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)\n", "tfidf_vector = tfidf_vectorizer.fit_transform(df_queries_examples['Paragraph'])\n", "\n", "tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=df_queries_examples.index, columns=tfidf_vectorizer.get_feature_names())\n", "tfidf_df = tfidf_df.stack().reset_index()\n", "tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})\n", "top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(top_n)\n", "top_tfidf\n", "#top_tfidf = top_tfidf.groupby(['document'])['term'].apply(lambda x: \"%s\" % ' '.join(x)).reset_index()\n", "#top_tfidf.drop(columns=['document'], inplace=True)\n", "#top_tfidf\n", "\n", "#" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
documenttermtfidf
1220alguém0.329479
2290cabeça0.329485
2410casal0.329454
7060herdeiro0.329393
11980qualidade0.302293
............
14905299centra0.473791
14941799exclusivamente0.473743
14955199inconstitucionalidades0.473785
14965599justifica0.407034
14889399admitir0.293587
\n", "

500 rows × 3 columns

\n", "
" ], "text/plain": [ " document term tfidf\n", "122 0 alguém 0.329479\n", "229 0 cabeça 0.329485\n", "241 0 casal 0.329454\n", "706 0 herdeiro 0.329393\n", "1198 0 qualidade 0.302293\n", "... ... ... ...\n", "149052 99 centra 0.473791\n", "149417 99 exclusivamente 0.473743\n", "149551 99 inconstitucionalidades 0.473785\n", "149655 99 justifica 0.407034\n", "148893 99 admitir 0.293587\n", "\n", "[500 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Terms in this list will get a red dot in the visualization\n", "term_list = []\n", "\n", "#adding a little randomness to break ties in term ranking\n", "top_tfidf_plusRand = top_tfidf.copy()\n", "top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001\n", "top_tfidf_plusRand\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "# base for all visualizations, with rank calculation\n", "base = alt.Chart(top_tfidf_plusRand).encode(\n", " x = 'rank:O',\n", " y = 'document:N'\n", ").transform_window(\n", " rank = \"rank()\",\n", " sort = [alt.SortField(\"tfidf\", order=\"descending\")],\n", " groupby = [\"document\"],\n", ")\n", "\n", "# heatmap specification\n", "heatmap = base.mark_rect().encode(\n", " color = 'tfidf:Q'\n", ")\n", "\n", "# red circle over terms in above list\n", "circle = base.mark_circle(size=300).encode(\n", " color = alt.condition(\n", " alt.FieldOneOfPredicate(field='term', oneOf=term_list),\n", " alt.value('red'),\n", " alt.value('#FFFFFF00') \n", " )\n", ")\n", "\n", "# text labels, white for darker heatmap colors\n", "text = base.mark_text(baseline='middle').encode(\n", " text = 'term:N',\n", " color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))\n", ")\n", "\n", "# display the three superimposed visualizations\n", "(heatmap + circle + text).properties(width = 1000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Best Matching Algorithm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okapi BM25 (BM25) {cite}`bm25_1`, , in which BM stands for Best Matching, is a ranking function usually used in search engines like Elasticsearch (explored later in the document) to estimate the relevance of documents given a query. BM25 relies on a bag-of-words logic, by which it ranks a collection of documents based on the query terms that appear in which document, independently of the position in that same file. The equation is as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "score(D,Q) = \\sum^n_{i=1} IDF(q_i) \\cdot \\frac{f(q_i, D) \\cdot (k_i +1)}{f(q_i, D)+k_1 \\cdot (1-b +b \\cdot \\frac{|D|}{avgdl})}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where Q represents the Query given (with $q_1, ..., q_n$ being the keywords) and D represents the Document. $f(q_i, D)$ represents the $q_i$ frequency in the document D that has a $|D|$ length. \n", "Both $k_1$ and $b$ terms are free terms that are used to optimize BM25 function." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Distance metrics for lexical approaches" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A traditional way to search for similar documents is through the Jaccard Similarity measure between two sets. It evaluates the amount of shared information or content between both sets. In practice, it represents the extent of the intersection divided by the size of their union. It is defined as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\text{Sim}(C_1, C_2)=\\frac{|C_1 \\cap C_2 |}{|C_1 \\cup C_2 |}\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where $C_1$ and $C_2$ represent two sets." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.9.13 64-bit (microsoft store)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "6222f30a04324eb0e1388dd9d92114e7e65302edf73019a41583b5b5a3ccc776" } } }, "nbformat": 4, "nbformat_minor": 2 }