{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lexical approaches for Information Retrieval\n", "\n", "Traditionally, when a system searches for document content related to a query, it uses techniques that search for documents or entries that present those exact query words. This process is called lexical search. The drawbacks with this approach are when some important passages or documents are not retrieved because some words need to be present. In a conversation, we can mention: “I walked through the public garden. I enjoyed it.” We can easily understand that “it” was an underlying meaning of “public garden”. However, it is complex to pass that information to the computer.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Term Frequency Algorithm\n", "\n", "Term Frequency-Inverse Document Frequency (TF-IDF){cite}`tfidf` is a ranking function for document search and information retrieval. It evaluates how relevant a term ``t`` is relative to a document ``d`` while being based on a group of ``N`` documents. TF-IDF is defined as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", " \\text{TF-IDF}(t,d, D) = \\text{TF}(t,d) \\times \\text{IDF}(t,D)\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TF-IDF is composed by two parts: \n", "- Term Frequency (TF)\n", "- Inverse Document Frequency (IDF)\n", "\n", "\n", "\n", "TF represents how often a term appears in a specific document. It is calculated as follows:\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "$$\n", "\\text{TF}(t,d) = \\log(1+\\text{freq}(t,d))\n", "$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where $\\text{freq}(t,d)$ represents the frequency of the term $t$ within the document $d$.\n", "\n", "IDF represents the rarity of a term in the entire group of documents, where values near 0 show that terms are very common and values near 1 show that terms are rarer.\n", "IDF is defined as follows:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "$$\n", "\\text{IDF}(t,D) = \\log(\\frac{N}{count(d \\in D:t\\in d)})\n", "$$" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Paragraph | \n", "
---|---|
0 | \n", "Dá-se a preterição, quando o cabeça de casal d... | \n", "
1 | \n", "Para interpretarmos correctamente a parte deci... | \n", "
2 | \n", "A documentação só será deficiente quando não p... | \n", "
3 | \n", "Interrompida a prescrição com a citação do exe... | \n", "
4 | \n", "Não estando o laudo de junta médica elaborado ... | \n", "
\n", " | Paragraph | \n", "top_5_keywords | \n", "Index | \n", "
---|---|---|---|
0 | \n", "Dá-se a preterição, quando o cabeça de casal d... | \n", "alguém cabeça casal herdeiro qualidade | \n", "0 | \n", "
1 | \n", "Para interpretarmos correctamente a parte deci... | \n", "parte analisar antecedentes atenta decisória | \n", "1 | \n", "
2 | \n", "A documentação só será deficiente quando não p... | \n", "sentido algumas captação comprometa daí | \n", "2 | \n", "
3 | \n", "Interrompida a prescrição com a citação do exe... | \n", "executado prescrição efeitos interrompida irre... | \n", "3 | \n", "
4 | \n", "Não estando o laudo de junta médica elaborado ... | \n", "junta médica perícia estando realização | \n", "4 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
95 | \n", "Existe o erro notório na apreciação da prova, ... | \n", "apreciação arbitrários baseada conjugada contr... | \n", "95 | \n", "
96 | \n", "Por outro lado, tratando-se no caso concreto, ... | \n", "jurisdicional conselho duplo lado levantava | \n", "96 | \n", "
97 | \n", "Nos termos do disposto no art. 4, n 4, alínea ... | \n", "etaf acidente resultantes alínea visando | \n", "97 | \n", "
98 | \n", "Para que exista \"dupla conformidade relevante ... | \n", "decisório exista segmento autónomo dupla | \n", "98 | \n", "
99 | \n", "Não se justifica admitir revista que se centra... | \n", "centra exclusivamente inconstitucionalidades j... | \n", "99 | \n", "
100 rows × 3 columns
\n", "\n", " | document | \n", "term | \n", "tfidf | \n", "
---|---|---|---|
122 | \n", "0 | \n", "alguém | \n", "0.329389 | \n", "
229 | \n", "0 | \n", "cabeça | \n", "0.329389 | \n", "
241 | \n", "0 | \n", "casal | \n", "0.329389 | \n", "
706 | \n", "0 | \n", "herdeiro | \n", "0.329389 | \n", "
1198 | \n", "0 | \n", "qualidade | \n", "0.302254 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
149052 | \n", "99 | \n", "centra | \n", "0.473740 | \n", "
149417 | \n", "99 | \n", "exclusivamente | \n", "0.473740 | \n", "
149551 | \n", "99 | \n", "inconstitucionalidades | \n", "0.473740 | \n", "
149655 | \n", "99 | \n", "justifica | \n", "0.407025 | \n", "
148893 | \n", "99 | \n", "admitir | \n", "0.293579 | \n", "
500 rows × 3 columns
\n", "\n", " | document | \n", "term | \n", "tfidf | \n", "
---|---|---|---|
122 | \n", "0 | \n", "alguém | \n", "0.329479 | \n", "
229 | \n", "0 | \n", "cabeça | \n", "0.329485 | \n", "
241 | \n", "0 | \n", "casal | \n", "0.329454 | \n", "
706 | \n", "0 | \n", "herdeiro | \n", "0.329393 | \n", "
1198 | \n", "0 | \n", "qualidade | \n", "0.302293 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
149052 | \n", "99 | \n", "centra | \n", "0.473791 | \n", "
149417 | \n", "99 | \n", "exclusivamente | \n", "0.473743 | \n", "
149551 | \n", "99 | \n", "inconstitucionalidades | \n", "0.473785 | \n", "
149655 | \n", "99 | \n", "justifica | \n", "0.407034 | \n", "
148893 | \n", "99 | \n", "admitir | \n", "0.293587 | \n", "
500 rows × 3 columns
\n", "