{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lexical approaches for Information Retrieval\n",
    "\n",
    "Traditionally, when a system searches for document content related to a query, it uses techniques that search for documents or entries that present those exact query words. This process is called lexical search. The drawbacks with this approach are when some important passages or documents are not retrieved because some words need to be present. In a conversation, we can mention: “I walked through the public garden. I enjoyed it.” We can easily understand that “it” was an underlying meaning of “public garden”. However, it is complex to pass that information to the computer.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Term Frequency Algorithm\n",
    "\n",
    "Term Frequency-Inverse Document Frequency (TF-IDF){cite}`tfidf` is a ranking function for document search and information retrieval. It evaluates how relevant a term ``t`` is relative to a document ``d`` while being based on a group of ``N`` documents. TF-IDF is defined as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "  \\text{TF-IDF}(t,d, D) = \\text{TF}(t,d) \\times \\text{IDF}(t,D)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TF-IDF is composed by two parts: \n",
    "- Term Frequency (TF)\n",
    "- Inverse Document Frequency (IDF)\n",
    "\n",
    "\n",
    "\n",
    "TF represents how often a term appears in a specific document. It is calculated as follows:\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\text{TF}(t,d) = \\log(1+\\text{freq}(t,d))\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where $\\text{freq}(t,d)$ represents the frequency of the term $t$ within the document $d$.\n",
    "\n",
    "IDF represents the rarity of a term in the entire group of documents, where values near 0 show that terms are very common and values near 1 show that terms are rarer.\n",
    "IDF is defined as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "$$\n",
    "\\text{IDF}(t,D) = \\log(\\frac{N}{count(d \\in D:t\\in d)})\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Paragraph</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dá-se a preterição, quando o cabeça de casal d...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Para interpretarmos correctamente a parte deci...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A documentação só será deficiente quando não p...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Interrompida a prescrição com a citação do exe...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Não estando o laudo de junta médica elaborado ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           Paragraph\n",
       "0  Dá-se a preterição, quando o cabeça de casal d...\n",
       "1  Para interpretarmos correctamente a parte deci...\n",
       "2  A documentação só será deficiente quando não p...\n",
       "3  Interrompida a prescrição com a citação do exe...\n",
       "4  Não estando o laudo de junta médica elaborado ..."
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_queries_examples = pd.read_csv('../Data/queries.csv')\n",
    "df_queries_examples.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     C:\\Users\\Rui\\AppData\\Roaming\\nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "nltk.download('stopwords')\n",
    "from nltk.corpus import stopwords\n",
    "stop_words = set(stopwords.words('portuguese'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_tf_idf(text_files, stop_words, redudant_words, Original_df):\n",
    "    #text_files contains the dataframe with the text to be used for tf-idf\n",
    "    #stop_words contains the list of stop words to be removed\n",
    "    #Original_df contains the original dataframe with the text to be used for tf-idf (the one that should be returned)\n",
    "    #redudant_words contains the list of redudant words to be removed\n",
    "\n",
    "    top_n = 5\n",
    "\n",
    "\n",
    "    tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)\n",
    "    tfidf_vector = tfidf_vectorizer.fit_transform(text_files)\n",
    "\n",
    "    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=Original_df.index, columns=tfidf_vectorizer.get_feature_names())\n",
    "    tfidf_df = tfidf_df.stack().reset_index()\n",
    "    tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})\n",
    "    top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(top_n)\n",
    "\n",
    "    \"\"\"group the top 10 results by document and concatenate the terms into a string\"\"\"\n",
    "    top_tfidf2 = top_tfidf.groupby(['document'])['term'].apply(lambda x: \"%s\" % ' '.join(x)).reset_index()\n",
    "    top_tfidf2.drop(columns=['document'], inplace=True)\n",
    "    \"\"\"join the top 10 terms with the original document\"\"\"\n",
    "    \n",
    "    Original_df['top_'+str(top_n)+'_keywords'] = top_tfidf2['term'].reset_index(drop=True)\n",
    "    Original_df['Index']= Original_df.index\n",
    "\n",
    "    return Original_df\n",
    "\n",
    "#df_queries_examples = get_tf_idf(df_all_docs['Paragraph'], stop_words, redudant_words, df_queries_examples)\n",
    "#df_queries_examples"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Rui\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python39\\site-packages\\sklearn\\utils\\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n",
      "  warnings.warn(msg, category=FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Paragraph</th>\n",
       "      <th>top_5_keywords</th>\n",
       "      <th>Index</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Dá-se a preterição, quando o cabeça de casal d...</td>\n",
       "      <td>alguém cabeça casal herdeiro qualidade</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Para interpretarmos correctamente a parte deci...</td>\n",
       "      <td>parte analisar antecedentes atenta decisória</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>A documentação só será deficiente quando não p...</td>\n",
       "      <td>sentido algumas captação comprometa daí</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Interrompida a prescrição com a citação do exe...</td>\n",
       "      <td>executado prescrição efeitos interrompida irre...</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Não estando o laudo de junta médica elaborado ...</td>\n",
       "      <td>junta médica perícia estando realização</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95</th>\n",
       "      <td>Existe o erro notório na apreciação da prova, ...</td>\n",
       "      <td>apreciação arbitrários baseada conjugada contr...</td>\n",
       "      <td>95</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>96</th>\n",
       "      <td>Por outro lado, tratando-se no caso concreto, ...</td>\n",
       "      <td>jurisdicional conselho duplo lado levantava</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>97</th>\n",
       "      <td>Nos termos do disposto no art. 4, n 4, alínea ...</td>\n",
       "      <td>etaf acidente resultantes alínea visando</td>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98</th>\n",
       "      <td>Para que exista \"dupla conformidade relevante ...</td>\n",
       "      <td>decisório exista segmento autónomo dupla</td>\n",
       "      <td>98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99</th>\n",
       "      <td>Não se justifica admitir revista que se centra...</td>\n",
       "      <td>centra exclusivamente inconstitucionalidades j...</td>\n",
       "      <td>99</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>100 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                            Paragraph  \\\n",
       "0   Dá-se a preterição, quando o cabeça de casal d...   \n",
       "1   Para interpretarmos correctamente a parte deci...   \n",
       "2   A documentação só será deficiente quando não p...   \n",
       "3   Interrompida a prescrição com a citação do exe...   \n",
       "4   Não estando o laudo de junta médica elaborado ...   \n",
       "..                                                ...   \n",
       "95  Existe o erro notório na apreciação da prova, ...   \n",
       "96  Por outro lado, tratando-se no caso concreto, ...   \n",
       "97  Nos termos do disposto no art. 4, n 4, alínea ...   \n",
       "98  Para que exista \"dupla conformidade relevante ...   \n",
       "99  Não se justifica admitir revista que se centra...   \n",
       "\n",
       "                                       top_5_keywords  Index  \n",
       "0              alguém cabeça casal herdeiro qualidade      0  \n",
       "1        parte analisar antecedentes atenta decisória      1  \n",
       "2             sentido algumas captação comprometa daí      2  \n",
       "3   executado prescrição efeitos interrompida irre...      3  \n",
       "4             junta médica perícia estando realização      4  \n",
       "..                                                ...    ...  \n",
       "95  apreciação arbitrários baseada conjugada contr...     95  \n",
       "96        jurisdicional conselho duplo lado levantava     96  \n",
       "97           etaf acidente resultantes alínea visando     97  \n",
       "98           decisório exista segmento autónomo dupla     98  \n",
       "99  centra exclusivamente inconstitucionalidades j...     99  \n",
       "\n",
       "[100 rows x 3 columns]"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "df_queries_examples = get_tf_idf(df_queries_examples['Paragraph'], stop_words, [], df_queries_examples)\n",
    "df_queries_examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize keywords"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "retrieved from https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/03-TF-IDF-Scikit-Learn.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Rui\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python39\\site-packages\\sklearn\\utils\\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n",
      "  warnings.warn(msg, category=FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>document</th>\n",
       "      <th>term</th>\n",
       "      <th>tfidf</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <td>0</td>\n",
       "      <td>alguém</td>\n",
       "      <td>0.329389</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>229</th>\n",
       "      <td>0</td>\n",
       "      <td>cabeça</td>\n",
       "      <td>0.329389</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>241</th>\n",
       "      <td>0</td>\n",
       "      <td>casal</td>\n",
       "      <td>0.329389</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>706</th>\n",
       "      <td>0</td>\n",
       "      <td>herdeiro</td>\n",
       "      <td>0.329389</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1198</th>\n",
       "      <td>0</td>\n",
       "      <td>qualidade</td>\n",
       "      <td>0.302254</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149052</th>\n",
       "      <td>99</td>\n",
       "      <td>centra</td>\n",
       "      <td>0.473740</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149417</th>\n",
       "      <td>99</td>\n",
       "      <td>exclusivamente</td>\n",
       "      <td>0.473740</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149551</th>\n",
       "      <td>99</td>\n",
       "      <td>inconstitucionalidades</td>\n",
       "      <td>0.473740</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149655</th>\n",
       "      <td>99</td>\n",
       "      <td>justifica</td>\n",
       "      <td>0.407025</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148893</th>\n",
       "      <td>99</td>\n",
       "      <td>admitir</td>\n",
       "      <td>0.293579</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>500 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        document                    term     tfidf\n",
       "122            0                  alguém  0.329389\n",
       "229            0                  cabeça  0.329389\n",
       "241            0                   casal  0.329389\n",
       "706            0                herdeiro  0.329389\n",
       "1198           0               qualidade  0.302254\n",
       "...          ...                     ...       ...\n",
       "149052        99                  centra  0.473740\n",
       "149417        99          exclusivamente  0.473740\n",
       "149551        99  inconstitucionalidades  0.473740\n",
       "149655        99               justifica  0.407025\n",
       "148893        99                 admitir  0.293579\n",
       "\n",
       "[500 rows x 3 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Only visible with a small sample like 100 documents max\n",
    "import numpy as np\n",
    "import altair as alt\n",
    "\n",
    "# Terms in this list will get a red dot in the visualization\n",
    "not_so_good_words = stop_words\n",
    "\n",
    "top_n = 5\n",
    "\n",
    "\n",
    "tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words)\n",
    "tfidf_vector = tfidf_vectorizer.fit_transform(df_queries_examples['Paragraph'])\n",
    "\n",
    "tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=df_queries_examples.index, columns=tfidf_vectorizer.get_feature_names())\n",
    "tfidf_df = tfidf_df.stack().reset_index()\n",
    "tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})\n",
    "top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(top_n)\n",
    "top_tfidf\n",
    "#top_tfidf = top_tfidf.groupby(['document'])['term'].apply(lambda x: \"%s\" % ' '.join(x)).reset_index()\n",
    "#top_tfidf.drop(columns=['document'], inplace=True)\n",
    "#top_tfidf\n",
    "\n",
    "#"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>document</th>\n",
       "      <th>term</th>\n",
       "      <th>tfidf</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>122</th>\n",
       "      <td>0</td>\n",
       "      <td>alguém</td>\n",
       "      <td>0.329479</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>229</th>\n",
       "      <td>0</td>\n",
       "      <td>cabeça</td>\n",
       "      <td>0.329485</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>241</th>\n",
       "      <td>0</td>\n",
       "      <td>casal</td>\n",
       "      <td>0.329454</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>706</th>\n",
       "      <td>0</td>\n",
       "      <td>herdeiro</td>\n",
       "      <td>0.329393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1198</th>\n",
       "      <td>0</td>\n",
       "      <td>qualidade</td>\n",
       "      <td>0.302293</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149052</th>\n",
       "      <td>99</td>\n",
       "      <td>centra</td>\n",
       "      <td>0.473791</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149417</th>\n",
       "      <td>99</td>\n",
       "      <td>exclusivamente</td>\n",
       "      <td>0.473743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149551</th>\n",
       "      <td>99</td>\n",
       "      <td>inconstitucionalidades</td>\n",
       "      <td>0.473785</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>149655</th>\n",
       "      <td>99</td>\n",
       "      <td>justifica</td>\n",
       "      <td>0.407034</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148893</th>\n",
       "      <td>99</td>\n",
       "      <td>admitir</td>\n",
       "      <td>0.293587</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>500 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "        document                    term     tfidf\n",
       "122            0                  alguém  0.329479\n",
       "229            0                  cabeça  0.329485\n",
       "241            0                   casal  0.329454\n",
       "706            0                herdeiro  0.329393\n",
       "1198           0               qualidade  0.302293\n",
       "...          ...                     ...       ...\n",
       "149052        99                  centra  0.473791\n",
       "149417        99          exclusivamente  0.473743\n",
       "149551        99  inconstitucionalidades  0.473785\n",
       "149655        99               justifica  0.407034\n",
       "148893        99                 admitir  0.293587\n",
       "\n",
       "[500 rows x 3 columns]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Terms in this list will get a red dot in the visualization\n",
    "term_list = []\n",
    "\n",
    "#adding a little randomness to break ties in term ranking\n",
    "top_tfidf_plusRand = top_tfidf.copy()\n",
    "top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001\n",
    "top_tfidf_plusRand\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "<div id=\"altair-viz-07928a4cf5a44a3a890478e80c395521\"></div>\n",
       "<script type=\"text/javascript\">\n",
       "  var VEGA_DEBUG = (typeof VEGA_DEBUG == \"undefined\") ? {} : VEGA_DEBUG;\n",
       "  (function(spec, embedOpt){\n",
       "    let outputDiv = document.currentScript.previousElementSibling;\n",
       "    if (outputDiv.id !== \"altair-viz-07928a4cf5a44a3a890478e80c395521\") {\n",
       "      outputDiv = document.getElementById(\"altair-viz-07928a4cf5a44a3a890478e80c395521\");\n",
       "    }\n",
       "    const paths = {\n",
       "      \"vega\": \"https://cdn.jsdelivr.net/npm//vega@5?noext\",\n",
       "      \"vega-lib\": \"https://cdn.jsdelivr.net/npm//vega-lib?noext\",\n",
       "      \"vega-lite\": \"https://cdn.jsdelivr.net/npm//vega-lite@4.17.0?noext\",\n",
       "      \"vega-embed\": \"https://cdn.jsdelivr.net/npm//vega-embed@6?noext\",\n",
       "    };\n",
       "\n",
       "    function maybeLoadScript(lib, version) {\n",
       "      var key = `${lib.replace(\"-\", \"\")}_version`;\n",
       "      return (VEGA_DEBUG[key] == version) ?\n",
       "        Promise.resolve(paths[lib]) :\n",
       "        new Promise(function(resolve, reject) {\n",
       "          var s = document.createElement('script');\n",
       "          document.getElementsByTagName(\"head\")[0].appendChild(s);\n",
       "          s.async = true;\n",
       "          s.onload = () => {\n",
       "            VEGA_DEBUG[key] = version;\n",
       "            return resolve(paths[lib]);\n",
       "          };\n",
       "          s.onerror = () => reject(`Error loading script: ${paths[lib]}`);\n",
       "          s.src = paths[lib];\n",
       "        });\n",
       "    }\n",
       "\n",
       "    function showError(err) {\n",
       "      outputDiv.innerHTML = `<div class=\"error\" style=\"color:red;\">${err}</div>`;\n",
       "      throw err;\n",
       "    }\n",
       "\n",
       "    function displayChart(vegaEmbed) {\n",
       "      vegaEmbed(outputDiv, spec, embedOpt)\n",
       "        .catch(err => showError(`Javascript Error: ${err.message}<br>This usually means there's a typo in your chart specification. See the javascript console for the full traceback.`));\n",
       "    }\n",
       "\n",
       "    if(typeof define === \"function\" && define.amd) {\n",
       "      requirejs.config({paths});\n",
       "      require([\"vega-embed\"], displayChart, err => showError(`Error loading script: ${err.message}`));\n",
       "    } else {\n",
       "      maybeLoadScript(\"vega\", \"5\")\n",
       "        .then(() => maybeLoadScript(\"vega-lite\", \"4.17.0\"))\n",
       "        .then(() => maybeLoadScript(\"vega-embed\", \"6\"))\n",
       "        .catch(showError)\n",
       "        .then(() => displayChart(vegaEmbed));\n",
       "    }\n",
       "  })({\"config\": {\"view\": {\"continuousWidth\": 400, \"continuousHeight\": 300}}, \"layer\": [{\"mark\": \"rect\", \"encoding\": {\"color\": {\"field\": \"tfidf\", \"type\": \"quantitative\"}, \"x\": {\"field\": \"rank\", \"type\": \"ordinal\"}, \"y\": {\"field\": \"document\", \"type\": \"nominal\"}}, \"transform\": [{\"window\": [{\"op\": \"rank\", \"field\": \"\", \"as\": \"rank\"}], \"groupby\": [\"document\"], \"sort\": [{\"field\": \"tfidf\", \"order\": \"descending\"}]}]}, {\"mark\": {\"type\": \"circle\", \"size\": 300}, \"encoding\": {\"color\": {\"condition\": {\"value\": \"red\", \"test\": {\"field\": \"term\", \"oneOf\": []}}, \"value\": \"#FFFFFF00\"}, \"x\": {\"field\": \"rank\", \"type\": \"ordinal\"}, \"y\": {\"field\": \"document\", \"type\": \"nominal\"}}, \"transform\": [{\"window\": [{\"op\": \"rank\", \"field\": \"\", \"as\": \"rank\"}], \"groupby\": [\"document\"], \"sort\": [{\"field\": \"tfidf\", \"order\": \"descending\"}]}]}, {\"mark\": {\"type\": \"text\", \"baseline\": \"middle\"}, \"encoding\": {\"color\": {\"condition\": {\"value\": \"white\", \"test\": \"(datum.tfidf >= 0.23)\"}, \"value\": \"black\"}, \"text\": {\"field\": \"term\", \"type\": \"nominal\"}, \"x\": {\"field\": \"rank\", \"type\": \"ordinal\"}, \"y\": {\"field\": \"document\", \"type\": \"nominal\"}}, \"transform\": [{\"window\": [{\"op\": \"rank\", \"field\": \"\", \"as\": \"rank\"}], \"groupby\": [\"document\"], \"sort\": [{\"field\": \"tfidf\", \"order\": \"descending\"}]}]}], \"data\": {\"name\": \"data-1516fe233040f512d221c8e69f5d38f9\"}, \"width\": 1000, \"$schema\": \"https://vega.github.io/schema/vega-lite/v4.17.0.json\", \"datasets\": {\"data-1516fe233040f512d221c8e69f5d38f9\": [{\"document\": 0, \"term\": \"algu\\u00e9m\", \"tfidf\": 0.3294794723241328}, {\"document\": 0, \"term\": \"cabe\\u00e7a\", \"tfidf\": 0.3294847395608481}, {\"document\": 0, \"term\": \"casal\", \"tfidf\": 0.3294539353621099}, {\"document\": 0, \"term\": \"herdeiro\", \"tfidf\": 0.32939256186223037}, {\"document\": 0, \"term\": \"qualidade\", \"tfidf\": 0.30229309938227367}, {\"document\": 1, \"term\": \"parte\", \"tfidf\": 0.4177721442678086}, {\"document\": 1, \"term\": \"analisar\", \"tfidf\": 0.2566826397241613}, {\"document\": 1, \"term\": \"antecedentes\", \"tfidf\": 0.25664021302476997}, {\"document\": 1, \"term\": \"atenta\", \"tfidf\": 0.25669695979622154}, {\"document\": 1, \"term\": \"decis\\u00f3ria\", \"tfidf\": 0.25669701217190627}, {\"document\": 2, \"term\": \"sentido\", \"tfidf\": 0.3569702249048015}, {\"document\": 2, \"term\": \"algumas\", \"tfidf\": 0.2297919076923672}, {\"document\": 2, \"term\": \"capta\\u00e7\\u00e3o\", \"tfidf\": 0.22985595600386471}, {\"document\": 2, \"term\": \"comprometa\", \"tfidf\": 0.22978738473885227}, {\"document\": 2, \"term\": \"da\\u00ed\", \"tfidf\": 0.2298560258917044}, {\"document\": 3, \"term\": \"executado\", \"tfidf\": 0.5161600935412389}, {\"document\": 3, \"term\": \"prescri\\u00e7\\u00e3o\", \"tfidf\": 0.4200372102215672}, {\"document\": 3, \"term\": \"efeitos\", \"tfidf\": 0.25813493286182565}, {\"document\": 3, \"term\": \"interrompida\", \"tfidf\": 0.2580514330985177}, {\"document\": 3, \"term\": \"irrelevantes\", \"tfidf\": 0.2580760067411481}, {\"document\": 4, \"term\": \"junta\", \"tfidf\": 0.2977230335402166}, {\"document\": 4, \"term\": \"m\\u00e9dica\", \"tfidf\": 0.29777680364883935}, {\"document\": 4, \"term\": \"per\\u00edcia\", \"tfidf\": 0.29774118219070556}, {\"document\": 4, \"term\": \"estando\", \"tfidf\": 0.2732515782479824}, {\"document\": 4, \"term\": \"realiza\\u00e7\\u00e3o\", \"tfidf\": 0.24228983180504246}, {\"document\": 5, \"term\": \"devedor\", \"tfidf\": 0.388643021239051}, {\"document\": 5, \"term\": \"directamente\", \"tfidf\": 0.19432056795178995}, {\"document\": 5, \"term\": \"emergente\", \"tfidf\": 0.19432290245128603}, {\"document\": 5, \"term\": \"injun\\u00e7\\u00e3o\", \"tfidf\": 0.19435723525097257}, {\"document\": 5, \"term\": \"nessa\", \"tfidf\": 0.19435738026247376}, {\"document\": 6, \"term\": \"documentos\", \"tfidf\": 0.33897048927936935}, {\"document\": 6, \"term\": \"640\", \"tfidf\": 0.16947425504660626}, {\"document\": 6, \"term\": \"aditar\", \"tfidf\": 0.16950457346455608}, {\"document\": 6, \"term\": \"alega\\u00e7\\u00f5es\", \"tfidf\": 0.16952886367447778}, {\"document\": 6, \"term\": \"apelante\", \"tfidf\": 0.16949027823017754}, {\"document\": 7, \"term\": \"abrigo\", \"tfidf\": 0.2680044323792714}, {\"document\": 7, \"term\": \"desempenhava\", \"tfidf\": 0.2680014559277881}, {\"document\": 7, \"term\": \"ocorrido\", \"tfidf\": 0.2680427370137578}, {\"document\": 7, \"term\": \"acidente\", \"tfidf\": 0.2459655771556657}, {\"document\": 7, \"term\": \"inser\\u00e7\\u00e3o\", \"tfidf\": 0.2459855940353597}, {\"document\": 8, \"term\": \"despacho\", \"tfidf\": 0.3257536127957179}, {\"document\": 8, \"term\": \"factos\", \"tfidf\": 0.3109685735233508}, {\"document\": 8, \"term\": \"administra\\u00e7\\u00e3o\", \"tfidf\": 0.2001229060142987}, {\"document\": 8, \"term\": \"alegados\", \"tfidf\": 0.20017273417147888}, {\"document\": 8, \"term\": \"aludido\", \"tfidf\": 0.20020603561007397}, {\"document\": 9, \"term\": \"convic\\u00e7\\u00e3o\", \"tfidf\": 0.4892733696720185}, {\"document\": 9, \"term\": \"distanciamento\", \"tfidf\": 0.24466636112222667}, {\"document\": 9, \"term\": \"esperam\", \"tfidf\": 0.2446176105339064}, {\"document\": 9, \"term\": \"ileg\\u00edtima\", \"tfidf\": 0.24465773097993276}, {\"document\": 9, \"term\": \"imparcialidade\", \"tfidf\": 0.2446483116550846}, {\"document\": 10, \"term\": \"firmado\", \"tfidf\": 0.3473825005748232}, {\"document\": 10, \"term\": \"ju\\u00edzo\", \"tfidf\": 0.3253306611946438}, {\"document\": 10, \"term\": \"assente\", \"tfidf\": 0.18929950393132364}, {\"document\": 10, \"term\": \"confirmou\", \"tfidf\": 0.18933018213592323}, {\"document\": 10, \"term\": \"consonante\", \"tfidf\": 0.1893655780517832}, {\"document\": 11, \"term\": \"acerca\", \"tfidf\": 0.23237828875258393}, {\"document\": 11, \"term\": \"dispensa\", \"tfidf\": 0.23235829871539043}, {\"document\": 11, \"term\": \"interpor\", \"tfidf\": 0.2323481566875279}, {\"document\": 11, \"term\": \"ponderosas\", \"tfidf\": 0.23238865511297704}, {\"document\": 11, \"term\": \"requerida\", \"tfidf\": 0.23233854554934374}, {\"document\": 12, \"term\": \"interrogat\\u00f3rio\", \"tfidf\": 0.36968550584448245}, {\"document\": 12, \"term\": \"primeiro\", \"tfidf\": 0.3696818448432188}, {\"document\": 12, \"term\": \"grava\\u00e7\\u00e3o\", \"tfidf\": 0.3392755622340059}, {\"document\": 12, \"term\": \"judicial\", \"tfidf\": 0.3176974112196442}, {\"document\": 12, \"term\": \"objecto\", \"tfidf\": 0.3176619656666634}, {\"document\": 13, \"term\": \"discute\", \"tfidf\": 0.34272913058526067}, {\"document\": 13, \"term\": \"terreno\", \"tfidf\": 0.3427041302683569}, {\"document\": 13, \"term\": \"titularidade\", \"tfidf\": 0.320886922916777}, {\"document\": 13, \"term\": \"propriedade\", \"tfidf\": 0.30395046176133383}, {\"document\": 13, \"term\": \"provid\\u00eancia\", \"tfidf\": 0.30393601662593317}, {\"document\": 14, \"term\": \"presun\\u00e7\\u00e3o\", \"tfidf\": 0.4380662605218519}, {\"document\": 14, \"term\": \"despedimento\", \"tfidf\": 0.37637556588254734}, {\"document\": 14, \"term\": \"compensa\\u00e7\\u00e3o\", \"tfidf\": 0.21906276849879144}, {\"document\": 14, \"term\": \"considerou\", \"tfidf\": 0.21901556517701573}, {\"document\": 14, \"term\": \"empregadora\", \"tfidf\": 0.21907546482367785}, {\"document\": 15, \"term\": \"cancelamento\", \"tfidf\": 0.30699474138370536}, {\"document\": 15, \"term\": \"consentimento\", \"tfidf\": 0.30694813904437857}, {\"document\": 15, \"term\": \"credor\", \"tfidf\": 0.3070082274941478}, {\"document\": 15, \"term\": \"garantida\", \"tfidf\": 0.30694756957088537}, {\"document\": 15, \"term\": \"hipoteca\", \"tfidf\": 0.30698376584723674}, {\"document\": 16, \"term\": \"judicial\", \"tfidf\": 0.3557023633447315}, {\"document\": 16, \"term\": \"aplicadas\", \"tfidf\": 0.20704945196183017}, {\"document\": 16, \"term\": \"cuja\", \"tfidf\": 0.20709036817264484}, {\"document\": 16, \"term\": \"c\\u00famulo\", \"tfidf\": 0.20707096917834955}, {\"document\": 16, \"term\": \"disciplinares\", \"tfidf\": 0.2070716497974379}, {\"document\": 17, \"term\": \"penas\", \"tfidf\": 0.3995159576898438}, {\"document\": 17, \"term\": \"pena\", \"tfidf\": 0.3543116822352668}, {\"document\": 17, \"term\": \"prescri\\u00e7\\u00e3o\", \"tfidf\": 0.3543881053573335}, {\"document\": 17, \"term\": \"prazo\", \"tfidf\": 0.33823650032885405}, {\"document\": 17, \"term\": \"acess\\u00f3ria\", \"tfidf\": 0.21768484297536425}, {\"document\": 18, \"term\": \"concurso\", \"tfidf\": 0.485544195402013}, {\"document\": 18, \"term\": \"ideal\", \"tfidf\": 0.4854966640590937}, {\"document\": 18, \"term\": \"pluralidade\", \"tfidf\": 0.3236642225547753}, {\"document\": 18, \"term\": \"a\\u00e7\\u00e3o\", \"tfidf\": 0.1618394555607478}, {\"document\": 18, \"term\": \"a\\u00e7\\u00f5es\", \"tfidf\": 0.16185382483427724}, {\"document\": 19, \"term\": \"atraso\", \"tfidf\": 0.2249288829826735}, {\"document\": 19, \"term\": \"candidatura\", \"tfidf\": 0.22492920701084118}, {\"document\": 19, \"term\": \"comiss\\u00e3o\", \"tfidf\": 0.22493958489698893}, {\"document\": 19, \"term\": \"elei\\u00e7\\u00f5es\", \"tfidf\": 0.22495406749727764}, {\"document\": 19, \"term\": \"extempor\\u00e2neo\", \"tfidf\": 0.22495033681577034}, {\"document\": 20, \"term\": \"13\", \"tfidf\": 0.36156049038286864}, {\"document\": 20, \"term\": \"acto\", \"tfidf\": 0.3022712547067132}, {\"document\": 20, \"term\": \"anulabilidade\", \"tfidf\": 0.2104327041704001}, {\"document\": 20, \"term\": \"cpa\", \"tfidf\": 0.2104739369631631}, {\"document\": 20, \"term\": \"essenciais\", \"tfidf\": 0.21046820419396345}, {\"document\": 21, \"term\": \"ccp\", \"tfidf\": 0.25974572836775534}, {\"document\": 21, \"term\": \"coimbra\", \"tfidf\": 0.25974156013714084}, {\"document\": 21, \"term\": \"referidas\", \"tfidf\": 0.2597167257714753}, {\"document\": 21, \"term\": \"sujeita\", \"tfidf\": 0.2597377142838932}, {\"document\": 21, \"term\": \"universidade\", \"tfidf\": 0.2597437678933429}, {\"document\": 22, \"term\": \"desmerecem\", \"tfidf\": 0.27724730269793657}, {\"document\": 22, \"term\": \"divisar\", \"tfidf\": 0.27723608320939247}, {\"document\": 22, \"term\": \"particulares\", \"tfidf\": 0.2772319476520066}, {\"document\": 22, \"term\": \"universalista\", \"tfidf\": 0.2772649697380991}, {\"document\": 22, \"term\": \"voca\\u00e7\\u00e3o\", \"tfidf\": 0.2772511234555594}, {\"document\": 23, \"term\": \"bem\", \"tfidf\": 0.2833361427913719}, {\"document\": 23, \"term\": \"tal\", \"tfidf\": 0.27303378959779956}, {\"document\": 23, \"term\": \"adequado\", \"tfidf\": 0.1901120180926803}, {\"document\": 23, \"term\": \"apontem\", \"tfidf\": 0.19013541794602967}, {\"document\": 23, \"term\": \"claros\", \"tfidf\": 0.1900824400116356}, {\"document\": 24, \"term\": \"sido\", \"tfidf\": 0.2654025191477845}, {\"document\": 24, \"term\": \"causa\", \"tfidf\": 0.2533402286932458}, {\"document\": 24, \"term\": \"ter\", \"tfidf\": 0.21950710202173815}, {\"document\": 24, \"term\": \"decis\\u00e3o\", \"tfidf\": 0.20207794505709042}, {\"document\": 24, \"term\": \"adquirido\", \"tfidf\": 0.16306696969304893}, {\"document\": 25, \"term\": \"lei\", \"tfidf\": 0.352216749931393}, {\"document\": 25, \"term\": \"al\", \"tfidf\": 0.27107037306958454}, {\"document\": 25, \"term\": \"02\", \"tfidf\": 0.1744558896517344}, {\"document\": 25, \"term\": \"140\", \"tfidf\": 0.17447059400577786}, {\"document\": 25, \"term\": \"19\", \"tfidf\": 0.17445711506175932}, {\"document\": 26, \"term\": \"instaurar\", \"tfidf\": 0.24153599869737888}, {\"document\": 26, \"term\": \"menoridade\", \"tfidf\": 0.24146004495598222}, {\"document\": 26, \"term\": \"tutelar\", \"tfidf\": 0.24153842807640682}, {\"document\": 26, \"term\": \"c\\u00edvel\", \"tfidf\": 0.22156991655575733}, {\"document\": 26, \"term\": \"extracontratual\", \"tfidf\": 0.22158283814318927}, {\"document\": 27, \"term\": \"provis\\u00f3ria\", \"tfidf\": 0.34607889879139436}, {\"document\": 27, \"term\": \"38\", \"tfidf\": 0.1730392714363131}, {\"document\": 27, \"term\": \"39\", \"tfidf\": 0.17302908962683894}, {\"document\": 27, \"term\": \"565\", \"tfidf\": 0.17303067592515348}, {\"document\": 27, \"term\": \"arbitramento\", \"tfidf\": 0.1731258036108843}, {\"document\": 28, \"term\": \"licenciamento\", \"tfidf\": 0.409669549549825}, {\"document\": 28, \"term\": \"come\\u00e7a\", \"tfidf\": 0.2048710536859897}, {\"document\": 28, \"term\": \"constru\\u00e7\\u00e3o\", \"tfidf\": 0.2048698601922197}, {\"document\": 28, \"term\": \"definitivamente\", \"tfidf\": 0.20482735994236317}, {\"document\": 28, \"term\": \"dunas\", \"tfidf\": 0.2048840652183756}, {\"document\": 29, \"term\": \"sobre\", \"tfidf\": 0.2690506325052782}, {\"document\": 29, \"term\": \"ac\\u00f3rd\\u00e3os\", \"tfidf\": 0.20588132329936373}, {\"document\": 29, \"term\": \"atua\\u00e7\\u00e3o\", \"tfidf\": 0.20584771698956528}, {\"document\": 29, \"term\": \"contenciosa\", \"tfidf\": 0.2058661275569707}, {\"document\": 29, \"term\": \"contradi\\u00e7\\u00e3o\", \"tfidf\": 0.2058139059829479}, {\"document\": 30, \"term\": \"discute\", \"tfidf\": 0.34271298050717286}, {\"document\": 30, \"term\": \"terreno\", \"tfidf\": 0.3427307829883652}, {\"document\": 30, \"term\": \"titularidade\", \"tfidf\": 0.32090896549495757}, {\"document\": 30, \"term\": \"propriedade\", \"tfidf\": 0.3039427763908841}, {\"document\": 30, \"term\": \"provid\\u00eancia\", \"tfidf\": 0.30396073785064004}, {\"document\": 31, \"term\": \"trabalho\", \"tfidf\": 0.43497698085242015}, {\"document\": 31, \"term\": \"posto\", \"tfidf\": 0.2800137539529677}, {\"document\": 31, \"term\": \"r\\u00e9\", \"tfidf\": 0.28000946874753635}, {\"document\": 31, \"term\": \"despedimento\", \"tfidf\": 0.24063868311270467}, {\"document\": 31, \"term\": \"autor\", \"tfidf\": 0.20120647526769672}, {\"document\": 32, \"term\": \"prazo\", \"tfidf\": 0.39486949036633506}, {\"document\": 32, \"term\": \"perent\\u00f3rio\", \"tfidf\": 0.3389362932892597}, {\"document\": 32, \"term\": \"praticar\", \"tfidf\": 0.3389170253588276}, {\"document\": 32, \"term\": \"art\", \"tfidf\": 0.21004390274812001}, {\"document\": 32, \"term\": \"139\", \"tfidf\": 0.1695113656928908}, {\"document\": 33, \"term\": \"avalia\\u00e7\\u00e3o\", \"tfidf\": 0.3450350877466358}, {\"document\": 33, \"term\": \"trabalhador\", \"tfidf\": 0.3165836518506774}, {\"document\": 33, \"term\": \"2009\", \"tfidf\": 0.17251339456134346}, {\"document\": 33, \"term\": \"201\", \"tfidf\": 0.17258184778791633}, {\"document\": 33, \"term\": \"2014\", \"tfidf\": 0.1725395959216922}, {\"document\": 34, \"term\": \"interna\", \"tfidf\": 0.4836979632568198}, {\"document\": 34, \"term\": \"seguran\\u00e7a\", \"tfidf\": 0.44379425064139205}, {\"document\": 34, \"term\": \"abra\", \"tfidf\": 0.24186674515269369}, {\"document\": 34, \"term\": \"cometido\", \"tfidf\": 0.24188669409414337}, {\"document\": 34, \"term\": \"essencial\", \"tfidf\": 0.241880914500576}, {\"document\": 35, \"term\": \"complexa\", \"tfidf\": 0.20170304270049325}, {\"document\": 35, \"term\": \"dado\", \"tfidf\": 0.20166245749383652}, {\"document\": 35, \"term\": \"diametralmente\", \"tfidf\": 0.20172492534085976}, {\"document\": 35, \"term\": \"dificuldade\", \"tfidf\": 0.20169270469496234}, {\"document\": 35, \"term\": \"divergente\", \"tfidf\": 0.20174637667613352}, {\"document\": 36, \"term\": \"aus\\u00eancias\", \"tfidf\": 0.3006430505030246}, {\"document\": 36, \"term\": \"habita\\u00e7\\u00e3o\", \"tfidf\": 0.3006406119768822}, {\"document\": 36, \"term\": \"alcan\\u00e7adas\", \"tfidf\": 0.1503720022310973}, {\"document\": 36, \"term\": \"base\", \"tfidf\": 0.1503038685768436}, {\"document\": 36, \"term\": \"curtos\", \"tfidf\": 0.1503738973792711}, {\"document\": 37, \"term\": \"cabo\", \"tfidf\": 0.4439152546343843}, {\"document\": 37, \"term\": \"investiga\\u00e7\\u00e3o\", \"tfidf\": 0.2959635085780509}, {\"document\": 37, \"term\": \"tratam\", \"tfidf\": 0.29600648652579886}, {\"document\": 37, \"term\": \"dilig\\u00eancias\", \"tfidf\": 0.2715564428233694}, {\"document\": 37, \"term\": \"determinar\", \"tfidf\": 0.14797766082065367}, {\"document\": 38, \"term\": \"acto\", \"tfidf\": 0.38962691722534804}, {\"document\": 38, \"term\": \"atribu\\u00edda\", \"tfidf\": 0.2712443331228885}, {\"document\": 38, \"term\": \"imput\\u00e1vel\", \"tfidf\": 0.27118467927147755}, {\"document\": 38, \"term\": \"passiva\", \"tfidf\": 0.2711824151962464}, {\"document\": 38, \"term\": \"repercuss\\u00e3o\", \"tfidf\": 0.2712808505619897}, {\"document\": 39, \"term\": \"seguradora\", \"tfidf\": 0.3746197588562069}, {\"document\": 39, \"term\": \"rela\\u00e7\\u00e3o\", \"tfidf\": 0.274760779193232}, {\"document\": 39, \"term\": \"ambos\", \"tfidf\": 0.20415032082438175}, {\"document\": 39, \"term\": \"certo\", \"tfidf\": 0.20416187251827275}, {\"document\": 39, \"term\": \"concession\\u00e1ria\", \"tfidf\": 0.20411124218170118}, {\"document\": 40, \"term\": \"67\", \"tfidf\": 0.35474011224540963}, {\"document\": 40, \"term\": \"atento\", \"tfidf\": 0.3547539344498138}, {\"document\": 40, \"term\": \"excecional\", \"tfidf\": 0.32550675540611446}, {\"document\": 40, \"term\": \"justificar\", \"tfidf\": 0.32544093144703024}, {\"document\": 40, \"term\": \"\\u00f3nus\", \"tfidf\": 0.3254726547670502}, {\"document\": 41, \"term\": \"cada\", \"tfidf\": 0.46562226843445065}, {\"document\": 41, \"term\": \"prova\", \"tfidf\": 0.29610333529245236}, {\"document\": 41, \"term\": \"activa\", \"tfidf\": 0.2328485433563473}, {\"document\": 41, \"term\": \"certid\\u00e3o\", \"tfidf\": 0.23279846493332498}, {\"document\": 41, \"term\": \"eleitoral\", \"tfidf\": 0.23285228118433798}, {\"document\": 42, \"term\": \"relev\\u00e2ncia\", \"tfidf\": 0.26003752169788413}, {\"document\": 42, \"term\": \"acerto\", \"tfidf\": 0.17441413724179053}, {\"document\": 42, \"term\": \"aparenta\", \"tfidf\": 0.17446513551735893}, {\"document\": 42, \"term\": \"apontando\", \"tfidf\": 0.17446442837058315}, {\"document\": 42, \"term\": \"colocadas\", \"tfidf\": 0.17438502205114945}, {\"document\": 43, \"term\": \"22\", \"tfidf\": 0.2196396091277671}, {\"document\": 43, \"term\": \"93\", \"tfidf\": 0.219664814316576}, {\"document\": 43, \"term\": \"conduta\", \"tfidf\": 0.21970818332540767}, {\"document\": 43, \"term\": \"considerada\", \"tfidf\": 0.21962317132994094}, {\"document\": 43, \"term\": \"consideravelmente\", \"tfidf\": 0.2196296734119239}, {\"document\": 44, \"term\": \"pedir\", \"tfidf\": 0.37107079058748554}, {\"document\": 44, \"term\": \"falta\", \"tfidf\": 0.3290891137027191}, {\"document\": 44, \"term\": \"al\", \"tfidf\": 0.3141435096748709}, {\"document\": 44, \"term\": \"causa\", \"tfidf\": 0.31405792607835614}, {\"document\": 44, \"term\": \"186\", \"tfidf\": 0.20219331208051147}, {\"document\": 45, \"term\": \"conduziria\", \"tfidf\": 0.3804142627824685}, {\"document\": 45, \"term\": \"duplica\\u00e7\\u00e3o\", \"tfidf\": 0.38044224774642577}, {\"document\": 45, \"term\": \"enriquecimento\", \"tfidf\": 0.38041135065389353}, {\"document\": 45, \"term\": \"pretens\\u00e3o\", \"tfidf\": 0.3490415378725823}, {\"document\": 45, \"term\": \"valores\", \"tfidf\": 0.349072925416413}, {\"document\": 46, \"term\": \"proposta\", \"tfidf\": 0.39566467685693796}, {\"document\": 46, \"term\": \"prazo\", \"tfidf\": 0.3348951345100391}, {\"document\": 46, \"term\": \"15\", \"tfidf\": 0.26382084600446565}, {\"document\": 46, \"term\": \"dias\", \"tfidf\": 0.23396268363837197}, {\"document\": 46, \"term\": \"tal\", \"tfidf\": 0.20649982658712931}, {\"document\": 47, \"term\": \"deve\", \"tfidf\": 0.4059341830102308}, {\"document\": 47, \"term\": \"reconhecimento\", \"tfidf\": 0.34838018299045065}, {\"document\": 47, \"term\": \"direito\", \"tfidf\": 0.1854554535347151}, {\"document\": 47, \"term\": \"331\", \"tfidf\": 0.17418350262767576}, {\"document\": 47, \"term\": \"aceitar\", \"tfidf\": 0.17417761713217664}, {\"document\": 48, \"term\": \"prosseguir\", \"tfidf\": 0.2675991488282745}, {\"document\": 48, \"term\": \"estado\", \"tfidf\": 0.24555723915840066}, {\"document\": 48, \"term\": \"extracontratual\", \"tfidf\": 0.24550525877148144}, {\"document\": 48, \"term\": \"minist\\u00e9rio\", \"tfidf\": 0.24551265584570905}, {\"document\": 48, \"term\": \"responsabilidade\", \"tfidf\": 0.24557437844140262}, {\"document\": 49, \"term\": \"bom\", \"tfidf\": 0.2336071694130904}, {\"document\": 49, \"term\": \"colectiva\", \"tfidf\": 0.23355239533523897}, {\"document\": 49, \"term\": \"confian\\u00e7a\", \"tfidf\": 0.23353850713620017}, {\"document\": 49, \"term\": \"difama\\u00e7\\u00e3o\", \"tfidf\": 0.23362982101436836}, {\"document\": 49, \"term\": \"incluem\", \"tfidf\": 0.23354503954683867}, {\"document\": 50, \"term\": \"cognosc\\u00edveis\", \"tfidf\": 0.3843541563095273}, {\"document\": 50, \"term\": \"oficioso\", \"tfidf\": 0.3843805527047159}, {\"document\": 50, \"term\": \"decididas\", \"tfidf\": 0.35269652316001077}, {\"document\": 50, \"term\": \"inst\\u00e2ncias\", \"tfidf\": 0.33021184257856034}, {\"document\": 50, \"term\": \"suscitadas\", \"tfidf\": 0.3302383290712077}, {\"document\": 51, \"term\": \"avalistas\", \"tfidf\": 0.26303924842285314}, {\"document\": 51, \"term\": \"desvincula\\u00e7\\u00e3o\", \"tfidf\": 0.2630089613365422}, {\"document\": 51, \"term\": \"livran\\u00e7a\", \"tfidf\": 0.2630200324805745}, {\"document\": 51, \"term\": \"nada\", \"tfidf\": 0.24128883770060397}, {\"document\": 51, \"term\": \"cr\\u00e9dito\", \"tfidf\": 0.21400014605941944}, {\"document\": 52, \"term\": \"rgco\", \"tfidf\": 0.35678946435609415}, {\"document\": 52, \"term\": \"tendo\", \"tfidf\": 0.3065058445349454}, {\"document\": 52, \"term\": \"disposto\", \"tfidf\": 0.2562870985917268}, {\"document\": 52, \"term\": \"audi\\u00e7\\u00e3o\", \"tfidf\": 0.1783945720133846}, {\"document\": 52, \"term\": \"contraordenacional\", \"tfidf\": 0.178430246537528}, {\"document\": 53, \"term\": \"500\", \"tfidf\": 0.5025165235229168}, {\"document\": 53, \"term\": \"constante\", \"tfidf\": 0.50253092302258}, {\"document\": 53, \"term\": \"remete\", \"tfidf\": 0.5025261206546342}, {\"document\": 53, \"term\": \"fundamenta\\u00e7\\u00e3o\", \"tfidf\": 0.3746575880241763}, {\"document\": 53, \"term\": \"ac\\u00f3rd\\u00e3o\", \"tfidf\": 0.3196561465680933}, {\"document\": 54, \"term\": \"constitucional\", \"tfidf\": 0.30698504766056445}, {\"document\": 54, \"term\": \"liberdade\", \"tfidf\": 0.3070727588051709}, {\"document\": 54, \"term\": \"priva\\u00e7\\u00e3o\", \"tfidf\": 0.30701580247034}, {\"document\": 54, \"term\": \"pris\\u00e3o\", \"tfidf\": 0.28176686383335925}, {\"document\": 54, \"term\": \"recurso\", \"tfidf\": 0.19530933758282198}, {\"document\": 55, \"term\": \"371\", \"tfidf\": 0.19940784205988502}, {\"document\": 55, \"term\": \"97\", \"tfidf\": 0.199436150913282}, {\"document\": 55, \"term\": \"fixada\", \"tfidf\": 0.19940818233767055}, {\"document\": 55, \"term\": \"fixado\", \"tfidf\": 0.1994270494825143}, {\"document\": 55, \"term\": \"havendo\", \"tfidf\": 0.19939554687288863}, {\"document\": 56, \"term\": \"relev\\u00e2ncia\", \"tfidf\": 0.3090303438300642}, {\"document\": 56, \"term\": \"quest\\u00e3o\", \"tfidf\": 0.29783936787638415}, {\"document\": 56, \"term\": \"jur\\u00eddica\", \"tfidf\": 0.27902107414964833}, {\"document\": 56, \"term\": \"revista\", \"tfidf\": 0.2394033678227515}, {\"document\": 56, \"term\": \"atinente\", \"tfidf\": 0.2073190390273574}, {\"document\": 57, \"term\": \"aparentando\", \"tfidf\": 0.16904622536695454}, {\"document\": 57, \"term\": \"apoio\", \"tfidf\": 0.1690010495960067}, {\"document\": 57, \"term\": \"extravase\", \"tfidf\": 0.16906858624277135}, {\"document\": 57, \"term\": \"incorrido\", \"tfidf\": 0.16901813856197792}, {\"document\": 57, \"term\": \"limites\", \"tfidf\": 0.16901069694561358}, {\"document\": 58, \"term\": \"aferi\\u00e7\\u00e3o\", \"tfidf\": 0.24034192512143832}, {\"document\": 58, \"term\": \"equacionada\", \"tfidf\": 0.24038275068449821}, {\"document\": 58, \"term\": \"irrelevando\", \"tfidf\": 0.24035077785425268}, {\"document\": 58, \"term\": \"plano\", \"tfidf\": 0.24035716830904263}, {\"document\": 58, \"term\": \"pressuposto\", \"tfidf\": 0.24036093690172794}, {\"document\": 59, \"term\": \"cond\\u00f3minos\", \"tfidf\": 0.42686323340190746}, {\"document\": 59, \"term\": \"assembleia\", \"tfidf\": 0.2611766920558209}, {\"document\": 59, \"term\": \"vez\", \"tfidf\": 0.2445119737288373}, {\"document\": 59, \"term\": \"repara\\u00e7\\u00e3o\", \"tfidf\": 0.23166783579084516}, {\"document\": 59, \"term\": \"interesse\", \"tfidf\": 0.22109761687587765}, {\"document\": 60, \"term\": \"gest\\u00e3o\", \"tfidf\": 0.37121909124205316}, {\"document\": 60, \"term\": \"ansiedade\", \"tfidf\": 0.20233595861371434}, {\"document\": 60, \"term\": \"causados\", \"tfidf\": 0.20231354506906185}, {\"document\": 60, \"term\": \"comercial\", \"tfidf\": 0.2023306124917201}, {\"document\": 60, \"term\": \"equipa\", \"tfidf\": 0.20233407234164522}, {\"document\": 61, \"term\": \"70\", \"tfidf\": 0.26197669442582666}, {\"document\": 61, \"term\": \"dependente\", \"tfidf\": 0.2620117437305636}, {\"document\": 61, \"term\": \"il\\u00edquidos\", \"tfidf\": 0.2620004248239646}, {\"document\": 61, \"term\": \"inconstitucional\", \"tfidf\": 0.26201619176166324}, {\"document\": 61, \"term\": \"interpreta\\u00e7\\u00e3o\", \"tfidf\": 0.2620336297796903}, {\"document\": 62, \"term\": \"interesse\", \"tfidf\": 0.2580054594944558}, {\"document\": 62, \"term\": \"abranger\", \"tfidf\": 0.16612958746281145}, {\"document\": 62, \"term\": \"altera\\u00e7\\u00f5es\", \"tfidf\": 0.16615820854459495}, {\"document\": 62, \"term\": \"debruce\", \"tfidf\": 0.16607826495662673}, {\"document\": 62, \"term\": \"entretanto\", \"tfidf\": 0.16616175744452805}, {\"document\": 63, \"term\": \"rela\\u00e7\\u00e3o\", \"tfidf\": 0.3056413333545322}, {\"document\": 63, \"term\": \"autoridade\", \"tfidf\": 0.22711283024884904}, {\"document\": 63, \"term\": \"contraente\", \"tfidf\": 0.22711930515960616}, {\"document\": 63, \"term\": \"interesses\", \"tfidf\": 0.2270735334345189}, {\"document\": 63, \"term\": \"material\", \"tfidf\": 0.22713999608191976}, {\"document\": 64, \"term\": \"prevista\", \"tfidf\": 0.5013979078440786}, {\"document\": 64, \"term\": \"artigo\", \"tfidf\": 0.2961960106235233}, {\"document\": 64, \"term\": \"determina\", \"tfidf\": 0.25077336399066463}, {\"document\": 64, \"term\": \"diploma\", \"tfidf\": 0.2506979432129598}, {\"document\": 64, \"term\": \"impedimento\", \"tfidf\": 0.2507600267065901}, {\"document\": 65, \"term\": \"acusa\\u00e7\\u00e3o\", \"tfidf\": 0.29916583882775966}, {\"document\": 65, \"term\": \"basta\", \"tfidf\": 0.2991713977636214}, {\"document\": 65, \"term\": \"defensor\", \"tfidf\": 0.29915346959693706}, {\"document\": 65, \"term\": \"exig\\u00eancia\", \"tfidf\": 0.2991837654071091}, {\"document\": 65, \"term\": \"justo\", \"tfidf\": 0.29918139899730983}, {\"document\": 66, \"term\": \"agiu\", \"tfidf\": 0.29254149451088834}, {\"document\": 66, \"term\": \"condutas\", \"tfidf\": 0.2924841209411195}, {\"document\": 66, \"term\": \"conscientemente\", \"tfidf\": 0.29251133777243143}, {\"document\": 66, \"term\": \"livre\", \"tfidf\": 0.2924811866900969}, {\"document\": 66, \"term\": \"momentos\", \"tfidf\": 0.2924971429934509}, {\"document\": 67, \"term\": \"coa\\u00e7\\u00e3o\", \"tfidf\": 0.19617453577011026}, {\"document\": 67, \"term\": \"condi\\u00e7\\u00e3o\", \"tfidf\": 0.19624129798097137}, {\"document\": 67, \"term\": \"designadamente\", \"tfidf\": 0.1961796127618037}, {\"document\": 67, \"term\": \"determinaram\", \"tfidf\": 0.19622831039669972}, {\"document\": 67, \"term\": \"havido\", \"tfidf\": 0.19618928447544598}, {\"document\": 68, \"term\": \"prazo\", \"tfidf\": 0.5247601168010393}, {\"document\": 68, \"term\": \"60\", \"tfidf\": 0.2252003464772242}, {\"document\": 68, \"term\": \"circunst\\u00e2ncia\", \"tfidf\": 0.22520707528176306}, {\"document\": 68, \"term\": \"demora\", \"tfidf\": 0.22517713581397053}, {\"document\": 68, \"term\": \"excessivo\", \"tfidf\": 0.22517056143424144}, {\"document\": 69, \"term\": \"infantic\\u00eddio\", \"tfidf\": 0.4152088444724934}, {\"document\": 69, \"term\": \"parto\", \"tfidf\": 0.4152662665714097}, {\"document\": 69, \"term\": \"ap\\u00f3s\", \"tfidf\": 0.2076430381553338}, {\"document\": 69, \"term\": \"atuado\", \"tfidf\": 0.2076267873038099}, {\"document\": 69, \"term\": \"influ\\u00eancia\", \"tfidf\": 0.20763108394716015}, {\"document\": 70, \"term\": \"pedido\", \"tfidf\": 0.31668998252268626}, {\"document\": 70, \"term\": \"acordaram\", \"tfidf\": 0.18434605209292357}, {\"document\": 70, \"term\": \"alvar\\u00e1\", \"tfidf\": 0.18433630677828572}, {\"document\": 70, \"term\": \"atingir\", \"tfidf\": 0.1843724244670916}, {\"document\": 70, \"term\": \"cabia\", \"tfidf\": 0.18430659995404503}, {\"document\": 71, \"term\": \"carreira\", \"tfidf\": 0.509416766835903}, {\"document\": 71, \"term\": \"administrativa\", \"tfidf\": 0.24397611683537387}, {\"document\": 71, \"term\": \"acesso\", \"tfidf\": 0.16984026375544137}, {\"document\": 71, \"term\": \"administrativo\", \"tfidf\": 0.16980608845144413}, {\"document\": 71, \"term\": \"automaticamente\", \"tfidf\": 0.1698031324907828}, {\"document\": 72, \"term\": \"alegadamente\", \"tfidf\": 0.23658401866340123}, {\"document\": 72, \"term\": \"embarca\\u00e7\\u00f5es\", \"tfidf\": 0.23649990999017942}, {\"document\": 72, \"term\": \"emerge\", \"tfidf\": 0.23655588598769034}, {\"document\": 72, \"term\": \"identificadas\", \"tfidf\": 0.23652611593826398}, {\"document\": 72, \"term\": \"imediata\", \"tfidf\": 0.23656908463623444}, {\"document\": 73, \"term\": \"aprecia\\u00e7\\u00e3o\", \"tfidf\": 0.36615390915105916}, {\"document\": 73, \"term\": \"autos\", \"tfidf\": 0.30206298768429474}, {\"document\": 73, \"term\": \"quest\\u00e3o\", \"tfidf\": 0.2525391731647882}, {\"document\": 73, \"term\": \"apreciaram\", \"tfidf\": 0.17577848822483405}, {\"document\": 73, \"term\": \"contencioso\", \"tfidf\": 0.1757994113343064}, {\"document\": 74, \"term\": \"proc\", \"tfidf\": 0.34004427355098166}, {\"document\": 74, \"term\": \"quest\\u00f5es\", \"tfidf\": 0.2442631064637202}, {\"document\": 74, \"term\": \"06\", \"tfidf\": 0.16999061130422083}, {\"document\": 74, \"term\": \"10t9cnt\", \"tfidf\": 0.17007105365614433}, {\"document\": 74, \"term\": \"11\", \"tfidf\": 0.17004958652357757}, {\"document\": 75, \"term\": \"excep\\u00e7\\u00e3o\", \"tfidf\": 0.7446734535581075}, {\"document\": 75, \"term\": \"prescri\\u00e7\\u00e3o\", \"tfidf\": 0.3030993400092587}, {\"document\": 75, \"term\": \"arguida\", \"tfidf\": 0.18620800616066047}, {\"document\": 75, \"term\": \"arguidas\", \"tfidf\": 0.1862207555429213}, {\"document\": 75, \"term\": \"causas\", \"tfidf\": 0.18621062612336833}, {\"document\": 76, \"term\": \"articulado\", \"tfidf\": 0.24544831866892003}, {\"document\": 76, \"term\": \"concretiza\\u00e7\\u00e3o\", \"tfidf\": 0.24550295042723083}, {\"document\": 76, \"term\": \"exposi\\u00e7\\u00e3o\", \"tfidf\": 0.24547337513591755}, {\"document\": 76, \"term\": \"formular\", \"tfidf\": 0.24551765176231402}, {\"document\": 76, \"term\": \"imprecis\\u00e3o\", \"tfidf\": 0.24546390639063786}, {\"document\": 77, \"term\": \"auferir\", \"tfidf\": 0.19916775202409454}, {\"document\": 77, \"term\": \"celebra\\u00e7\\u00e3o\", \"tfidf\": 0.19919718409848056}, {\"document\": 77, \"term\": \"deixou\", \"tfidf\": 0.1991497588401425}, {\"document\": 77, \"term\": \"efectiva\", \"tfidf\": 0.1991326230937592}, {\"document\": 77, \"term\": \"il\\u00edcito\", \"tfidf\": 0.1991424153439927}, {\"document\": 78, \"term\": \"nele\", \"tfidf\": 0.3642110707816277}, {\"document\": 78, \"term\": \"considerar\", \"tfidf\": 0.3292943644278156}, {\"document\": 78, \"term\": \"afirma\\u00e7\\u00f5es\", \"tfidf\": 0.21192878341753088}, {\"document\": 78, \"term\": \"assim\", \"tfidf\": 0.21192899639224777}, {\"document\": 78, \"term\": \"avaliando\", \"tfidf\": 0.21193416754787278}, {\"document\": 79, \"term\": \"despacho\", \"tfidf\": 0.31638376193788326}, {\"document\": 79, \"term\": \"processo\", \"tfidf\": 0.24725475502106609}, {\"document\": 79, \"term\": \"142\", \"tfidf\": 0.19432306887117035}, {\"document\": 79, \"term\": \"adicionais\", \"tfidf\": 0.19438090836924143}, {\"document\": 79, \"term\": \"afirma\", \"tfidf\": 0.19436859108900514}, {\"document\": 80, \"term\": \"tributa\\u00e7\\u00e3o\", \"tfidf\": 0.3471308444853946}, {\"document\": 80, \"term\": \"ac\\u00e7\\u00f5es\", \"tfidf\": 0.20202075317565718}, {\"document\": 80, \"term\": \"ampla\", \"tfidf\": 0.20206212455254463}, {\"document\": 80, \"term\": \"autonomiza\\u00e7\\u00e3o\", \"tfidf\": 0.20209106584103742}, {\"document\": 80, \"term\": \"condicionam\", \"tfidf\": 0.20205918325902145}, {\"document\": 81, \"term\": \"ali\\u00e1s\", \"tfidf\": 0.2392304445752841}, {\"document\": 81, \"term\": \"correta\", \"tfidf\": 0.2392435435062754}, {\"document\": 81, \"term\": \"dados\", \"tfidf\": 0.23928780591877288}, {\"document\": 81, \"term\": \"fez\", \"tfidf\": 0.239256096270435}, {\"document\": 81, \"term\": \"sum\\u00e1rio\", \"tfidf\": 0.23924729501721048}, {\"document\": 82, \"term\": \"gest\\u00e3o\", \"tfidf\": 0.43280777378364915}, {\"document\": 82, \"term\": \"cash\", \"tfidf\": 0.23582627784297427}, {\"document\": 82, \"term\": \"centralizada\", \"tfidf\": 0.2359061106276947}, {\"document\": 82, \"term\": \"dom\\u00ednio\", \"tfidf\": 0.2358933611703647}, {\"document\": 82, \"term\": \"encontrem\", \"tfidf\": 0.23583396988635288}, {\"document\": 83, \"term\": \"ind\\u00edcios\", \"tfidf\": 0.3971747360393183}, {\"document\": 83, \"term\": \"aplicada\", \"tfidf\": 0.264832893939649}, {\"document\": 83, \"term\": \"suficientes\", \"tfidf\": 0.26482727992117466}, {\"document\": 83, \"term\": \"vir\", \"tfidf\": 0.264841721542635}, {\"document\": 83, \"term\": \"possibilidade\", \"tfidf\": 0.24795805242643662}, {\"document\": 84, \"term\": \"nacionalidade\", \"tfidf\": 0.42401127275094075}, {\"document\": 84, \"term\": \"cita\\u00e7\\u00e3o\", \"tfidf\": 0.38907019217979316}, {\"document\": 84, \"term\": \"239\", \"tfidf\": 0.21197354117072922}, {\"document\": 84, \"term\": \"carta\", \"tfidf\": 0.21200027169986502}, {\"document\": 84, \"term\": \"citado\", \"tfidf\": 0.21197619601258333}, {\"document\": 85, \"term\": \"previstos\", \"tfidf\": 0.29076047652122067}, {\"document\": 85, \"term\": \"deveres\", \"tfidf\": 0.24984330096578825}, {\"document\": 85, \"term\": \"tribunal\", \"tfidf\": 0.208861963713359}, {\"document\": 85, \"term\": \"prova\", \"tfidf\": 0.18496262014311057}, {\"document\": 85, \"term\": \"66\", \"tfidf\": 0.1453892421347097}, {\"document\": 86, \"term\": \"banco\", \"tfidf\": 0.2778095369729244}, {\"document\": 86, \"term\": \"compete\", \"tfidf\": 0.27787251117179573}, {\"document\": 86, \"term\": \"formulado\", \"tfidf\": 0.2778142611902986}, {\"document\": 86, \"term\": \"por\\u00e9m\", \"tfidf\": 0.27783061313894386}, {\"document\": 86, \"term\": \"supervis\\u00e3o\", \"tfidf\": 0.27783109313753745}, {\"document\": 87, \"term\": \"limite\", \"tfidf\": 0.5198829189549908}, {\"document\": 87, \"term\": \"m\\u00e1ximo\", \"tfidf\": 0.38987380662147353}, {\"document\": 87, \"term\": \"fixa\", \"tfidf\": 0.2833326290029712}, {\"document\": 87, \"term\": \"montante\", \"tfidf\": 0.24338413055553434}, {\"document\": 87, \"term\": \"lei\", \"tfidf\": 0.19067053563344502}, {\"document\": 88, \"term\": \"incidentes\", \"tfidf\": 0.38450026186792946}, {\"document\": 88, \"term\": \"oposi\\u00e7\\u00e3o\", \"tfidf\": 0.3845056546176913}, {\"document\": 88, \"term\": \"terceiro\", \"tfidf\": 0.38453344159675124}, {\"document\": 88, \"term\": \"chamado\", \"tfidf\": 0.1922775781127691}, {\"document\": 88, \"term\": \"compreendida\", \"tfidf\": 0.192233489694131}, {\"document\": 89, \"term\": \"32\", \"tfidf\": 0.3268037224761448}, {\"document\": 89, \"term\": \"c\\u00f3digo\", \"tfidf\": 0.28084773351812126}, {\"document\": 89, \"term\": \"acto\", \"tfidf\": 0.234818158217588}, {\"document\": 89, \"term\": \"artigo\", \"tfidf\": 0.19307651870221282}, {\"document\": 89, \"term\": \"03\", \"tfidf\": 0.16345925256544358}, {\"document\": 90, \"term\": \"trabalho\", \"tfidf\": 0.4170694460303191}, {\"document\": 90, \"term\": \"cooperante\", \"tfidf\": 0.3579822255615723}, {\"document\": 90, \"term\": \"presta\\u00e7\\u00e3o\", \"tfidf\": 0.3579447809102067}, {\"document\": 90, \"term\": \"acordo\", \"tfidf\": 0.30750264351043194}, {\"document\": 90, \"term\": \"associado\", \"tfidf\": 0.1789884006433557}, {\"document\": 91, \"term\": \"revista\", \"tfidf\": 0.3129870613336825}, {\"document\": 91, \"term\": \"consistente\", \"tfidf\": 0.27093363538344606}, {\"document\": 91, \"term\": \"indicando\", \"tfidf\": 0.2709418627577102}, {\"document\": 91, \"term\": \"decididas\", \"tfidf\": 0.24865081640339076}, {\"document\": 91, \"term\": \"presente\", \"tfidf\": 0.2485994252282715}, {\"document\": 92, \"term\": \"brasil\", \"tfidf\": 0.309212305916583}, {\"document\": 92, \"term\": \"rendimento\", \"tfidf\": 0.3092201283702423}, {\"document\": 92, \"term\": \"rep\\u00fablica\", \"tfidf\": 0.2837027822989963}, {\"document\": 92, \"term\": \"tributa\\u00e7\\u00e3o\", \"tfidf\": 0.26566098040060954}, {\"document\": 92, \"term\": \"sobre\", \"tfidf\": 0.2021081994043919}, {\"document\": 93, \"term\": \"contacto\", \"tfidf\": 0.3419556976999686}, {\"document\": 93, \"term\": \"direta\", \"tfidf\": 0.3419665039730066}, {\"document\": 93, \"term\": \"impress\\u00e3o\", \"tfidf\": 0.3419955897645493}, {\"document\": 93, \"term\": \"local\", \"tfidf\": 0.3138354167296224}, {\"document\": 93, \"term\": \"onde\", \"tfidf\": 0.313803886077734}, {\"document\": 94, \"term\": \"revista\", \"tfidf\": 0.30840766363954325}, {\"document\": 94, \"term\": \"acertadamente\", \"tfidf\": 0.26702731125923346}, {\"document\": 94, \"term\": \"ademais\", \"tfidf\": 0.2670416725970106}, {\"document\": 94, \"term\": \"colocada\", \"tfidf\": 0.2670499473742111}, {\"document\": 94, \"term\": \"enunciaram\", \"tfidf\": 0.26706366707971096}, {\"document\": 95, \"term\": \"aprecia\\u00e7\\u00e3o\", \"tfidf\": 0.2582619475132288}, {\"document\": 95, \"term\": \"arbitr\\u00e1rios\", \"tfidf\": 0.18598277571917338}, {\"document\": 95, \"term\": \"baseada\", \"tfidf\": 0.18601234636332573}, {\"document\": 95, \"term\": \"conjugada\", \"tfidf\": 0.18604977664823658}, {\"document\": 95, \"term\": \"contradit\\u00f3rios\", \"tfidf\": 0.18597355254547882}, {\"document\": 96, \"term\": \"jurisdicional\", \"tfidf\": 0.45649276277656914}, {\"document\": 96, \"term\": \"conselho\", \"tfidf\": 0.22822979406359606}, {\"document\": 96, \"term\": \"duplo\", \"tfidf\": 0.22824324037784338}, {\"document\": 96, \"term\": \"lado\", \"tfidf\": 0.22827996447233967}, {\"document\": 96, \"term\": \"levantava\", \"tfidf\": 0.22822811430417755}, {\"document\": 97, \"term\": \"etaf\", \"tfidf\": 0.3103708470461752}, {\"document\": 97, \"term\": \"acidente\", \"tfidf\": 0.28479210045008907}, {\"document\": 97, \"term\": \"resultantes\", \"tfidf\": 0.28477476591881157}, {\"document\": 97, \"term\": \"al\\u00ednea\", \"tfidf\": 0.2666619628583982}, {\"document\": 97, \"term\": \"visando\", \"tfidf\": 0.26665072363738757}, {\"document\": 98, \"term\": \"decis\\u00f3rio\", \"tfidf\": 0.3547944480209989}, {\"document\": 98, \"term\": \"exista\", \"tfidf\": 0.3547472886052347}, {\"document\": 98, \"term\": \"segmento\", \"tfidf\": 0.3547889229429891}, {\"document\": 98, \"term\": \"aut\\u00f3nomo\", \"tfidf\": 0.32553745728400585}, {\"document\": 98, \"term\": \"dupla\", \"tfidf\": 0.32552396434438463}, {\"document\": 99, \"term\": \"centra\", \"tfidf\": 0.473790616312563}, {\"document\": 99, \"term\": \"exclusivamente\", \"tfidf\": 0.4737427863242849}, {\"document\": 99, \"term\": \"inconstitucionalidades\", \"tfidf\": 0.4737851422250096}, {\"document\": 99, \"term\": \"justifica\", \"tfidf\": 0.407033506307669}, {\"document\": 99, \"term\": \"admitir\", \"tfidf\": 0.2935868504149946}]}}, {\"mode\": \"vega-lite\"});\n",
       "</script>"
      ],
      "text/plain": [
       "alt.LayerChart(...)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "\n",
    "# base for all visualizations, with rank calculation\n",
    "base = alt.Chart(top_tfidf_plusRand).encode(\n",
    "    x = 'rank:O',\n",
    "    y = 'document:N'\n",
    ").transform_window(\n",
    "    rank = \"rank()\",\n",
    "    sort = [alt.SortField(\"tfidf\", order=\"descending\")],\n",
    "    groupby = [\"document\"],\n",
    ")\n",
    "\n",
    "# heatmap specification\n",
    "heatmap = base.mark_rect().encode(\n",
    "    color = 'tfidf:Q'\n",
    ")\n",
    "\n",
    "# red circle over terms in above list\n",
    "circle = base.mark_circle(size=300).encode(\n",
    "    color = alt.condition(\n",
    "        alt.FieldOneOfPredicate(field='term', oneOf=term_list),\n",
    "        alt.value('red'),\n",
    "        alt.value('#FFFFFF00')        \n",
    "    )\n",
    ")\n",
    "\n",
    "# text labels, white for darker heatmap colors\n",
    "text = base.mark_text(baseline='middle').encode(\n",
    "    text = 'term:N',\n",
    "    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))\n",
    ")\n",
    "\n",
    "# display the three superimposed visualizations\n",
    "(heatmap + circle + text).properties(width = 1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Best Matching Algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okapi BM25 (BM25) {cite}`bm25_1`, , in which BM stands for Best Matching, is a ranking function usually used in search engines like Elasticsearch (explored later in the document) to estimate the relevance of documents given a query. BM25 relies on a bag-of-words logic, by which it ranks a collection of documents based on the query terms that appear in which document, independently of the position in that same file. The equation is as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "score(D,Q) = \\sum^n_{i=1} IDF(q_i) \\cdot \\frac{f(q_i, D) \\cdot (k_i +1)}{f(q_i, D)+k_1 \\cdot (1-b +b \\cdot \\frac{|D|}{avgdl})}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where Q represents the Query given (with $q_1, ..., q_n$ being the keywords) and D represents the Document. $f(q_i, D)$ represents the $q_i$ frequency in the document D that has a $|D|$ length. \n",
    "Both $k_1$ and $b$ terms are free terms that are used to optimize BM25 function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Distance metrics for lexical approaches"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A traditional way to search for similar documents is through the Jaccard Similarity measure between two sets. It evaluates the amount of shared information or content between both sets. In practice, it represents the extent of the intersection divided by the size of their union. It is defined as follows:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\n",
    "\\text{Sim}(C_1, C_2)=\\frac{|C_1 \\cap C_2 |}{|C_1 \\cup C_2 |}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "where $C_1$ and $C_2$ represent two sets."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.9.13 64-bit (microsoft store)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "6222f30a04324eb0e1388dd9d92114e7e65302edf73019a41583b5b5a3ccc776"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}