2.3. Techniques for efficient retrieval process#

2.3.1. MinHashing#

MinHashing is a technique that aims to create a matrix signature to show similar neighbours, based on a binary matrix and permutations. This technique is of substantial importance, since it allows searching similar documents, efficiently, and solely based on a similarity metric. It starts by having an input matrix with size \(R \times C\) filled with \(0\) s and \(1\) s, where \(R\) represents the number of rows (shingles or sections) and \(C\) represents the number of columns (documents or items). In a context where each column represents a different document and each row represents a different shingle, \(1\) and \(0\) show if that shingle is contained within the document or not, respectively.

After that, it is applied a random permutation to swap the order of the rows. When the permutation is defined, a signature matrix with size \(P \times C\) will start to be filled, where P represents the number of permutations executed.

MinHashing

Figure based on https://towardsdatascience.com/understanding-locality-sensitive-hashing-49f6d1f6134

The first element of the initial row in the signature matrix is filled with a value that represents the number of rows that were checked until a 1 is found in the first column according to a searching order defined by the first permutation.

In the Figure, the first permutation is \([6; 3; 1; 7; 2; 5; 4]\). We start with row number 3 and analyse if the first element is a 1 or 0. Since it is 1, we insert into the Signature Matrix the number 1, since we found a 1 in the first row that we checked.

Following that, we move to the second column and start following the permutation until we find a 1 in that column. Since the first occurrence is on the fifth verification, we input in the Signature Matrix the number 5. This process is repeated with the rest of the input matrix and the different permutations.

By analysing the signature matrix and the Jaccard Similarity, it is possible to infer which documents are similar to one another, based on the signatures of each document.

2.3.2. Locality Sensitive Hashing#

LSH [AINguytitleenR] is a technique that joins similar hash data points into buckets. In practice, similar items would be located in the same bucket with high probability [AR15] [AIL+15]. This technique is used in multiple applications such as recommendation systems to compare customers.

There are multiple approaches to utilizing LSH. The traditional approach that is going to be explored in this paper utilizes Shingling, MinHashing, and a banded LSH function.

The first step is K-shingling, which is the process of creating a set of shingles of size \(k\). This step can be visualized as moving a window of size \(k\) along the text and recording the unique shingles (or substrings) of size \(k\). This process is done throughout the different text entries.

After the first step is completed, we have a set of shingles from all the different inputs. With this complete set, one-hot encodings for each text entry are created that represent if a shingle is present within that entry or not.

Consequently, we apply the MinHashing process to create a signature for every entry. Based on those signatures, Jaccard Similarity can compare how similar two documents or entries are.

Finally, we use the banding technique, which splits each signature in \(b\) bands into \(b\) sub-vectors. Each sub-vector is processed through a hash function and mapped to a hash bucket. If sub-vectors from different document signatures are mapped into the same hash bucket, the original items have a high probability of being similar.