Tags:Data Analysis, Deep Hashing, Entity Resolution, Similarity Search and Word Embeddings
Abstract:
In today’s era, the rate at which data is accumulating is exponential, which makes it increasingly challenging to retrieve relevant information. In such a scenario, high-dimensional similarity search serves as a popular method to extract relevant information from large data volumes or Big Data, and it further drives different Machine Learning (ML) tasks including, Near Duplicate Detection & Location Recognition. However, Big Data, due to its characteristics, poses a variety of challenges to ML applications, such as high class imbalance, the need for feature engineering to support heterogeneous data and the need for efficient solutions for queries over array data. Consequently, in this thesis, we aim to optimize the data analytics pipeline for the utilization and effective management of feature engineering data (embedding data), offering as one of the solutions in the context of high-dimensional similarity search. In doing so, we evaluate the impact of similarity-preserving hashing on helping with data blocking and skipping for ML applications of supervised entity resolution and top-k similarity search.
An Evaluation of Deep Hashing for High-Dimensional Similarity Search on Embedded Data