Thanks to visit codestin.com
Credit goes to github.com

Skip to content

George-RG/Nearest-Neighbor-Search

Repository files navigation

Nearest-Neighbor-Search

Validate Workflow

Description

This repository contains a K-Nearest Neighbor Search engine developed as part of the Project 2024-2025 course of DIT. This implementation uses a recently proposed algorithm called Vanama, and its filtered counterparts Filtered Vamana.

The project is structured into three phases. The initial, simplest phase is optimized to handle approximately 10,000 data points efficiently. By the final phase, the project was optimized to manage datasets with a significantly larger volume of data points.

Requirements

  • C++17
  • g++ 9.4.0 or later
  • Make 4.2.1 or later
  • STL library (Standard Template Library) for C++

Authors

Compilation Instructions

Fast Setup

The fastest way to test the project is to run the following commands:

git clone
cd Nearest-Neighbor-Search
make run_test
./other/setup.sh
make run

Makefile

This project comes with a Makefile that provides easy compilation and execution of the program. The following commands are available:

make                    : Compile the program
make run                : Run the program (read below for more details)
make test               : Compile tests
make test_run           : Run tests
make run_valgrind_test  : Run valgrind on the test executable
make DEBUG=1            : Compile the project with debug flags
make clean              : Clean the build files

The make run command is preconfigured to run the program with the smallest dataset from the ones provided here (the siftsmall dataset). Furthermore, the files of that dataset must be placed under the data/siftsmall directory.

This repository also provides a script in other/setup.sh that can download and setup all the provided datasets.

More configuration options can be found at the top of the Makefile.

Manual configuration

The program has 3 main modes of operation:

  • Unfiltered Mode: This is the most basic operational mode for the program. The user provides the path to the unfiltered dataset, queries, and the ground truth file. The program then calculates the nearest neighbors for each query point using the simple vamana algorithm and compares them to the ground truth.

  • Filtered Mode: This mode is similar to the unfiltered mode, but the datapoints also contain a filter. This changes the way the nearest neighbors are calculated. In this mode, the user can choose between the use of the filteredVamana algorithm or the stitchedVamana algorithm.

  • Ground Truth Mode: This mode is used to calculate the ground truth for a dataset. The user provides the path to the dataset and the queries. The program then calculates the nearest neighbors for each query point and stores them in a file.

The program can be run using the following command:

./vamana <mode> <dataset> <queries> <ground_truth> 

Where:

  • <mode> is one of the following:
    • unfiltered
    • filtered
    • ground_truth
  • <dataset> is the path to the dataset file
  • <queries> is the path to the queries file
  • <ground_truth> is the path to the ground truth file

Note: For more information you can run ./vamana --help to see the available options. Or ./vamana <mode> --help to see the options for a specific mode.

Multithreading

The project is designed to be able to utilize multiple threads to speed up the computation. But the speed up is not linear and depends on the dataset and on the other parameters of the program. Furthermore, to avoid the slowdown of the program, the implementation is not fully thread safe and thus any changes to the code should be made with caution.

Implementation Details

The implementation of this project was centered around the idea that the engine should be able to handle datasets with a large number of data points.

To achieve this we made sure that all the decisions we made were based on the efficiency of the code both in terms of time and space complexity.

To make things easier to analyze let's split the implementation into data structures and algorithms.

Data Structures

Starting with the Graph class, we implemented a directed graph that is used to store both the data points and the edges between them. The graph is implemented using an adjacency list design, where each node has a list of edges starting from it.

To improve the efficiency of the graph, we used an unordered_map data structure to map the data points to their corresponding unordered_set of edges.

All the above data structures can be represented as follows:

std::unordered_map<int, std::unordered_set<int>> graph;

This design has the following complexity:

Operation Complexity
Insertion O (1)
Deletion O (1)
Search O (1)

For storing the data points within the graph, we used another class called VectorData. This class is used to store the coordinates of each data point. To do this efficiently, we used a single-dimensional array (or a single chunk of memory) to store all the data points in sequence.

This design ensures fast and efficient access to all the data and great cache locality.

Note: If the data are provided in a file, then the VectorData class uses the array coming straight from mmap() function, which maps the file into memory. This ensures minimal overhead in the initialization of the data points.

So the Graph class is represented as follows:

class Graph {
    // The graph represented as an adjacency list
    std::unordered_map<int, std::unordered_set<int>> graph;
    /// The number of nodes in the graph
    size_t numVectors;
    // Number of dimensions of the vectors
    size_t dim;
    // The data points
    VectorData<T> *vectorData;
};

Algorithms

Let's now delve into the primary algorithms implemented in this project.

  • Euclidean Distance:

    The Euclidean Distance algorithm computes the distance between two points in a dataset using the following formula:

    Euclidean Distance Formula

    Two important considerations for this algorithm are:

    • We refrained from calculating the square root of the distance sums, as our goal is solely to compare distances between points, and computing the square root is a computationally expensive operation.
    • To prevent overflow, when a distance exceeds the maximum value for the data type, we simply return that maximum value, as it is sufficiently large that the exact distance becomes inconsequential.
  • Medoid Calculation:

    The Medoid Calculation algorithm identifies the medoid of a dataset by summing the distances between each point and all other points.

    To enhance the algorithm's efficiency and minimize time complexity, we implemented an optimization strategy using a distance matrix. This matrix stores the total distance of each point to all others, allowing us to calculate the distance between two points only once and minimize redundant calculations.

  • Filtered Medoid Calculation:

    The Filtered Medoid Calculation algorithm has two variations. In one variation, the medoid is calculated using the same method as the simple medoid calculation but only considering the points that contain the specific filter. In the other variation, the medoid is selected randomly from the points that contain the filter.

  • Greedy Search:

    The Greedy Search algorithm identifies approximate nearest neighbors for a query point within a dataset, and we implemented optimizations to enhance its performance.

    We used an ordered set to store the k-nearest neighbors by their distance from the query point and an unordered set to track visited nodes. Additionally, we introduced a set for unvisited elements in the nearest neighbors set, reducing recalculations in each iteration.

  • Filtered Greedy Search:

    The Filtered Greedy Search algorithm is a variation of the Greedy Search algorithm that can work with a graph that contains filters. When searching for the nearest neighbors of a filtered query, its operation is similar to the Greedy Search algorithm, but it only considers points that contain the filter. On the other hand, when searching for the nearest neighbors of an unfiltered query, the algorithm performs the Greedy Search algorithm on each subgraph that contains the filter and then stitches the results together.

  • Robust Pruning:

    The Robust Pruning algorithm updates a point’s neighbors with the best candidates. Although neighbors and visited nodes are stored as unordered sets, their joint elements are stored in a sorted set to maintain order by distance from the query point.

  • Vanama:

    The Vamana algorithm does not have a standalone implementation; rather, it integrates the Greedy Search and Robust Pruning algorithms to construct the final directed graph.

  • Filtered Vamana:

    The Filtered Vamana algorithm is a variation of the Vamana algorithm that only considers points that contain the filter. This is achieved by using a filter to check if a point should be considered as a neighbor or not.

  • Stitched Vamana: In this variation of the Vamana algorithm, we act as the Graph was split into smaller subgraphs each containing only the points that have a specific filter. The algorithm then performs the Vamana algorithm on each subgraph and then stitches the results together.

Testing

To ensure the correctness of our implementation, we developed a comprehensive set of tests covering all key functionalities of the project, utilizing the Acutest library.

Contributors

Section Contributor
Graph Class Georgios Nikolaidis
VectorData Class Georgios Nikolaidis
Euclidean Distance Function Ioanna Poulou
Medoid Function Georgios Nikolaidis
Filtered Medoid Function Ioanna Poulou
Greedy Search Function Ioanna Poulou
Filtered Greedy Search Georgios Nikolaidis
Robust Pruning Function Ioanna Poulou
Vanama Function Ioanna Poulou
Filtered Vanama Function Ioanna Poulou
Stitched Vanama Function Georgios Nikolaidis

About

K-Nearest Neighbor Search engine developt as part of the Project 2024-2025 course of DIT. Based on Vanama Paper

Topics

Resources

Stars

Watchers

Forks

Contributors