This repository contains a K-Nearest Neighbor Search engine developed as part of the Project 2024-2025 course of DIT. This implementation uses a recently proposed algorithm called Vanama, and its filtered counterparts Filtered Vamana.
The project is structured into three phases. The initial, simplest phase is optimized to handle approximately 10,000 data points efficiently. By the final phase, the project was optimized to manage datasets with a significantly larger volume of data points.
- C++17
- g++ 9.4.0 or later
- Make 4.2.1 or later
- STL library (Standard Template Library) for C++
- Georgios Nikolaidis ([email protected]) (1115202100118)
- Ioanna Poulou ([email protected]) (1115202100161)
The fastest way to test the project is to run the following commands:
git clone
cd Nearest-Neighbor-Search
make run_test
./other/setup.sh
make runThis project comes with a Makefile that provides easy compilation and execution of the program. The following commands are available:
make : Compile the program
make run : Run the program (read below for more details)
make test : Compile tests
make test_run : Run tests
make run_valgrind_test : Run valgrind on the test executable
make DEBUG=1 : Compile the project with debug flags
make clean : Clean the build files
The
make runcommand is preconfigured to run the program with the smallest dataset from the ones provided here (thesiftsmalldataset). Furthermore, the files of that dataset must be placed under thedata/siftsmalldirectory.
This repository also provides a script in
other/setup.shthat can download and setup all the provided datasets.
More configuration options can be found at the top of the Makefile.
The program has 3 main modes of operation:
-
Unfiltered Mode: This is the most basic operational mode for the program. The user provides the path to the unfiltered dataset, queries, and the ground truth file. The program then calculates the nearest neighbors for each query point using the simple vamana algorithm and compares them to the ground truth.
-
Filtered Mode: This mode is similar to the unfiltered mode, but the datapoints also contain a filter. This changes the way the nearest neighbors are calculated. In this mode, the user can choose between the use of the filteredVamana algorithm or the stitchedVamana algorithm.
-
Ground Truth Mode: This mode is used to calculate the ground truth for a dataset. The user provides the path to the dataset and the queries. The program then calculates the nearest neighbors for each query point and stores them in a file.
The program can be run using the following command:
./vamana <mode> <dataset> <queries> <ground_truth> Where:
<mode>is one of the following:unfilteredfilteredground_truth
<dataset>is the path to the dataset file<queries>is the path to the queries file<ground_truth>is the path to the ground truth file
Note: For more information you can run
./vamana --helpto see the available options. Or./vamana <mode> --helpto see the options for a specific mode.
The project is designed to be able to utilize multiple threads to speed up the computation. But the speed up is not linear and depends on the dataset and on the other parameters of the program. Furthermore, to avoid the slowdown of the program, the implementation is not fully thread safe and thus any changes to the code should be made with caution.
The implementation of this project was centered around the idea that the engine should be able to handle datasets with a large number of data points.
To achieve this we made sure that all the decisions we made were based on the efficiency of the code both in terms of time and space complexity.
To make things easier to analyze let's split the implementation into data structures and algorithms.
Starting with the Graph class, we implemented a directed graph that is used to store both the data points and the edges between them. The graph is implemented using an adjacency list design, where each node has a list of edges starting from it.
To improve the efficiency of the graph, we used an unordered_map data structure to map the data points to their corresponding unordered_set of edges.
All the above data structures can be represented as follows:
std::unordered_map<int, std::unordered_set<int>> graph;This design has the following complexity:
| Operation | Complexity |
|---|---|
| Insertion | O (1) |
| Deletion | O (1) |
| Search | O (1) |
For storing the data points within the graph, we used another class called VectorData. This class is used to store the coordinates of each data point. To do this efficiently, we used a single-dimensional array (or a single chunk of memory) to store all the data points in sequence.
This design ensures fast and efficient access to all the data and great cache locality.
Note: If the data are provided in a file, then the VectorData class uses the array coming straight from
mmap()function, which maps the file into memory. This ensures minimal overhead in the initialization of the data points.
So the Graph class is represented as follows:
class Graph {
// The graph represented as an adjacency list
std::unordered_map<int, std::unordered_set<int>> graph;
/// The number of nodes in the graph
size_t numVectors;
// Number of dimensions of the vectors
size_t dim;
// The data points
VectorData<T> *vectorData;
};Let's now delve into the primary algorithms implemented in this project.
-
Euclidean Distance:
The Euclidean Distance algorithm computes the distance between two points in a dataset using the following formula:
Two important considerations for this algorithm are:
- We refrained from calculating the square root of the distance sums, as our goal is solely to compare distances between points, and computing the square root is a computationally expensive operation.
- To prevent overflow, when a distance exceeds the maximum value for the data type, we simply return that maximum value, as it is sufficiently large that the exact distance becomes inconsequential.
-
Medoid Calculation:
The Medoid Calculation algorithm identifies the medoid of a dataset by summing the distances between each point and all other points.
To enhance the algorithm's efficiency and minimize time complexity, we implemented an optimization strategy using a distance matrix. This matrix stores the total distance of each point to all others, allowing us to calculate the distance between two points only once and minimize redundant calculations.
-
Filtered Medoid Calculation:
The Filtered Medoid Calculation algorithm has two variations. In one variation, the medoid is calculated using the same method as the simple medoid calculation but only considering the points that contain the specific filter. In the other variation, the medoid is selected randomly from the points that contain the filter.
-
Greedy Search:
The Greedy Search algorithm identifies approximate nearest neighbors for a query point within a dataset, and we implemented optimizations to enhance its performance.
We used an ordered set to store the k-nearest neighbors by their distance from the query point and an unordered set to track visited nodes. Additionally, we introduced a set for unvisited elements in the nearest neighbors set, reducing recalculations in each iteration.
-
Filtered Greedy Search:
The Filtered Greedy Search algorithm is a variation of the Greedy Search algorithm that can work with a graph that contains filters. When searching for the nearest neighbors of a filtered query, its operation is similar to the Greedy Search algorithm, but it only considers points that contain the filter. On the other hand, when searching for the nearest neighbors of an unfiltered query, the algorithm performs the Greedy Search algorithm on each subgraph that contains the filter and then stitches the results together.
-
Robust Pruning:
The Robust Pruning algorithm updates a point’s neighbors with the best candidates. Although neighbors and visited nodes are stored as unordered sets, their joint elements are stored in a sorted set to maintain order by distance from the query point.
-
Vanama:
The Vamana algorithm does not have a standalone implementation; rather, it integrates the Greedy Search and Robust Pruning algorithms to construct the final directed graph.
-
Filtered Vamana:
The Filtered Vamana algorithm is a variation of the Vamana algorithm that only considers points that contain the filter. This is achieved by using a filter to check if a point should be considered as a neighbor or not.
-
Stitched Vamana: In this variation of the Vamana algorithm, we act as the Graph was split into smaller subgraphs each containing only the points that have a specific filter. The algorithm then performs the Vamana algorithm on each subgraph and then stitches the results together.
To ensure the correctness of our implementation, we developed a comprehensive set of tests covering all key functionalities of the project, utilizing the Acutest library.
| Section | Contributor |
|---|---|
| Graph Class | Georgios Nikolaidis |
| VectorData Class | Georgios Nikolaidis |
| Euclidean Distance Function | Ioanna Poulou |
| Medoid Function | Georgios Nikolaidis |
| Filtered Medoid Function | Ioanna Poulou |
| Greedy Search Function | Ioanna Poulou |
| Filtered Greedy Search | Georgios Nikolaidis |
| Robust Pruning Function | Ioanna Poulou |
| Vanama Function | Ioanna Poulou |
| Filtered Vanama Function | Ioanna Poulou |
| Stitched Vanama Function | Georgios Nikolaidis |