-
Notifications
You must be signed in to change notification settings - Fork 4.1k
RaBitQ Fast Scan #4595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
RaBitQ Fast Scan #4595
Conversation
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating diff in D81787307. |
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating diff in D81787307. |
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
9d56cd5
to
dc16ad4
Compare
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating diff in D81787307. |
1 similar comment
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating diff in D81787307. |
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
dc16ad4
to
1e35f32
Compare
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307. |
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
1e35f32
to
b29c350
Compare
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307. |
Summary: Pull Request resolved: #4595 **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
b29c350
to
ceb48fa
Compare
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307. |
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307. |
Summary: Pull Request resolved: #4595 **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
@alibeklfc has exported this pull request. If you are a Meta employee, you can view the originating Diff in D81787307. |
Summary: **Introduction** This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism. **Implementation** * **New Source and Header Files**: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ. * **Batched Processing**: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput. * **Specialized Post-processing Handler**: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors. * **LUT**: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically: * The LUT computes a float value as c1 * inner_product + c2 * popcount, which is then quantized. This quantization can cause the results to differ slightly from those of IndexRaBitQ. * It is possible to avoid this by storing only the inner_product in the LUT, but doing so would require calculating all data vector popcounts during search, introducing a tradeoff between speed and accuracy. * With the idea proposed in diff D80904214, the algorithm can be modified in the future to eliminate the popcount calculation step, potentially improving both efficiency and accuracy. * **Query Offset Parameter**: RaBitQ uses query factors in distance calculations that should be computed in `compute_float_LUT` method (the most efficient place since we are calculating `rotated_qq` anyways) and used for final distance calculations in handlers. However, the previous version of `compute_quantized_LUT` that calls `compute_float_LUT` did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameter `query_offset` to both `compute_quantized_LUT` and `compute_float_LUT` methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation. **Testing** * Conducted comprehensive tests in the test_rabitq suite covering accuracy comparisons with IndexRaBitQ for L2 and Inner Product metrics, encoding/decoding consistency, query quantization bit settings, small dataset functionality, performance against PQFastScan, serialization, memory management, error handling, and thread safety. * All tests passed successfully, validating the correctness and robustness of IndexRaBitQFastScan. **Results** results_rabitq * **Performance Dependency**: Performance measurements confirm that IndexRaBitQFastScan is notably faster than IndexRaBitQ when the qb value is high. While the original IndexRaBitQ experiences increased runtime with higher qb values, the fast scan variant maintains consistent runtime regardless of qb. * **Parallelized Training Loop**: The training loop is parallelized, greatly reducing training time. This parallelism should also be added to the original IndexRaBitQ. * **Consistency Across Metrics**: The performance advantages of IndexRaBitQFastScan hold true for both L2 and Inner Product metrics, demonstrating robustness across different distance measures. * One of the next steps is to benchmark IndexRaBitQFastScan against other algorithms to evaluate its performance in a broader context. Differential Revision: D81787307
Summary:
Introduction
This diff adds a new index called the IndexRaBitQFastScan algorithm. The algorithm is based on the existing IndexRaBitQ but achieves higher speed as it processes batches of 32 data vectors concurrently. It leverages the established IndexFastScan architecture to enable efficient batch processing and parallelism.
Implementation
New Source and Header Files: Added implementations for IndexRaBitQFastScan, following a similar interface to IndexRaBitQ.
Batched Processing: The search operation processes multiple (32) data vectors in a single batch, taking advantage of low-level parallelism to improve throughput.
Specialized Post-processing Handler: A dedicated handler was added for IndexRaBitQFastScan to perform necessary post-processing during search because the LUT accumulates only partial distances. Unlike AQ Fast Scan's simple scalar post-processing, RaBitQ requires complex distance adjustments depending on both query and database vector factors.
LUT: IndexRaBitQFastScan produces slightly different results than IndexRaBitQ due to an extra quantization step in the IndexFastScan architecture. Specifically:
Query Offset Parameter: RaBitQ uses query factors in distance calculations that should be computed in
compute_float_LUT
method (the most efficient place since we are calculatingrotated_qq
anyways) and used for final distance calculations in handlers. However, the previous version ofcompute_quantized_LUT
that callscompute_float_LUT
did not know the query_offset, preventing proper storage of query factors at their global indices. To solve this, I added the extra parameterquery_offset
to bothcompute_quantized_LUT
andcompute_float_LUT
methods. After this change, computed query factors can be accessed by the correct global query index during distance calculations, avoiding expensive recalculation.Testing
Results
results_rabitq
Differential Revision: D81787307