Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

yanghua
Copy link
Collaborator

@yanghua yanghua commented Sep 20, 2025

closes #4585

Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@yanghua
Copy link
Collaborator Author

yanghua commented Sep 20, 2025

This PR is only for CI. Still WIP, not ready for review.

@yanghua yanghua force-pushed the primary-key-conflict-detection branch 2 times, most recently from a007dc9 to 5de6d10 Compare September 20, 2025 07:17
yanghua and others added 4 commits September 30, 2025 16:31
…insert operations

This commit introduces a comprehensive Bloom Filter-based conflict detection mechanism
for concurrent merge insert operations in Lance, addressing primary key conflicts
during transaction commits.

Key Features:
- Two-tier storage strategy: exact mapping for small datasets (<200KB), Bloom Filter for large datasets
- Support for multiple primary key types: String, Int64, UInt64, Binary, and composite keys
- Probabilistic conflict detection with configurable false positive rates
- Seamless integration with existing transaction and merge insert workflows

Core Components:
1. Transaction Structure Extension:
   - Extended Transaction protobuf with primary_key_bloom_filter field
   - Added serialization/deserialization support for Bloom Filter data

2. Primary Key Bloom Filter Module:
   - PrimaryKeyBloomFilter wrapper with intelligent storage selection
   - Support for exact mapping (HashMap) and probabilistic filtering (SBBF)
   - Comprehensive primary key type handling and composite key support

3. Conflict Detection Logic:
   - ConflictDetector interface with DefaultConflictDetector implementation
   - Intersection-based conflict detection algorithm
   - False positive identification and graceful handling

4. System Integration:
   - Enhanced TransactionRebase with check_merge_primary_key_conflicts
   - Modified MergeInsertBuilder to collect primary keys during execution
   - Updated Merger struct to track collected primary keys

Technical Specifications:
- Time Complexity: O(k) for insertion and query operations
- Space Complexity: O(m) for Bloom Filter bit array
- Configurable false positive rate (default: 1%)
- Configurable storage threshold (default: 200KB)
- Full backward compatibility with existing Lance functionality

Files Modified:
- protos/transaction.proto: Added primary_key_bloom_filter field
- rust/lance/src/dataset/transaction.rs: Extended Transaction struct
- rust/lance/src/dataset/write/merge_insert.rs: Integrated primary key collection
- rust/lance/src/io/commit/conflict_resolver.rs: Enhanced conflict detection

Files Added:
- rust/lance/src/dataset/conflict_detection/: Complete conflict detection module
  - primary_key_filter.rs: Core Bloom Filter implementation
  - conflict_detector.rs: Conflict detection interface and logic
  - examples.rs: Comprehensive usage examples
  - integration_tests.rs: End-to-end test suite
  - README.md: Module documentation

This implementation provides reliable, efficient concurrent conflict detection
while maintaining Lance's performance characteristics and architectural principles.
@yanghua yanghua force-pushed the primary-key-conflict-detection branch from c45c79c to 8d89628 Compare September 30, 2025 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MergeInsert produces duplicated rows
1 participant