-
Notifications
You must be signed in to change notification settings - Fork 841
Add Semantic Similarity chunker #6994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@dotnet-policy-service agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces new chunking implementations for the DataIngestion library, adding support for splitting documents into chunks based on different strategies: semantic similarity, sections, markdown structure, and token-based chunking. The key changes include:
- New chunker implementations:
SemanticSimilarityChunker,SectionChunker,MarkdownChunker, andDocumentTokenChunker - Comprehensive test coverage for all new chunkers
- Addition of
System.Numerics.Tensorspackage dependency for cosine similarity calculations - Addition of
Microsoft.ML.Tokenizers.Data.O200kBasepackage for tokenization support - Test infrastructure with shared
TestEmbeddingGeneratorfor testing AI-based chunking
Reviewed Changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj |
Adds System.Numerics.Tensors package dependency and removes trailing whitespace |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs |
Implements semantic similarity-based chunking using embeddings and cosine similarity |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SectionChunker.cs |
Implements section-based chunking that treats document sections as separate entities |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/MarkdownChunker.cs |
Implements markdown header-based chunking with configurable header levels |
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs |
Implements token-based chunking with configurable overlap |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Microsoft.Extensions.DataIngestion.Tests.csproj |
Adds test dependencies and links to shared test helpers |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs |
Comprehensive tests for semantic similarity chunker |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SectionChunkerTests.cs |
Tests for section-based chunking including nested sections |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs |
Tests for markdown chunker with various header configurations |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/DocumentTokenChunkerTests.cs |
Base tests for token-based chunking |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/OverlapDocumentTokenChunkerTests.cs |
Tests for token chunking with overlap |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/NoOverlapDocumentTokenChunkerTests.cs |
Tests for token chunking without overlap |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/DocumentChunkerTests.cs |
Base test class with common test scenarios |
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkAssertions.cs |
Helper class for chunk assertions |
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/MarkdownChunker.cs
Outdated
Show resolved
Hide resolved
adamsitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big thanks for all your hard work and contribution to our product @KrystofS !
I've left some comments, PTAL.
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SectionChunker.cs
Outdated
Show resolved
Hide resolved
...raries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/OverlapDocumentTokenChunkerTests.cs
Outdated
Show resolved
Hide resolved
...ies/Microsoft.Extensions.DataIngestion.Tests/Microsoft.Extensions.DataIngestion.Tests.csproj
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Show resolved
Hide resolved
adamsitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. PTAL at my comments (mostly nits). Thank you again @KrystofS !
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
...ibraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs
Outdated
Show resolved
Hide resolved
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkAssertions.cs
Outdated
Show resolved
Hide resolved
adamsitnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for your contribution @KrystofS !
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <[email protected]>
…nticSimilarityChunker.cs Co-authored-by: Adam Sitnik <[email protected]>
Co-authored-by: Adam Sitnik <[email protected]>
Co-authored-by: Adam Sitnik <[email protected]>
20c8bd7 to
78afabb
Compare
Microsoft Reviewers: Open in CodeFlow