Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@KrystofS
Copy link
Contributor

@KrystofS KrystofS commented Oct 31, 2025

Microsoft Reviewers: Open in CodeFlow

Copilot AI review requested due to automatic review settings October 31, 2025 19:25
@KrystofS
Copy link
Contributor Author

@KrystofS please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@dotnet-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@dotnet-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@dotnet-policy-service agree company="Microsoft"

Contributor License Agreement

@dotnet-policy-service agree

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces new chunking implementations for the DataIngestion library, adding support for splitting documents into chunks based on different strategies: semantic similarity, sections, markdown structure, and token-based chunking. The key changes include:

  • New chunker implementations: SemanticSimilarityChunker, SectionChunker, MarkdownChunker, and DocumentTokenChunker
  • Comprehensive test coverage for all new chunkers
  • Addition of System.Numerics.Tensors package dependency for cosine similarity calculations
  • Addition of Microsoft.ML.Tokenizers.Data.O200kBase package for tokenization support
  • Test infrastructure with shared TestEmbeddingGenerator for testing AI-based chunking

Reviewed Changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/Libraries/Microsoft.Extensions.DataIngestion/Microsoft.Extensions.DataIngestion.csproj Adds System.Numerics.Tensors package dependency and removes trailing whitespace
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SemanticSimilarityChunker.cs Implements semantic similarity-based chunking using embeddings and cosine similarity
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/SectionChunker.cs Implements section-based chunking that treats document sections as separate entities
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/MarkdownChunker.cs Implements markdown header-based chunking with configurable header levels
src/Libraries/Microsoft.Extensions.DataIngestion/Chunkers/DocumentTokenChunker.cs Implements token-based chunking with configurable overlap
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Microsoft.Extensions.DataIngestion.Tests.csproj Adds test dependencies and links to shared test helpers
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SemanticSimilarityChunkerTests.cs Comprehensive tests for semantic similarity chunker
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/SectionChunkerTests.cs Tests for section-based chunking including nested sections
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/MarkdownChunkerTests.cs Tests for markdown chunker with various header configurations
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/DocumentTokenChunkerTests.cs Base tests for token-based chunking
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/OverlapDocumentTokenChunkerTests.cs Tests for token chunking with overlap
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/NoOverlapDocumentTokenChunkerTests.cs Tests for token chunking without overlap
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/DocumentChunkerTests.cs Base test class with common test scenarios
test/Libraries/Microsoft.Extensions.DataIngestion.Tests/Chunkers/ChunkAssertions.cs Helper class for chunk assertions

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big thanks for all your hard work and contribution to our product @KrystofS !

I've left some comments, PTAL.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. PTAL at my comments (mostly nits). Thank you again @KrystofS !

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for your contribution @KrystofS !

@adamsitnik adamsitnik changed the title Add new document chunkers Add Semantic Similarity chunker Nov 5, 2025
@adamsitnik adamsitnik enabled auto-merge (squash) November 5, 2025 20:19
@adamsitnik adamsitnik merged commit 192782e into dotnet:main Nov 5, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants