Proofread: ‎07-TextSplitter/04-SemanticChunker

안녕하세요 @BaBetterB 님,

https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/pull/68 (수정 https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/pull/107) 번역 확인입니다.

문법적으로 완성된 노트북이지만 크게 세 가지 제안 드리고자 합니다.
1. 개요 단락 **Semantic Chunker**를 제 생각에는 backtick `SemanticChunker`로 `함수` 표기법과 같이 통일하는 것이 나을 것 같습니다. 변경한 표기법을 포함하여 단어와 문장을 영어식 표현으로 변경하고자 합니다.
2. 튜토리얼 목적에 맞게 일부분은 실행순서를 제시하는 대화형으로 전환하고자 합니다.
3. 소제목들을 `Percentile-Based Splitting`, `Standard Deviation Splitting`과 `Interquartile Range Splitting`로 변경하였고 몇몇 수학 표기들을 조금 더 명확하게 표현하고자 합니다.

모두 반영하였을 때 전체 흐름을 보실 수 있게 번역/감수본 커밋 [링크](https://github.com/chaeyoonyunakim/LangChain-OpenTutorial/commit/60a427d3d64628f78b9a9aed2eaf8503ed283574) 추가하였습니다. *노트북 실행결과가 없는 버전*임을 참고 부탁드립니다.

## Overview

This tutorial dives into a Text Splitter that uses semantic similarity to split text.

LangChain's `SemanticChunker` is a powerful tool that takes document chunking to a whole new level. Unlike traiditional methods that split text at fixed intervals, the `SemanticChunker` analyzes the meaning of the content to create more logical divisions. 

This approach relies on **OpenAI's embedding model** , calculating how similar different pieces of text are by converting them into numerical representations. The tool offers various splitting options to suit your needs. You can choose from methods based on percentiles, standard deviation, or interquartile range. 

What sets the `SemanticChunker` apart is its ability to preserve context by identifying natural breaks. This ultimately leads to better performance when working with large language models.

Since the `SemanticChunker` understands the actual content, it generates chunks that are more useful and maintain the flow and context of the original document.

See [Greg Kamradt's notebook](https://github.com/FullStackRetrieval-com/RetrievalTutorials/blob/main/tutorials/LevelsOfTextSplitting/5_Levels_Of_Text_Splitting.ipynb)

The method breaks down the text into individual sentences first. Then, it groups sementically similar sentences into chunks (e.g., 3 sentences), and finally merges similar sentences in the embedding space.

## Breakpoints
This chunking process works by indentifying natural breaks between sentences.

Here's how it decides where to split the text:
1. It calculates the difference between these embeddings for each pair of sentences.
2. When the difference between two sentences exceeds a certain threshold (breakpoint), the `text_splitter` identifies this as a natural break and splits the text at that point.

Check out [Greg Kamradt's video](https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580) for more details.

### Percentile-Based Splitting
This method sorts all embedding differences between sentences. Then, it splits the text at a specific percentile (e.g., 70th percentile).


감사합니다.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proofread: ‎07-TextSplitter/04-SemanticChunker #109

Overview

Breakpoints

Percentile-Based Splitting

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proofread: ‎07-TextSplitter/04-SemanticChunker #109

Description

Overview

Breakpoints

Percentile-Based Splitting

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions