Smart Text Processing

"Finally, semantic chunking that understands meaning!" — Text processing, done intelligently.

Text scanners for Go that go beyond simple line-by-line processing. Built with semantic understanding at its core, making text chunking intelligent and context-aware.

Traditional text processing treats all chunks equally, but meaning isn't uniform across text. The scanner library uses embedding-based semantic analysis to group related content together, making it perfect for RAG systems, document analysis, and intelligent text processing pipelines.

Quick Start: Semantic Chunking

Here's how to chunk text by semantic similarity instead of arbitrary boundaries:

package main

import (
  "context"
  "strings"

  "github.com/fogfish/scanner"
)

func main() {
  api  := // create instance of embedding vector provider scanner.Embedder
  text := `
  The quick brown fox jumps over the lazy dog. 
  This is a classic pangram used in typography.
  
  Machine learning has revolutionized AI.
  Neural networks can now understand language context.
  
  Climate change affects global weather patterns.
  Rising temperatures impact ecosystems worldwide.
  `

  // Break text into sentences first
  sentences := scanner.NewSentencer(
    scanner.EndOfSentence, 
    strings.NewReader(text),
  )

  // Group sentences by semantic similarity
  semantic := scanner.NewSemantic(api, sentences)
  semantic.Window(10)                         // Look at 10 sentences at a time
  semantic.Similarity(scanner.HighSimilarity) // Group highly similar content

	// Get semantically coherent chunks
	for semantic.Scan() {
		chunk := semantic.Text()
		fmt.Printf("Semantic chunk: %v\n", chunk)
		// Output will group related sentences together:
		// - Typography sentences together
		// - AI/ML sentences together  
		// - Climate sentences together
	}
}

This approach produces chunks where sentences actually relate to each other, rather than arbitrary splits that might separate related concepts.

Why Semantic Chunking Matters

Traditional chunking problems:

Splits related content across chunks
Breaks context mid-conversation
Fixed boundaries ignore meaning
Poor retrieval in RAG systems

Semantic chunking benefits:

Keeps related content together
Maintains semantic coherence
Context-aware boundaries
Better embedding similarity for retrieval

Perfect for:

RAG Systems: Better retrieval through coherent chunks
Document Analysis: Group related paragraphs and concepts
Content Summarization: Preserve topic boundaries
Text Classification: Maintain semantic integrity

The Scanner Toolkit

Beyond semantic chunking, the library provides a complete text processing toolkit:

Scanner	Purpose	Use Case
Semantic	Groups by meaning similarity	RAG, document analysis
Sentencer	Splits by punctuation	Natural sentence boundaries
Slicer	Fixed delimiter splitting	CSV, structured data
Chunker	Fixed-size chunks	Token limits, simple splitting
Tagger	Tag bounded chunks	Markup data
Sorter	Semantic sorting of data	Organizing similar items
Identity	Entire input as one chunk	Small documents

All scanners implement the familiar bufio.Scanner interface:

for scanner.Scan() {
  text := scanner.Text()
  // Process chunk
}

Similarity Control

Fine-tune semantic grouping with built-in similarity functions:

semantic.Similarity(scanner.HighSimilarity)   // Very similar content (0.0-0.2)
semantic.Similarity(scanner.MediumSimilarity) // Related content (0.2-0.5)
semantic.Similarity(scanner.WeakSimilarity)   // Loosely related (0.5-0.8)

// Custom similarity threshold
semantic.Similarity(scanner.RangeSimilarity(0.1, 0.3))

// Custom similarity logic
semantic.Similarity(scanner.CosineSimilarity(func(d float32) bool {
    return d < 0.25 // Custom threshold
}))

Algorithm Behavior

Control how chunks grow:

// Compare new sentences to the first sentence in chunk (stable reference)
semantic.SimilarityWith(scanner.SIMILARITY_WITH_HEAD)

// Compare new sentences to the last added sentence (evolving reference)  
semantic.SimilarityWith(scanner.SIMILARITY_WITH_TAIL)

Getting Started

The library requires Go 1.24 or later.

go get -u github.com/fogfish/scanner

Compatible with any embedding provider - OpenAI, Cohere, local models, or custom implementations. Just implement the simple Embedder interface:

type Embedder interface {
    Embedding(ctx context.Context, text string) ([]float32, int, error)
}

How To Contribute

The library is MIT licensed and accepts contributions via GitHub pull requests:

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Added some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

git clone https://github.com/fogfish/scanner
cd scanner
go test ./...

License

See LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chunker.go		chunker.go
chunker_test.go		chunker_test.go
embedding.go		embedding.go
embedding_test.go		embedding_test.go
go.mod		go.mod
go.sum		go.sum
identity.go		identity.go
scanner.go		scanner.go
semantic.go		semantic.go
semantic_test.go		semantic_test.go
sentencer.go		sentencer.go
sentencer_test.go		sentencer_test.go
slicer.go		slicer.go
slicer_test.go		slicer_test.go
sorter.go		sorter.go
sorter_test.go		sorter_test.go
tagger.go		tagger.go
tagger_test.go		tagger_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Smart Text Processing

Quick Start: Semantic Chunking

Why Semantic Chunking Matters

The Scanner Toolkit

Similarity Control

Algorithm Behavior

Getting Started

How To Contribute

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

fogfish/scanner

Folders and files

Latest commit

History

Repository files navigation

Smart Text Processing

Quick Start: Semantic Chunking

Why Semantic Chunking Matters

The Scanner Toolkit

Similarity Control

Algorithm Behavior

Getting Started

How To Contribute

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages