-
Notifications
You must be signed in to change notification settings - Fork 169
Description
When trying to index binary files (like PDFs), they are skipped with the following reason: binary content at byte offset x. Seems like binary files with a byte offset don't play nice with zoekt for whatever reason.
Diagnosing with Ask:
You're absolutely correct! The primary reason PDFs and other binary files are skipped is because they contain null bytes (0x00), not just because of size limits or trigram counts.
The Null Byte Check
In IndexBuilder.Add() indexbuilder.go:428-431:
if idx := bytes.IndexByte(doc.Content, 0); idx >= 0 {
doc.SkipReason = fmt.Sprintf("binary content at byte offset %d", idx)
doc.Language = "binary"
}This checks if there's a null byte anywhere in the file content. If found, it:
- Sets the
SkipReasonto indicate where the binary content was detected - Marks the language as "binary"
Then the content is replaced indexbuilder.go:433-440:
if doc.SkipReason != "" {
doc.Content = []byte(notIndexedMarker + doc.SkipReason)
doc.Symbols = nil
doc.SymbolsMetaData = nil
if doc.Language == "" {
doc.Language = "skipped"
}
}The file content is replaced with a marker string like "NOT-INDEXED: binary content at byte offset 5" indexbuilder.go:386.
Similar Check in DocChecker
The DocChecker.Check() method has an identical check indexbuilder.go:544-546:
if index := bytes.IndexByte(content, 0); index > 0 {
return fmt.Errorf("binary data at byte offset %d", index)
}Bottom line: PDFs cannot be indexed in Zoekt because they contain null bytes, which are fundamental to binary file formats. The presence of even a single null byte causes the file to be marked as binary and excluded from text-based indexing. There's no configuration option to override this behavior since Zoekt is designed specifically for text-based source code search.