Thanks to visit codestin.com
Credit goes to github.com

Skip to content

[FR] Support zoekt indexing of binary files #575

@brendan-kellam

Description

@brendan-kellam

When trying to index binary files (like PDFs), they are skipped with the following reason: binary content at byte offset x. Seems like binary files with a byte offset don't play nice with zoekt for whatever reason.


Diagnosing with Ask:

You're absolutely correct! The primary reason PDFs and other binary files are skipped is because they contain null bytes (0x00), not just because of size limits or trigram counts.

The Null Byte Check

In IndexBuilder.Add() indexbuilder.go:428-431:

if idx := bytes.IndexByte(doc.Content, 0); idx >= 0 {
    doc.SkipReason = fmt.Sprintf("binary content at byte offset %d", idx)
    doc.Language = "binary"
}

This checks if there's a null byte anywhere in the file content. If found, it:

  • Sets the SkipReason to indicate where the binary content was detected
  • Marks the language as "binary"

Then the content is replaced indexbuilder.go:433-440:

if doc.SkipReason != "" {
    doc.Content = []byte(notIndexedMarker + doc.SkipReason)
    doc.Symbols = nil
    doc.SymbolsMetaData = nil
    if doc.Language == "" {
        doc.Language = "skipped"
    }
}

The file content is replaced with a marker string like "NOT-INDEXED: binary content at byte offset 5" indexbuilder.go:386.

Similar Check in DocChecker

The DocChecker.Check() method has an identical check indexbuilder.go:544-546:

if index := bytes.IndexByte(content, 0); index > 0 {
    return fmt.Errorf("binary data at byte offset %d", index)
}

Bottom line: PDFs cannot be indexed in Zoekt because they contain null bytes, which are fundamental to binary file formats. The presence of even a single null byte causes the file to be marked as binary and excluded from text-based indexing. There's no configuration option to override this behavior since Zoekt is designed specifically for text-based source code search.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestzoektIssue related to the Zoekt code search engine. https://github.com/sourcebot-dev/zoekt

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions