[FR] Support zoekt indexing of binary files

When trying to index binary files (like PDFs), they are skipped with the following reason: `binary content at byte offset x`. Seems like binary files with a byte offset don't play nice with zoekt for whatever reason.

---

Diagnosing with Ask:

You're absolutely correct! The primary reason PDFs and other binary files are skipped is because they contain **null bytes (0x00)**, not just because of size limits or trigram counts.

## The Null Byte Check

**In `IndexBuilder.Add()`** [indexbuilder.go:428-431](indexbuilder.go):

```go
if idx := bytes.IndexByte(doc.Content, 0); idx >= 0 {
    doc.SkipReason = fmt.Sprintf("binary content at byte offset %d", idx)
    doc.Language = "binary"
}
```

This checks if there's a null byte anywhere in the file content. If found, it:
- Sets the `SkipReason` to indicate where the binary content was detected
- Marks the language as "binary"

**Then the content is replaced** [indexbuilder.go:433-440](indexbuilder.go):

```go
if doc.SkipReason != "" {
    doc.Content = []byte(notIndexedMarker + doc.SkipReason)
    doc.Symbols = nil
    doc.SymbolsMetaData = nil
    if doc.Language == "" {
        doc.Language = "skipped"
    }
}
```

The file content is replaced with a marker string like `"NOT-INDEXED: binary content at byte offset 5"` [indexbuilder.go:386](indexbuilder.go).

## Similar Check in DocChecker

The `DocChecker.Check()` method has an identical check [indexbuilder.go:544-546](indexbuilder.go):

```go
if index := bytes.IndexByte(content, 0); index > 0 {
    return fmt.Errorf("binary data at byte offset %d", index)
}
```

**Bottom line**: PDFs cannot be indexed in Zoekt because they contain null bytes, which are fundamental to binary file formats. The presence of even a single null byte causes the file to be marked as binary and excluded from text-based indexing. There's no configuration option to override this behavior since Zoekt is designed specifically for text-based source code search.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FR] Support zoekt indexing of binary files #575

The Null Byte Check

Similar Check in DocChecker

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FR] Support zoekt indexing of binary files #575

Description

The Null Byte Check

Similar Check in DocChecker

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions