Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
c4a6601
upgrade docusaurus version
badmonster0 Aug 21, 2025
369ac26
initial checkin
badmonster0 Aug 21, 2025
3b609d9
example documentation for custom targets
badmonster0 Aug 21, 2025
f44135f
Update custom_targets.md
badmonster0 Aug 21, 2025
7b045be
paper indexing
badmonster0 Aug 21, 2025
fda59b1
Update academic_papers_index.md
badmonster0 Aug 21, 2025
98eaa05
add example for knowledge graphs
badmonster0 Aug 21, 2025
7707e41
add examples for photo search / knowledge graph
badmonster0 Aug 21, 2025
b74a1ed
Create multi_format_index.md
badmonster0 Aug 21, 2025
2ddb232
Update multi_format_index.md
badmonster0 Aug 21, 2025
f18b84d
product recommendation example
badmonster0 Aug 21, 2025
145f488
Merge branch 'main' into examples
badmonster0 Aug 21, 2025
84a553a
Create manual_extraction.md
badmonster0 Aug 21, 2025
0ceda04
Create simple_text_embedding.md
badmonster0 Aug 21, 2025
57a61e2
Delete code_index.md
badmonster0 Aug 21, 2025
70e74a2
patient intake form
badmonster0 Aug 21, 2025
ed847f4
Create image_search.md
badmonster0 Aug 21, 2025
8ccf086
visual & images for examples
badmonster0 Aug 22, 2025
b72a49d
Merge branch 'main' into examples
badmonster0 Aug 22, 2025
e483a71
update example for semantic search 101
badmonster0 Aug 22, 2025
9eefa87
compress image
badmonster0 Aug 22, 2025
8966c05
Merge branch 'main' into examples
badmonster0 Aug 22, 2025
c6542bb
tags & images
badmonster0 Aug 22, 2025
b689d9e
Merge branch 'main' into examples
badmonster0 Aug 26, 2025
23b8130
polish codebase example docs
badmonster0 Aug 26, 2025
83a58b7
add flow overview to codebase example
badmonster0 Aug 26, 2025
2600706
add image to illustrate chunks
badmonster0 Aug 26, 2025
6c99025
Merge branch 'main' into examples
badmonster0 Aug 26, 2025
2d76b05
docs: custom target example
badmonster0 Aug 26, 2025
2c9a3ab
Merge branch 'main' into examples
badmonster0 Aug 26, 2025
d687b5d
docs: docs to knowledge graph, add image illustrations, reorganize ex…
badmonster0 Aug 26, 2025
bc33999
Merge branch 'main' into examples
badmonster0 Aug 26, 2025
1de530f
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
78081ea
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
2bb8792
docs: paper metadata extraction example
badmonster0 Aug 27, 2025
c4f23fb
docs: patient form extraction
badmonster0 Aug 27, 2025
0e0e641
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
c13f000
docs: product recommendation example
badmonster0 Aug 27, 2025
9df543d
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
e39fe0f
docs: ollama example
badmonster0 Aug 27, 2025
b5e4ade
Merge branch 'main' into examples
badmonster0 Aug 27, 2025
25748a7
docs: photo search example
badmonster0 Aug 28, 2025
833718f
Merge branch 'main' into examples
badmonster0 Aug 28, 2025
1627cd4
docs: vector index 101 example
badmonster0 Aug 28, 2025
a4349fd
Merge branch 'main' into examples
badmonster0 Aug 28, 2025
b0c002b
docs: image search example
badmonster0 Aug 28, 2025
12eb6ce
Merge branch 'main' into examples
badmonster0 Aug 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
docs: paper metadata extraction example
  • Loading branch information
badmonster0 committed Aug 27, 2025
commit 2bb87922589933df88aa2044723fe329f51603ae
114 changes: 60 additions & 54 deletions docs/docs/examples/examples/academic_papers_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,10 @@ sidebar_custom_props:
tags: [vector-index, metadata]
---

import { GitHubButton, YouTubeButton } from '../../../src/components/GitHubButton';
import { GitHubButton, YouTubeButton, DocumentationButton } from '../../../src/components/GitHubButton';

<GitHubButton url="https://github.com/cocoindex-io/cocoindex/tree/main/examples/paper_metadata"/>


## What we will achieve

1. Extract the paper metadata, including file name, title, author information, abstract, and number of pages.
Expand All @@ -27,18 +26,8 @@ to answer questions like "Give me all the papers by Jeff Dean."

4. If you want to perform full PDF embedding for the paper, you can extend the flow.

## Setup

- [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
CocoIndex uses PostgreSQL internally for incremental processing.
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai).
Alternatively, we have native support for Gemini, Ollama, LiteLLM. Check out the [guide](https://cocoindex.io/docs/ai/llm#ollama).
You can choose your favorite LLM provider and work completely on-premises.

## Define Indexing Flow

To better help you navigate what we will walk through, here is a flow diagram:

## Flow Overview
![Flow Overview](/img/examples/academic_papers_index/flow.png)
1. Import a list of papers in PDF.
2. For each file:
- Extract the first page of the paper.
Expand All @@ -50,9 +39,15 @@ To better help you navigate what we will walk through, here is a flow diagram:
- Author-to-paper mapping, for author-based query.
- Embeddings for titles and abstract chunks, for semantic search.

Let’s zoom in on the steps.
## Setup

- [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres).
CocoIndex uses PostgreSQL internally for incremental processing.
- [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.

### Import the Papers
<DocumentationButton href="https://cocoindex.io/docs/ai/llm" text="LLM" margin="0 0 16px 0" />

## Import the Papers

```python
@cocoindex.flow_def(name="PaperMetadata")
Expand All @@ -65,12 +60,12 @@ def paper_metadata_flow(
)
```

`flow_builder.add_source` will create a table with sub fields (`filename`, `content`),
we can refer to the [documentation](https://cocoindex.io/docs/ops/sources) for more details.
`flow_builder.add_source` will create a table with sub fields (`filename`, `content`).
<DocumentationButton href="https://cocoindex.io/docs/ops/sources" text="Sources" margin="0 0 16px 0" />

### Extract and collect metadata
## Extract and collect metadata

#### Extract first page for basic info
### Extract first page for basic info

Define a custom function to extract the first page and number of pages of the PDF.

Expand All @@ -96,20 +91,19 @@ def extract_basic_info(content: bytes) -> PaperBasicInfo:

```

Now, plug this into your flow.
We extract metadata from the first page to minimize processing cost, since the entire PDF can be very large.
Now plug this into the flow. We extract metadata from the first page to minimize processing cost, since the entire PDF can be very large.

```python
with data_scope["documents"].row() as doc:
doc["basic_info"] = doc["content"].transform(extract_basic_info)
```
![Extract basic info](/img/examples/academic_papers_index/basic_info.png)

After this step, you should have the basic info of each paper.
After this step, we should have the basic info of each paper.

### Parse basic info

We will convert the first page to Markdown using Marker.
Alternatively, you can easily plug in your favorite PDF parser, such as Docling.
We will convert the first page to Markdown using Marker. Alternatively, you can easily plug in any PDF parser, such as Docling using CocoIndex's [custom function](https://cocoindex.io/docs/custom_ops/custom_functions).

Define a marker converter function and cache it, since its initialization is resource-intensive.
This ensures that the same converter instance is reused for different input files.
Expand Down Expand Up @@ -140,18 +134,20 @@ def pdf_to_markdown(content: bytes) -> str:
Pass it to your transform

```python
with data_scope["documents"].row() as doc:
with data_scope["documents"].row() as doc:
# ... process
doc["first_page_md"] = doc["basic_info"]["first_page"].transform(
pdf_to_markdown
)
```
![First page in Markdown](/img/examples/academic_papers_index/first_page.png)

After this step, you should have the first page of each paper in Markdown format.

#### Extract basic info with LLM
### Extract basic info with LLM

Define a schema for LLM extraction. CocoIndex natively supports LLM-structured extraction with complex and nested schemas.
If you are interested in learning more about nested schemas, refer to [this article](https://cocoindex.io/blogs/patient-intake-form-extraction-with-llm).
If you are interested in learning more about nested schemas, refer to [this example](https://cocoindex.io/docs/examples/patient_form_extraction).

```python
@dataclasses.dataclass
Expand All @@ -163,7 +159,6 @@ class PaperMetadata:
title: str
authors: list[Author]
abstract: str

```

Plug it into the `ExtractByLlm` function. With a dataclass defined, CocoIndex will automatically parse the LLM response into the dataclass.
Expand All @@ -181,26 +176,27 @@ doc["metadata"] = doc["first_page_md"].transform(
```

After this step, you should have the metadata of each paper.
![Metadata](/img/examples/academic_papers_index/metadata.png)

#### Collect paper metadata
### Collect paper metadata

```python
paper_metadata = data_scope.add_collector()
with data_scope["documents"].row() as doc:
# ... process
# Collect metadata
paper_metadata.collect(
filename=doc["filename"],
title=doc["metadata"]["title"],
authors=doc["metadata"]["authors"],
abstract=doc["metadata"]["abstract"],
num_pages=doc["basic_info"]["num_pages"],
)
paper_metadata = data_scope.add_collector()
with data_scope["documents"].row() as doc:
# ... process
# Collect metadata
paper_metadata.collect(
filename=doc["filename"],
title=doc["metadata"]["title"],
authors=doc["metadata"]["authors"],
abstract=doc["metadata"]["abstract"],
num_pages=doc["basic_info"]["num_pages"],
)
```

Just collect anything you need :)

#### Collect `author` to `filename` information
### Collect `author` to `filename` information
We’ve already extracted author list. Here we want to collect Author → Papers in a separate table to build a look up functionality.
Simply collect by author.

Expand All @@ -216,9 +212,9 @@ with data_scope["documents"].row() as doc:
```


### Compute and collect embeddings
## Compute and collect embeddings

#### Title
### Title

```python
doc["title_embedding"] = doc["metadata"]["title"].transform(
Expand All @@ -228,7 +224,7 @@ doc["title_embedding"] = doc["metadata"]["title"].transform(
)
```

#### Abstract
### Abstract

Split abstract into chunks, embed each chunk and collect their embeddings.
Sometimes the abstract could be very long.
Expand All @@ -252,6 +248,8 @@ doc["abstract_chunks"] = doc["metadata"]["abstract"].transform(

After this step, you should have the abstract chunks of each paper.

![Abstract chunks](/img/examples/academic_papers_index/abstract_chunks.png)

Embed each chunk and collect their embeddings.

```python
Expand All @@ -265,7 +263,9 @@ with doc["abstract_chunks"].row() as chunk:

After this step, you should have the embeddings of the abstract chunks of each paper.

#### Collect embeddings
![Abstract chunks embeddings](/img/examples/academic_papers_index/chunk_embedding.png)

### Collect embeddings

```python
metadata_embeddings = data_scope.add_collector()
Expand All @@ -292,7 +292,7 @@ with data_scope["documents"].row() as doc:
)
```

### Export
## Export
Finally, we export the data to Postgres.

```python
Expand All @@ -319,14 +319,9 @@ metadata_embeddings.export(
)
```

In this example we use PGVector as embedding stores/
With CocoIndex, you can do one line switch on other supported Vector databases like Qdrant, see this [guide](https://cocoindex.io/docs/ops/targets#entry-oriented-targets) for more details.
We aim to standardize interfaces and make it like assembling building blocks.
In this example we use PGVector as embedding store. With CocoIndex, you can do one line switch on other supported Vector databases.

## View in CocoInsight step by step

You can walk through the project step by step in [CocoInsight](https://www.youtube.com/watch?v=MMrpUfUcZPk) to see
exactly how each field is constructed and what happens behind the scenes.
<DocumentationButton href="https://cocoindex.io/docs/ops/targets#entry-oriented-targets" text="Entry Oriented Targets" margin="0 0 16px 0" />

## Query the index

Expand All @@ -338,3 +333,14 @@ For now CocoIndex doesn't provide additional query interface. We can write SQL o
- The query space has excellent solutions for querying, reranking, and other search-related functionality.

If you need assist with writing the query, please feel free to reach out to us at [Discord](https://discord.com/invite/zpA9S2DR7s).

## CocoInsight

You can walk through the project step by step in [CocoInsight](https://www.youtube.com/watch?v=MMrpUfUcZPk) to see exactly how each field is constructed and what happens behind the scenes.


```sh
cocoindex server -ci main.py
```

Follow the url `https://cocoindex.io/cocoinsight`. It connects to your local CocoIndex server, with zero pipeline data retention.
6 changes: 4 additions & 2 deletions docs/docs/examples/examples/docs_to_knowledge_graph.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@ and then build a knowledge graph.
## Setup
* [Install PostgreSQL](https://cocoindex.io/docs/getting_started/installation#-install-postgres). CocoIndex uses PostgreSQL internally for incremental processing.
* [Install Neo4j](https://cocoindex.io/docs/ops/targets#neo4j-dev-instance), a graph database.
* [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, you can switch to Ollama, which runs LLM models locally.
<DocumentationButton href="https://cocoindex.io/docs/ai/llm#ollama" text="Ollama" margin="0 0 16px 0" />
* [Configure your OpenAI API key](https://cocoindex.io/docs/ai/llm#openai). Alternatively, we have native support for Gemini, Ollama, LiteLLM. You can choose your favorite LLM provider and work completely on-premises.

<DocumentationButton href="https://cocoindex.io/docs/ai/llm" text="LLM" margin="0 0 16px 0" />


## Documentation
<DocumentationButton href="https://cocoindex.io/docs/ops/targets#property-graph-targets" text="Property Graph Targets" margin="0 0 16px 0" />
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/static/img/examples/academic_papers_index/cover.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion examples/paper_metadata/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ version = "0.1.0"
description = "Build index for papers with both metadata and content embeddings"
requires-python = ">=3.11"
dependencies = [
"cocoindex[embeddings]>=0.1.79",
"cocoindex[embeddings]>=0.1.83",
"pypdf>=5.7.0",
"marker-pdf>=1.5.2",
]
Expand Down