Thanks to visit codestin.com
Credit goes to Github.com

Skip to content

A simple 3D asset retrieval system based on objaverse. Query any 3D asset using text(CN/EN) or images, inter-modal or cross-modal. Equipped with backend API service and front-end gradio demo.

License

Notifications You must be signed in to change notification settings

3DSceneAgent/AssetRetrieval3D

Repository files navigation

3D Asset Retrieval Demo

Python License Gradio 中文

A simple multi-modal 3D asset retrieval system. This repo uses objaverse as a simple demonstration, while it can generalize and scale to any 3D dataset.

Demo

1. Features

Demo built upon objaverse, using Cap3D 650k+ english captions, 650k+ translated captions ,and 260k+ gobjaverse asset renderings

  • Text Search: Retrieve 3D assets in English or Chinese
  • Image Search: Retrieve 3D assets using a single RGB image
  • Cross-Modal Retrieval: Retrieve 3D assets using text2image or image2text similarities
  • Dual Algorithms:
    • SigLip: Fast, English-only, per-image embeddings. Medium Retrieval Quality(WIP).
    • Qwen3-VL-Embedding: Bilingual, multi-image embeddings. High Retrieval Quality(Recommended).
  • Vector Database: PostgreSQL with pgvector for efficient similarity search
  • Web Interface: Beautiful Gradio UI with 3D model viewer
  • REST API: FastAPI backend for programmatic access

2. Architecture

┌─────────────────┐
│ English Captions│
└────────┬────────┘
         │
         ├──► Translation ──► Chinese Captions
         │
         ├──► SigLip Embeddings ──┐
         │    - Text (EN)          │
         │    - Images (per-view)  │
         │                         │
         └──► Qwen Embeddings ─────┤
              - Text (EN + CN)     │
              - Images (multi)     │
                                   │
                                   ▼
                         ┌──────────────────┐
                         │ PostgreSQL       │
                         │ + pgvector       │
                         └────────┬─────────┘
                                  │
                         ┌────────▼─────────┐
                         │  FastAPI Backend │
                         └────────┬─────────┘
                                  │
                         ┌────────▼─────────┐
                         │  Gradio Frontend │
                         └──────────────────┘

3. Get Started

Prerequisites

  • Python 3.8+
  • PostgreSQL 12+ with pgvector extension
  • NVIDIA GPU (recommended for SigLip)
  • DashScope API key (for Qwen)

Installation

  1. Clone the repository
git clone https://github.com/3DSceneAgent/AssetRetrieval3D
  1. Install dependencies:

    # optionally create new env 
    # conda create -n asset_retrieval python=3.10
    # conda activate asset_retrieval
    pip install -r requirements.txt
  2. Set up environment variables:

    cp .env.example .env 
    # setup variables
    export DASHSCOPE_API_KEY="your-api-key"
    export DB_HOST="localhost"
    export DB_PORT="5432"
    export DB_USER="postgres"
    export DB_PASSWORD="your-password"
  3. Configure PostgreSQL:

    • Install PostgreSQL
    • Install pgvector extension:
      CREATE EXTENSION vector;

4. Usage

Step 0: Configure Settings

Edit config.py to adjust settings:

  • MAX_ASSETS: Set to a small number (e.g., 100) for testing
  • Database credentials
  • API keys
  • Paths

Step 1: Translate Captions (Optional but Recommended)

Translate English captions to Chinese:

python scripts/01_translate_captions.py

This creates data/text_captions_cap3d_cn.json.

Note: This uses the Qwen batch API and may take hours for the full dataset.

Step 2: Generate Embeddings

(Optional) SigLip Embeddings

python scripts/02_embed_siglip.py

This generates:

  • Text embeddings (English only)
  • Image embeddings (one per viewpoint)

(Optional) Qwen Embeddings

python scripts/03_embed_qwen.py

This generates:

  • Text embeddings (English and Chinese)
  • Multi-image embeddings (8 images per asset)

Note: Can run both in parallel if you have resources.

Step 3: Populate Database

python scripts/04_populate_database.py

This:

  • Creates two databases: siglip_embeddings and qwen_embeddings
  • Creates tables with pgvector columns
  • Inserts all embeddings
  • Creates vector indexes

Step 4: Start Backend API

python backend/app.py

Or with uvicorn:

uvicorn backend.app:app --host 0.0.0.0 --port 8000

The API will be available at http://localhost:8000

Step 5: Start Frontend

In a new terminal:

python frontend/gradio_app.py

The UI will be available at http://localhost:7860

Step 6: Test Backend with Client Script

A comprehensive test client is provided to verify the backend service:

# Run all tests with default settings
python test_client.py

# Test only SigLip algorithm
python test_client.py --algorithm siglip

# Test only image search
python test_client.py --test-type image

# Run with verbose output (shows all results)
python test_client.py --verbose --num-tests 1

# Test against remote backend
python test_client.py --backend-url http://remote-server:8000

The test client will:

  • Check backend health
  • Test text search (English and Chinese for Qwen)
  • Test image search using real assets from the database
  • Test both uni-modal and cross-modal search
  • Display results with similarity scores and captions /search/text

5. Configuration Options

config.py Key Settings

  • MAX_ASSETS: Limit number of assets to process (for debugging)
  • TRANSLATION_BATCH_SIZE: Captions per translation batch (default: 1000)
  • EMBEDDING_BATCH_SIZE: Batch size for embedding generation (default: 100)
  • QWEN_NUM_IMAGES: Number of images per asset for Qwen (default: 8)
  • DEFAULT_TOP_K: Default number of search results (default: 10)

6. Algorithms

Embedding Algoirthms

Feature SigLip Qwen
Text Languages English only English + Chinese
Image Embeddings One per viewpoint Multi-image (8 views)
Speed Fast Slower (API calls)
Requires GPU Yes (local) No (API)
Cross-modal Yes Yes

Search Modes

Inner-Modal Search

  • Text → Text: Find assets with similar descriptions
  • Image → Image: Find visually similar assets

Cross-Modal Search

  • Text → Image: Find images matching text description
  • Image → Text: Find text descriptions matching image

7. File Structure

objaverse_retrieval/
├── config.py                 # Configuration
├── requirements.txt          # Dependencies
├── README.md                # This file
├── test_client.py           # Backend test client
│
├── data/                    # Data files
│   ├── text_captions_cap3d.json
│   ├── text_captions_cap3d_cn.json  (generated)
│   ├── gobjaverse/          # 3D asset images
│   └── gobjaverse_280k_index_to_objaverse.json
│
├── outputs/                 # Generated outputs
│   ├── embeddings/          # Saved embeddings
│   ├── translations/        # Translation results
│   └── batch_jsonl/         # API batch files
│
├── utils/                   # Utility modules
│   ├── data_loader.py
│   ├── image_utils.py
│   └── db_utils.py
│
├── scripts/                 # Processing scripts
│   ├── 01_translate_captions.py
│   ├── 02_embed_siglip.py
│   ├── 03_embed_qwen.py
│   └── 04_populate_database.py
│
├── backend/                 # FastAPI backend
│   ├── app.py
│   ├── embedding_service.py
│   └── vector_search.py
│
└── frontend/               # Gradio frontend
    └── gradio_app.py

License

The code and application is licensed under Apache2.0 License.

Acknowledgments

About

A simple 3D asset retrieval system based on objaverse. Query any 3D asset using text(CN/EN) or images, inter-modal or cross-modal. Equipped with backend API service and front-end gradio demo.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published