A simple multi-modal 3D asset retrieval system. This repo uses objaverse as a simple demonstration, while it can generalize and scale to any 3D dataset.
Demo built upon objaverse, using Cap3D 650k+ english captions, 650k+ translated captions ,and 260k+ gobjaverse asset renderings
- Text Search: Retrieve 3D assets in English or Chinese
- Image Search: Retrieve 3D assets using a single RGB image
- Cross-Modal Retrieval: Retrieve 3D assets using text2image or image2text similarities
- Dual Algorithms:
- SigLip: Fast, English-only, per-image embeddings. Medium Retrieval Quality(WIP).
- Qwen3-VL-Embedding: Bilingual, multi-image embeddings. High Retrieval Quality(Recommended).
- Vector Database: PostgreSQL with pgvector for efficient similarity search
- Web Interface: Beautiful Gradio UI with 3D model viewer
- REST API: FastAPI backend for programmatic access
┌─────────────────┐
│ English Captions│
└────────┬────────┘
│
├──► Translation ──► Chinese Captions
│
├──► SigLip Embeddings ──┐
│ - Text (EN) │
│ - Images (per-view) │
│ │
└──► Qwen Embeddings ─────┤
- Text (EN + CN) │
- Images (multi) │
│
▼
┌──────────────────┐
│ PostgreSQL │
│ + pgvector │
└────────┬─────────┘
│
┌────────▼─────────┐
│ FastAPI Backend │
└────────┬─────────┘
│
┌────────▼─────────┐
│ Gradio Frontend │
└──────────────────┘
- Python 3.8+
- PostgreSQL 12+ with pgvector extension
- NVIDIA GPU (recommended for SigLip)
- DashScope API key (for Qwen)
- Clone the repository
git clone https://github.com/3DSceneAgent/AssetRetrieval3D-
Install dependencies:
# optionally create new env # conda create -n asset_retrieval python=3.10 # conda activate asset_retrieval pip install -r requirements.txt
-
Set up environment variables:
cp .env.example .env # setup variables export DASHSCOPE_API_KEY="your-api-key" export DB_HOST="localhost" export DB_PORT="5432" export DB_USER="postgres" export DB_PASSWORD="your-password"
-
Configure PostgreSQL:
- Install PostgreSQL
- Install pgvector extension:
CREATE EXTENSION vector;
Edit config.py to adjust settings:
MAX_ASSETS: Set to a small number (e.g., 100) for testing- Database credentials
- API keys
- Paths
Translate English captions to Chinese:
python scripts/01_translate_captions.pyThis creates data/text_captions_cap3d_cn.json.
Note: This uses the Qwen batch API and may take hours for the full dataset.
python scripts/02_embed_siglip.pyThis generates:
- Text embeddings (English only)
- Image embeddings (one per viewpoint)
python scripts/03_embed_qwen.pyThis generates:
- Text embeddings (English and Chinese)
- Multi-image embeddings (8 images per asset)
Note: Can run both in parallel if you have resources.
python scripts/04_populate_database.pyThis:
- Creates two databases:
siglip_embeddingsandqwen_embeddings - Creates tables with pgvector columns
- Inserts all embeddings
- Creates vector indexes
python backend/app.pyOr with uvicorn:
uvicorn backend.app:app --host 0.0.0.0 --port 8000The API will be available at http://localhost:8000
In a new terminal:
python frontend/gradio_app.pyThe UI will be available at http://localhost:7860
A comprehensive test client is provided to verify the backend service:
# Run all tests with default settings
python test_client.py
# Test only SigLip algorithm
python test_client.py --algorithm siglip
# Test only image search
python test_client.py --test-type image
# Run with verbose output (shows all results)
python test_client.py --verbose --num-tests 1
# Test against remote backend
python test_client.py --backend-url http://remote-server:8000The test client will:
- Check backend health
- Test text search (English and Chinese for Qwen)
- Test image search using real assets from the database
- Test both uni-modal and cross-modal search
- Display results with similarity scores and captions /search/text
- MAX_ASSETS: Limit number of assets to process (for debugging)
- TRANSLATION_BATCH_SIZE: Captions per translation batch (default: 1000)
- EMBEDDING_BATCH_SIZE: Batch size for embedding generation (default: 100)
- QWEN_NUM_IMAGES: Number of images per asset for Qwen (default: 8)
- DEFAULT_TOP_K: Default number of search results (default: 10)
| Feature | SigLip | Qwen |
|---|---|---|
| Text Languages | English only | English + Chinese |
| Image Embeddings | One per viewpoint | Multi-image (8 views) |
| Speed | Fast | Slower (API calls) |
| Requires GPU | Yes (local) | No (API) |
| Cross-modal | Yes | Yes |
- Text → Text: Find assets with similar descriptions
- Image → Image: Find visually similar assets
- Text → Image: Find images matching text description
- Image → Text: Find text descriptions matching image
objaverse_retrieval/
├── config.py # Configuration
├── requirements.txt # Dependencies
├── README.md # This file
├── test_client.py # Backend test client
│
├── data/ # Data files
│ ├── text_captions_cap3d.json
│ ├── text_captions_cap3d_cn.json (generated)
│ ├── gobjaverse/ # 3D asset images
│ └── gobjaverse_280k_index_to_objaverse.json
│
├── outputs/ # Generated outputs
│ ├── embeddings/ # Saved embeddings
│ ├── translations/ # Translation results
│ └── batch_jsonl/ # API batch files
│
├── utils/ # Utility modules
│ ├── data_loader.py
│ ├── image_utils.py
│ └── db_utils.py
│
├── scripts/ # Processing scripts
│ ├── 01_translate_captions.py
│ ├── 02_embed_siglip.py
│ ├── 03_embed_qwen.py
│ └── 04_populate_database.py
│
├── backend/ # FastAPI backend
│ ├── app.py
│ ├── embedding_service.py
│ └── vector_search.py
│
└── frontend/ # Gradio frontend
└── gradio_app.py
The code and application is licensed under Apache2.0 License.