Ingestion Cookbook

This repository contains a collection of scripts for ingesting data into various vector databases using open-source embeddings. It serves as a cookbook with recipes (scripts) for advanced data ingestion techniques and similarity search.

About Vector Database Cloud

Vector Database Cloud is a platform that provides one-click deployment of popular vector databases including Qdrant, Milvus, ChromaDB, and Pgvector on cloud. Our platform ensures a secure API, a comprehensive customer dashboard, efficient vector search, and real-time monitoring.

Introduction

Vector Database Cloud is designed to seamlessly integrate with your existing data workflows. Whether you're working with structured data, unstructured data, or high-dimensional vectors, you can leverage popular ETL (Extract, Transform, Load) tools to streamline the process of moving data into and out of Vector Database Cloud.

Supported Vector Databases

pgvector: PostgreSQL extension for vector similarity search
Milvus: Cloud-native vector database for similarity search
ChromaDB: Open-source embedding database
Qdrant: Vector database for next-gen AI applications

Prerequisites

Python 3.7+
Access to Vector Database Cloud with API URL and API key for each database
Basic understanding of vector databases and their applications

Installation

Clone this repository:

git clone https://github.com/VectorDBCloud/Ingestion-Cookbooks.git
cd Ingestion-Cookbooks

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up your environment variables for the Vector Database Cloud API:
```
export VECTORDBCLOUD_API_URL=your_api_url
export VECTORDBCLOUD_API_KEY=your_api_key
```
Note: Specific environment variable names may vary for each database. Check the individual scripts for details.

Usage

Each database directory contains a Python script demonstrating data ingestion and basic similarity search using open-source embeddings. The scripts use the Sentence Transformer model for generating embeddings.

To run a script:

python <database_name>/<script_name>.py

For example:

python pgvector/pgvector_ingestion.py

Make sure to update the connection details and customize the data according to your needs. You can modify the script parameters, input data, and embedding model to suit your specific use case.

Vector DB Cookbook

The vector_db_cookbook.py script in the root directory demonstrates how to prepare data for multiple vector databases using a unified approach. It's a useful starting point for working with different vector database formats.

This script showcases:

Data preprocessing techniques
Embedding generation using various models
Formatting data for different vector databases
Basic similarity search implementation

To use the cookbook:

python vector_db_cookbook.py

Best Practices

Data Preparation: Ensure your data is clean and properly formatted before ingestion.
Embedding Selection: Choose appropriate embedding models for your data type and use case.
Batch Processing: For large datasets, implement batch processing to avoid memory issues.
Error Handling: Implement robust error handling and logging in your ingestion scripts.
Performance Optimization: Use bulk inserts and optimize your queries for better performance.
Regular Updates: Keep your vector database and embeddings up-to-date with your latest data.
Security: Always use secure connections and API keys when working with cloud-based vector databases.

Troubleshooting

If you encounter issues:

Ensure all environment variables are correctly set.
Check your internet connection and API endpoint accessibility.
Verify that you have the correct permissions for the Vector Database Cloud services.
For specific error messages, refer to the documentation of the respective vector database or create an issue in this repository.

Contributing

We welcome contributions to improve and expand our Open-Source Embedding Cookbook! Here's how you can contribute:

Fork the repository: Create your own fork of the code.
Create a new branch: Make your changes in a new git branch.
Make your changes: Enhance existing cookbooks or add new ones.
Follow the style guidelines: Ensure your code follows our coding standards.
Write clear commit messages: Your commit messages should clearly describe the changes you've made.
Submit a pull request: Open a new pull request with your changes.
Respond to feedback: Be open to feedback and make necessary adjustments to your pull request.

For more detailed information on contributing, please refer to our Contribution Guidelines.

We also encourage you to:

Report bugs and issues through our Issue Tracker.
Suggest new features or improvements.
Help improve documentation.
Share your experiences and use cases with the community.

Remember, all contributors are expected to adhere to our Code of Conduct. We appreciate your efforts to make this project better for everyone!

Related Resources

License

This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

You are free to:

Share — copy and redistribute the material in any medium or format
Adapt — remix, transform, and build upon the material for any purpose, even commercially

Under the following terms:

Attribution — You must give appropriate credit to Vector Database Cloud, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests Vector Database Cloud endorses you or your use.

Additionally, we require that any use of this guide includes visible attribution to Vector Database Cloud. This attribution should be in the form of "Ingestion Cookbooks by Vector Database Cloud" or "Based on Vector Database Cloud Ingestion Cookbooks", along with a link to https://vectordbcloud.com, in any public-facing applications, documentation, or redistributions of this guide.

No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

For the full license text, visit: https://creativecommons.org/licenses/by/4.0/legalcode

Disclaimer

The information and resources provided in this community repository are for general informational purposes only. While we strive to keep the information up-to-date and correct, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information, products, services, or related graphics contained in this repository for any purpose. Any reliance you place on such information is therefore strictly at your own risk.

Vector Database Cloud configurations may vary, and it's essential to consult the official documentation before implementing any solutions or suggestions found in this community repository. Always follow best practices for security and performance when working with databases and cloud services.

The content in this repository may change without notice. Users are responsible for ensuring they are using the most current version of any information or code provided.

This disclaimer applies to Vector Database Cloud, its contributors, and any third parties involved in creating, producing, or delivering the content in this repository.

The use of any information or code in this repository may carry inherent risks, including but not limited to data loss, system failures, or security vulnerabilities. Users should thoroughly test and validate any implementations in a safe environment before deploying to production systems.

For complex implementations or critical systems, we strongly recommend seeking advice from qualified professionals or consulting services.

By using this repository, you acknowledge and agree to this disclaimer. If you do not agree with any part of this disclaimer, please do not use the information or resources provided in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
README.md		README.md
chromadb_ingestion.py		chromadb_ingestion.py
milvus_ingestion.py		milvus_ingestion.py
pgvector_ingestion.py		pgvector_ingestion.py
qdrant_ingestion.py		qdrant_ingestion.py
requirements.txt		requirements.txt
vector_db_cookbook.py		vector_db_cookbook.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ingestion Cookbook

Table of Contents

About Vector Database Cloud

Introduction

Supported Vector Databases

Prerequisites

Installation

Usage

Vector DB Cookbook

Best Practices

Troubleshooting

Contributing

Related Resources

License

Disclaimer

About

Uh oh!

Releases

Packages

Languages

VectorDBCloud/Ingestion-Cookbooks

Folders and files

Latest commit

History

Repository files navigation

Ingestion Cookbook

Table of Contents

About Vector Database Cloud

Introduction

Supported Vector Databases

Prerequisites

Installation

Usage

Vector DB Cookbook

Best Practices

Troubleshooting

Contributing

Related Resources

License

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages