-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Initial commit of vector database example with new embeddings #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but I find it a bit weird t have the decorator separated from the function def like this... up to you if you find this looks ok (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, updated to precede the function that it is referring to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with Teds comment that it seems to make more sense decorating get_embeddings()
, but I might be missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, I did on the other as it is calling get_embeddings but you are right, makes more sense this way. Updated to decorate get_embeddings()
" # Embed the corpus\n", | ||
" with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:\n", | ||
" futures = [\n", | ||
" executor.submit(get_embeddings, text_batch)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if get_embeddings fails after 6 retries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've wrapped it in a try/except block so the function will return the exception if get_embeddings fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly wasn't sure if the Exception would be raised, or if the thread would just fail silently. It seems it would be raised when result()
is called. If that's the case you can just let the Exception propagate naturally if you think that makes the code cleaner (without the need for the try/except)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, I've removed the try/except block to make it cleaner
" \n", | ||
" def splits_num(self, elements: int) -> int:\n", | ||
" \"\"\" Determines how many chunks DataFrame contians. \"\"\"\n", | ||
" return round(elements / self.batch_size)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be ceil()
? might end up with a batch larger than batch_size with np.array_split()
. (which could be fine, just want to make sure this was considered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still works if there is a batch larger than batch size, but happy to put ceil in there as well for a cleaner solution if needed
"# Upsert content vectors in content namespace\n", | ||
"print(\"Uploading vectors to content namespace..\")\n", | ||
"for batch_df in df_batcher(article_df):\n", | ||
" index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a thread pool here can also speed the upserting, if youre dealing with a large amount of vectors in one go. but it also makes sense to leave this out if you dont want to overcomplicate the demo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left out to keep it fairly simple, but I've added notes in three places to let them know it is a good approach to improve upsert performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, there is some demo code in pinecone docs as well, if you want to just refer to that (maybe you already did)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a direct reference to their docs for parallel insertion in the Pinecone intro cell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I left a bunch of style comments. Don't feel obligated to change them all if you disagree or have more pressing things to do.
"metadata": {}, | ||
"source": [ | ||
"# Vector Database Introduction\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Start off with an explanation of why someone might want a vector database and what a vector database is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added now, used a few sources to write this up but feedback would be welcome
"id": "cb1537e6", | ||
"metadata": {}, | ||
"source": [ | ||
"# Vector Database Introduction\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: rename to be more specific. E.g., "How to use vector databases for embeddings search". I think "introduction" is more ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to be more descriptive and also changed the named of the notebook itself
" - *Index Data*: We'll create an index with __title__ search vectors in it\n", | ||
" - *Search Data*: We'll run a few searches to confirm it works\n", | ||
"\n", | ||
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
"source": [ | ||
"## Setup\n", | ||
"\n", | ||
"Here we import the required libraries and set the embedding model that we'd like to use" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add periods to all these sentences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
"source": [ | ||
"## Setup\n", | ||
"\n", | ||
"Here we import the required libraries and set the embedding model that we'd like to use" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can potentially rephrase some of these to be more concise. E.g., "Import the require libraries and select the embedding model." (Words "here" and "we" feels potentially superfluous. Imperative tone is usually concise.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through all of these and tried to simplify - further feedback welcome if still too wordy
"source": [ | ||
"### Create Index\n", | ||
"\n", | ||
"First we need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [this article](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.)." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: rather than 'this article' name it 'Pinecone documentation' so that people have a clue as to what it is before they click it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and done the same for Weaviate
" top_k=top_k)\n", | ||
"\n", | ||
" # Print query results \n", | ||
" print(f'\\nMost similar results querying {query} in \"{namespace}\" namespace:\\n')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would wrap the {query} in quotation marks too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe rephrase to 'Most similar results to "{query}" in "{namespace}" namespace:\n'. Current phrasing feels a bit harder to parse with the querying verb thrown in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
" counter = 0\n", | ||
" for k,v in df.iterrows():\n", | ||
" counter += 1\n", | ||
" print(f'Result {counter} with a score of {v.score} is {v.title}')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and made consistent across the two vector DBs
"source": [ | ||
"class_obj = {\n", | ||
" \"class\": \"Article\",\n", | ||
" \"vectorizer\": \"none\", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using BERT! :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy/paste error! Removed and reworded
"id": "ad74202e", | ||
"metadata": {}, | ||
"source": [ | ||
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
Hi, would you mind adding Qdrant as another option? I can provide a working example similar to the created ones. Qdrant https://github.com/qdrant/qdrant is a high-performant vector search database written in Rust. The fastest open-source solution available a the moment according to benchmarks. https://qdrant.tech/benchmarks/ |
Add Qdrant as another example of vector database
Couple more suggestions:
|
^on the error I'm hitting, I've emailed pinecone support and will wait to hear back. |
" - *Index Data*: We'll create an index with namespaces for __titles__ and __content__\n", | ||
" - *Search Data*: We'll test out both namespaces with search queries to confirm it works\n", | ||
"- **Weaviate**\n", | ||
" - *Setup*: Here we setup the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"setup" is the noun, "set up" is the verb. These should say "set up".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
"source": [ | ||
"## Weaviate\n", | ||
"\n", | ||
"The other vector database option we'll explore here is **Weaviate**, which offers both a managed, SaaS option like Pinecone, as well as a self-hosted option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I noticed is that some of the notebook is in future tense (e.g., "we will do X"), some is in present tense (e.g., "we set up Y"), and some is in past tense ("we used Docker"). Could be worth another pass to make them all consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to align with future tense
One last suggestion: I think it would be helpful if you precomputed the embeddings, stored them on our CDN, and let people download them, so that they don't have to pay $5 each time they run the example. This is what we've done in some of the other examples. Feel free to pick any file format you like. DM me to discuss how to upload to our CDN and what URL we'll want to give it. Then, if everything runs on your end, we can merge. Thanks again for all the work on this! |
@ted-at-openai these are now resolved |
Initial commit of vector database example with new embeddings
Initial commit of vector database example with new embeddings
This PR contains a notebook running through an example of using our embeddings to embed Simple Wikipedia and then indexing and searching it in both Weaviate and Pinecone.