Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Initial commit of vector database example with new embeddings #56

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Feb 6, 2023

Conversation

colin-openai
Copy link
Collaborator

This PR contains a notebook running through an example of using our embeddings to embed Simple Wikipedia and then indexing and searching it in both Weaviate and Pinecone.

"metadata": {},
"outputs": [],
"source": [
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but I find it a bit weird t have the decorator separated from the function def like this... up to you if you find this looks ok (:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, updated to precede the function that it is referring to

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with Teds comment that it seems to make more sense decorating get_embeddings(), but I might be missing something

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I did on the other as it is calling get_embeddings but you are right, makes more sense this way. Updated to decorate get_embeddings()

" # Embed the corpus\n",
" with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:\n",
" futures = [\n",
" executor.submit(get_embeddings, text_batch)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if get_embeddings fails after 6 retries?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've wrapped it in a try/except block so the function will return the exception if get_embeddings fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly wasn't sure if the Exception would be raised, or if the thread would just fail silently. It seems it would be raised when result() is called. If that's the case you can just let the Exception propagate naturally if you think that makes the code cleaner (without the need for the try/except)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I've removed the try/except block to make it cleaner

" \n",
" def splits_num(self, elements: int) -> int:\n",
" \"\"\" Determines how many chunks DataFrame contians. \"\"\"\n",
" return round(elements / self.batch_size)\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be ceil()? might end up with a batch larger than batch_size with np.array_split(). (which could be fine, just want to make sure this was considered)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still works if there is a batch larger than batch size, but happy to put ceil in there as well for a cleaner solution if needed

"# Upsert content vectors in content namespace\n",
"print(\"Uploading vectors to content namespace..\")\n",
"for batch_df in df_batcher(article_df):\n",
" index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a thread pool here can also speed the upserting, if youre dealing with a large amount of vectors in one go. but it also makes sense to leave this out if you dont want to overcomplicate the demo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left out to keep it fairly simple, but I've added notes in three places to let them know it is a good approach to improve upsert performance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, there is some demo code in pinecone docs as well, if you want to just refer to that (maybe you already did)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a direct reference to their docs for parallel insertion in the Pinecone intro cell

Copy link
Collaborator

@ted-at-openai ted-at-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a bunch of style comments. Don't feel obligated to change them all if you disagree or have more pressing things to do.

"metadata": {},
"source": [
"# Vector Database Introduction\n",
"\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Start off with an explanation of why someone might want a vector database and what a vector database is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added now, used a few sources to write this up but feedback would be welcome

"id": "cb1537e6",
"metadata": {},
"source": [
"# Vector Database Introduction\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: rename to be more specific. E.g., "How to use vector databases for embeddings search". I think "introduction" is more ambiguous.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to be more descriptive and also changed the named of the notebook itself

" - *Index Data*: We'll create an index with __title__ search vectors in it\n",
" - *Search Data*: We'll run a few searches to confirm it works\n",
"\n",
"Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing period

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

"source": [
"## Setup\n",
"\n",
"Here we import the required libraries and set the embedding model that we'd like to use"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add periods to all these sentences.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

"source": [
"## Setup\n",
"\n",
"Here we import the required libraries and set the embedding model that we'd like to use"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can potentially rephrase some of these to be more concise. E.g., "Import the require libraries and select the embedding model." (Words "here" and "we" feels potentially superfluous. Imperative tone is usually concise.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through all of these and tried to simplify - further feedback welcome if still too wordy

"source": [
"### Create Index\n",
"\n",
"First we need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [this article](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.)."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: rather than 'this article' name it 'Pinecone documentation' so that people have a clue as to what it is before they click it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and done the same for Weaviate

" top_k=top_k)\n",
"\n",
" # Print query results \n",
" print(f'\\nMost similar results querying {query} in \"{namespace}\" namespace:\\n')\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would wrap the {query} in quotation marks too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rephrase to 'Most similar results to "{query}" in "{namespace}" namespace:\n'. Current phrasing feels a bit harder to parse with the querying verb thrown in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

" counter = 0\n",
" for k,v in df.iterrows():\n",
" counter += 1\n",
" print(f'Result {counter} with a score of {v.score} is {v.title}')\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting each line with Result #1 #2 etc feels a bit low-information density. Could also do something like "Battle of Bannockburn (score = 0.869)" so that the results are immediately readable without needing to read all the way across the line. Your call.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and made consistent across the two vector DBs

"source": [
"class_obj = {\n",
" \"class\": \"Article\",\n",
" \"vectorizer\": \"none\", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not using BERT! :D

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy/paste error! Removed and reworded

"id": "ad74202e",
"metadata": {},
"source": [
"Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing period

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@kacperlukawski
Copy link
Contributor

kacperlukawski commented Jan 17, 2023

Hi, would you mind adding Qdrant as another option? I can provide a working example similar to the created ones.

Qdrant https://github.com/qdrant/qdrant is a high-performant vector search database written in Rust. The fastest open-source solution available a the moment according to benchmarks. https://qdrant.tech/benchmarks/

@ted-at-openai
Copy link
Collaborator

Couple more suggestions:

  • Add installation instructions for pinecone, weaviate, and qdrant_client, as most people won't have them. I think it's especially valuable here since the package names and import names aren't the same. Looks like pinecone-client and weaviate-client and qdrant-client, vs pinecone, weaviate, and qdrant_client.
  • Add qdrant to the table of contents/outline at the top
  • After running pip install --upgrade pinecone-client I immediately run into an error when importing it. Not sure why, but I want to figure it out before we merge

@ted-at-openai
Copy link
Collaborator

^on the error I'm hitting, I've emailed pinecone support and will wait to hear back.

" - *Index Data*: We'll create an index with namespaces for __titles__ and __content__\n",
" - *Search Data*: We'll test out both namespaces with search queries to confirm it works\n",
"- **Weaviate**\n",
" - *Setup*: Here we setup the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"setup" is the noun, "set up" is the verb. These should say "set up".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"source": [
"## Weaviate\n",
"\n",
"The other vector database option we'll explore here is **Weaviate**, which offers both a managed, SaaS option like Pinecone, as well as a self-hosted option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n",
Copy link
Collaborator

@ted-at-openai ted-at-openai Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed is that some of the notebook is in future tense (e.g., "we will do X"), some is in present tense (e.g., "we set up Y"), and some is in past tense ("we used Docker"). Could be worth another pass to make them all consistent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to align with future tense

@ted-at-openai
Copy link
Collaborator

ted-at-openai commented Jan 26, 2023

One last suggestion: I think it would be helpful if you precomputed the embeddings, stored them on our CDN, and let people download them, so that they don't have to pay $5 each time they run the example. This is what we've done in some of the other examples. Feel free to pick any file format you like. DM me to discuss how to upload to our CDN and what URL we'll want to give it.

Then, if everything runs on your end, we can merge. Thanks again for all the work on this!

@colin-openai
Copy link
Collaborator Author

One last suggestion: I think it would be helpful if you precomputed the embeddings, stored them on our CDN, and let people download them, so that they don't have to pay $5 each time they run the example. This is what we've done in some of the other examples. Feel free to pick any file format you like. DM me to discuss how to upload to our CDN and what URL we'll want to give it.

Then, if everything runs on your end, we can merge. Thanks again for all the work on this!

@ted-at-openai these are now resolved

@ted-at-openai ted-at-openai merged commit 5ef1523 into main Feb 6, 2023
syusuke9999 pushed a commit to syusuke9999/openai-cookbook that referenced this pull request May 12, 2023
Initial commit of vector database example with new embeddings
katia-openai pushed a commit that referenced this pull request Feb 29, 2024
Initial commit of vector database example with new embeddings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants