Initial commit of vector database example with new embeddings #56

colin-openai · 2023-01-05T09:56:40Z

This PR contains a notebook running through an example of using our embeddings to embed Simple Wikipedia and then indexing and searching it in both Weaviate and Pinecone.

filipeabperes · 2023-01-09T17:18:44Z

examples/vector_databases/Vector_db_introduction.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",


nit, but I find it a bit weird t have the decorator separated from the function def like this... up to you if you find this looks ok (:

Fair point, updated to precede the function that it is referring to

agree with Teds comment that it seems to make more sense decorating get_embeddings(), but I might be missing something

Fair point, I did on the other as it is calling get_embeddings but you are right, makes more sense this way. Updated to decorate get_embeddings()

filipeabperes · 2023-01-09T17:19:56Z

examples/vector_databases/Vector_db_introduction.ipynb

+    "    # Embed the corpus\n",
+    "    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:\n",
+    "        futures = [\n",
+    "            executor.submit(get_embeddings, text_batch)\n",


what happens if get_embeddings fails after 6 retries?

I've wrapped it in a try/except block so the function will return the exception if get_embeddings fails

I mostly wasn't sure if the Exception would be raised, or if the thread would just fail silently. It seems it would be raised when result() is called. If that's the case you can just let the Exception propagate naturally if you think that makes the code cleaner (without the need for the try/except)

Agree, I've removed the try/except block to make it cleaner

examples/vector_databases/Vector_db_introduction.ipynb

filipeabperes · 2023-01-09T17:28:45Z

examples/vector_databases/Vector_db_introduction.ipynb

+    "    \n",
+    "    def splits_num(self, elements: int) -> int:\n",
+    "        \"\"\" Determines how many chunks DataFrame contians. \"\"\"\n",
+    "        return round(elements / self.batch_size)\n",


should this be ceil()? might end up with a batch larger than batch_size with np.array_split(). (which could be fine, just want to make sure this was considered)

This still works if there is a batch larger than batch size, but happy to put ceil in there as well for a cleaner solution if needed

examples/vector_databases/Vector_db_introduction.ipynb

filipeabperes · 2023-01-09T17:55:03Z

examples/vector_databases/Vector_db_introduction.ipynb

+    "# Upsert content vectors in content namespace\n",
+    "print(\"Uploading vectors to content namespace..\")\n",
+    "for batch_df in df_batcher(article_df):\n",
+    "    index.upsert(vectors=zip(batch_df.vector_id, batch_df.content_vector), namespace='content')"


using a thread pool here can also speed the upserting, if youre dealing with a large amount of vectors in one go. but it also makes sense to leave this out if you dont want to overcomplicate the demo

I've left out to keep it fairly simple, but I've added notes in three places to let them know it is a good approach to improve upsert performance

yep, there is some demo code in pinecone docs as well, if you want to just refer to that (maybe you already did)

I've added a direct reference to their docs for parallel insertion in the Pinecone intro cell

examples/vector_databases/Vector_db_introduction.ipynb

…psert operation

ted-at-openai

Looks good! I left a bunch of style comments. Don't feel obligated to change them all if you disagree or have more pressing things to do.

ted-at-openai · 2023-01-11T17:14:58Z

examples/vector_databases/Vector_db_introduction.ipynb

+   "metadata": {},
+   "source": [
+    "# Vector Database Introduction\n",
+    "\n",


Suggestion: Start off with an explanation of why someone might want a vector database and what a vector database is.

Added now, used a few sources to write this up but feedback would be welcome

ted-at-openai · 2023-01-11T17:16:39Z

examples/vector_databases/Vector_db_introduction.ipynb

+   "id": "cb1537e6",
+   "metadata": {},
+   "source": [
+    "# Vector Database Introduction\n",


Suggestion: rename to be more specific. E.g., "How to use vector databases for embeddings search". I think "introduction" is more ambiguous.

Renamed to be more descriptive and also changed the named of the notebook itself

ted-at-openai · 2023-01-11T17:17:39Z

examples/vector_databases/Vector_db_introduction.ipynb

+    "    - *Index Data*: We'll create an index with __title__ search vectors in it\n",
+    "    - *Search Data*: We'll run a few searches to confirm it works\n",
+    "\n",
+    "Once you've run through this notebook you should have a basic understanding of how to setup and use vector databases, and can move on to more complex use cases making use of our embeddings"


missing period

ted-at-openai · 2023-01-11T17:18:08Z

examples/vector_databases/Vector_db_introduction.ipynb

+   "source": [
+    "## Setup\n",
+    "\n",
+    "Here we import the required libraries and set the embedding model that we'd like to use"


I'd add periods to all these sentences.

ted-at-openai · 2023-01-11T17:19:30Z

examples/vector_databases/Vector_db_introduction.ipynb

+   "source": [
+    "## Setup\n",
+    "\n",
+    "Here we import the required libraries and set the embedding model that we'd like to use"


Can potentially rephrase some of these to be more concise. E.g., "Import the require libraries and select the embedding model." (Words "here" and "we" feels potentially superfluous. Imperative tone is usually concise.)

I've gone through all of these and tried to simplify - further feedback welcome if still too wordy

ted-at-openai · 2023-01-11T17:54:50Z

examples/vector_databases/Vector_db_introduction.ipynb

+   "source": [
+    "### Create Index\n",
+    "\n",
+    "First we need to create an index, which we'll call `wikipedia-articles`. Once we have an index, we can create multiple namespaces, which can make a single index searchable for various use cases. For more details, consult [this article](https://docs.pinecone.io/docs/namespaces#:~:text=Pinecone%20allows%20you%20to%20partition,different%20subsets%20of%20your%20index.)."


Suggestion: rather than 'this article' name it 'Pinecone documentation' so that people have a clue as to what it is before they click it

Updated and done the same for Weaviate

ted-at-openai · 2023-01-11T18:01:56Z

examples/vector_databases/Vector_db_introduction.ipynb

+    "                               top_k=top_k)\n",
+    "\n",
+    "    # Print query results \n",
+    "    print(f'\\nMost similar results querying {query} in \"{namespace}\" namespace:\\n')\n",


would wrap the {query} in quotation marks too

maybe rephrase to 'Most similar results to "{query}" in "{namespace}" namespace:\n'. Current phrasing feels a bit harder to parse with the querying verb thrown in.

ted-at-openai · 2023-01-11T18:05:18Z

examples/vector_databases/Vector_db_introduction.ipynb

+    "    counter = 0\n",
+    "    for k,v in df.iterrows():\n",
+    "        counter += 1\n",
+    "        print(f'Result {counter} with a score of {v.score} is {v.title}')\n",


Starting each line with Result #1 #2 etc feels a bit low-information density. Could also do something like "Battle of Bannockburn (score = 0.869)" so that the results are immediately readable without needing to read all the way across the line. Your call.

Updated and made consistent across the two vector DBs

ted-at-openai · 2023-01-11T18:06:50Z

examples/vector_databases/Vector_db_introduction.ipynb

+   "source": [
+    "class_obj = {\n",
+    "    \"class\": \"Article\",\n",
+    "    \"vectorizer\": \"none\", # explicitly tell Weaviate not to vectorize anything, we are providing the vectors ourselves through our BERT model\n",


We're not using BERT! :D

Copy/paste error! Removed and reworded

ted-at-openai · 2023-01-11T18:09:13Z

examples/vector_databases/Vector_db_introduction.ipynb

+   "id": "ad74202e",
+   "metadata": {},
+   "source": [
+    "Thanks for following along, you're now equipped to set up your own vector databases and use embeddings to do all kinds of cool things - enjoy! For more complex use cases please continue to work through other cookbook examples in this repo"


missing period

kacperlukawski · 2023-01-17T12:17:04Z

Hi, would you mind adding Qdrant as another option? I can provide a working example similar to the created ones.

Qdrant https://github.com/qdrant/qdrant is a high-performant vector search database written in Rust. The fastest open-source solution available a the moment according to benchmarks. https://qdrant.tech/benchmarks/

Add Qdrant as another example of vector database

ted-at-openai · 2023-01-20T01:01:28Z

Couple more suggestions:

Add installation instructions for pinecone, weaviate, and qdrant_client, as most people won't have them. I think it's especially valuable here since the package names and import names aren't the same. Looks like pinecone-client and weaviate-client and qdrant-client, vs pinecone, weaviate, and qdrant_client.
Add qdrant to the table of contents/outline at the top
After running pip install --upgrade pinecone-client I immediately run into an error when importing it. Not sure why, but I want to figure it out before we merge

ted-at-openai · 2023-01-20T02:13:09Z

^on the error I'm hitting, I've emailed pinecone support and will wait to hear back.

ted-at-openai · 2023-01-26T22:10:31Z

examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb

+    "    - *Index Data*: We'll create an index with namespaces for __titles__ and __content__\n",
+    "    - *Search Data*: We'll test out both namespaces with search queries to confirm it works\n",
+    "- **Weaviate**\n",
+    "    - *Setup*: Here we setup the Python client for Weaviate. For more details go [here](https://weaviate.io/developers/weaviate/current/client-libraries/python.html)\n",


"setup" is the noun, "set up" is the verb. These should say "set up".

ted-at-openai · 2023-01-26T22:12:27Z

examples/vector_databases/Using_vector_databases_for_embeddings_search.ipynb

+   "source": [
+    "## Weaviate\n",
+    "\n",
+    "The other vector database option we'll explore here is **Weaviate**, which offers both a managed, SaaS option like Pinecone, as well as a self-hosted option. As we've already looked at a cloud vector database, we'll try the self-hosted option here.\n",


One thing I noticed is that some of the notebook is in future tense (e.g., "we will do X"), some is in present tense (e.g., "we set up Y"), and some is in past tense ("we used Docker"). Could be worth another pass to make them all consistent.

Updated to align with future tense

ted-at-openai · 2023-01-26T22:16:10Z

One last suggestion: I think it would be helpful if you precomputed the embeddings, stored them on our CDN, and let people download them, so that they don't have to pay $5 each time they run the example. This is what we've done in some of the other examples. Feel free to pick any file format you like. DM me to discuss how to upload to our CDN and what URL we'll want to give it.

Then, if everything runs on your end, we can merge. Thanks again for all the work on this!

colin-openai · 2023-02-06T11:49:14Z

One last suggestion: I think it would be helpful if you precomputed the embeddings, stored them on our CDN, and let people download them, so that they don't have to pay $5 each time they run the example. This is what we've done in some of the other examples. Feel free to pick any file format you like. DM me to discuss how to upload to our CDN and what URL we'll want to give it.

Then, if everything runs on your end, we can merge. Thanks again for all the work on this!

@ted-at-openai these are now resolved

Initial commit of vector database example with new embeddings

Initial commit of vector database example with new embeddings

b7cb216

colin-openai requested a review from ted-at-openai January 5, 2023 09:56

Merge branch 'main' into colin

73802cb

colin-openai requested a review from filipeabperes January 9, 2023 16:24

filipeabperes reviewed Jan 9, 2023

View reviewed changes

Colin Jarvis added 3 commits January 11, 2023 03:55

Pushing updated Vector DB introduction notebook with PR changes

f2567b6

Removed extra whitespace within Weaviate query function

21ca5a9

Added notes where users can consider adding thread pool to speed up u…

fb18b7f

…psert operation

colin-openai requested a review from filipeabperes January 11, 2023 12:05

ted-at-openai reviewed Jan 11, 2023

View reviewed changes

Colin Jarvis added 4 commits January 16, 2023 07:41

Pushing updated version responding to PR comments

80b8e5f

Updated document name to be more descriptive

f20f6ac

Updated title

a367e11

Updated wordings of headers

225b917

kacperlukawski and others added 2 commits January 18, 2023 11:15

Add Qdrant as another example of vector database

5ee5fec

Merge pull request #81 from kacperlukawski/qdrant-example

2fed004

Add Qdrant as another example of vector database

Jessie-1 approved these changes Jan 21, 2023

View reviewed changes

colin-openai added 2 commits January 25, 2023 16:42

Updated text to include Qdrant in guide

ea8a52d

Tested notebook end to end

a37477f

ted-at-openai reviewed Jan 26, 2023

View reviewed changes

colin-openai added 2 commits January 31, 2023 14:37

Updated to align wording

befe771

Pushing update to remove data loading

3ad0e71

ted-at-openai merged commit 5ef1523 into main Feb 6, 2023

syusuke9999 pushed a commit to syusuke9999/openai-cookbook that referenced this pull request May 12, 2023

Merge pull request openai#56 from openai/colin

180bf25

Initial commit of vector database example with new embeddings

katia-openai pushed a commit that referenced this pull request Feb 29, 2024

Merge pull request #56 from openai/colin

84abe38

Initial commit of vector database example with new embeddings

Initial commit of vector database example with new embeddings #56

Initial commit of vector database example with new embeddings #56

Uh oh!

Conversation

colin-openai commented Jan 5, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ted-at-openai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

kacperlukawski commented Jan 17, 2023 •

edited

Loading

ted-at-openai Jan 26, 2023 •

edited

Loading

ted-at-openai commented Jan 26, 2023 •

edited

Loading