Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 838f000

Browse files
Merge pull request openai#42 from openai/ted/update-embedding-model
updates embedding examples with new embedding model
2 parents 7de3d50 + fd181ec commit 838f000

12 files changed

+12317
-12320
lines changed

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -446,11 +446,11 @@ Embeddings can be used for search either by themselves or as a feature in a larg
446446
The simplest way to use embeddings for search is as follows:
447447

448448
* Before the search (precompute):
449-
* Split your text corpus into chunks smaller than the token limit (e.g., ~2,000 tokens)
450-
* Embed each chunk using a 'doc' model (e.g., `text-search-curie-doc-001`)
449+
* Split your text corpus into chunks smaller than the token limit (e.g., <8,000 tokens)
450+
* Embed each chunk
451451
* Store those embeddings in your own database or in a vector search provider like [Pinecone](https://www.pinecone.io) or [Weaviate](https://weaviate.io)
452452
* At the time of the search (live compute):
453-
* Embed the search query using the corresponding 'query' model (e.g. `text-search-curie-query-001`)
453+
* Embed the search query
454454
* Find the closest embeddings in your database
455455
* Return the top results, ranked by cosine similarity
456456

@@ -460,7 +460,7 @@ In more advanced search systems, the the cosine similarity of embeddings can be
460460

461461
#### Recommendations
462462

463-
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set. And instead of using pairs of doc-query models, you can use a single symmetric similarity model (e.g., `text-similarity-curie-001`).
463+
Recommendations are quite similar to search, except that instead of a free-form text query, the inputs are items in a set.
464464

465465
An example of how to use embeddings for recommendations is shown in [Recommendation_using_embeddings.ipynb](examples/Recommendation_using_embeddings.ipynb).
466466

examples/Classification_using_embeddings.ipynb

Lines changed: 21 additions & 21 deletions
Large diffs are not rendered by default.

examples/Clustering.ipynb

Lines changed: 33 additions & 33 deletions
Large diffs are not rendered by default.

examples/Code_search.ipynb

Lines changed: 100 additions & 98 deletions
Large diffs are not rendered by default.

examples/Get_embeddings.ipynb

Lines changed: 6 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
{
1818
"data": {
1919
"text/plain": [
20-
"12288"
20+
"1536"
2121
]
2222
},
2323
"execution_count": 1,
@@ -29,8 +29,8 @@
2929
"import openai\n",
3030
"\n",
3131
"embedding = openai.Embedding.create(\n",
32-
" input=\"Sample document text goes here\",\n",
33-
" engine=\"text-similarity-davinci-001\"\n",
32+
" input=\"Your text goes here\",\n",
33+
" engine=\"text-embedding-ada-002\"\n",
3434
")[\"data\"][0][\"embedding\"]\n",
3535
"len(embedding)\n"
3636
]
@@ -44,7 +44,7 @@
4444
"name": "stdout",
4545
"output_type": "stream",
4646
"text": [
47-
"1024\n"
47+
"1536\n"
4848
]
4949
}
5050
],
@@ -54,33 +54,15 @@
5454
"\n",
5555
"\n",
5656
"@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))\n",
57-
"def get_embedding(text: str, engine=\"text-similarity-davinci-001\") -> list[float]:\n",
57+
"def get_embedding(text: str, engine=\"text-embedding-ada-002\") -> list[float]:\n",
5858
"\n",
5959
" # replace newlines, which can negatively affect performance.\n",
6060
" text = text.replace(\"\\n\", \" \")\n",
6161
"\n",
6262
" return openai.Embedding.create(input=[text], engine=engine)[\"data\"][0][\"embedding\"]\n",
6363
"\n",
6464
"\n",
65-
"embedding = get_embedding(\"Sample query text goes here\", engine=\"text-search-ada-query-001\")\n",
66-
"print(len(embedding))\n"
67-
]
68-
},
69-
{
70-
"cell_type": "code",
71-
"execution_count": 3,
72-
"metadata": {},
73-
"outputs": [
74-
{
75-
"name": "stdout",
76-
"output_type": "stream",
77-
"text": [
78-
"1024\n"
79-
]
80-
}
81-
],
82-
"source": [
83-
"embedding = get_embedding(\"Sample document text goes here\", engine=\"text-search-ada-doc-001\")\n",
65+
"embedding = get_embedding(\"Your text goes here\", engine=\"text-embedding-ada-002\")\n",
8466
"print(len(embedding))\n"
8567
]
8668
}

examples/Obtain_dataset.ipynb

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,14 @@
1111
"We will combine the review summary and review text into a single combined text. The model will encode this combined text and it will output a single vector embedding."
1212
]
1313
},
14+
{
15+
"attachments": {},
16+
"cell_type": "markdown",
17+
"metadata": {},
18+
"source": [
19+
"To run this notebook, you will need to install: pandas, openai, transformers, plotly, matplotlib, scikit-learn, torch (transformer dep), torchvision, and scipy."
20+
]
21+
},
1422
{
1523
"cell_type": "code",
1624
"execution_count": 1,
@@ -131,7 +139,7 @@
131139
"\n",
132140
"# remove reviews that are too long\n",
133141
"df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))\n",
134-
"df = df[df.n_tokens<2000].tail(1_000)\n",
142+
"df = df[df.n_tokens<8000].tail(1_000)\n",
135143
"len(df)"
136144
]
137145
},
@@ -148,20 +156,22 @@
148156
"metadata": {},
149157
"outputs": [],
150158
"source": [
159+
"import openai\n",
151160
"from openai.embeddings_utils import get_embedding\n",
161+
"# Ensure you have your API key set in your environment per the README: https://github.com/openai/openai-python#usage\n",
152162
"\n",
153-
"# This will take just under 10 minutes\n",
154-
"df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))\n",
155-
"df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))\n",
163+
"# This will take just between 5 and 10 minutes\n",
164+
"df['ada_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))\n",
165+
"df['ada_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))\n",
156166
"df.to_csv('data/fine_food_reviews_with_embeddings_1k.csv')"
157167
]
158168
}
159169
],
160170
"metadata": {
161171
"kernelspec": {
162-
"display_name": "Python 3.9.9 ('openai')",
172+
"display_name": "openai-cookbook",
163173
"language": "python",
164-
"name": "python3"
174+
"name": "openai-cookbook"
165175
},
166176
"language_info": {
167177
"codemirror_mode": {
@@ -173,12 +183,12 @@
173183
"name": "python",
174184
"nbconvert_exporter": "python",
175185
"pygments_lexer": "ipython3",
176-
"version": "3.9.9"
186+
"version": "3.9.6"
177187
},
178188
"orig_nbformat": 4,
179189
"vscode": {
180190
"interpreter": {
181-
"hash": "365536dcbde60510dc9073d6b991cd35db2d9bac356a11f5b64279a5e6708b97"
191+
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
182192
}
183193
}
184194
},

0 commit comments

Comments
 (0)