diff --git a/examples/third_party_examples/Visualizing_embeddings_in_Kangas.ipynb b/examples/third_party_examples/Visualizing_embeddings_in_Kangas.ipynb new file mode 100644 index 0000000000..bc0c377be2 --- /dev/null +++ b/examples/third_party_examples/Visualizing_embeddings_in_Kangas.ipynb @@ -0,0 +1,439 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "0wjP9mrldJsd" + }, + "source": [ + "## Visualizing the embeddings in Kangas\n", + "\n", + "In this Jupyter Notebook, we construct a Kangas DataGrid containing the data and projections of the embeddings into 2 dimensions." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4tPKQqqldJsj" + }, + "source": [ + "## What is Kangas?\n", + "\n", + "[Kangas](https://github.com/comet-ml/kangas/) as an open source, mixed-media, dataframe-like tool for data scientists. It was developed by [Comet](https://comet.com/), a company designed to help reduce the friction of moving models into production. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6sNsB2iFdJsk" + }, + "source": [ + "### 1. Setup\n", + "\n", + "To get started, we pip install kangas, and import it." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "N8gi529adL-f", + "outputId": "c12e9973-a179-41e3-c5a8-f241804d99ad" + }, + "outputs": [], + "source": [ + "%pip install kangas --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "id": "htxjXThodRxD" + }, + "outputs": [], + "source": [ + "import kangas as kg" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2. Constructing a Kangas DataGrid\n", + "\n", + "We create a Kangas Datagrid with the original data and the embeddings. The data is composed of a rows of reviews, and the embeddings are composed of 1536 floating-point values. In this example, we get the data directly from github, in case you aren't running this notebook inside OpenAI's repo.\n", + "\n", + "We use Kangas to read the CSV file into a DataGrid for further processing." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "0SxWlRTrdVJq", + "outputId": "d36c3a14-2e80-4315-e285-f39f6b008976" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading CSV file 'fine_food_reviews_with_embeddings_1k.csv'...\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "1001it [00:00, 2412.90it/s]\n", + "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 2899.16it/s]\n" + ] + } + ], + "source": [ + "data = kg.read_csv(\"https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can review the fields of the CSV file:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bzhQgoRGeMCp", + "outputId": "791c4e40-fb28-409e-d1e9-20b753fb1215" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "DataGrid (in memory)\n", + " Name : fine_food_reviews_with_embeddings_1k\n", + " Rows : 1,000\n", + " Columns: 9\n", + "# Column Non-Null Count DataGrid Type \n", + "--- -------------------- --------------- --------------------\n", + "1 Column 1 1,000 INTEGER \n", + "2 ProductId 1,000 TEXT \n", + "3 UserId 1,000 TEXT \n", + "4 Score 1,000 INTEGER \n", + "5 Summary 1,000 TEXT \n", + "6 Text 1,000 TEXT \n", + "7 combined 1,000 TEXT \n", + "8 n_tokens 1,000 INTEGER \n", + "9 embedding 1,000 TEXT \n" + ] + } + ], + "source": [ + "data.info()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And get a glimpse of the first and last rows:" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 349 + }, + "id": "Q95N832aeaBr", + "outputId": "aaea2816-e5a1-4e52-f228-c3e6aca6fa3e" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
row-id | Column 1 | ProductId | UserId | Score | Summary | Text | combined | n_tokens | embedding |
---|---|---|---|---|---|---|---|---|---|
1 | 0 | B003XPF9BO | A3R7JR3FMEBXQB | 5 | where does one | Wanted to save | Title: where do | 52 | [0.007018072064 |
2 | 297 | B003VXHGPK | A21VWSCGW7UUAR | 4 | Good, but not W | Honestly, I hav | Title: Good, bu | 178 | [-0.00314055196 |
3 | 296 | B008JKTTUA | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
4 | 295 | B000LKTTTW | A14MQ40CCU8B13 | 5 | Best tomato sou | I have a hard t | Title: Best tom | 111 | [-0.00139322795 |
5 | 294 | B001D09KAM | A34XBAIFT02B60 | 1 | Should advertis | First, these sh | Title: Should a | 78 | [-0.01757248118 |
... | 996 | 623 | B0000CFXYA | A3GS4GWPIBV0NT | 1 | Strange inflamm | Truthfully wasn | Title: Strange | 110 | [0.000110913533 |
997 | 624 | B0001BH5YM | A1BZ3HMAKK0NC | 5 | My favorite and | You've just got | Title: My favor | 80 | [-0.02086931467 |
998 | 625 | B0009ET7TC | A2FSDQY5AI6TNX | 5 | My furbabies LO | Shake the conta | Title: My furba | 47 | [-0.00974910240 |
999 | 619 | B007PA32L2 | A15FF2P7RPKH6G | 5 | got this for th | all i have hear | Title: got this | 50 | [-0.00521062919 |
1000 | 999 | B001EQ5GEO | A3VYU0VO6DYV6I | 5 | I love Maui Cof | My first experi | Title: I love M | 118 | [-0.00605782261 |
[1000 rows x 9 columns] | |||||||||
* Use DataGrid.save() to save to disk | |||||||||
** Use DataGrid.show() to start user interface |