diff --git a/notebooks/getting_started/bq_dataframes_llm_code_generation.ipynb b/notebooks/generative_ai/bq_dataframes_llm_code_generation.ipynb similarity index 95% rename from notebooks/getting_started/bq_dataframes_llm_code_generation.ipynb rename to notebooks/generative_ai/bq_dataframes_llm_code_generation.ipynb index 39e2ef535c..2e4ce3e510 100644 --- a/notebooks/getting_started/bq_dataframes_llm_code_generation.ipynb +++ b/notebooks/generative_ai/bq_dataframes_llm_code_generation.ipynb @@ -34,18 +34,18 @@ "\n", "\n", " \n", " \n", "
\n", - " \n", + " \n", " \"Colab Run in Colab\n", " \n", " \n", - " \n", + " \n", " \"GitHub\n", " View on GitHub\n", " \n", " \n", - " \n", + " \n", " \"Vertex\n", " Open in Vertex AI Workbench\n", " \n", @@ -162,6 +162,9 @@ }, { "cell_type": "markdown", + "metadata": { + "id": "Wbr2aVtFQBcg" + }, "source": [ "### Set up your Google Cloud project\n", "\n", @@ -183,10 +186,7 @@ " * Vertex AI API\n", "\n", "4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)." - ], - "metadata": { - "id": "Wbr2aVtFQBcg" - } + ] }, { "cell_type": "markdown", @@ -350,39 +350,44 @@ }, { "cell_type": "markdown", - "source": [ - "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.reset_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location." - ], "metadata": { "id": "DTVtFlqeFbrU" - } + }, + "source": [ + "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.reset_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location." + ] }, { "cell_type": "markdown", + "metadata": { + "id": "6eytf4xQHzcF" + }, "source": [ "# Define the LLM model\n", "\n", "BigQuery DataFrames provides integration with [`text-bison` model of the PaLM API](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text) via Vertex AI.\n", "\n", "This section walks through a few steps required in order to use the model in your notebook." - ], - "metadata": { - "id": "6eytf4xQHzcF" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "rS4VO1TGiO4G" + }, "source": [ "## Create a BigQuery Cloud resource connection\n", "\n", "You need to create a [Cloud resource connection](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection) to enable BigQuery DataFrames to interact with Vertex AI services." - ], - "metadata": { - "id": "rS4VO1TGiO4G" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KFPjDM4LVh96" + }, + "outputs": [], "source": [ "CONN_NAME = \"bqdf-llm\"\n", "\n", @@ -412,15 +417,13 @@ " f\"serviceAccount:{response.cloud_resource.service_account_id}\"\n", " )\n", "print(CONN_SERVICE_ACCOUNT)" - ], - "metadata": { - "id": "KFPjDM4LVh96" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "W6l6Ol2biU9h" + }, "source": [ "## Set permissions for the service account\n", "\n", @@ -429,52 +432,52 @@ " - `roles/run.invoker`: This role is required for the connection to have read-only access to Cloud Run services that back custom/remote functions ([documentation](https://cloud.google.com/bigquery/docs/remote-functions#grant_permission_on_function)).\n", "\n", "Set these permissions by running the following `gcloud` commands:" - ], - "metadata": { - "id": "W6l6Ol2biU9h" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d8wja24SVq6s" + }, + "outputs": [], "source": [ "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/bigquery.connectionUser'\n", "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/aiplatform.user'\n", "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/run.invoker'" - ], - "metadata": { - "id": "d8wja24SVq6s" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "qUjT8nw-jIXp" + }, "source": [ "## Define the model\n", "\n", "Use `bigframes.ml.llm` to define the model:" - ], - "metadata": { - "id": "qUjT8nw-jIXp" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sdjeXFwcHfl7" + }, + "outputs": [], "source": [ "from bigframes.ml.llm import PaLM2TextGenerator\n", "\n", "session = bf.get_global_session()\n", "connection = f\"{PROJECT_ID}.{REGION}.{CONN_NAME}\"\n", "model = PaLM2TextGenerator(session=session, connection_name=connection)" - ], - "metadata": { - "id": "sdjeXFwcHfl7" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "GbW0oCnU1s1N" + }, "source": [ "# Read data from Cloud Storage into BigQuery DataFrames\n", "\n", @@ -486,80 +489,82 @@ "* An in-memory pandas DataFrame\n", "\n", "In this tutorial, you create BigQuery DataFrames DataFrames by reading two CSV files stored in Cloud Storage, one containing a list of DataFrame API names and one containing a list of Series API names." - ], - "metadata": { - "id": "GbW0oCnU1s1N" - } + ] }, { "cell_type": "code", - "source": [ - "df_api = bf.read_csv(\"gs://cloud-samples-data/vertex-ai/bigframe/df.csv\")\n", - "series_api = bf.read_csv(\"gs://cloud-samples-data/vertex-ai/bigframe/series.csv\")" - ], + "execution_count": null, "metadata": { "id": "SchiTkQGIJog" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "df_api = bf.read_csv(\"gs://cloud-samples-data/vertex-ai/bigframe/df.csv\")\n", + "series_api = bf.read_csv(\"gs://cloud-samples-data/vertex-ai/bigframe/series.csv\")" + ] }, { "cell_type": "markdown", - "source": [ - "Take a peek at a few rows of data for each file:" - ], "metadata": { "id": "7OBjw2nmQY3-" - } + }, + "source": [ + "Take a peek at a few rows of data for each file:" + ] }, { "cell_type": "code", - "source": [ - "df_api.head(2)" - ], + "execution_count": null, "metadata": { "id": "QCqgVCIsGGuv" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "df_api.head(2)" + ] }, { "cell_type": "code", - "source": [ - "series_api.head(2)" - ], + "execution_count": null, "metadata": { "id": "BGJnZbgEGS5-" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "series_api.head(2)" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "m3ZJEsi7SUKV" + }, "source": [ "# Generate code using the LLM model\n", "\n", "Prepare the prompts and send them to the LLM model for prediction." - ], - "metadata": { - "id": "m3ZJEsi7SUKV" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "9EMAqR37AfLS" + }, "source": [ "## Prompt design in BigQuery DataFrames\n", "\n", "Designing prompts for LLMs is a fast growing area and you can read more in [this documentation](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/introduction-prompt-design).\n", "\n", "For this tutorial, you use a simple prompt to ask the LLM model for sample code for each of the API methods (or rows) from the last step's DataFrames. The output is the new DataFrames `df_prompt` and `series_prompt`, which contain the full prompt text." - ], - "metadata": { - "id": "9EMAqR37AfLS" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EDAaIwHpQCDZ" + }, + "outputs": [], "source": [ "df_prompt_prefix = \"Generate Pandas sample code for DataFrame.\"\n", "series_prompt_prefix = \"Generate Pandas sample code for Series.\"\n", @@ -568,83 +573,83 @@ "series_prompt = (series_prompt_prefix + series_api['API'])\n", "\n", "df_prompt.head(2)" - ], - "metadata": { - "id": "EDAaIwHpQCDZ" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", + "metadata": { + "id": "rwPLjqW2Ajzh" + }, "source": [ "## Make predictions using the LLM model\n", "\n", "Use the BigQuery DataFrames DataFrame containing the full prompt text as the input to the `predict` method. The `predict` method calls the LLM model and returns its generated text output back to two new BigQuery DataFrames DataFrames, `df_pred` and `series_pred`.\n", "\n", "Note: The predictions might take a few minutes to run." - ], - "metadata": { - "id": "rwPLjqW2Ajzh" - } + ] }, { "cell_type": "code", - "source": [ - "df_pred = model.predict(df_prompt.to_frame(), max_output_tokens=1024)\n", - "series_pred = model.predict(series_prompt.to_frame(), max_output_tokens=1024)" - ], + "execution_count": null, "metadata": { "id": "6i6HkFJZa8na" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "df_pred = model.predict(df_prompt.to_frame(), max_output_tokens=1024)\n", + "series_pred = model.predict(series_prompt.to_frame(), max_output_tokens=1024)" + ] }, { "cell_type": "markdown", - "source": [ - "Once the predictions are processed, take a look at the sample output from the LLM, which provides code samples for the API names listed in the DataFrames dataset." - ], "metadata": { "id": "89cB8MW4UIdV" - } + }, + "source": [ + "Once the predictions are processed, take a look at the sample output from the LLM, which provides code samples for the API names listed in the DataFrames dataset." + ] }, { "cell_type": "code", - "source": [ - "print(df_pred['ml_generate_text_llm_result'].iloc[0])" - ], + "execution_count": null, "metadata": { "id": "9A2gw6hP_2nX" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "print(df_pred['ml_generate_text_llm_result'].iloc[0])" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "Fx4lsNqMorJ-" + }, "source": [ "# Manipulate LLM output using a remote function\n", "\n", "The output that the LLM provides often contains additional text beyond the code sample itself. Using BigQuery DataFrames, you can deploy custom Python functions that process and transform this output.\n", "\n" - ], - "metadata": { - "id": "Fx4lsNqMorJ-" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "d8L7SN03VByG" + }, "source": [ "Running the cell below creates a custom function that you can use to process the LLM output data in two ways:\n", "1. Strip the LLM text output to include only the code block.\n", "2. Substitute `import pandas as pd` with `import bigframes.pandas as bf` so that the resulting code block works with BigQuery DataFrames." - ], - "metadata": { - "id": "d8L7SN03VByG" - } + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GskyyUQPowBT" + }, + "outputs": [], "source": [ "@bf.remote_function([str], str, bigquery_connection=CONN_NAME)\n", "def extract_code(text: str):\n", @@ -656,166 +661,161 @@ " return res\n", " except:\n", " return \"\"" - ], - "metadata": { - "id": "GskyyUQPowBT" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "The custom function is deployed as a Cloud Function, and then integrated with BigQuery as a [remote function](https://cloud.google.com/bigquery/docs/remote-functions). Save both of the function names so that you can clean them up at the end of this notebook." - ], "metadata": { "id": "hVQAoqBUOJQf" - } + }, + "source": [ + "The custom function is deployed as a Cloud Function, and then integrated with BigQuery as a [remote function](https://cloud.google.com/bigquery/docs/remote-functions). Save both of the function names so that you can clean them up at the end of this notebook." + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PBlp-C-DOHRO" + }, + "outputs": [], "source": [ "CLOUD_FUNCTION_NAME = format(extract_code.bigframes_cloud_function)\n", "print(\"Cloud Function Name \" + CLOUD_FUNCTION_NAME)\n", "REMOTE_FUNCTION_NAME = format(extract_code.bigframes_remote_function)\n", "print(\"Remote Function Name \" + REMOTE_FUNCTION_NAME)" - ], - "metadata": { - "id": "PBlp-C-DOHRO" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Apply the custom function to each LLM output DataFrame to get the processed results:" - ], "metadata": { "id": "4FEucaiqVs3H" - } + }, + "source": [ + "Apply the custom function to each LLM output DataFrame to get the processed results:" + ] }, { "cell_type": "code", - "source": [ - "df_code = df_pred.assign(code=df_pred['ml_generate_text_llm_result'].apply(extract_code))\n", - "series_code = series_pred.assign(code=series_pred['ml_generate_text_llm_result'].apply(extract_code))" - ], + "execution_count": null, "metadata": { "id": "bsQ9cmoWo0Ps" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "df_code = df_pred.assign(code=df_pred['ml_generate_text_llm_result'].apply(extract_code))\n", + "series_code = series_pred.assign(code=series_pred['ml_generate_text_llm_result'].apply(extract_code))" + ] }, { "cell_type": "markdown", - "source": [ - "You can see the differences by inspecting the first row of data:" - ], "metadata": { "id": "ujQVVuhfWA3y" - } + }, + "source": [ + "You can see the differences by inspecting the first row of data:" + ] }, { "cell_type": "code", - "source": [ - "print(df_code['code'].iloc[0])" - ], + "execution_count": null, "metadata": { "id": "7yWzjhGy_zcy" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "print(df_code['code'].iloc[0])" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "GTRdUw-Ro5R1" + }, "source": [ "# Save the results to Cloud Storage\n", "\n", "BigQuery DataFrames lets you save a BigQuery DataFrames DataFrame as a CSV file in Cloud Storage for further use. Try that now with your processed LLM output data." - ], - "metadata": { - "id": "GTRdUw-Ro5R1" - } + ] }, { "cell_type": "markdown", - "source": [ - "Create a new Cloud Storage bucket with a unique name:" - ], "metadata": { "id": "9DQ7eiQxPTi3" - } + }, + "source": [ + "Create a new Cloud Storage bucket with a unique name:" + ] }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-J5LHgS6LLZ0" + }, + "outputs": [], "source": [ "import uuid\n", "BUCKET_ID = \"code-samples-\" + str(uuid.uuid1())\n", "\n", "!gsutil mb gs://{BUCKET_ID}" - ], - "metadata": { - "id": "-J5LHgS6LLZ0" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "markdown", - "source": [ - "Use `to_csv` to write each BigQuery DataFrames DataFrame as a CSV file in the Cloud Storage bucket:" - ], "metadata": { "id": "tyxZXj0UPYUv" - } + }, + "source": [ + "Use `to_csv` to write each BigQuery DataFrames DataFrame as a CSV file in the Cloud Storage bucket:" + ] }, { "cell_type": "code", - "source": [ - "df_code[[\"code\"]].to_csv(f\"gs://{BUCKET_ID}/df_code*.csv\")\n", - "series_code[[\"code\"]].to_csv(f\"gs://{BUCKET_ID}/series_code*.csv\")" - ], + "execution_count": null, "metadata": { "id": "Zs_b5L-4IvER" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "df_code[[\"code\"]].to_csv(f\"gs://{BUCKET_ID}/df_code*.csv\")\n", + "series_code[[\"code\"]].to_csv(f\"gs://{BUCKET_ID}/series_code*.csv\")" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "UDBtDlrTuuh8" + }, "source": [ "You can navigate to the Cloud Storage bucket browser to download the two files and view them.\n", "\n", "Run the following cell, and then follow the link to your Cloud Storage bucket browser:" - ], - "metadata": { - "id": "UDBtDlrTuuh8" - } + ] }, { "cell_type": "code", - "source": [ - "print(f'https://console.developers.google.com/storage/browser/{BUCKET_ID}/')" - ], + "execution_count": null, "metadata": { "id": "PspCXu-qu_ND" }, - "execution_count": null, - "outputs": [] + "outputs": [], + "source": [ + "print(f'https://console.developers.google.com/storage/browser/{BUCKET_ID}/')" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "RGSvUk48RK20" + }, "source": [ "# Summary and next steps\n", "\n", "You've used BigQuery DataFrames' integration with LLM models (`bigframes.ml.llm`) to generate code samples, and have tranformed LLM output by creating and using a custom function in BigQuery DataFrames.\n", "\n", "Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks)." - ], - "metadata": { - "id": "RGSvUk48RK20" - } + ] }, { "cell_type": "markdown", @@ -833,6 +833,11 @@ }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yw7A461XLjvW" + }, + "outputs": [], "source": [ "# # Delete the BigQuery Connection\n", "# from google.cloud import bigquery_connection_v1 as bq_connection\n", @@ -840,12 +845,7 @@ "# CONNECTION_ID = f\"projects/{PROJECT_ID}/locations/{REGION}/connections/{CONN_NAME}\"\n", "# client.delete_connection(name=CONNECTION_ID)\n", "# print(f\"Deleted connection '{CONNECTION_ID}'.\")" - ], - "metadata": { - "id": "yw7A461XLjvW" - }, - "execution_count": null, - "outputs": [] + ] }, { "cell_type": "code", @@ -864,22 +864,22 @@ }, { "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iQFo6OUBLmi3" + }, + "outputs": [], "source": [ "# # Delete the Google Cloud Storage bucket and files\n", "# ! gsutil rm -r gs://{BUCKET_ID}\n", "# print(f\"Deleted bucket '{BUCKET_ID}'.\")" - ], - "metadata": { - "id": "iQFo6OUBLmi3" - }, - "execution_count": null, - "outputs": [] + ] } ], "metadata": { "colab": { - "toc_visible": true, - "provenance": [] + "provenance": [], + "toc_visible": true }, "kernelspec": { "display_name": "Python 3", diff --git a/notebooks/getting_started/bq_dataframes_ml_linear_regression.ipynb b/notebooks/regression/bq_dataframes_ml_linear_regression.ipynb similarity index 98% rename from notebooks/getting_started/bq_dataframes_ml_linear_regression.ipynb rename to notebooks/regression/bq_dataframes_ml_linear_regression.ipynb index d317217810..338d6edf4f 100644 --- a/notebooks/getting_started/bq_dataframes_ml_linear_regression.ipynb +++ b/notebooks/regression/bq_dataframes_ml_linear_regression.ipynb @@ -35,18 +35,18 @@ "\n", "\n", " \n", " \n", "
\n", - " \n", + " \n", " \"Colab Run in Colab\n", " \n", " \n", - " \n", + " \n", " \"GitHub\n", " View on GitHub\n", " \n", " \n", - " \n", + " \n", " \"Vertex\n", " Open in Vertex AI Workbench\n", " \n", diff --git a/noxfile.py b/noxfile.py index a113e1fcde..84e5ab11bb 100644 --- a/noxfile.py +++ b/noxfile.py @@ -607,8 +607,8 @@ def notebook(session): # appropriate values and omitting cleanup logic that may break # our test infrastructure. "notebooks/getting_started/getting_started_bq_dataframes.ipynb", - "notebooks/getting_started/bq_dataframes_llm_code_generation.ipynb", - "notebooks/getting_started/bq_dataframes_ml_linear_regression.ipynb", + "notebooks/generative_ai/bq_dataframes_llm_code_generation.ipynb", + "notebooks/regression/bq_dataframes_ml_linear_regression.ipynb", "notebooks/generative_ai/bq_dataframes_ml_drug_name_generation.ipynb", "notebooks/vertex_sdk/sdk2_bigframes_pytorch.ipynb", "notebooks/vertex_sdk/sdk2_bigframes_sklearn.ipynb",