From 306e4c0d59672597c39d51eae8251bf3862649c1 Mon Sep 17 00:00:00 2001 From: Ashley Xu Date: Tue, 9 Jan 2024 19:48:16 +0000 Subject: [PATCH] chore: add polished ml fundamental notebooks and retire the old one --- .../getting_started/ml_fundamentals.ipynb | 3908 ----------------- .../ml_fundamentals_bq_dataframes.ipynb | 970 ++++ 2 files changed, 970 insertions(+), 3908 deletions(-) delete mode 100644 notebooks/getting_started/ml_fundamentals.ipynb create mode 100644 notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb diff --git a/notebooks/getting_started/ml_fundamentals.ipynb b/notebooks/getting_started/ml_fundamentals.ipynb deleted file mode 100644 index 165bd90f31..0000000000 --- a/notebooks/getting_started/ml_fundamentals.ipynb +++ /dev/null @@ -1,3908 +0,0 @@ -{ - "cells": [ - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Using ML - ML fundamentals\n", - "\n", - "The `bigframes.ml` module implements Scikit-Learn's machine learning API in\n", - "BigQuery DataFrames. It exposes BigQuery's ML capabilities in a simple, popular\n", - "API that works seamlessly with the rest of the BigQuery DataFrames API." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job 7ddb1bda-402a-4e8e-8476-7904010fb4ef is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job e8aba858-7660-4274-8d90-8d2b0382f8f6 is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
speciesislandculmen_length_mmculmen_depth_mmflipper_length_mmbody_mass_gsex
penguin_id
0Adelie Penguin (Pygoscelis adeliae)Biscoe40.118.9188.04300.0MALE
1Adelie Penguin (Pygoscelis adeliae)Torgersen39.118.7181.03750.0MALE
2Gentoo penguin (Pygoscelis papua)Biscoe47.414.6212.04725.0FEMALE
3Chinstrap penguin (Pygoscelis antarctica)Dream42.516.7187.03350.0FEMALE
4Adelie Penguin (Pygoscelis adeliae)Biscoe43.219.0197.04775.0MALE
5Gentoo penguin (Pygoscelis papua)Biscoe46.715.3219.05200.0MALE
6Adelie Penguin (Pygoscelis adeliae)Biscoe41.321.1195.04400.0MALE
7Gentoo penguin (Pygoscelis papua)Biscoe45.213.8215.04750.0FEMALE
8Gentoo penguin (Pygoscelis papua)Biscoe46.513.5210.04550.0FEMALE
9Gentoo penguin (Pygoscelis papua)Biscoe50.515.2216.05000.0FEMALE
10Gentoo penguin (Pygoscelis papua)Biscoe48.215.6221.05100.0MALE
11Adelie Penguin (Pygoscelis adeliae)Dream38.118.6190.03700.0FEMALE
12Gentoo penguin (Pygoscelis papua)Biscoe50.715.0223.05550.0MALE
13Adelie Penguin (Pygoscelis adeliae)Biscoe37.820.0190.04250.0MALE
14Adelie Penguin (Pygoscelis adeliae)Biscoe35.017.9190.03450.0FEMALE
15Gentoo penguin (Pygoscelis papua)Biscoe48.715.7208.05350.0MALE
16Adelie Penguin (Pygoscelis adeliae)Torgersen34.621.1198.04400.0MALE
17Gentoo penguin (Pygoscelis papua)Biscoe46.815.4215.05150.0MALE
18Chinstrap penguin (Pygoscelis antarctica)Dream50.320.0197.03300.0MALE
19Adelie Penguin (Pygoscelis adeliae)Dream37.218.1178.03900.0MALE
20Chinstrap penguin (Pygoscelis antarctica)Dream51.018.8203.04100.0MALE
21Adelie Penguin (Pygoscelis adeliae)Biscoe40.517.9187.03200.0FEMALE
22Gentoo penguin (Pygoscelis papua)Biscoe45.513.9210.04200.0FEMALE
23Adelie Penguin (Pygoscelis adeliae)Dream42.218.5180.03550.0FEMALE
24Chinstrap penguin (Pygoscelis antarctica)Dream51.720.3194.03775.0MALE
\n", - "

25 rows × 7 columns

\n", - "
[334 rows x 7 columns in total]" - ], - "text/plain": [ - " species island \\\n", - "penguin_id \n", - "0 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", - "1 Adelie Penguin (Pygoscelis adeliae) Torgersen \n", - "2 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "3 Chinstrap penguin (Pygoscelis antarctica) Dream \n", - "4 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", - "5 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "6 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", - "7 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "8 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "9 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "10 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "11 Adelie Penguin (Pygoscelis adeliae) Dream \n", - "12 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "13 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", - "14 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", - "15 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "16 Adelie Penguin (Pygoscelis adeliae) Torgersen \n", - "17 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "18 Chinstrap penguin (Pygoscelis antarctica) Dream \n", - "19 Adelie Penguin (Pygoscelis adeliae) Dream \n", - "20 Chinstrap penguin (Pygoscelis antarctica) Dream \n", - "21 Adelie Penguin (Pygoscelis adeliae) Biscoe \n", - "22 Gentoo penguin (Pygoscelis papua) Biscoe \n", - "23 Adelie Penguin (Pygoscelis adeliae) Dream \n", - "24 Chinstrap penguin (Pygoscelis antarctica) Dream \n", - "\n", - " culmen_length_mm culmen_depth_mm flipper_length_mm body_mass_g \\\n", - "penguin_id \n", - "0 40.1 18.9 188.0 4300.0 \n", - "1 39.1 18.7 181.0 3750.0 \n", - "2 47.4 14.6 212.0 4725.0 \n", - "3 42.5 16.7 187.0 3350.0 \n", - "4 43.2 19.0 197.0 4775.0 \n", - "5 46.7 15.3 219.0 5200.0 \n", - "6 41.3 21.1 195.0 4400.0 \n", - "7 45.2 13.8 215.0 4750.0 \n", - "8 46.5 13.5 210.0 4550.0 \n", - "9 50.5 15.2 216.0 5000.0 \n", - "10 48.2 15.6 221.0 5100.0 \n", - "11 38.1 18.6 190.0 3700.0 \n", - "12 50.7 15.0 223.0 5550.0 \n", - "13 37.8 20.0 190.0 4250.0 \n", - "14 35.0 17.9 190.0 3450.0 \n", - "15 48.7 15.7 208.0 5350.0 \n", - "16 34.6 21.1 198.0 4400.0 \n", - "17 46.8 15.4 215.0 5150.0 \n", - "18 50.3 20.0 197.0 3300.0 \n", - "19 37.2 18.1 178.0 3900.0 \n", - "20 51.0 18.8 203.0 4100.0 \n", - "21 40.5 17.9 187.0 3200.0 \n", - "22 45.5 13.9 210.0 4200.0 \n", - "23 42.2 18.5 180.0 3550.0 \n", - "24 51.7 20.3 194.0 3775.0 \n", - "\n", - " sex \n", - "penguin_id \n", - "0 MALE \n", - "1 MALE \n", - "2 FEMALE \n", - "3 FEMALE \n", - "4 MALE \n", - "5 MALE \n", - "6 MALE \n", - "7 FEMALE \n", - "8 FEMALE \n", - "9 FEMALE \n", - "10 MALE \n", - "11 FEMALE \n", - "12 MALE \n", - "13 MALE \n", - "14 FEMALE \n", - "15 MALE \n", - "16 MALE \n", - "17 MALE \n", - "18 MALE \n", - "19 MALE \n", - "20 MALE \n", - "21 FEMALE \n", - "22 FEMALE \n", - "23 FEMALE \n", - "24 MALE \n", - "...\n", - "\n", - "[334 rows x 7 columns]" - ] - }, - "execution_count": 1, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Lets load some test data to use in this tutorial\n", - "import bigframes.pandas\n", - "\n", - "df = bigframes.pandas.read_gbq(\"bigquery-public-data.ml_datasets.penguins\")\n", - "df = df.dropna()\n", - "\n", - "# Temporary workaround: lets name our index so it isn't lost BigQuery DataFrame\n", - "# currently drops unnamed indexes when round-tripping through pandas, which\n", - "# some ML APIs do to route around missing functionality\n", - "df.index.name = \"penguin_id\"\n", - "\n", - "df" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data split\n", - "\n", - "Part of preparing data for a machine learning task is splitting it into subsets for training and testing, to ensure that the solution is not overfitting. Most commonly this is done with `bigframes.ml.model_selection.train_test_split` like so:" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job deda90a8-6ec7-419c-8067-e85777bd916f is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job efe8fa0a-d450-475a-99d5-36beeb985247 is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 5022c56d-e605-4cab-be1b-1ecf189588a1 is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 175bd293-d448-4510-b926-1d8cfb4eb5e7 is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job a3a2e68c-f5f3-4237-99ad-44974f29d090 is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "X_train shape: (267, 6)\n", - "X_test shape: (67, 6)\n", - "y_train shape: (267, 1)\n", - "y_test shape: (67, 1)\n" - ] - } - ], - "source": [ - "# In this example, we're doing supervised learning, where we will learn to predict\n", - "# output variable `y` from input features `X`\n", - "X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]\n", - "y = df[['body_mass_g']] \n", - "\n", - "from bigframes.ml.model_selection import train_test_split\n", - "\n", - "# This will split X and y into test and training sets, with 20% of the rows in the test set,\n", - "# and the rest in the training set\n", - "X_train, X_test, y_train, y_test = train_test_split(\n", - " X, y, test_size=0.2)\n", - "\n", - "# Show the shape of the data after the split\n", - "print(f\"\"\"X_train shape: {X_train.shape}\n", - "X_test shape: {X_test.shape}\n", - "y_train shape: {y_train.shape}\n", - "y_test shape: {y_test.shape}\"\"\")" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job db3365fb-67ca-44cc-a117-88a80dc63cca is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job ab78f7ab-a115-448b-92d0-19c091a831ca is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
islandculmen_length_mmculmen_depth_mmflipper_length_mmsexspecies
penguin_id
249Torgersen41.118.6189.0MALEAdelie Penguin (Pygoscelis adeliae)
36Biscoe43.414.4218.0FEMALEGentoo penguin (Pygoscelis papua)
74Biscoe42.814.2209.0FEMALEGentoo penguin (Pygoscelis papua)
235Dream34.017.1185.0FEMALEAdelie Penguin (Pygoscelis adeliae)
117Dream37.818.1193.0MALEAdelie Penguin (Pygoscelis adeliae)
\n", - "

5 rows × 6 columns

\n", - "
[5 rows x 6 columns in total]" - ], - "text/plain": [ - " island culmen_length_mm culmen_depth_mm flipper_length_mm \\\n", - "penguin_id \n", - "249 Torgersen 41.1 18.6 189.0 \n", - "36 Biscoe 43.4 14.4 218.0 \n", - "74 Biscoe 42.8 14.2 209.0 \n", - "235 Dream 34.0 17.1 185.0 \n", - "117 Dream 37.8 18.1 193.0 \n", - "\n", - " sex species \n", - "penguin_id \n", - "249 MALE Adelie Penguin (Pygoscelis adeliae) \n", - "36 FEMALE Gentoo penguin (Pygoscelis papua) \n", - "74 FEMALE Gentoo penguin (Pygoscelis papua) \n", - "235 FEMALE Adelie Penguin (Pygoscelis adeliae) \n", - "117 MALE Adelie Penguin (Pygoscelis adeliae) \n", - "\n", - "[5 rows x 6 columns]" - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# If we look at the data, we can see that random rows were selected for\n", - "# each side of the split\n", - "X_test.head(5)" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job 22a72cad-11a6-4f8e-b16d-f92853b8112e is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job bc952727-8806-4fe2-abf2-c3a8a2bd9b6d is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
body_mass_g
penguin_id
2493325.0
364600.0
744700.0
2353400.0
1173750.0
\n", - "

5 rows × 1 columns

\n", - "
[5 rows x 1 columns in total]" - ], - "text/plain": [ - " body_mass_g\n", - "penguin_id \n", - "249 3325.0\n", - "36 4600.0\n", - "74 4700.0\n", - "235 3400.0\n", - "117 3750.0\n", - "\n", - "[5 rows x 1 columns]" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# Note that this matches the rows in X_test\n", - "y_test.head(5)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Estimators\n", - "\n", - "Following Scikit-Learn, all learning components are \"estimators\"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:\n", - "\n", - "- a constructor that takes a list of parameters\n", - "- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`\n", - "- a `.fit(..)` method to fit the estimator to training data\n", - "\n", - "There estimators can be further broken down into two main subtypes:\n", - "\n", - "### Transformers\n", - "\n", - "Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.\n", - "\n", - "An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job f239341e-785f-43e1-bfe0-683132d6f15f is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 2d5bbbb9-efc4-4f4e-a8dc-2c7b66b0e5e0 is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 66120e1c-2471-4a0c-8b82-aeb189c8866a is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 62825fc4-5b77-43e5-a3e4-525ebfd1285b is DONE. 2.1 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 656d1d69-b4ff-4db6-9f2d-28dcf91e2fd7 is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 466507c8-1474-4725-93e5-baf8ee292e39 is DONE. 8.5 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
standard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mm
penguin_id
0-0.7505050.84903-0.937262
20.622496-1.3224020.804051
3-0.299107-0.261935-1.009817
50.490839-0.9689131.311935
6-0.5248061.959995-0.429379
70.208715-1.7263891.021716
91.205551-1.0194121.09427
100.772962-0.8174181.457044
121.243168-1.1204081.602153
14-1.7097250.344046-0.792152
170.509647-0.9184151.021716
181.1679351.404513-0.284269
19-1.2959440.445043-1.662809
201.2995930.7985320.151059
21-0.6752720.344046-1.009817
220.26514-1.6758910.658942
241.431251.556008-0.501934
250.3027560.041055-0.574488
260.302756-1.6758910.949161
270.227523-1.7768880.658942
281.318401-0.3629321.747263
292.2023881.3035160.441278
30-0.9197791.959995-0.356824
311.036277-0.6154241.747263
32-0.2238740.19255-0.356824
\n", - "

25 rows × 3 columns

\n", - "
[267 rows x 3 columns in total]" - ], - "text/plain": [ - " standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm \\\n", - "penguin_id \n", - "0 -0.750505 0.84903 \n", - "2 0.622496 -1.322402 \n", - "3 -0.299107 -0.261935 \n", - "5 0.490839 -0.968913 \n", - "6 -0.524806 1.959995 \n", - "7 0.208715 -1.726389 \n", - "9 1.205551 -1.019412 \n", - "10 0.772962 -0.817418 \n", - "12 1.243168 -1.120408 \n", - "14 -1.709725 0.344046 \n", - "17 0.509647 -0.918415 \n", - "18 1.167935 1.404513 \n", - "19 -1.295944 0.445043 \n", - "20 1.299593 0.798532 \n", - "21 -0.675272 0.344046 \n", - "22 0.26514 -1.675891 \n", - "24 1.43125 1.556008 \n", - "25 0.302756 0.041055 \n", - "26 0.302756 -1.675891 \n", - "27 0.227523 -1.776888 \n", - "28 1.318401 -0.362932 \n", - "29 2.202388 1.303516 \n", - "30 -0.919779 1.959995 \n", - "31 1.036277 -0.615424 \n", - "32 -0.223874 0.19255 \n", - "\n", - " standard_scaled_flipper_length_mm \n", - "penguin_id \n", - "0 -0.937262 \n", - "2 0.804051 \n", - "3 -1.009817 \n", - "5 1.311935 \n", - "6 -0.429379 \n", - "7 1.021716 \n", - "9 1.09427 \n", - "10 1.457044 \n", - "12 1.602153 \n", - "14 -0.792152 \n", - "17 1.021716 \n", - "18 -0.284269 \n", - "19 -1.662809 \n", - "20 0.151059 \n", - "21 -1.009817 \n", - "22 0.658942 \n", - "24 -0.501934 \n", - "25 -0.574488 \n", - "26 0.949161 \n", - "27 0.658942 \n", - "28 1.747263 \n", - "29 0.441278 \n", - "30 -0.356824 \n", - "31 1.747263 \n", - "32 -0.356824 \n", - "...\n", - "\n", - "[267 rows x 3 columns]" - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from bigframes.ml.preprocessing import StandardScaler\n", - "\n", - "# StandardScaler will only work on numeric columns\n", - "numeric_columns = [\"culmen_length_mm\", \"culmen_depth_mm\", \"flipper_length_mm\"]\n", - "\n", - "scaler = StandardScaler()\n", - "scaler.fit(X_train[numeric_columns])\n", - "\n", - "# Now, standardscaler should transform the numbers to have mean of zero\n", - "# and standard deviation of one:\n", - "scaler.transform(X_train[numeric_columns])" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job 845c6cff-ac6c-46c1-8e9b-061519f1fa1a is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 1e17f5f7-2956-4bdd-baa9-c07591481341 is DONE. 536 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job e2fde7a6-67b4-45a4-91d4-1cb9eff66ae5 is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job e0683619-23c5-44fd-8930-9d3c9d02729a is DONE. 2.1 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
standard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mm
penguin_id
1-0.9385870.748033-1.445145
4-0.167450.899528-0.284269
80.453222-1.8778850.658942
11-1.126670.697535-0.792152
13-1.1830941.404513-0.792152
150.867003-0.7669190.513833
16-1.7849581.959995-0.211715
23-0.3555320.647036-1.5177
34-0.600039-1.7768880.949161
36-0.129833-1.4233991.23938
42-1.615684-0.514427-0.429379
480.415606-0.7164211.021716
610.396797-1.1709071.457044
640.434414-1.1204081.09427
65-1.2207111.051024-1.445145
68-1.484026-0.009443-1.009817
701.6381411.4045130.296168
720.8293870.142052-0.719598
74-0.242683-1.5243960.586387
77-1.277136-0.211437-0.647043
810.208715-1.2214050.804051
911.2619760.6470360.005949
960.246331-1.3224020.731497
105-1.8037660.445043-1.009817
111-1.1642860.697535-2.098138
\n", - "

25 rows × 3 columns

\n", - "
[67 rows x 3 columns in total]" - ], - "text/plain": [ - " standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm \\\n", - "penguin_id \n", - "1 -0.938587 0.748033 \n", - "4 -0.16745 0.899528 \n", - "8 0.453222 -1.877885 \n", - "11 -1.12667 0.697535 \n", - "13 -1.183094 1.404513 \n", - "15 0.867003 -0.766919 \n", - "16 -1.784958 1.959995 \n", - "23 -0.355532 0.647036 \n", - "34 -0.600039 -1.776888 \n", - "36 -0.129833 -1.423399 \n", - "42 -1.615684 -0.514427 \n", - "48 0.415606 -0.716421 \n", - "61 0.396797 -1.170907 \n", - "64 0.434414 -1.120408 \n", - "65 -1.220711 1.051024 \n", - "68 -1.484026 -0.009443 \n", - "70 1.638141 1.404513 \n", - "72 0.829387 0.142052 \n", - "74 -0.242683 -1.524396 \n", - "77 -1.277136 -0.211437 \n", - "81 0.208715 -1.221405 \n", - "91 1.261976 0.647036 \n", - "96 0.246331 -1.322402 \n", - "105 -1.803766 0.445043 \n", - "111 -1.164286 0.697535 \n", - "\n", - " standard_scaled_flipper_length_mm \n", - "penguin_id \n", - "1 -1.445145 \n", - "4 -0.284269 \n", - "8 0.658942 \n", - "11 -0.792152 \n", - "13 -0.792152 \n", - "15 0.513833 \n", - "16 -0.211715 \n", - "23 -1.5177 \n", - "34 0.949161 \n", - "36 1.23938 \n", - "42 -0.429379 \n", - "48 1.021716 \n", - "61 1.457044 \n", - "64 1.09427 \n", - "65 -1.445145 \n", - "68 -1.009817 \n", - "70 0.296168 \n", - "72 -0.719598 \n", - "74 0.586387 \n", - "77 -0.647043 \n", - "81 0.804051 \n", - "91 0.005949 \n", - "96 0.731497 \n", - "105 -1.009817 \n", - "111 -2.098138 \n", - "...\n", - "\n", - "[67 rows x 3 columns]" - ] - }, - "execution_count": 6, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# We can then repeat this transformation on new data\n", - "scaler.transform(X_test[numeric_columns])" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Composing transformers\n", - "\n", - "To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed:" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job 75c1ce67-e5d7-4f4c-947e-381fc5298236 is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 41962e2e-4d14-4053-9297-3ce61699551a is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 5d3c22c9-c972-4213-8557-726c9e0aca37 is DONE. 22.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 9cb7b33f-ea05-4cf4-9f92-bb3aa4ea8d10 is DONE. 2.1 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job fe1f35d6-d82c-4aab-a284-637b72554f5b is DONE. 29.2 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 37bc90ff-59cb-4b0c-8f9d-73bcda43524a is DONE. 536 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job e23f4724-fdd8-45a9-8c87-defd8d471035 is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 257378db-0569-42d7-965a-7757154c710b is DONE. 21.4 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
onehotencoded_islandstandard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mmonehotencoded_sexonehotencoded_species
penguin_id
0[{'index': 1, 'value': 1.0}]-0.7505050.84903-0.937262[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
2[{'index': 1, 'value': 1.0}]0.622496-1.3224020.804051[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
3[{'index': 2, 'value': 1.0}]-0.299107-0.261935-1.009817[{'index': 1, 'value': 1.0}][{'index': 2, 'value': 1.0}]
5[{'index': 1, 'value': 1.0}]0.490839-0.9689131.311935[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
6[{'index': 1, 'value': 1.0}]-0.5248061.959995-0.429379[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
7[{'index': 1, 'value': 1.0}]0.208715-1.7263891.021716[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
9[{'index': 1, 'value': 1.0}]1.205551-1.0194121.09427[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
10[{'index': 1, 'value': 1.0}]0.772962-0.8174181.457044[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
12[{'index': 1, 'value': 1.0}]1.243168-1.1204081.602153[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
14[{'index': 1, 'value': 1.0}]-1.7097250.344046-0.792152[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
17[{'index': 1, 'value': 1.0}]0.509647-0.9184151.021716[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
18[{'index': 2, 'value': 1.0}]1.1679351.404513-0.284269[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
19[{'index': 2, 'value': 1.0}]-1.2959440.445043-1.662809[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
20[{'index': 2, 'value': 1.0}]1.2995930.7985320.151059[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
21[{'index': 1, 'value': 1.0}]-0.6752720.344046-1.009817[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
22[{'index': 1, 'value': 1.0}]0.26514-1.6758910.658942[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
24[{'index': 2, 'value': 1.0}]1.431251.556008-0.501934[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
25[{'index': 2, 'value': 1.0}]0.3027560.041055-0.574488[{'index': 1, 'value': 1.0}][{'index': 2, 'value': 1.0}]
26[{'index': 1, 'value': 1.0}]0.302756-1.6758910.949161[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
27[{'index': 1, 'value': 1.0}]0.227523-1.7768880.658942[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
28[{'index': 1, 'value': 1.0}]1.318401-0.3629321.747263[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
29[{'index': 2, 'value': 1.0}]2.2023881.3035160.441278[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
30[{'index': 2, 'value': 1.0}]-0.9197791.959995-0.356824[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
31[{'index': 1, 'value': 1.0}]1.036277-0.6154241.747263[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
32[{'index': 3, 'value': 1.0}]-0.2238740.19255-0.356824[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
\n", - "

25 rows × 6 columns

\n", - "
[267 rows x 6 columns in total]" - ], - "text/plain": [ - " onehotencoded_island standard_scaled_culmen_length_mm \\\n", - "penguin_id \n", - "0 [{'index': 1, 'value': 1.0}] -0.750505 \n", - "2 [{'index': 1, 'value': 1.0}] 0.622496 \n", - "3 [{'index': 2, 'value': 1.0}] -0.299107 \n", - "5 [{'index': 1, 'value': 1.0}] 0.490839 \n", - "6 [{'index': 1, 'value': 1.0}] -0.524806 \n", - "7 [{'index': 1, 'value': 1.0}] 0.208715 \n", - "9 [{'index': 1, 'value': 1.0}] 1.205551 \n", - "10 [{'index': 1, 'value': 1.0}] 0.772962 \n", - "12 [{'index': 1, 'value': 1.0}] 1.243168 \n", - "14 [{'index': 1, 'value': 1.0}] -1.709725 \n", - "17 [{'index': 1, 'value': 1.0}] 0.509647 \n", - "18 [{'index': 2, 'value': 1.0}] 1.167935 \n", - "19 [{'index': 2, 'value': 1.0}] -1.295944 \n", - "20 [{'index': 2, 'value': 1.0}] 1.299593 \n", - "21 [{'index': 1, 'value': 1.0}] -0.675272 \n", - "22 [{'index': 1, 'value': 1.0}] 0.26514 \n", - "24 [{'index': 2, 'value': 1.0}] 1.43125 \n", - "25 [{'index': 2, 'value': 1.0}] 0.302756 \n", - "26 [{'index': 1, 'value': 1.0}] 0.302756 \n", - "27 [{'index': 1, 'value': 1.0}] 0.227523 \n", - "28 [{'index': 1, 'value': 1.0}] 1.318401 \n", - "29 [{'index': 2, 'value': 1.0}] 2.202388 \n", - "30 [{'index': 2, 'value': 1.0}] -0.919779 \n", - "31 [{'index': 1, 'value': 1.0}] 1.036277 \n", - "32 [{'index': 3, 'value': 1.0}] -0.223874 \n", - "\n", - " standard_scaled_culmen_depth_mm \\\n", - "penguin_id \n", - "0 0.84903 \n", - "2 -1.322402 \n", - "3 -0.261935 \n", - "5 -0.968913 \n", - "6 1.959995 \n", - "7 -1.726389 \n", - "9 -1.019412 \n", - "10 -0.817418 \n", - "12 -1.120408 \n", - "14 0.344046 \n", - "17 -0.918415 \n", - "18 1.404513 \n", - "19 0.445043 \n", - "20 0.798532 \n", - "21 0.344046 \n", - "22 -1.675891 \n", - "24 1.556008 \n", - "25 0.041055 \n", - "26 -1.675891 \n", - "27 -1.776888 \n", - "28 -0.362932 \n", - "29 1.303516 \n", - "30 1.959995 \n", - "31 -0.615424 \n", - "32 0.19255 \n", - "\n", - " standard_scaled_flipper_length_mm onehotencoded_sex \\\n", - "penguin_id \n", - "0 -0.937262 [{'index': 2, 'value': 1.0}] \n", - "2 0.804051 [{'index': 1, 'value': 1.0}] \n", - "3 -1.009817 [{'index': 1, 'value': 1.0}] \n", - "5 1.311935 [{'index': 2, 'value': 1.0}] \n", - "6 -0.429379 [{'index': 2, 'value': 1.0}] \n", - "7 1.021716 [{'index': 1, 'value': 1.0}] \n", - "9 1.09427 [{'index': 1, 'value': 1.0}] \n", - "10 1.457044 [{'index': 2, 'value': 1.0}] \n", - "12 1.602153 [{'index': 2, 'value': 1.0}] \n", - "14 -0.792152 [{'index': 1, 'value': 1.0}] \n", - "17 1.021716 [{'index': 2, 'value': 1.0}] \n", - "18 -0.284269 [{'index': 2, 'value': 1.0}] \n", - "19 -1.662809 [{'index': 2, 'value': 1.0}] \n", - "20 0.151059 [{'index': 2, 'value': 1.0}] \n", - "21 -1.009817 [{'index': 1, 'value': 1.0}] \n", - "22 0.658942 [{'index': 1, 'value': 1.0}] \n", - "24 -0.501934 [{'index': 2, 'value': 1.0}] \n", - "25 -0.574488 [{'index': 1, 'value': 1.0}] \n", - "26 0.949161 [{'index': 1, 'value': 1.0}] \n", - "27 0.658942 [{'index': 1, 'value': 1.0}] \n", - "28 1.747263 [{'index': 2, 'value': 1.0}] \n", - "29 0.441278 [{'index': 2, 'value': 1.0}] \n", - "30 -0.356824 [{'index': 2, 'value': 1.0}] \n", - "31 1.747263 [{'index': 2, 'value': 1.0}] \n", - "32 -0.356824 [{'index': 2, 'value': 1.0}] \n", - "\n", - " onehotencoded_species \n", - "penguin_id \n", - "0 [{'index': 1, 'value': 1.0}] \n", - "2 [{'index': 3, 'value': 1.0}] \n", - "3 [{'index': 2, 'value': 1.0}] \n", - "5 [{'index': 3, 'value': 1.0}] \n", - "6 [{'index': 1, 'value': 1.0}] \n", - "7 [{'index': 3, 'value': 1.0}] \n", - "9 [{'index': 3, 'value': 1.0}] \n", - "10 [{'index': 3, 'value': 1.0}] \n", - "12 [{'index': 3, 'value': 1.0}] \n", - "14 [{'index': 1, 'value': 1.0}] \n", - "17 [{'index': 3, 'value': 1.0}] \n", - "18 [{'index': 2, 'value': 1.0}] \n", - "19 [{'index': 1, 'value': 1.0}] \n", - "20 [{'index': 2, 'value': 1.0}] \n", - "21 [{'index': 1, 'value': 1.0}] \n", - "22 [{'index': 3, 'value': 1.0}] \n", - "24 [{'index': 2, 'value': 1.0}] \n", - "25 [{'index': 2, 'value': 1.0}] \n", - "26 [{'index': 3, 'value': 1.0}] \n", - "27 [{'index': 3, 'value': 1.0}] \n", - "28 [{'index': 3, 'value': 1.0}] \n", - "29 [{'index': 2, 'value': 1.0}] \n", - "30 [{'index': 1, 'value': 1.0}] \n", - "31 [{'index': 3, 'value': 1.0}] \n", - "32 [{'index': 1, 'value': 1.0}] \n", - "...\n", - "\n", - "[267 rows x 6 columns]" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from bigframes.ml.compose import ColumnTransformer\n", - "from bigframes.ml.preprocessing import OneHotEncoder\n", - "\n", - "# Create an aggregate transform that applies StandardScaler to the numeric columns,\n", - "# and OneHotEncoder to the string columns\n", - "preproc = ColumnTransformer([\n", - " (\"scale\", StandardScaler(), [\"culmen_length_mm\", \"culmen_depth_mm\", \"flipper_length_mm\"]),\n", - " (\"encode\", OneHotEncoder(), [\"species\", \"sex\", \"island\"])])\n", - "\n", - "# Now we can fit all columns of the training data\n", - "preproc.fit(X_train)\n", - "\n", - "processed_X_train = preproc.transform(X_train)\n", - "processed_X_test = preproc.transform(X_test)\n", - "\n", - "processed_X_train" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Predictors\n", - "\n", - "Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.\n", - "\n", - "Predictors can be further broken down into two categories:\n", - "\n", - "#### Supervised predictors\n", - "\n", - "Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job 7d9c9f8b-6b4c-451f-ae3d-06fb7090d148 is DONE. 21.4 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job be87ccfa-72ab-4858-9d4a-b2f5f8b2a5e6 is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 2d651fac-11bf-42da-8c18-bd33207379ca is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 58836ccc-242b-4574-bc48-4c269e74dbf1 is DONE. 5.7 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 1bf531f0-0fde-489b-ab36-6040a2a12377 is DONE. 536 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 4245f4e6-4d5b-404f-81d7-50f0553e2456 is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job ed951699-c005-450e-a8b6-0916ec234e7f is DONE. 5.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
predicted_body_mass_gonehotencoded_islandstandard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mmonehotencoded_sexonehotencoded_species
penguin_id
13781.402407[{'index': 3, 'value': 1.0}]-0.9385870.748033-1.445145[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
44124.107944[{'index': 1, 'value': 1.0}]-0.167450.899528-0.284269[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
84670.344196[{'index': 1, 'value': 1.0}]0.453222-1.8778850.658942[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
113529.417214[{'index': 2, 'value': 1.0}]-1.126670.697535-0.792152[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
134014.101714[{'index': 1, 'value': 1.0}]-1.1830941.404513-0.792152[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
155212.41288[{'index': 1, 'value': 1.0}]0.867003-0.7669190.513833[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
164163.595615[{'index': 3, 'value': 1.0}]-1.7849581.959995-0.211715[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
233392.453069[{'index': 2, 'value': 1.0}]-0.3555320.647036-1.5177[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
344698.305397[{'index': 1, 'value': 1.0}]-0.600039-1.7768880.949161[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
364828.226949[{'index': 1, 'value': 1.0}]-0.129833-1.4233991.23938[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
423430.58866[{'index': 1, 'value': 1.0}]-1.615684-0.514427-0.429379[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
485314.260221[{'index': 1, 'value': 1.0}]0.415606-0.7164211.021716[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
615363.205372[{'index': 1, 'value': 1.0}]0.396797-1.1709071.457044[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
644855.908314[{'index': 1, 'value': 1.0}]0.434414-1.1204081.09427[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
653413.100524[{'index': 2, 'value': 1.0}]-1.2207111.051024-1.445145[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
683340.219002[{'index': 3, 'value': 1.0}]-1.484026-0.009443-1.009817[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
704228.73157[{'index': 2, 'value': 1.0}]1.6381411.4045130.296168[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
723811.538478[{'index': 2, 'value': 1.0}]0.8293870.142052-0.719598[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
744659.770763[{'index': 1, 'value': 1.0}]-0.242683-1.5243960.586387[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
773453.388804[{'index': 2, 'value': 1.0}]-1.277136-0.211437-0.647043[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
814766.245033[{'index': 1, 'value': 1.0}]0.208715-1.2214050.804051[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
914057.807281[{'index': 2, 'value': 1.0}]1.2619760.6470360.005949[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
964739.827445[{'index': 1, 'value': 1.0}]0.246331-1.3224020.731497[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
1053394.891976[{'index': 1, 'value': 1.0}]-1.8037660.445043-1.009817[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
1113201.493683[{'index': 1, 'value': 1.0}]-1.1642860.697535-2.098138[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
\n", - "

25 rows × 7 columns

\n", - "
[67 rows x 7 columns in total]" - ], - "text/plain": [ - " predicted_body_mass_g onehotencoded_island \\\n", - "penguin_id \n", - "1 3781.402407 [{'index': 3, 'value': 1.0}] \n", - "4 4124.107944 [{'index': 1, 'value': 1.0}] \n", - "8 4670.344196 [{'index': 1, 'value': 1.0}] \n", - "11 3529.417214 [{'index': 2, 'value': 1.0}] \n", - "13 4014.101714 [{'index': 1, 'value': 1.0}] \n", - "15 5212.41288 [{'index': 1, 'value': 1.0}] \n", - "16 4163.595615 [{'index': 3, 'value': 1.0}] \n", - "23 3392.453069 [{'index': 2, 'value': 1.0}] \n", - "34 4698.305397 [{'index': 1, 'value': 1.0}] \n", - "36 4828.226949 [{'index': 1, 'value': 1.0}] \n", - "42 3430.58866 [{'index': 1, 'value': 1.0}] \n", - "48 5314.260221 [{'index': 1, 'value': 1.0}] \n", - "61 5363.205372 [{'index': 1, 'value': 1.0}] \n", - "64 4855.908314 [{'index': 1, 'value': 1.0}] \n", - "65 3413.100524 [{'index': 2, 'value': 1.0}] \n", - "68 3340.219002 [{'index': 3, 'value': 1.0}] \n", - "70 4228.73157 [{'index': 2, 'value': 1.0}] \n", - "72 3811.538478 [{'index': 2, 'value': 1.0}] \n", - "74 4659.770763 [{'index': 1, 'value': 1.0}] \n", - "77 3453.388804 [{'index': 2, 'value': 1.0}] \n", - "81 4766.245033 [{'index': 1, 'value': 1.0}] \n", - "91 4057.807281 [{'index': 2, 'value': 1.0}] \n", - "96 4739.827445 [{'index': 1, 'value': 1.0}] \n", - "105 3394.891976 [{'index': 1, 'value': 1.0}] \n", - "111 3201.493683 [{'index': 1, 'value': 1.0}] \n", - "\n", - " standard_scaled_culmen_length_mm standard_scaled_culmen_depth_mm \\\n", - "penguin_id \n", - "1 -0.938587 0.748033 \n", - "4 -0.16745 0.899528 \n", - "8 0.453222 -1.877885 \n", - "11 -1.12667 0.697535 \n", - "13 -1.183094 1.404513 \n", - "15 0.867003 -0.766919 \n", - "16 -1.784958 1.959995 \n", - "23 -0.355532 0.647036 \n", - "34 -0.600039 -1.776888 \n", - "36 -0.129833 -1.423399 \n", - "42 -1.615684 -0.514427 \n", - "48 0.415606 -0.716421 \n", - "61 0.396797 -1.170907 \n", - "64 0.434414 -1.120408 \n", - "65 -1.220711 1.051024 \n", - "68 -1.484026 -0.009443 \n", - "70 1.638141 1.404513 \n", - "72 0.829387 0.142052 \n", - "74 -0.242683 -1.524396 \n", - "77 -1.277136 -0.211437 \n", - "81 0.208715 -1.221405 \n", - "91 1.261976 0.647036 \n", - "96 0.246331 -1.322402 \n", - "105 -1.803766 0.445043 \n", - "111 -1.164286 0.697535 \n", - "\n", - " standard_scaled_flipper_length_mm onehotencoded_sex \\\n", - "penguin_id \n", - "1 -1.445145 [{'index': 2, 'value': 1.0}] \n", - "4 -0.284269 [{'index': 2, 'value': 1.0}] \n", - "8 0.658942 [{'index': 1, 'value': 1.0}] \n", - "11 -0.792152 [{'index': 1, 'value': 1.0}] \n", - "13 -0.792152 [{'index': 2, 'value': 1.0}] \n", - "15 0.513833 [{'index': 2, 'value': 1.0}] \n", - "16 -0.211715 [{'index': 2, 'value': 1.0}] \n", - "23 -1.5177 [{'index': 1, 'value': 1.0}] \n", - "34 0.949161 [{'index': 1, 'value': 1.0}] \n", - "36 1.23938 [{'index': 1, 'value': 1.0}] \n", - "42 -0.429379 [{'index': 1, 'value': 1.0}] \n", - "48 1.021716 [{'index': 2, 'value': 1.0}] \n", - "61 1.457044 [{'index': 2, 'value': 1.0}] \n", - "64 1.09427 [{'index': 1, 'value': 1.0}] \n", - "65 -1.445145 [{'index': 1, 'value': 1.0}] \n", - "68 -1.009817 [{'index': 1, 'value': 1.0}] \n", - "70 0.296168 [{'index': 2, 'value': 1.0}] \n", - "72 -0.719598 [{'index': 2, 'value': 1.0}] \n", - "74 0.586387 [{'index': 1, 'value': 1.0}] \n", - "77 -0.647043 [{'index': 1, 'value': 1.0}] \n", - "81 0.804051 [{'index': 1, 'value': 1.0}] \n", - "91 0.005949 [{'index': 2, 'value': 1.0}] \n", - "96 0.731497 [{'index': 1, 'value': 1.0}] \n", - "105 -1.009817 [{'index': 1, 'value': 1.0}] \n", - "111 -2.098138 [{'index': 1, 'value': 1.0}] \n", - "\n", - " onehotencoded_species \n", - "penguin_id \n", - "1 [{'index': 1, 'value': 1.0}] \n", - "4 [{'index': 1, 'value': 1.0}] \n", - "8 [{'index': 3, 'value': 1.0}] \n", - "11 [{'index': 1, 'value': 1.0}] \n", - "13 [{'index': 1, 'value': 1.0}] \n", - "15 [{'index': 3, 'value': 1.0}] \n", - "16 [{'index': 1, 'value': 1.0}] \n", - "23 [{'index': 1, 'value': 1.0}] \n", - "34 [{'index': 3, 'value': 1.0}] \n", - "36 [{'index': 3, 'value': 1.0}] \n", - "42 [{'index': 1, 'value': 1.0}] \n", - "48 [{'index': 3, 'value': 1.0}] \n", - "61 [{'index': 3, 'value': 1.0}] \n", - "64 [{'index': 3, 'value': 1.0}] \n", - "65 [{'index': 1, 'value': 1.0}] \n", - "68 [{'index': 1, 'value': 1.0}] \n", - "70 [{'index': 2, 'value': 1.0}] \n", - "72 [{'index': 2, 'value': 1.0}] \n", - "74 [{'index': 3, 'value': 1.0}] \n", - "77 [{'index': 1, 'value': 1.0}] \n", - "81 [{'index': 3, 'value': 1.0}] \n", - "91 [{'index': 2, 'value': 1.0}] \n", - "96 [{'index': 3, 'value': 1.0}] \n", - "105 [{'index': 1, 'value': 1.0}] \n", - "111 [{'index': 1, 'value': 1.0}] \n", - "\n", - "[67 rows x 7 columns]" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from bigframes.ml.linear_model import LinearRegression\n", - "\n", - "linreg = LinearRegression()\n", - "\n", - "# Learn from the training data how to predict output y\n", - "linreg.fit(processed_X_train, y_train)\n", - "\n", - "# Predict y for the test data\n", - "predicted_y_test = linreg.predict(processed_X_test)\n", - "\n", - "predicted_y_test" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Unsupervised predictors\n", - "\n", - "In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job 027042f1-9a18-43d8-a378-ab9410e395b1 is DONE. 23.5 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 6c8484a0-a504-4e50-93d6-3d247c9ff558 is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job e81ca2de-df2e-41ec-af86-14f8dcec1b44 is DONE. 6.2 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 3e6d413c-f8c4-4390-95eb-3a1f5bc59aed is DONE. 536 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job e448220d-0c50-45b7-bcbe-d1159b3d18ce is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job e167a234-828d-4f05-8654-63cf97e50ba3 is DONE. 10.2 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
CENTROID_IDNEAREST_CENTROIDS_DISTANCEonehotencoded_islandstandard_scaled_culmen_length_mmstandard_scaled_culmen_depth_mmstandard_scaled_flipper_length_mmonehotencoded_sexonehotencoded_species
penguin_id
13[{'CENTROID_ID': 3, 'DISTANCE': 1.236380597035...[{'index': 3, 'value': 1.0}]-0.9385870.748033-1.445145[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
43[{'CENTROID_ID': 3, 'DISTANCE': 1.039497631856...[{'index': 1, 'value': 1.0}]-0.167450.899528-0.284269[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
81[{'CENTROID_ID': 1, 'DISTANCE': 1.171040485975...[{'index': 1, 'value': 1.0}]0.453222-1.8778850.658942[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
112[{'CENTROID_ID': 2, 'DISTANCE': 0.969102754012...[{'index': 2, 'value': 1.0}]-1.126670.697535-0.792152[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
133[{'CENTROID_ID': 3, 'DISTANCE': 1.113138945949...[{'index': 1, 'value': 1.0}]-1.1830941.404513-0.792152[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
151[{'CENTROID_ID': 1, 'DISTANCE': 1.070996026772...[{'index': 1, 'value': 1.0}]0.867003-0.7669190.513833[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
163[{'CENTROID_ID': 3, 'DISTANCE': 1.780136190720...[{'index': 3, 'value': 1.0}]-1.7849581.959995-0.211715[{'index': 2, 'value': 1.0}][{'index': 1, 'value': 1.0}]
232[{'CENTROID_ID': 2, 'DISTANCE': 1.382540667483...[{'index': 2, 'value': 1.0}]-0.3555320.647036-1.5177[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
341[{'CENTROID_ID': 1, 'DISTANCE': 1.598627908302...[{'index': 1, 'value': 1.0}]-0.600039-1.7768880.949161[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
361[{'CENTROID_ID': 1, 'DISTANCE': 1.095162305190...[{'index': 1, 'value': 1.0}]-0.129833-1.4233991.23938[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
422[{'CENTROID_ID': 2, 'DISTANCE': 1.275841743930...[{'index': 1, 'value': 1.0}]-1.615684-0.514427-0.429379[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
481[{'CENTROID_ID': 1, 'DISTANCE': 0.882209023196...[{'index': 1, 'value': 1.0}]0.415606-0.7164211.021716[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
611[{'CENTROID_ID': 1, 'DISTANCE': 0.816202832282...[{'index': 1, 'value': 1.0}]0.396797-1.1709071.457044[{'index': 2, 'value': 1.0}][{'index': 3, 'value': 1.0}]
641[{'CENTROID_ID': 1, 'DISTANCE': 0.735435721625...[{'index': 1, 'value': 1.0}]0.434414-1.1204081.09427[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
652[{'CENTROID_ID': 2, 'DISTANCE': 1.292559869148...[{'index': 2, 'value': 1.0}]-1.2207111.051024-1.445145[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
682[{'CENTROID_ID': 2, 'DISTANCE': 0.876430138449...[{'index': 3, 'value': 1.0}]-1.484026-0.009443-1.009817[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
704[{'CENTROID_ID': 4, 'DISTANCE': 1.314229913955...[{'index': 2, 'value': 1.0}]1.6381411.4045130.296168[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
724[{'CENTROID_ID': 4, 'DISTANCE': 0.938569518009...[{'index': 2, 'value': 1.0}]0.8293870.142052-0.719598[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
741[{'CENTROID_ID': 1, 'DISTANCE': 1.350320088546...[{'index': 1, 'value': 1.0}]-0.242683-1.5243960.586387[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
772[{'CENTROID_ID': 2, 'DISTANCE': 0.904806634663...[{'index': 2, 'value': 1.0}]-1.277136-0.211437-0.647043[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
811[{'CENTROID_ID': 1, 'DISTANCE': 0.919082578073...[{'index': 1, 'value': 1.0}]0.208715-1.2214050.804051[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
914[{'CENTROID_ID': 4, 'DISTANCE': 0.760360038086...[{'index': 2, 'value': 1.0}]1.2619760.6470360.005949[{'index': 2, 'value': 1.0}][{'index': 2, 'value': 1.0}]
961[{'CENTROID_ID': 1, 'DISTANCE': 0.950188657227...[{'index': 1, 'value': 1.0}]0.246331-1.3224020.731497[{'index': 1, 'value': 1.0}][{'index': 3, 'value': 1.0}]
1052[{'CENTROID_ID': 2, 'DISTANCE': 1.101316467029...[{'index': 1, 'value': 1.0}]-1.8037660.445043-1.009817[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
1112[{'CENTROID_ID': 2, 'DISTANCE': 1.549061068385...[{'index': 1, 'value': 1.0}]-1.1642860.697535-2.098138[{'index': 1, 'value': 1.0}][{'index': 1, 'value': 1.0}]
\n", - "

25 rows × 8 columns

\n", - "
[67 rows x 8 columns in total]" - ], - "text/plain": [ - " CENTROID_ID NEAREST_CENTROIDS_DISTANCE \\\n", - "penguin_id \n", - "1 3 [{'CENTROID_ID': 3, 'DISTANCE': 1.236380597035... \n", - "4 3 [{'CENTROID_ID': 3, 'DISTANCE': 1.039497631856... \n", - "8 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.171040485975... \n", - "11 2 [{'CENTROID_ID': 2, 'DISTANCE': 0.969102754012... \n", - "13 3 [{'CENTROID_ID': 3, 'DISTANCE': 1.113138945949... \n", - "15 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.070996026772... \n", - "16 3 [{'CENTROID_ID': 3, 'DISTANCE': 1.780136190720... \n", - "23 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.382540667483... \n", - "34 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.598627908302... \n", - "36 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.095162305190... \n", - "42 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.275841743930... \n", - "48 1 [{'CENTROID_ID': 1, 'DISTANCE': 0.882209023196... \n", - "61 1 [{'CENTROID_ID': 1, 'DISTANCE': 0.816202832282... \n", - "64 1 [{'CENTROID_ID': 1, 'DISTANCE': 0.735435721625... \n", - "65 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.292559869148... \n", - "68 2 [{'CENTROID_ID': 2, 'DISTANCE': 0.876430138449... \n", - "70 4 [{'CENTROID_ID': 4, 'DISTANCE': 1.314229913955... \n", - "72 4 [{'CENTROID_ID': 4, 'DISTANCE': 0.938569518009... \n", - "74 1 [{'CENTROID_ID': 1, 'DISTANCE': 1.350320088546... \n", - "77 2 [{'CENTROID_ID': 2, 'DISTANCE': 0.904806634663... \n", - "81 1 [{'CENTROID_ID': 1, 'DISTANCE': 0.919082578073... \n", - "91 4 [{'CENTROID_ID': 4, 'DISTANCE': 0.760360038086... \n", - "96 1 [{'CENTROID_ID': 1, 'DISTANCE': 0.950188657227... \n", - "105 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.101316467029... \n", - "111 2 [{'CENTROID_ID': 2, 'DISTANCE': 1.549061068385... \n", - "\n", - " onehotencoded_island standard_scaled_culmen_length_mm \\\n", - "penguin_id \n", - "1 [{'index': 3, 'value': 1.0}] -0.938587 \n", - "4 [{'index': 1, 'value': 1.0}] -0.16745 \n", - "8 [{'index': 1, 'value': 1.0}] 0.453222 \n", - "11 [{'index': 2, 'value': 1.0}] -1.12667 \n", - "13 [{'index': 1, 'value': 1.0}] -1.183094 \n", - "15 [{'index': 1, 'value': 1.0}] 0.867003 \n", - "16 [{'index': 3, 'value': 1.0}] -1.784958 \n", - "23 [{'index': 2, 'value': 1.0}] -0.355532 \n", - "34 [{'index': 1, 'value': 1.0}] -0.600039 \n", - "36 [{'index': 1, 'value': 1.0}] -0.129833 \n", - "42 [{'index': 1, 'value': 1.0}] -1.615684 \n", - "48 [{'index': 1, 'value': 1.0}] 0.415606 \n", - "61 [{'index': 1, 'value': 1.0}] 0.396797 \n", - "64 [{'index': 1, 'value': 1.0}] 0.434414 \n", - "65 [{'index': 2, 'value': 1.0}] -1.220711 \n", - "68 [{'index': 3, 'value': 1.0}] -1.484026 \n", - "70 [{'index': 2, 'value': 1.0}] 1.638141 \n", - "72 [{'index': 2, 'value': 1.0}] 0.829387 \n", - "74 [{'index': 1, 'value': 1.0}] -0.242683 \n", - "77 [{'index': 2, 'value': 1.0}] -1.277136 \n", - "81 [{'index': 1, 'value': 1.0}] 0.208715 \n", - "91 [{'index': 2, 'value': 1.0}] 1.261976 \n", - "96 [{'index': 1, 'value': 1.0}] 0.246331 \n", - "105 [{'index': 1, 'value': 1.0}] -1.803766 \n", - "111 [{'index': 1, 'value': 1.0}] -1.164286 \n", - "\n", - " standard_scaled_culmen_depth_mm \\\n", - "penguin_id \n", - "1 0.748033 \n", - "4 0.899528 \n", - "8 -1.877885 \n", - "11 0.697535 \n", - "13 1.404513 \n", - "15 -0.766919 \n", - "16 1.959995 \n", - "23 0.647036 \n", - "34 -1.776888 \n", - "36 -1.423399 \n", - "42 -0.514427 \n", - "48 -0.716421 \n", - "61 -1.170907 \n", - "64 -1.120408 \n", - "65 1.051024 \n", - "68 -0.009443 \n", - "70 1.404513 \n", - "72 0.142052 \n", - "74 -1.524396 \n", - "77 -0.211437 \n", - "81 -1.221405 \n", - "91 0.647036 \n", - "96 -1.322402 \n", - "105 0.445043 \n", - "111 0.697535 \n", - "\n", - " standard_scaled_flipper_length_mm onehotencoded_sex \\\n", - "penguin_id \n", - "1 -1.445145 [{'index': 2, 'value': 1.0}] \n", - "4 -0.284269 [{'index': 2, 'value': 1.0}] \n", - "8 0.658942 [{'index': 1, 'value': 1.0}] \n", - "11 -0.792152 [{'index': 1, 'value': 1.0}] \n", - "13 -0.792152 [{'index': 2, 'value': 1.0}] \n", - "15 0.513833 [{'index': 2, 'value': 1.0}] \n", - "16 -0.211715 [{'index': 2, 'value': 1.0}] \n", - "23 -1.5177 [{'index': 1, 'value': 1.0}] \n", - "34 0.949161 [{'index': 1, 'value': 1.0}] \n", - "36 1.23938 [{'index': 1, 'value': 1.0}] \n", - "42 -0.429379 [{'index': 1, 'value': 1.0}] \n", - "48 1.021716 [{'index': 2, 'value': 1.0}] \n", - "61 1.457044 [{'index': 2, 'value': 1.0}] \n", - "64 1.09427 [{'index': 1, 'value': 1.0}] \n", - "65 -1.445145 [{'index': 1, 'value': 1.0}] \n", - "68 -1.009817 [{'index': 1, 'value': 1.0}] \n", - "70 0.296168 [{'index': 2, 'value': 1.0}] \n", - "72 -0.719598 [{'index': 2, 'value': 1.0}] \n", - "74 0.586387 [{'index': 1, 'value': 1.0}] \n", - "77 -0.647043 [{'index': 1, 'value': 1.0}] \n", - "81 0.804051 [{'index': 1, 'value': 1.0}] \n", - "91 0.005949 [{'index': 2, 'value': 1.0}] \n", - "96 0.731497 [{'index': 1, 'value': 1.0}] \n", - "105 -1.009817 [{'index': 1, 'value': 1.0}] \n", - "111 -2.098138 [{'index': 1, 'value': 1.0}] \n", - "\n", - " onehotencoded_species \n", - "penguin_id \n", - "1 [{'index': 1, 'value': 1.0}] \n", - "4 [{'index': 1, 'value': 1.0}] \n", - "8 [{'index': 3, 'value': 1.0}] \n", - "11 [{'index': 1, 'value': 1.0}] \n", - "13 [{'index': 1, 'value': 1.0}] \n", - "15 [{'index': 3, 'value': 1.0}] \n", - "16 [{'index': 1, 'value': 1.0}] \n", - "23 [{'index': 1, 'value': 1.0}] \n", - "34 [{'index': 3, 'value': 1.0}] \n", - "36 [{'index': 3, 'value': 1.0}] \n", - "42 [{'index': 1, 'value': 1.0}] \n", - "48 [{'index': 3, 'value': 1.0}] \n", - "61 [{'index': 3, 'value': 1.0}] \n", - "64 [{'index': 3, 'value': 1.0}] \n", - "65 [{'index': 1, 'value': 1.0}] \n", - "68 [{'index': 1, 'value': 1.0}] \n", - "70 [{'index': 2, 'value': 1.0}] \n", - "72 [{'index': 2, 'value': 1.0}] \n", - "74 [{'index': 3, 'value': 1.0}] \n", - "77 [{'index': 1, 'value': 1.0}] \n", - "81 [{'index': 3, 'value': 1.0}] \n", - "91 [{'index': 2, 'value': 1.0}] \n", - "96 [{'index': 3, 'value': 1.0}] \n", - "105 [{'index': 1, 'value': 1.0}] \n", - "111 [{'index': 1, 'value': 1.0}] \n", - "\n", - "[67 rows x 8 columns]" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from bigframes.ml.cluster import KMeans\n", - "\n", - "kmeans = KMeans(n_clusters=4)\n", - "\n", - "kmeans.fit(processed_X_train)\n", - "\n", - "kmeans.predict(processed_X_test)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Pipelines\n", - "\n", - "Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "Pipeline(steps=[('preproc',\n", - " ColumnTransformer(transformers=[('scale', StandardScaler(),\n", - " ['culmen_length_mm',\n", - " 'culmen_depth_mm',\n", - " 'flipper_length_mm']),\n", - " ('encode', OneHotEncoder(),\n", - " ['species', 'sex',\n", - " 'island'])])),\n", - " ('linreg', LinearRegression())])" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from bigframes.ml.pipeline import Pipeline\n", - "\n", - "pipeline = Pipeline([\n", - " ('preproc', preproc),\n", - " ('linreg', linreg)\n", - "])\n", - "\n", - "# Print our pipeline\n", - "pipeline" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The pipeline simplifies the workflow by applying each of its component steps automatically:" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job b11be0d8-e6f1-41cb-8cb2-25a38e7ef311 is DONE. 24.7 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job f32ea25c-be39-4726-a8f5-604ae83849a6 is DONE. 8.5 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 86e29b78-76f5-4937-8bde-407b99af04a2 is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job ca819734-0d41-4d9e-b743-09edae8c7fee is DONE. 29.6 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 49bb5bed-cc84-47e0-9a90-08ab01e00548 is DONE. 536 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 1e40a085-2289-47dd-afd8-820413186b9f is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 60319296-a480-4f51-b7ad-190ac6de963a is DONE. 6.2 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
predicted_body_mass_gislandculmen_length_mmculmen_depth_mmflipper_length_mmsexspecies
penguin_id
13781.396682Torgersen39.118.7181.0MALEAdelie Penguin (Pygoscelis adeliae)
44124.102574Biscoe43.219.0197.0MALEAdelie Penguin (Pygoscelis adeliae)
84670.338389Biscoe46.513.5210.0FEMALEGentoo penguin (Pygoscelis papua)
113529.411644Dream38.118.6190.0FEMALEAdelie Penguin (Pygoscelis adeliae)
134014.09632Biscoe37.820.0190.0MALEAdelie Penguin (Pygoscelis adeliae)
155212.407319Biscoe48.715.7208.0MALEGentoo penguin (Pygoscelis papua)
164163.590502Torgersen34.621.1198.0MALEAdelie Penguin (Pygoscelis adeliae)
233392.44731Dream42.218.5180.0FEMALEAdelie Penguin (Pygoscelis adeliae)
344698.299674Biscoe40.913.7214.0FEMALEGentoo penguin (Pygoscelis papua)
364828.221398Biscoe43.414.4218.0FEMALEGentoo penguin (Pygoscelis papua)
423430.582874Biscoe35.516.2195.0FEMALEAdelie Penguin (Pygoscelis adeliae)
485314.254798Biscoe46.315.8215.0MALEGentoo penguin (Pygoscelis papua)
615363.19995Biscoe46.214.9221.0MALEGentoo penguin (Pygoscelis papua)
644855.90281Biscoe46.415.0216.0FEMALEGentoo penguin (Pygoscelis papua)
653413.094869Dream37.619.3181.0FEMALEAdelie Penguin (Pygoscelis adeliae)
683340.213193Torgersen36.217.2187.0FEMALEAdelie Penguin (Pygoscelis adeliae)
704228.726508Dream52.820.0205.0MALEChinstrap penguin (Pygoscelis antarctica)
723811.532821Dream48.517.5191.0MALEChinstrap penguin (Pygoscelis antarctica)
744659.765013Biscoe42.814.2209.0FEMALEGentoo penguin (Pygoscelis papua)
773453.383042Dream37.316.8192.0FEMALEAdelie Penguin (Pygoscelis adeliae)
814766.239424Biscoe45.214.8212.0FEMALEGentoo penguin (Pygoscelis papua)
914057.801947Dream50.818.5201.0MALEChinstrap penguin (Pygoscelis antarctica)
964739.821792Biscoe45.414.6211.0FEMALEGentoo penguin (Pygoscelis papua)
1053394.886275Biscoe34.518.1187.0FEMALEAdelie Penguin (Pygoscelis adeliae)
1113201.48777Biscoe37.918.6172.0FEMALEAdelie Penguin (Pygoscelis adeliae)
\n", - "

25 rows × 7 columns

\n", - "
[67 rows x 7 columns in total]" - ], - "text/plain": [ - " predicted_body_mass_g island culmen_length_mm \\\n", - "penguin_id \n", - "1 3781.396682 Torgersen 39.1 \n", - "4 4124.102574 Biscoe 43.2 \n", - "8 4670.338389 Biscoe 46.5 \n", - "11 3529.411644 Dream 38.1 \n", - "13 4014.09632 Biscoe 37.8 \n", - "15 5212.407319 Biscoe 48.7 \n", - "16 4163.590502 Torgersen 34.6 \n", - "23 3392.44731 Dream 42.2 \n", - "34 4698.299674 Biscoe 40.9 \n", - "36 4828.221398 Biscoe 43.4 \n", - "42 3430.582874 Biscoe 35.5 \n", - "48 5314.254798 Biscoe 46.3 \n", - "61 5363.19995 Biscoe 46.2 \n", - "64 4855.90281 Biscoe 46.4 \n", - "65 3413.094869 Dream 37.6 \n", - "68 3340.213193 Torgersen 36.2 \n", - "70 4228.726508 Dream 52.8 \n", - "72 3811.532821 Dream 48.5 \n", - "74 4659.765013 Biscoe 42.8 \n", - "77 3453.383042 Dream 37.3 \n", - "81 4766.239424 Biscoe 45.2 \n", - "91 4057.801947 Dream 50.8 \n", - "96 4739.821792 Biscoe 45.4 \n", - "105 3394.886275 Biscoe 34.5 \n", - "111 3201.48777 Biscoe 37.9 \n", - "\n", - " culmen_depth_mm flipper_length_mm sex \\\n", - "penguin_id \n", - "1 18.7 181.0 MALE \n", - "4 19.0 197.0 MALE \n", - "8 13.5 210.0 FEMALE \n", - "11 18.6 190.0 FEMALE \n", - "13 20.0 190.0 MALE \n", - "15 15.7 208.0 MALE \n", - "16 21.1 198.0 MALE \n", - "23 18.5 180.0 FEMALE \n", - "34 13.7 214.0 FEMALE \n", - "36 14.4 218.0 FEMALE \n", - "42 16.2 195.0 FEMALE \n", - "48 15.8 215.0 MALE \n", - "61 14.9 221.0 MALE \n", - "64 15.0 216.0 FEMALE \n", - "65 19.3 181.0 FEMALE \n", - "68 17.2 187.0 FEMALE \n", - "70 20.0 205.0 MALE \n", - "72 17.5 191.0 MALE \n", - "74 14.2 209.0 FEMALE \n", - "77 16.8 192.0 FEMALE \n", - "81 14.8 212.0 FEMALE \n", - "91 18.5 201.0 MALE \n", - "96 14.6 211.0 FEMALE \n", - "105 18.1 187.0 FEMALE \n", - "111 18.6 172.0 FEMALE \n", - "\n", - " species \n", - "penguin_id \n", - "1 Adelie Penguin (Pygoscelis adeliae) \n", - "4 Adelie Penguin (Pygoscelis adeliae) \n", - "8 Gentoo penguin (Pygoscelis papua) \n", - "11 Adelie Penguin (Pygoscelis adeliae) \n", - "13 Adelie Penguin (Pygoscelis adeliae) \n", - "15 Gentoo penguin (Pygoscelis papua) \n", - "16 Adelie Penguin (Pygoscelis adeliae) \n", - "23 Adelie Penguin (Pygoscelis adeliae) \n", - "34 Gentoo penguin (Pygoscelis papua) \n", - "36 Gentoo penguin (Pygoscelis papua) \n", - "42 Adelie Penguin (Pygoscelis adeliae) \n", - "48 Gentoo penguin (Pygoscelis papua) \n", - "61 Gentoo penguin (Pygoscelis papua) \n", - "64 Gentoo penguin (Pygoscelis papua) \n", - "65 Adelie Penguin (Pygoscelis adeliae) \n", - "68 Adelie Penguin (Pygoscelis adeliae) \n", - "70 Chinstrap penguin (Pygoscelis antarctica) \n", - "72 Chinstrap penguin (Pygoscelis antarctica) \n", - "74 Gentoo penguin (Pygoscelis papua) \n", - "77 Adelie Penguin (Pygoscelis adeliae) \n", - "81 Gentoo penguin (Pygoscelis papua) \n", - "91 Chinstrap penguin (Pygoscelis antarctica) \n", - "96 Gentoo penguin (Pygoscelis papua) \n", - "105 Adelie Penguin (Pygoscelis adeliae) \n", - "111 Adelie Penguin (Pygoscelis adeliae) \n", - "\n", - "[67 rows x 7 columns]" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pipeline.fit(X_train, y_train)\n", - "\n", - "predicted_y_test = pipeline.predict(X_test)\n", - "predicted_y_test" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step." - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Evaluating results\n", - "\n", - "Some models include a convenient `.score(X, y)` method for evaulation with a preset accuracy metric:" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job c02fb597-8d5a-42ca-9185-03b59c5ef2f9 is DONE. 29.6 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 7f1f565b-0f73-4a4e-b33f-8484fa260838 is DONE. 0 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job d4b9d4a6-d75e-46e1-b092-ab58e8aef890 is DONE. 48 Bytes processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
mean_absolute_errormean_squared_errormean_squared_log_errormedian_absolute_errorr2_scoreexplained_variance
0216.44435772639.6987070.00463170.5883560.8963960.900547
\n", - "

1 rows × 6 columns

\n", - "
[1 rows x 6 columns in total]" - ], - "text/plain": [ - " mean_absolute_error mean_squared_error mean_squared_log_error \\\n", - "0 216.444357 72639.698707 0.00463 \n", - "\n", - " median_absolute_error r2_score explained_variance \n", - "0 170.588356 0.896396 0.900547 \n", - "\n", - "[1 rows x 6 columns]" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression\n", - "pipeline.score(X_test, y_test)" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For a more general approach, the library `bigframes.ml.metrics` is provided:" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "Query job 73448ee8-698b-435f-b11e-6fe2de3bcd8d is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job e002f59d-a03c-4ec9-a85a-93adbfd7bd17 is DONE. 28.9 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "Query job 4ab1febc-fb55-473a-b295-69e4329cc5f0 is DONE. 30.0 kB processed. Open Job" - ], - "text/plain": [ - "" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/plain": [ - "0.8963962044533755" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from bigframes.ml.metrics import r2_score\n", - "\n", - "r2_score(y_test, predicted_y_test[\"predicted_body_mass_g\"])" - ] - }, - { - "attachments": {}, - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Save/Load to BigQuery\n", - "\n", - "Estimators can be saved to BigQuery as BQML models, and loaded again in future.\n", - "\n", - "Saving requires `bigquery.tables.create` permission, and loading requires `bigquery.models.getMetadata` permission.\n", - "These permissions can be at project level or the dataset level.\n", - "\n", - "If you have those permissions, please go ahead and uncomment the code in the following cells and run." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "# # Replace with a path where you have permission to save a model\n", - "# model_name = \"bigframes-dev.bqml_tutorial.penguins_model\"\n", - "\n", - "# linreg.to_gbq(model_name, replace=True)" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "# # WARNING - until b/281709360 is fixed & pipeline is updated, pipelines will load as models,\n", - "# # and details of their transform steps will be lost (the loaded model will behave the same)\n", - "# bigframes.pandas.read_gbq_model(model_name)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.9" - }, - "orig_nbformat": 4, - "vscode": { - "interpreter": { - "hash": "a850322d07d9bdc9ec5f301d307e048bcab2390ae395e1cbce9335f4e081e5e2" - } - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb b/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb new file mode 100644 index 0000000000..089c167d39 --- /dev/null +++ b/notebooks/getting_started/ml_fundamentals_bq_dataframes.ipynb @@ -0,0 +1,970 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ur8xi4C7S06n" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JAPoU8Sm5E6e" + }, + "source": [ + "# Machine Learning Fundamentals with BigQuery DataFrames\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + " \n", + " \n", + " \"Vertex\n", + " Open in Vertex AI Workbench\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "24743cf4a1e1" + }, + "source": [ + "**_NOTE_**: This notebook has been tested in the following environment:\n", + "\n", + "* Python version = 3.10" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tvgnzT1CKxrO" + }, + "source": [ + "## Overview\n", + "\n", + "The `bigframes.ml` module implements Scikit-Learn's machine learning API in\n", + "BigQuery DataFrames. It exposes BigQuery's ML capabilities in a simple, popular\n", + "API that works seamlessly with the rest of the BigQuery DataFrames API.\n", + "\n", + "Learn more about [BigQuery DataFrames](https://cloud.google.com/python/docs/reference/bigframes/latest)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "d975e698c9a4" + }, + "source": [ + "### Objective\n", + "\n", + "In this tutorial, you will walk through an end-to-end machine learning workflow using BigQuery DataFrames. You will load data, manipulate and prepare it for model training, build supervised and unsupervised models, and evaluate and save a model for future use; all using built-in BigQuery DataFrames functionality." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "08d289fa873f" + }, + "source": [ + "### Dataset\n", + "\n", + "This tutorial uses the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aed92deeb4a0" + }, + "source": [ + "### Costs\n", + "\n", + "This tutorial uses billable components of Google Cloud:\n", + "\n", + "* BigQuery (storage and compute)\n", + "* BigQuery ML\n", + "\n", + "Learn about [BigQuery storage pricing](https://cloud.google.com/bigquery/pricing#storage),\n", + "[BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),\n", + "and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),\n", + "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", + "to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i7EUnXsZhAGF" + }, + "source": [ + "## Installation\n", + "\n", + "Depending on your Jupyter environment, you might have to install packages." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NRTcBQPZpKWd" + }, + "source": [ + "**Vertex AI Workbench or Colab**\n", + "\n", + "Do nothing, BigQuery DataFrames package is already installed." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bdOJtFo1pRnc" + }, + "source": [ + "**Local JupyterLab instance**\n", + "\n", + "Uncomment and run the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mfPoOwPLGpSr" + }, + "outputs": [], + "source": [ + "# !pip install bigframes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BF1j6f9HApxa" + }, + "source": [ + "## Before you begin\n", + "\n", + "Complete the tasks in this section to set up your environment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Yq7zKYWelRQP" + }, + "source": [ + "### Set up your Google Cloud project\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", + "\n", + "3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com) to enable the BigQuery API.\n", + "\n", + "4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WReHDGG5g0XY" + }, + "source": [ + "#### Set your project ID\n", + "\n", + "If you don't know your project ID, try the following:\n", + "* Run `gcloud config list`.\n", + "* Run `gcloud projects list`.\n", + "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oM1iC_MfAts1" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"\" # @param {type:\"string\"}\n", + "\n", + "# Set the project id\n", + "! gcloud config set project {PROJECT_ID}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "region" + }, + "source": [ + "#### Set the region\n", + "\n", + "You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eF-Twtc4XGem" + }, + "outputs": [], + "source": [ + "REGION = \"US\" # @param {type: \"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XcW9adriUQRc" + }, + "source": [ + "#### Set the dataset ID\n", + "\n", + "As part of this notebook, you will save BigQuery ML models to your Google Cloud project, which requires a dataset. Create the dataset, if needed, and provide the ID here as the `DATASET` variable used by BigQuery. Learn how to create a [BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BbMh9JHvUHAn" + }, + "outputs": [], + "source": [ + "DATASET = \"\" # @param {type: \"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NwxfWoR5UGwO" + }, + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sBCra4QMA2wR" + }, + "source": [ + "### Authenticate your Google Cloud account\n", + "\n", + "Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "74ccc9e52986" + }, + "source": [ + "**Vertex AI Workbench**\n", + "\n", + "Do nothing, you are already authenticated." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "de775a3773ba" + }, + "source": [ + "**Local JupyterLab instance**\n", + "\n", + "Uncomment and run the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "254614fa0c46" + }, + "outputs": [], + "source": [ + "# ! gcloud auth login" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ef21552ccea8" + }, + "source": [ + "**Colab**\n", + "\n", + "Uncomment and run the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "603adbbf0532" + }, + "outputs": [], + "source": [ + "# from google.colab import auth\n", + "# auth.authenticate_user()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "960505627ddf" + }, + "source": [ + "### Import libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PyQmSRbKA8r-" + }, + "outputs": [], + "source": [ + "import bigframes.pandas as bf" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "init_aip:mbsdk,all" + }, + "source": [ + "\n", + "### Set BigQuery DataFrames options" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NPPMuw2PXGeo" + }, + "outputs": [], + "source": [ + "bf.options.bigquery.project = PROJECT_ID\n", + "bf.options.bigquery.location = REGION" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pDfrKwMKE_dK" + }, + "source": [ + "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.reset_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LjfRpSruzg5j" + }, + "source": [ + "## Import data into BigQuery DataFrames\n", + "\n", + "You can create a DataFrame by reading data from a BigQuery table." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d86W4hNqzZJb" + }, + "outputs": [], + "source": [ + "df = bf.read_gbq(\"bigquery-public-data.ml_datasets.penguins\")\n", + "df = df.dropna()\n", + "\n", + "# BigQuery DataFrames creates a default numbered index, which we can give a name\n", + "df.index.name = \"penguin_id\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pDfCJ6-LkRB1" + }, + "source": [ + "Take a look at a few rows of the DataFrame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "arGaUZVWkSwT" + }, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WkUIcMXPkahu" + }, + "source": [ + "## Clean and prepare data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DScncEoDkiTG" + }, + "source": [ + "We're are going to start with supervised learning, where a Linear Regression model will learn to predict the body mass (output variable `y`) using input features such as flipper length, sex, species, and more (features `X`)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "B9mW93o9z_-L" + }, + "outputs": [], + "source": [ + "# Isolate input features and output variable into DataFrames\n", + "X = df[['island', 'culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'sex', 'species']]\n", + "y = df[['body_mass_g']]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wkw0Cs62k_cl" + }, + "source": [ + "Part of preparing data for a machine learning task is splitting it into subsets for training and testing to ensure that the solution is not overfitting. By default, BQML will automatically manage splitting the data for you. However, BQML also supports manually splitting out your training data.\n", + "\n", + "Performing a manual data split can be done with `bigframes.ml.model_selection.train_test_split` like so:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NysWAWmvlAxB" + }, + "outputs": [], + "source": [ + "from bigframes.ml.model_selection import train_test_split\n", + "\n", + "# This will split X and y into test and training sets, with 20% of the rows in the test set,\n", + "# and the rest in the training set\n", + "X_train, X_test, y_train, y_test = train_test_split(\n", + " X, y, test_size=0.2)\n", + "\n", + "# Show the shape of the data after the split\n", + "print(f\"\"\"X_train shape: {X_train.shape}\n", + "X_test shape: {X_test.shape}\n", + "y_train shape: {y_train.shape}\n", + "y_test shape: {y_test.shape}\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "faFnVnNolydu" + }, + "source": [ + "If we look at the data, we can see that random rows were selected for\n", + "each side of the split:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "f8bz1HwLlyLP" + }, + "outputs": [], + "source": [ + "X_test.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v4ic7GQEl67Y" + }, + "source": [ + "Note that the `y_test` data matches the same rows in `X_test`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "PflbhKGkl8v2" + }, + "outputs": [], + "source": [ + "y_test.head(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Dkf52IdvmSaj" + }, + "source": [ + "## Estimators\n", + "\n", + "Following Scikit-Learn, all learning components are \"estimators\"; objects that can learn from training data and then apply themselves to new data. Estimators share the following patterns:\n", + "\n", + "- a constructor that takes a list of parameters\n", + "- a standard string representation that shows the class name and all non-default parameters, e.g. `LinearRegression(fit_intercept=False)`\n", + "- a `.fit(..)` method to fit the estimator to training data\n", + "\n", + "There estimators can be further broken down into two main subtypes:\n", + " 1. Transformers\n", + " 2. Predictors\n", + "\n", + "Let's walk through each of these with our example model." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "55oNSWQ2Q5te" + }, + "source": [ + "### Transformers\n", + "\n", + "Transformers are estimators that are used to prepare data for consumption by other estimators ('preprocessing'). In addition to `.fit(...)`, the transformer implements a `.transform(...)` method, which will apply a transformation based on what was computed during `.fit(..)`. With this pattern dynamic preprocessing steps can be applied to both training and test/production data consistently.\n", + "\n", + "An example of a transformer is `bigframes.ml.preprocessing.StandardScaler`, which rescales a dataset to have a mean of zero and a standard deviation of one:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yhATDMR-mkdF" + }, + "outputs": [], + "source": [ + "from bigframes.ml.preprocessing import StandardScaler\n", + "\n", + "# StandardScaler will only work on numeric columns\n", + "numeric_columns = [\"culmen_length_mm\", \"culmen_depth_mm\", \"flipper_length_mm\"]\n", + "\n", + "scaler = StandardScaler()\n", + "scaler.fit(X_train[numeric_columns])\n", + "\n", + "# Now, standardscaler should transform the numbers to have mean of zero\n", + "# and standard deviation of one:\n", + "scaler.transform(X_train[numeric_columns])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vhywHzH-ml-W" + }, + "source": [ + "We can then repeat this transformation on the test data:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TfwSLOTXmspI" + }, + "outputs": [], + "source": [ + "scaler.transform(X_test[numeric_columns])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9enAdjzPmwmv" + }, + "source": [ + "#### Composing transformers\n", + "\n", + "To process data where different columns need different preprocessors, `bigframes.composition.ColumnTransformer` can be employed.\n", + "\n", + "Let's create an aggregate transform that applies `StandardScalar` to the numeric columns and `OneHotEncoder` to the string columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "I8Wwx3emmz2J" + }, + "outputs": [], + "source": [ + "from bigframes.ml.compose import ColumnTransformer\n", + "from bigframes.ml.preprocessing import OneHotEncoder\n", + "\n", + "# Create an aggregate transform that applies StandardScaler to the numeric columns,\n", + "# and OneHotEncoder to the string columns\n", + "preproc = ColumnTransformer([\n", + " (\"scale\", StandardScaler(), [\"culmen_length_mm\", \"culmen_depth_mm\", \"flipper_length_mm\"]),\n", + " (\"encode\", OneHotEncoder(), [\"species\", \"sex\", \"island\"])])\n", + "\n", + "# Now we can fit all columns of the training data\n", + "preproc.fit(X_train)\n", + "\n", + "processed_X_train = preproc.transform(X_train)\n", + "processed_X_test = preproc.transform(X_test)\n", + "\n", + "# View the processed training data\n", + "processed_X_train" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JhoO4fctm4Q5" + }, + "source": [ + "### Predictors\n", + "\n", + "Predictors are estimators that learn and make predictions. In addition to `.fit(...)`, the predictor implements a `.predict(...)` method, which will use what was learned during `.fit(...)` to predict some output.\n", + "\n", + "Predictors can be further broken down into two categories:\n", + "* Supervised predictors\n", + "* Unsupervised predictors" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TqLItVyjslP8" + }, + "source": [ + "#### Supervised predictors\n", + "\n", + "Supervised learning is when we train a model on input-output pairs, and then ask it to predict the output for new inputs. An example of such a predictor is `bigframes.ml.linear_models.LinearRegression`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZeloMmopm8KI" + }, + "outputs": [], + "source": [ + "from bigframes.ml.linear_model import LinearRegression\n", + "\n", + "linreg = LinearRegression()\n", + "\n", + "# Learn from the training data how to predict output y\n", + "linreg.fit(processed_X_train, y_train)\n", + "\n", + "# Predict y for the test data\n", + "predicted_y_test = linreg.predict(processed_X_test)\n", + "\n", + "# View predictions\n", + "predicted_y_test" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z42qesW_nAIf" + }, + "source": [ + "#### Unsupervised predictors\n", + "\n", + "In unsupervised learning, there are no known outputs in the training data, instead the model learns on input data alone and predicts something else. An example of an unsupervised predictor is `bigframes.ml.cluster.KMeans`, which learns how to fit input data to a target number of clusters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "M13zd02znCIg" + }, + "outputs": [], + "source": [ + "from bigframes.ml.cluster import KMeans\n", + "\n", + "# Specify KMeans with four clusters\n", + "kmeans = KMeans(n_clusters=4)\n", + "\n", + "# Fit data\n", + "kmeans.fit(processed_X_train)\n", + "\n", + "# View predictions\n", + "kmeans.predict(processed_X_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DFwsIbscnEvh" + }, + "source": [ + "## Pipelines\n", + "\n", + "Transfomers and predictors can be chained into a single estimator component using `bigframes.ml.pipeline.Pipeline`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Ku2OXqgJnEeR" + }, + "outputs": [], + "source": [ + "from bigframes.ml.pipeline import Pipeline\n", + "\n", + "pipeline = Pipeline([\n", + " ('preproc', preproc),\n", + " ('linreg', linreg)\n", + "])\n", + "\n", + "# Print our pipeline\n", + "pipeline" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cCQCY_6wnKz_" + }, + "source": [ + "The pipeline simplifies the workflow by applying each of its component steps automatically:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hsF7FYagnMko" + }, + "outputs": [], + "source": [ + "pipeline.fit(X_train, y_train)\n", + "\n", + "predicted_y_test = pipeline.predict(X_test)\n", + "predicted_y_test" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SiLzpsg8nRXn" + }, + "source": [ + "In the backend, a pipeline will actually be compiled into a single model with an embedded TRANSFORM step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sTzAxTv1nUKZ" + }, + "source": [ + "## Evaluating results\n", + "\n", + "Some models include a convenient `.score(X, y)` method for evaulation with a preset accuracy metric:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Q8nR1ZqznU-B" + }, + "outputs": [], + "source": [ + "# In the case of a pipeline, this will be equivalent to calling .score on the contained LinearRegression\n", + "pipeline.score(X_test, y_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UHM7jls6nY8A" + }, + "source": [ + "For a more general approach, the library `bigframes.ml.metrics` is provided:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vdEN4Ob9nan4" + }, + "outputs": [], + "source": [ + "from bigframes.ml.metrics import r2_score\n", + "\n", + "r2_score(y_test, predicted_y_test[\"predicted_body_mass_g\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "opn4ycPyneVh" + }, + "source": [ + "## Save to BigQuery\n", + "\n", + "Estimators can be saved to BigQuery as BQML models, and loaded again in future.\n", + "\n", + "Saving requires `bigquery.tables.create` permission, and loading requires `bigquery.models.getMetadata` permission.\n", + "These permissions can be at project level or the dataset level.\n", + "\n", + "If you have those permissions, please go ahead and uncomment the code in the following cells and run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fb0HpkdpnigJ" + }, + "outputs": [], + "source": [ + "linreg.to_gbq(f\"{DATASET}.penguins_model\", replace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_zNOBlHdnkII" + }, + "outputs": [], + "source": [ + "bf.read_gbq_model(f\"{DATASET}.penguins_model\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RfV-du5uTcBB" + }, + "source": [ + "We can also save the pipeline to BigQuery. BigQuery will save this as a single model, with the pre-processing steps embedded in the TRANSFORM property:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "P76_TQ3IR6nB" + }, + "outputs": [], + "source": [ + "pipeline.to_gbq(f\"{DATASET}.penguins_pipeline\", replace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "GKvlKFjAbToJ" + }, + "outputs": [], + "source": [ + "bf.read_gbq_model(f\"{DATASET}.penguins_pipeline\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wCsmt0IwFkDy" + }, + "source": [ + "## Summary and next steps\n", + "\n", + "You've completed an end-to-end machine learning workflow using the built-in capabilities of BigQuery DataFrames.\n", + "\n", + "Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TpV-iwP9qw9c" + }, + "source": [ + "### Cleaning up\n", + "\n", + "To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", + "\n", + "Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QwumLUKmVpuH" + }, + "outputs": [], + "source": [ + "# # Delete the BQML models\n", + "# MODEL_NAME = f\"{PROJECT_ID}:{DATASET}.penguins_model\"\n", + "# ! bq rm -f --model {MODEL_NAME}\n", + "# PIPELINE_NAME = f\"{PROJECT_ID}:{DATASET}.penguins_pipeline\"\n", + "# ! bq rm -f --model {PIPELINE_NAME}" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}