Add notebook fixing "wide" tutorial. #4749

MarkDaoust · 2018-07-11T22:46:14Z

Colab staging link:

https://colab.sandbox.google.com/github/lamberta/models/blob/mark-estimator-wide-2/samples/core/tutorials/estimators/linear.ipynb

fixes: tensorflow/tensorflow#18929

from https://github.com/tensorflow/tensorflow/blob/8a6ef2cb4f98bacc1f821f60c21914b4bd5faaef/tensorflow/docs_src/tutorials/representation/wide.md

notedown wide.md --to notebook --output wide.ipynb

I can't find values of regularization strength that matter for this problem.

karmel

This is great, thanks

karmel · 2018-07-12T18:23:52Z

samples/core/tutorials/estimators/wide.ipynb

+        "    models_path = os.path.join(os.getcwd(), 'models')\n",
+        "    sys.path.append(models_path)   \n",
+        "    os.environ['PYTHONPATH'] += os.pathsep+models_path\n",
+        "    os.chdir(\"models/official/wide_deep\")"


nit: This CD isn't strictly necessary, and, since we take the PYTHONPATH route in the official models in general, might lead to some confusion/future errors. The code below would just have to be updated to from official.wide_deep import ... instead of just import ...

Fixed.

It's much cleaner like that. Thanks.

karmel · 2018-07-12T18:28:19Z

samples/core/tutorials/estimators/wide.ipynb

+      "source": [
+        "def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):\n",
+        "  df = df.copy()\n",
+        "  label = df.pop(label_key)\n",


This may be my mis-memory of Pandas, which I admit I haven't used in a while, but do we have to create copies of the dataframe? Can't we just say, label = df[label_key] or something to that effect?

label = df[label_key]

That works. Fixed.
The only down side is that the label is still in the features dict.

karmel · 2018-07-12T18:30:58Z

samples/core/tutorials/estimators/wide.ipynb

+      "cell_type": "markdown",
+      "source": [
+        "But this approach has severly-limited scalability. For larger data it should be streamed off disk.\n",
+        "the `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n",


nit: capitalize The

karmel · 2018-07-12T18:33:35Z

samples/core/tutorials/estimators/wide.ipynb

+        "\n",
+        "Estimators use a system called `feature_columns` to describe how the model\n",
+        "should interpret each of the raw input features. An Estimator exepcts a vector\n",
+        "of numeric inputs, and feature columns describe how the model shoukld convert\n",


nit: shoukld --> should

karmel · 2018-07-12T18:35:19Z

samples/core/tutorials/estimators/wide.ipynb

+      },
+      "cell_type": "markdown",
+      "source": [
+        "if we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`"


nit: capitalize If

MarkDaoust · 2018-07-12T18:58:24Z

samples/core/tutorials/estimators/wide.ipynb

+        "ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)\n",
+        "\n",
+        "for feature_batch, label_batch in ds:\n",
+        "    break\n",


From @ispirmustafa:

it's better to put print statements within the body of for.

MarkDaoust · 2018-07-12T18:59:28Z

samples/core/tutorials/estimators/wide.ipynb

+      },
+      "cell_type": "code",
+      "source": [
+        "classifier = tf.estimator.LinearClassifier(feature_columns=[age], n_classes=2)\n",


From: @ispirmustafa

n_classes=2 is default so let's not set it.

lamberta · 2018-07-12T19:11:20Z

samples/core/tutorials/estimators/wide.ipynb

+      "cell_type": "markdown",
+      "source": [
+        "In this tutorial, we will use the `tf.estimator` API in TensorFlow to solve a\n",
+        "binary classification problem: Given census data about a person such as age,\n",


I think we can use this opportunity to promote ML fairness, want to include the "Key Point" callout we used in the housing dataset colab? https://www.tensorflow.org/tutorials/keras/basic_regression

MarkDaoust · 2018-07-12T21:16:38Z

samples/core/tutorials/estimators/wide.ipynb

+        "    models_path = os.path.join(os.getcwd(), 'models')\n",
+        "    sys.path.append(models_path)   \n",
+        "    os.environ['PYTHONPATH'] += os.pathsep+models_path\n",
+        "    os.chdir(\"models/official/wide_deep\")"


Fixed.

It's much cleaner like that. Thanks.

MarkDaoust · 2018-07-12T21:23:28Z

samples/core/tutorials/estimators/wide.ipynb

+      "source": [
+        "def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):\n",
+        "  df = df.copy()\n",
+        "  label = df.pop(label_key)\n",


label = df[label_key]

That works. Fixed.
The only down side is that the label is still in the features dict.

MarkDaoust · 2018-07-12T21:24:51Z

samples/core/tutorials/estimators/wide.ipynb

+      "cell_type": "markdown",
+      "source": [
+        "But this approach has severly-limited scalability. For larger data it should be streamed off disk.\n",
+        "the `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n",


MarkDaoust · 2018-07-12T21:25:27Z

samples/core/tutorials/estimators/wide.ipynb

+        "\n",
+        "Estimators use a system called `feature_columns` to describe how the model\n",
+        "should interpret each of the raw input features. An Estimator exepcts a vector\n",
+        "of numeric inputs, and feature columns describe how the model shoukld convert\n",


MarkDaoust · 2018-07-12T21:26:06Z

samples/core/tutorials/estimators/wide.ipynb

+      },
+      "cell_type": "markdown",
+      "source": [
+        "if we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`"


MarkDaoust · 2018-07-12T21:39:24Z

samples/core/tutorials/estimators/wide.ipynb

+      },
+      "cell_type": "code",
+      "source": [
+        "!python -m official.wide_deep.census_main --model_type=wide --train_epochs=2"


@karmel

In colab this just hangs for about a minute before running.
It doesn't do that on my local machine.

Any ideas?

If I interrupt it the traceback is (cloud_lib.on_gcp()):

Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/content/models/official/wide_deep/census_main.py", line 112, in <module> absl_app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 274, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 238, in _run_main sys.exit(main(argv)) File "/content/models/official/wide_deep/census_main.py", line 106, in main run_census(flags.FLAGS) File "/content/models/official/wide_deep/census_main.py", line 101, in run_census early_stop=True) File "/content/models/official/wide_deep/wide_deep_run_loop.py", line 91, in run_loop test_id=flags_obj.benchmark_test_id) File "/content/models/official/utils/logs/logger.py", line 151, in log_run_info test_id)) File "/content/models/official/utils/logs/logger.py", line 317, in _gather_run_info _collect_test_environment(run_info) File "/content/models/official/utils/logs/logger.py", line 421, in _collect_test_environment if cloud_lib.on_gcp(): File "/content/models/official/utils/logs/cloud_lib.py", line 28, in on_gcp response = requests.get(GCP_METADATA_URL, headers=GCP_METADATA_HEADER) File "/usr/local/lib/python3.6/dist-packages/requests/api.py", line 72, in get return request('get', url, params=params, **kwargs) File "/usr/local/lib/python3.6/dist-packages/requests/api.py", line 58, in request return session.request(method=method, url=url, **kwargs) File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 508, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 618, in send r = adapter.send(request, **kwargs) File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 440, in send timeout=timeout File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 601, in urlopen chunked=chunked) File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 357, in _make_request conn.request(method, url, **httplib_request_kw) File "/usr/lib/python3.6/http/client.py", line 1239, in request self._send_request(method, url, body, headers, encode_chunked) File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/usr/lib/python3.6/http/client.py", line 1026, in _send_output self.send(msg) File "/usr/lib/python3.6/http/client.py", line 964, in send self.connect() File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 166, in connect conn = self._new_conn() File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 141, in _new_conn (self.host, self.port), self.timeout, **extra_kw) File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 73, in create_connection sock.connect(sa)

Hmm, I'm not that familiar with colab-- could it be that the test is timing out? Or not finding the data or something?

It has the data. It runs fine. It just waits at this line for a minute first like it's waiting for this connection to time out.

This is in benchmark_logger.log_run_info --> what is that? down inside cloud_lib.on_gcp().
(It's not doing it in the rest of the notebook because I don't use benchmarks)

I'll send this to the Colab people and see what they say.

Oh, interesting. Also + @qlzh727 for insight.

From the stack trace. It seems that the logger was trying to detect the local running env, by querying URL "http://metadata/computeMetadata/v1/instance/hostname". Maybe the URL is not reachable on the codelab box, and suffering some delay when querying.

I probably should add a short timeout for the connection.

Oh, nice. Thanks!

lamberta · 2018-07-12T21:43:36Z

samples/core/tutorials/estimators/wide.ipynb

+      },
+      "cell_type": "markdown",
+      "source": [
+        "# TensorFlow Linear Model Tutorial\n",


New title? "Build a linear model using Estimators" (or something)
Can you change the file name while you're here? Will probably land: en/tutorials/estimators/linear_model ??

title

Done.

move

TODO

lamberta · 2018-07-12T21:43:52Z

samples/core/tutorials/estimators/wide.ipynb

+        "\n",
+        "To try the code for this tutorial:\n",
+        "\n",
+        "[Install TensorFlow](tensorlfow.org/install) if you haven't already.\n",


Remove install instructions

lamberta · 2018-07-12T21:44:52Z

samples/core/tutorials/estimators/wide.ipynb

+      },
+      "cell_type": "markdown",
+      "source": [
+        "Download the [tutorial code from github](https://github.com/tensorflow/models/tree/master/official/wide_deep/),\n",


I know this section is needed but is just feels so ... un-notebooky. Can it be collapsed?

Collapse? sure. But that will only apply in colab (not github or tf.org).

lamberta · 2018-07-12T21:51:20Z

samples/core/tutorials/estimators/wide.ipynb

+        "50,000 dollars.\n",
+        "\n",
+        "Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is each  feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/).\n",
+        "\n",


Maybe add a little blurb at the end (or somewhere in the first paragraph):
"For more information, see the Estimator guide."

lamberta · 2018-07-12T21:51:52Z

samples/core/tutorials/estimators/wide.ipynb

+        "colab": {}
+      },
+      "cell_type": "code",
+      "source": [


Remove empty code block at end

lamberta · 2018-07-12T21:52:43Z

samples/core/tutorials/estimators/wide.ipynb

+      },
+      "cell_type": "markdown",
+      "source": [
+        "## What Next\n",


Next steps

Include link to tfo.org/guide/estimators

MarkDaoust · 2018-07-13T18:11:19Z

samples/core/tutorials/estimators/wide.ipynb

+        "and set the `model_type` flag to `wide`."
+      ]
+    },
+    {


From @ispirmustafa:

L1 L2 is an important tool for linear models.

without them it would be really not complete linear example.

have you seen the mlcc content: https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/video-lecture

…rten the notebook.

lamberta · 2018-07-13T19:46:06Z

File rename and some copy edits in this PR: MarkDaoust#9
Stage: https://colab.sandbox.google.com/github/lamberta/models/blob/mark-estimator-wide-2/samples/core/tutorials/estimators/linear.ipynb

Mark estimator wide 2

MarkDaoust added 9 commits July 11, 2018 15:15

add wide.md from tensorflow/tensorflow

b0cfb2a

from https://github.com/tensorflow/tensorflow/blob/8a6ef2cb4f98bacc1f821f60c21914b4bd5faaef/tensorflow/docs_src/tutorials/representation/wide.md

Converted wide.md to notebook using notedown

c8eda49

notedown wide.md --to notebook --output wide.ipynb

Fix wide.ipynb

2c92997

Convert to colab format

9ba5b31

clear output

5b44f6e

Add ML-Fairness link, License and Buttons

5f43cce

close form?

b178da2

Close License?

48ddc77

fix

7a21070

MarkDaoust requested a review from lamberta as a code owner July 11, 2018 22:46

googlebot added the cla: yes label Jul 11, 2018

MarkDaoust requested a review from karmel July 12, 2018 00:11

Remove Regularization section.

66df453

I can't find values of regularization strength that matter for this problem.

karmel reviewed Jul 12, 2018

View reviewed changes

Resolve review comments - part 1

957ee97

MarkDaoust commented Jul 12, 2018

View reviewed changes

lamberta reviewed Jul 12, 2018

View reviewed changes

Fix review comments - part 2

18fc407

MarkDaoust commented Jul 12, 2018

View reviewed changes

Move Fairness note to second paragraph.

02a30e2

lamberta reviewed Jul 12, 2018

View reviewed changes

qlzh727 mentioned this pull request Jul 13, 2018

Add shorter timeout for GCP util. #4762

Merged

Resolve review comments.

0f5c10c

MarkDaoust commented Jul 13, 2018

View reviewed changes

lamberta added 2 commits July 13, 2018 12:39

Rename wide to linear

095ed06

Copy edits and formatting. Removed information sections at end to sho…

37b0423

…rten the notebook.

MarkDaoust added 3 commits July 13, 2018 13:58

reworded "data-frame" --> features dict.

1882842

typo

be5fbcf

Merge pull request #9 from lamberta/mark-estimator-wide-2

5de1185

Mark estimator wide 2

lamberta approved these changes Jul 13, 2018

View reviewed changes

MarkDaoust merged commit c5b8f2f into tensorflow:master Jul 13, 2018

Add notebook fixing "wide" tutorial. #4749

Add notebook fixing "wide" tutorial. #4749

Uh oh!

Conversation

MarkDaoust commented Jul 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karmel left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarkDaoust commented Jul 11, 2018 •

edited

Loading