-
Notifications
You must be signed in to change notification settings - Fork 45.6k
Add notebook fixing "wide" tutorial. #4749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
notedown wide.md --to notebook --output wide.ipynb
I can't find values of regularization strength that matter for this problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks
" models_path = os.path.join(os.getcwd(), 'models')\n", | ||
" sys.path.append(models_path) \n", | ||
" os.environ['PYTHONPATH'] += os.pathsep+models_path\n", | ||
" os.chdir(\"models/official/wide_deep\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This CD isn't strictly necessary, and, since we take the PYTHONPATH route in the official models in general, might lead to some confusion/future errors. The code below would just have to be updated to from official.wide_deep import ...
instead of just import ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
It's much cleaner like that. Thanks.
"source": [ | ||
"def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):\n", | ||
" df = df.copy()\n", | ||
" label = df.pop(label_key)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be my mis-memory of Pandas, which I admit I haven't used in a while, but do we have to create copies of the dataframe? Can't we just say, label = df[label_key]
or something to that effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label = df[label_key]
That works. Fixed.
The only down side is that the label is still in the features dict.
"cell_type": "markdown", | ||
"source": [ | ||
"But this approach has severly-limited scalability. For larger data it should be streamed off disk.\n", | ||
"the `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: capitalize The
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"\n", | ||
"Estimators use a system called `feature_columns` to describe how the model\n", | ||
"should interpret each of the raw input features. An Estimator exepcts a vector\n", | ||
"of numeric inputs, and feature columns describe how the model shoukld convert\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: shoukld --> should
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
}, | ||
"cell_type": "markdown", | ||
"source": [ | ||
"if we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: capitalize If
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"ds = easy_input_function(train_df, label_key='income_bracket', num_epochs=5, shuffle=True, batch_size=10)\n", | ||
"\n", | ||
"for feature_batch, label_batch in ds:\n", | ||
" break\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @ispirmustafa:
it's better to put print statements within the body of for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
}, | ||
"cell_type": "code", | ||
"source": [ | ||
"classifier = tf.estimator.LinearClassifier(feature_columns=[age], n_classes=2)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From: @ispirmustafa
n_classes=2 is default so let's not set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"cell_type": "markdown", | ||
"source": [ | ||
"In this tutorial, we will use the `tf.estimator` API in TensorFlow to solve a\n", | ||
"binary classification problem: Given census data about a person such as age,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use this opportunity to promote ML fairness, want to include the "Key Point" callout we used in the housing dataset colab? https://www.tensorflow.org/tutorials/keras/basic_regression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
" models_path = os.path.join(os.getcwd(), 'models')\n", | ||
" sys.path.append(models_path) \n", | ||
" os.environ['PYTHONPATH'] += os.pathsep+models_path\n", | ||
" os.chdir(\"models/official/wide_deep\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
It's much cleaner like that. Thanks.
"source": [ | ||
"def easy_input_function(df, label_key, num_epochs, shuffle, batch_size):\n", | ||
" df = df.copy()\n", | ||
" label = df.pop(label_key)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
label = df[label_key]
That works. Fixed.
The only down side is that the label is still in the features dict.
"cell_type": "markdown", | ||
"source": [ | ||
"But this approach has severly-limited scalability. For larger data it should be streamed off disk.\n", | ||
"the `census_dataset.input_fn` provides an example of how to do this using `tf.decode_csv` and `tf.data.TextLineDataset`: \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
"\n", | ||
"Estimators use a system called `feature_columns` to describe how the model\n", | ||
"should interpret each of the raw input features. An Estimator exepcts a vector\n", | ||
"of numeric inputs, and feature columns describe how the model shoukld convert\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
}, | ||
"cell_type": "markdown", | ||
"source": [ | ||
"if we run `input_layer` with the hashed column we see that the output shape is `(batch_size, hash_bucket_size)`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
}, | ||
"cell_type": "code", | ||
"source": [ | ||
"!python -m official.wide_deep.census_main --model_type=wide --train_epochs=2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In colab this just hangs for about a minute before running.
It doesn't do that on my local machine.
Any ideas?
If I interrupt it the traceback is (cloud_lib.on_gcp()
):
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/models/official/wide_deep/census_main.py", line 112, in <module>
absl_app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 274, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 238, in _run_main
sys.exit(main(argv))
File "/content/models/official/wide_deep/census_main.py", line 106, in main
run_census(flags.FLAGS)
File "/content/models/official/wide_deep/census_main.py", line 101, in run_census
early_stop=True)
File "/content/models/official/wide_deep/wide_deep_run_loop.py", line 91, in run_loop
test_id=flags_obj.benchmark_test_id)
File "/content/models/official/utils/logs/logger.py", line 151, in log_run_info
test_id))
File "/content/models/official/utils/logs/logger.py", line 317, in _gather_run_info
_collect_test_environment(run_info)
File "/content/models/official/utils/logs/logger.py", line 421, in _collect_test_environment
if cloud_lib.on_gcp():
File "/content/models/official/utils/logs/cloud_lib.py", line 28, in on_gcp
response = requests.get(GCP_METADATA_URL, headers=GCP_METADATA_HEADER)
File "/usr/local/lib/python3.6/dist-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 440, in send
timeout=timeout
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
chunked=chunked)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 357, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1026, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 964, in send
self.connect()
File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 166, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.6/dist-packages/urllib3/connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'm not that familiar with colab-- could it be that the test is timing out? Or not finding the data or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has the data. It runs fine. It just waits at this line for a minute first like it's waiting for this connection to time out.
This is in benchmark_logger.log_run_info
--> what is that? down inside cloud_lib.on_gcp()
.
(It's not doing it in the rest of the notebook because I don't use benchmarks)
I'll send this to the Colab people and see what they say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. Also + @qlzh727 for insight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the stack trace. It seems that the logger was trying to detect the local running env, by querying URL "http://metadata/computeMetadata/v1/instance/hostname". Maybe the URL is not reachable on the codelab box, and suffering some delay when querying.
I probably should add a short timeout for the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice. Thanks!
}, | ||
"cell_type": "markdown", | ||
"source": [ | ||
"# TensorFlow Linear Model Tutorial\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New title? "Build a linear model using Estimators" (or something)
Can you change the file name while you're here? Will probably land: en/tutorials/estimators/linear_model
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
title
Done.
move
TODO
"\n", | ||
"To try the code for this tutorial:\n", | ||
"\n", | ||
"[Install TensorFlow](tensorlfow.org/install) if you haven't already.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove install instructions
}, | ||
"cell_type": "markdown", | ||
"source": [ | ||
"Download the [tutorial code from github](https://github.com/tensorflow/models/tree/master/official/wide_deep/),\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this section is needed but is just feels so ... un-notebooky. Can it be collapsed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collapse? sure. But that will only apply in colab (not github or tf.org).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"50,000 dollars.\n", | ||
"\n", | ||
"Key Point: As a modeler and developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model like this could reinforce societal biases and disparities. Is each feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/).\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a little blurb at the end (or somewhere in the first paragraph):
"For more information, see the Estimator guide."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"colab": {} | ||
}, | ||
"cell_type": "code", | ||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove empty code block at end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
}, | ||
"cell_type": "markdown", | ||
"source": [ | ||
"## What Next\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next steps
Include link to tfo.org/guide/estimators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
"and set the `model_type` flag to `wide`." | ||
] | ||
}, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @ispirmustafa:
L1 L2 is an important tool for linear models.
without them it would be really not complete linear example.
have you seen the mlcc content: https://developers.google.com/machine-learning/crash-course/regularization-for-sparsity/video-lecture
File rename and some copy edits in this PR: MarkDaoust#9 |
Colab staging link:
https://colab.sandbox.google.com/github/lamberta/models/blob/mark-estimator-wide-2/samples/core/tutorials/estimators/linear.ipynb
fixes: tensorflow/tensorflow#18929