Documentation and Deployment
The buzz data is structured as shown below:
Using knitr to produce milestone documentation (knitr is an engine for dynamic report
generation with R)
For self/peer documentation, you want to concentrate on facts: what the stated goals were, where
the data came from, and what techniques were tried. You assume as long as you use standard
terminology or references that the reader can figure out anything else they need to know. You
want to emphasize any surprises or exceptional issues, as they’re exactly what’s expensive to
relearn. You can’t expect to share this sort of documentation with clients.
The first sort of documentation we recommend is project milestone or checkpoint
documentation. At major steps of the project you should take some time out to repeat your work
in aclean environment, . we’ll use the knitr R package to document starting work with the buzz
data.
What is knitr?
knitr is an R package that allows the inclusion of R code and results inside documents. knitr’s
operation is similar in concept to Knuth’s literate programming and to the R Sweave package. In
practice you maintain a master file that contains both user readable documentation and chunks of
program source code. The document types supported by knitr include LaTeX, Markdown, and
HTML. LaTeX format is a good choice for detailed typeset technical documents. Markdown
format is a good choice for online documentation and wikis. Direct HTML format may be
appropriate for some web applications.
knitr process schematic
A simple knitr Markdown example
Markdown (http://daringfireball.net/projects/markdown/) is a simple web-ready format that’s
used in many wikis. The following listing shows a simple Markdown document with knitr
annotation blocks denoted with ```.
knitr LaTeX example
LaTeX to create the final add.pdf file:
Simple knitr LaTeX result
Purpose of knitr
The purpose of knitr is to produce reproducible work. When you distribute your work in knitr
format, anyone can download your work and, without great effort, rerun it to confirm they get
the same results you did. This is the ideal standard of scientific research, but is rarely met, as
scientists usually are deficient in sharing all of their code, data, and actual procedures. knitr
collects and automates all the steps, so it becomes obvious .
knitr chunk options: A sampling of useful option assignments is given in table
Using comments and version control for running documentation
Another essential record of your work is what we call running documentation. Running
documentation is more informal than milestone/checkpoint documentation and is easiest
maintained in the form of code comments and version control records.
R’s comment style is simple: everything following a # (that isn’t itself quoted) until the end of a
line is a comment and ignored by the R interpreter. The following listing is an example of a well
commented block of R code.
Example code comment
# Return the pseudo logarithm of x, which is close to
# sign(x)*log10(abs(x)) for x such that abs(x) is large
# and doesn't "blow up" near zero. Useful
# for transforming wide range
variables that may be negative
# (like profit/loss).
# See: http://www.win-vector.com/blog
Good comments include what the function does, what types arguments are expected to be, limits
of domain, why you should care about the function, and where it’s from. Of critical importance
are any NB or TODO notes. It’s vastly more important to document any unexpected features or
limitations in your code than to try to explain the obvious. Because R variables don’t have types.
Using version control to record history
Version control can both maintain critical snapshots of your work in earlier states and produce
running documentation of what was done by whom and when in your project.
Version control saving the day
The basics of using Git as a version control system.
Familiar with a few commands:
git init .
git add -A .
git commit
git status
git log
git diff
git checkout
A possible project directory structure
Starting a Git project using the command line
When you’ve decided on your directory structure and want to start a version-controlled project,
do the following:
1.Start the project in a new directory. Place any work either in this directory or in subdirectories.
2. Move your interactive shell into this directory and type git init .. It’s okay if you’ve already
started working and there are already files present.
3. Exclude any subdirectories you don’t want under source control with .git ignore control files.
Using add/commit pairs to checkpoint work
As often as practical, enter the following two commands into an interactive shell in your project
directory:
A good rule of thumb for Git: you should be as nervous about having uncommitted changes as you
should be about not having clicked Save. You don’t need to push/pull often, but you do need to make
local commits often (even if you later squash them with a Git technique called rebasing). Any time you
want to know about your work progress, type either git status to see if there are any edits you can put
through the add/commit cycle, or git log to see the history of your work.
Using Git through RStudio
The RStudio IDE supplies a graphical user interface to Git that you should try. The add/commit
cycle can be performed as follows in RStudio:
Start a new project. From the RStudio command menu, select Project > Create Project, and
choose New Project. Then select the name of the project, what directory to create the new project
directory in, leave the type as (Default), and make sure Create a Git Repository for this Project is
checked. When the new project pane looks something like figure , click Create Project, and you
have a new project.
Do some work in your project. Create new files by selecting File > New > R Script. Type some
R code(like 1/5) into the editor pane and then click the Save icon to save the file. When saving
the file, be sure to choose your project directory or a subdirectory of your project. Commit your
changes to version control.
Using version control to explore your project
Git is ready to
Help you with any of the following tasks:
Tracking your work over time Recovering a deleted file
Comparing two past versions of a file
Finding when you added a specific bit of text
Recovering a whole file or a bit of text from the past (undo an edit)
Sharing files with collaborators
Publicly sharing your project (à la GitHub at https://github.com/, or Bitbucket at
https://bitbucket.org)
Maintaining different versions (branches) of your work And that’s why you want to add
and commit often.
Getting help on Git
For any Git command, you can type git help [command] to get usage information. For example,
to learn about git log, type git help log. The main ways to view the detailed history of your
project are command-line tools like git log --graph --name-status and GUI tools such as RStudio
and gitk. A Git commit represents the complete state of a directory tree at a given time. A Git branch
represents a sequence of commits and changes as you move through time. Commits are immutable;
branches record progress.
The usual shared workflow is like this:
Continuously: work, work, work.
Frequently: commit results to the local repository using a git add/git commit pair.
Every once in a while: pull a copy of the remote repository into our view with some variation of
git pull and then use git push to push work upstream.
The main rule of Git is this: don’t try anything clever (push/pull, and so on) unless you’re in a
“clean” state (everything committed, confirmed with git status).
The new Git commands you need to learn are these:
git push (usually used in the git push -u origin master variation)
git pull (usually used in the git fetch; git merge -m pull master origin/master or git pull --
rebase origin master variations)
git pull: rebase versus merge
Merging is what’s really happening, but rebase is much simpler to read. The general rule is that
you should only rebase work you haven’t yet shared (in our example, Worker B should feel free
to rebase their edits to appear to be after Worker A’s edits, as Worker B hasn’t yet successfully
pushed their work anywhere). You should avoid rebasing records people have seen,as you’re
essentially hiding the edit steps they may be basing their work on
Deploying models
A successful data science project should include at least a demonstration deployment of any
techniques and models developed. Good documentation and presentation are vital, but at some
point people have to see things working and be able to try their own tests. We strongly encourage
partnering with a development group to produce the actual production-hardened version of your
model, but a good demonstration helps recruit these collaborators.
Deploying models as R HTTP services One easy way to demonstrate an R model in operation
is to expose it as an HTTP service.
Listing shows how to call the HTTP service
Deploying models by export
it often makes sense to export a copy of the finished model from R, instead of attempting to
reproduce all of the details of model construction. When exporting a model, you’re depending on
development partners to handle the hard parts of hardening a model for production. Software
engineers tend to be good at project management and risk control, so export projects are also a
good opportunity to learn.
The structure of our random forest model is large but simple: a big collection of decision trees.
But the construction is time-consuming and technical. The idea is this: it can be easier to fax a
friend a solved Sudoku puzzle than to teach them your entire solution strategy.
Exporting the random forest model
A decision tree is a series of tests traditionally visualized as a diagram of decision nodes, as shown in the
top portion of the figure. The content of a decision tree is easy to store in a table where each table row
represents the facts about the decision node
Key takeaways
Use knitr to produce significant reproducible milestone/checkpoint documentation.
Write effective comments.
Use version control to save your work history.
Use version control to collaborate with others.
Make your models available to your partners for experimentation and testing.
Producing effective presentations
Table summarizes the relevant entities in our scenario, including products that are sold by our
company and by competitors.
Presenting your results to the project sponsor
The project sponsor is the person who wants the data science result—generally for the business
need that it will fill. Though project sponsors may have technical or quantitative backgrounds
and may enjoy hearing about technical details and nuances, their primary interest is business-
oriented, so you should discuss your results in terms of the business problem, with a minimum of
technical detail.
we recommend a structure similar to the following:
1. Summarize the motivation behind the project, and its goals.
2. State the project’s results.
3. Back up the results with details, as needed.
4. Discuss recommendations, outstanding issues, and possible future work.
Some people also recommend an “Executive Summary” slide: a one-slide synopsis of steps 1 and
2.
we’ll concentrate on the content of the presentations, rather than the visual format of the slides.
In an actual presentation, you’d likely prefer more visuals and less text than the slides that we
provide here.
Summarizing the project’s goals :
Let’s put together the goal slides for the WVCorp buzz model example. In our example, eRead is
WVCorp’s ebook reader, which led the market until our competitor released a new version of
their e-book reader, BookBits. The new version of BookBits has a shared-bookshelves feature
that eRead doesn’t provide—though many eRead users expressed the desire for such
functionality on the forums. Unfortunately, forum traffic is so high that product managers have a
hard time keeping up, and somehow missed detecting this expression of users’ needs. Hence,
WVCorp lost market share by not anticipating the demand for the shared-bookshelf feature.
Motivation f or project
Stating the project goal
Stating the project’s results
the presentation briefly describes what you did, and what the results were, in the context of the
business need.
Filling in the details
Filling in the details
Once your audience knows what you’ve done, why, and how well you’ve succeeded (from a
business point of view), you can fill in details to help them understand more. As before, try to
keep the discussion relatively nontechnical and grounded in the business process. A description
of where the model fits in the business process or workflow and some examples of interesting
findings.
“How it Works” slide in shows where the buzz model fits into a product manager’s workflow
The bottom slide of figure presents an interesting finding from the project
Optional slide on the modeling method
Making recommendations and discussing future work
No project ever produces a perfect outcome, and you should be up-front (but optimistic) about the
limitations of your results. In the buzz model example, we end the presentation by listing some
improvements and follow-ups that we’d like to make.
Discussing future work
The
project sponsor presentation focuses on the big picture and how your results help to better address a
business need.
Project sponsor presentation takeaways
the project sponsor presentation:
Keep it short.
Keep it focused on the business issues, not the technical ones.
Your project sponsor might use your presentation to help sell the project or its results to the rest of the
organization. Keep that in mind when presenting background and motivation.
Introduce your results early in the presentation, rather than building up to them.