0% found this document useful (0 votes)

19 views74 pages

Unit 1

The document provides an introduction to data science, outlining the data science process, roles in a data science project, and stages involved in executing a project. It emphasizes the importance of defining measurable goals, data collection, modeling, evaluation, and deployment, while highlighting the collaborative nature of data science involving various stakeholders. Additionally, it discusses the significance of setting expectations and understanding model performance in relation to business objectives.

Uploaded by

meghanaalluri2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views74 pages

Unit 1

Uploaded by

meghanaalluri2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 74

Introduction to Data Science

Dr T Prathima, Dept of IT, CBIT(A), Hyderabad

Unit-1
Introduction to data science: The Data Science Process: Roles in a data science project, Stages of a data science
project, Setting expectations.
Starting with R and data: Starting with R,working with data from files, Working with relational databases.
Exploring data: Using Summary Statistics to spot problems, Spotting problems with graphics and visualization.

Zumel, N., Mount, J., &Porzak, J., “Practical data science with R”, 2nd edition. Shelter Island, NY: Manning, 2019.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Data Science Process
• The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics
itself.
• We define data science as managing the process that can transform hypotheses and data into actionable
predictions.
• Typical predictive analytic goals include predicting who will win an election, what products will sell well
together, which loans will default, and which advertisements will be clicked on.
• The data scientist is responsible for acquiring and managing the data, choosing the modeling technique,
writing the code, and verifying the results.
• Because data science draws on so many disciplines, many of the best data scientists we meet started as
programmers, statisticians, business intelligence analysts, or scientists.
• By adding a few more techniques to their repertoire, they became excellent data scientists.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Examples
• Much of the theoretical basis of data science comes from statistics.
• But data science as we know it is strongly influenced by technology and
software engineering methodologies, and has largely evolved in heavily
computer science– and information technology– driven groups.
• Engineering flavor of data science - some famous examples:
• Amazon’s product recommendation systems
• Google’s advertisement valuation systems
• LinkedIn’s contact recommendation system
• Twitter’s trending topics
• Walmart’s consumer demand projection systems

Dr t Prathima, Dept. of IT, CBIT, Telangana

Features
• All of these systems are built off large datasets. That’s not to say they’re all in the realm of big data. But
none of them could’ve been successful if they’d only used small datasets. To manage the data, these systems
require concepts from computer science: database theory, parallel programming theory, streaming data
techniques, and data warehousing.
• Most of these systems are online or live. Rather than producing a single report or analysis, the data science
team deploys a decision procedure or scoring procedure to either directly make decisions or directly show
results to a large number of end users. The production deployment is the last chance to get things right, as
the data scientist can’t always be around to explain defects.
• All of these systems are allowed to make mistakes at some non-negotiable rate.
• None of these systems are concerned with cause. They’re successful when they find useful correlations and
are not held to correctly sorting cause from effect.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Supplementary material
• https://github.com/WinVector/PDSwR2
• Beyond Spreadsheets with R by Jonathan Carroll (Manning, 20108) or
• R in Action by Robert Kabacoff (now available in a second edition http://www.manning.com/kabacoff2/),
along with the text’s associated website
• Quick-R (http://www.statmethods.net).
• For statistics, we recommend Statistics, Fourth Edition, by David Freedman, Robert Pisani, and Roger Purves
(W. W. Norton & Company, 2007).
• zip file (https://github.com/WinVector/PDSwR2/raw/master/CodeExamples.zip)
• https://forums.manning.com/forums/practical-data-science-with-rsecond-edition

Dr t Prathima, Dept. of IT, CBIT, Telangana

Defining Data Science
• Data science is a cross-disciplinary practice that draws on methods from data engineering, descriptive
statistics, data mining, machine learning, and predictive analytics.
• Much like operations research, data science focuses on implementing data-driven decisions and managing
their consequences.
• The data scientist is responsible for guiding a data science project from start to finish.
• Success in a data science project comes not from access to any one exotic tool, but from having quantifiable
goals, good methodology, cross-discipline interactions, and a repeatable workflow.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Roles in a Data Science Project

Dr t Prathima, Dept. of IT, CBIT, Telangana

Roles – Project Sponsor
Suppose you’re working for a German bank. The bank feels that it’s losing too much money to bad
loans and wants to reduce its losses. To do so, they want a tool to help loan officers more
accurately detect risky loans.
• The most important role in a data science project is the project sponsor. The sponsor is the person who wants the
data science result; generally, they represent the business interests.
• The sponsor is responsible for deciding whether the project is a success or failure. The data scientist may fill the
sponsor role for their own project if they feel they know and can represent the business needs, but that’s not the
optimal arrangement.
• The ideal sponsor meets the following condition: if they’re satisfied with the project outcome, then the project is
by definition a success.
• Getting sponsor sign-off becomes the central organizing goal of a data science project.
• In the loan application example, the sponsor might be the bank’s head of Consumer Lending.

• To ensure sponsor sign-off, you must get clear goals from them through directed interviews. You attempt to
capture the sponsor’s expressed goals as quantitative statements. An example goal might be “Identify 90% of
accounts that will go into default at least two months before the first missed payment with a false positive rate
of no more than 25%.”

Dr t Prathima, Dept. of IT, CBIT, Telangana

Roles – Client
• While the sponsor is the role that represents the business interests, the client is the role that represents the
model’s end users’ interests.
• Sometimes, the sponsor and client roles may be filled by the same person. Again, the data scientist may fill
the client role if they can weight business trade-offs, but this isn’t ideal.
• The client is more hands-on than the sponsor; they’re the interface between the technical details of building
a good model and the day-to-day work process into which the model will be deployed.
• They aren’t necessarily mathematically or statistically sophisticated, but are familiar with the relevant
business processes and serve as the domain expert on the team.
• In the loan application example, the client may be a loan officer or someone who represents the interests of
loan officers.
• As with the sponsor, you should keep the client informed and involved. Ideally, you’d like to have regular
meetings with them to keep your efforts aligned with the needs of the end users.
• Generally, the client belongs to a different group in the organization and has other responsibilities beyond
your project. Keep meetings focused, present results and progress in terms they can understand, and take
their critiques to heart.
• If the end users can’t or won’t use your model, then the project isn’t a success, in the long run.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Data Scientist
• The next role in a data science project is the data scientist, who’s responsible for taking all necessary steps to
make the project succeed, including setting the project strategy and keeping the client informed.
• They design the project steps, pick the data sources, and pick the tools to be used. Since they pick the
techniques that will be tried, they have to be well informed about statistics and machine learning.
• They’re also responsible for project planning and tracking, though they may do this with a project
management partner.
• At a more technical level, the data scientist also looks at the data, performs statistical tests and procedures,
applies machine learning models, and evaluates results—the science portion of data science.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Data Architect
• The data architect is responsible for all the data and its storage.
• Often this role is filled by someone outside of the data science group, such as a database administrator or
architect.
• Data architects often manage data warehouses for many different projects, and they may only be available
for quick consultation.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Operations
• The operations role is critical both in acquiring data and delivering the final results.
• The person filling this role usually has operational responsibilities outside of the data science group.
• For example, if you’re deploying a data science result that affects how products are sorted on an online
shopping site, then the person responsible for running the site will have a lot to say about how such a thing
can be deployed.
• This person will likely have constraints on response time, programming language, or data size that you need
to respect in deployment.
• The person in the operations role may already be supporting your sponsor or your client, so they’re often
easy to find (though their time may be already very much in demand).

Dr t Prathima, Dept. of IT, CBIT, Telangana

Stages of a Data Science Project

• The ideal data science environment

is one that encourages feedback
and iteration between the data
scientist and all other stakeholders.
• This is reflected in the lifecycle of a
data science project.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Defining the Goal
• The first task in a data science project is to define a measurable and quantifiable goal. At this
stage, learn all that you can about the context of your project:
• Why do the sponsors want the project in the first place? What do they lack, and what
do they need?
• What are they doing to solve the problem now, and why isn’t that good enough?
• What resources will you need: what kind of data and how much staff? Will you have
domain experts to collaborate with, and what are the computational resources?
• How do the project sponsors plan to deploy your results? What are the constraints
that have to be met for successful deployment?
• Once you and the project sponsor and other stakeholders have established preliminary answers to these questions,
you and they can start defining the precise goal of the project.
• The goal should be specific and measurable; not “We want to get better at finding bad loans,” but instead “We want to
reduce our rate of loan charge-offs by at least 10%, using a model that predicts which loan applicants are likely to
default.”

Dr t Prathima, Dept. of IT, CBIT, Telangana

Data Collection and Management
• This step encompasses identifying the data you
need, exploring it, and conditioning it to be
suitable for analysis. This stage is often the most
time-consuming step in the process. It’s also one
of the most important:
• What data is available to me?
• Will it help me solve the problem?
• Is it enough?
• Is the data quality good enough?

Dr t Prathima, Dept. of IT, CBIT, Telangana

Modelling

• The most common data science modeling tasks are these:

• Classifying—Deciding if something belongs to one category or another
• Scoring—Predicting or estimating a numeric value, such as a price or probability
• Ranking—Learning to order items by preferences
• Clustering—Grouping items into most-similar groups
• Finding relations—Finding correlations or potential causes of effects seen in the data
• Characterizing—Very general plotting and report generation from data

The loan application problem is a classification problem: you want to identify loan applicants who are likely to
default.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
Model Evaluation and Critique
• Once you have a model, you need to determine if it meets your goals:
• Is it accurate enough for your needs? Does it generalize well?
• Does it perform better than “the obvious guess”? Better than whatever estimate you currently use?
• Do the results of the model make sense in the context of the problem domain?

• Confusion Matrix

Dr t Prathima, Dept. of IT, CBIT, Telangana

Presentation and Documentation
A presentation for the model’s end users (the loan officers) would instead emphasize how the model will help
them do their job better:
• How should they interpret the model?
• What does the model output look like?
• If the model provides a trace of which rules in the decision tree executed, how do they read that?
• If the model provides a confidence score in addition to a classification, how should they use the confidence
score?
• When might they potentially overrule the model?

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
Model Deployment and Maintenance
• Finally, the model is put into operation.
• In many organizations, this means the data scientist no longer has primary responsibility for the day-to-day
operation of the model.
• But you still should ensure that the model will run smoothly and won’t make disastrous unsupervised
decisions.
• You also want to make sure that the model can be updated as its environment changes.
• And in many situations, the model will initially be deployed in a small pilot program.
• The test might bring out issues that you didn’t anticipate, and you may have to adjust the model accordingly.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Setting Expectations
• Setting expectations is a crucial part of defining the project goals and success criteria.
• The business-facing members of your team (in particular, the project sponsor) probably already have an idea
of the performance required to meet business goals:
• For example, the bank wants to reduce their losses from bad loans by at least 10%. Before you get too deep
into a project, you should make sure that the resources you have are enough for you to meet the business
goals.
• You get to know the data better during the exploration and cleaning phase;
• after you have a sense of the data, you can get a sense of whether the data is good enough to meet desired
performance thresholds.
• If it’s not, then you’ll have to revisit the project design and goalsetting stage.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Determining lower bounds on model performance
• Understanding how well a model should do for acceptable performance is important when defining
acceptance criteria.
• The null model represents the lower bound on model performance that you should strive for.
• You can think of the null model as being “the obvious guess” that your model must do better than.
• In situations where there’s no existing model or solution, the null model is the simplest possible model:
• A model that labels all loans as GoodLoan (in effect, using only the existing process to classify loans) would
be correct 70% of the time.
• So you know that any actual model that you fit to the data should be better than 70% accurate to be
useful—if accuracy were your only metric.
• Since this is the simplest possible model, its error rate is called the base error rate.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Determining lower bounds on model
performance
• How much better than 70% should you be? In statistics there’s a procedure called hypothesis testing, or
significance testing, that tests whether your model is equivalent to a null model (in this case, whether a new
model is basically only as accurate as guessing GoodLoan all the time).
• You want your model accuracy to be “significantly better”—in statistical terms—than 70%.
• if there is an existing model or process in place, you’d like to have an idea of its precision, recall, and false
positive rates;
• improving one of these metrics is almost always more important than considering accuracy alone. If the
purpose of your project is to improve the existing process, then the current model must be unsatisfactory for
at least one of these metrics.
• Knowing the limitations of the existing process helps you determine useful lower bounds on desired
performance.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Starting with R and data

Dr t Prathima, Dept. of IT, CBIT, Telangana

Few resources
• R in Action, Second Edition, Robert Kabacoff, Manning, 2015
• Beyond Spreadsheets with R, Jonathan Carroll, Manning, 2018
• The Art of R Programming, Norman Matloff, No Starch Press, 2011
• R for Everyone, Second Edition, Jared P. Lander, Addison-Wesley, 2017

Dr t Prathima, Dept. of IT, CBIT, Telangana

R coding style guides
• The Google R Style Guide
(https://google.github.io/styleguide/Rguide.html)
• Hadley Wickham's style guide from Advanced R (http://adv-
r.had.co.nz/Style.html)

Dr t Prathima, Dept. of IT, CBIT, Telangana

EXAMPLES AND THE COMMENT CHARACTER (#)
• R commands as free text, with the results prefixed by the hash mark, #, which is R’s comment character.
• print(seq_len(25))
• # [1] 1 2 3 4 5 6 7 8 9 10 11 12
• # [13] 13 14 15 16 17 18 19 20 21 22 23 24
• # [25] 25
• Notice the numbers were wrapped to three lines, and each line starts with the index of the first cell reported
on the line inside square brackets

Dr t Prathima, Dept. of IT, CBIT, Telangana

VECTORS AND LISTS
• Vectors (sequential arrays of values) are fundamental R data structures.
• Lists can hold different types in each slot;
• vectors can only hold the same primitive or atomic type in each slot.
• In addition to numeric indexing, both vectors and lists support name-keys.
• Retrieving items from a list or vector can be done by the operators shown next.
• VECTOR INDEXING R vectors and lists are indexed from 1, and not from 0 as with many other programming
languages.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Vectorised

• Another aspect of vectors in R is that most R operations are vectorized.

• A function or operator is called vectorized when applying it to a vector is shorthand for applying a function
to each entry of the vector independently.
• For example, the function nchar() counts how many characters are in a string. In R this function can be
used on a single string, or on a vector of strings

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
LOGICAL OPERATIONS
• R’s logical operators come in two flavors.
• R has standard infix scalar-valued operators that expect only one
value and have the same behavior and same names as you would see
in C or Java: && and ||.
• R also has vectorized infix operators that work on vectors of logical
values: & and |.
• Be sure to always use the scalar versions (&& and ||) in situations
such as if statements, and the vectorized versions (& and |) when
processing logical vectors.

Dr t Prathima, Dept. of IT, CBIT, Telangana

NULL AND NANA (NOT AVAILABLE) VALUES
• In R NULL is just a synonym for the empty or length-zero vector formed by using the concatenate operator
c() with no arguments.
• NULL is simply a length-zero vector.
• For example, c(c(), 1, NULL) is perfectly valid and returns the value 1.
• For example, the vector c("a", NA, "c") is a vector of three character strings where we do not know
the value of the second entry.
• Having NA is a great convenience as it allows us to annotate missing or unavailable values in place, which
can be critical in data processing.
• Also, NA means “not available,” not invalid (as NaN denotes), so NA has some convenient rules such as
the logical expression FALSE & NA simplifying to FALSE.

Dr t Prathima, Dept. of IT, CBIT, Telangana

IDENTIFIERS
• Identifiers or symbol names are how R refers to variables and functions.
• The Google R Style Guide insists on writing symbol names in what is called “CamelCase” (word boundaries in
names are denoted by uppercase letters, as in “CamelCase” itself).
• The Advanced R guide recommends an underscore style where names inside identifiers are broken up with
underscores (such as “day_one” instead of “DayOne”).
• Also, many R users use a dot to break up names with identifiers (such as “day.one”).
• In particular, important built-in R types such as data.frame and packages such as data.table use
the dot notation convention.

Dr t Prathima, Dept. of IT, CBIT, Telangana

LINE BREAKS
• It is generally recommended to keep R source code lines at 80 columns or fewer.
• R accepts multiple-line statements as long as where the statement ends is unambiguous.
• For example, to break the single statement “1 + 2" into multiple lines, write the code as follows:
• 1 +
• 2
• Do not write code like the following, as the first line is itself a valid statement, creating ambiguity:
• 1
• + 2

Dr t Prathima, Dept. of IT, CBIT, Telangana

SEMICOLONS
• R allows semicolons as end-of-statement markers, but does not require them.
• Most style guides recommend not using semicolons in R code and certainly not using them at ends of lines.

Dr t Prathima, Dept. of IT, CBIT, Telangana

ASSIGNMENT
• R has many assignment operators; the preferred one is <-.
• = can be used for assignment in R, but is also used to bind argument
values to argument names by name during function calls (so there is
some potential ambiguity in using =).

Dr t Prathima, Dept. of IT, CBIT, Telangana

LEFT-HAND SIDES OF ASSIGNMENTS
• R allows slice expressions on the left-hand sides of assignments, and both numeric and logical array
indexing.
• This allows for very powerful array-slicing commands and coding styles.
• For example, we can replace all the missing values (denoted by “NA") in a vector with zero as shown in the
following example:

Dr t Prathima, Dept. of IT, CBIT, Telangana

Factors
• R can handle many kinds of data: numeric, logical, integer, strings (called character types), and factors.
• Factors are an R type that encodes a fixed set of strings as integers.
• Factors can save a lot on storage while appearing to behave as strings
• Factors also encode the entire set of allowed values, which is useful—but can make combining data from
different sources (that saw different sets of values) a bit of a chore.
• To avoid issues, we suggest delaying conversion of strings to factors until late in an analysis.
• This is usually accomplished by adding the argument stringsAsFactors = FALSE to functions such
as data.frame() or read.table().
• We, however, do encourage using factors when you have a reason, such as wanting to use summary() or
preparing to produce dummy indicators

Dr t Prathima, Dept. of IT, CBIT, Telangana

NAMED ARGUMENTS
• R is centered around applying functions to data.
• Functions that take a large number of arguments rapidly become confusing and illegible.
• This is why R includes a named argument feature.
• As an example, if we wanted to set our working directory to “/tmp” we would usually use the setwd()
command like so: setwd("/tmp").
• However, help(setwd) shows us the first argument to setwd() has the name dir, so we could also
write this as setwd(dir = "/tmp").
• This becomes useful for functions that have a large number of arguments, and for setting optional function
arguments.
• Note: named arguments must be set by =, and not by an assignment operator such as <-.

Dr t Prathima, Dept. of IT, CBIT, Telangana

PACKAGE NOTATION
• In R there are two primary ways to use a function from a package.
• The first is to attach the package with the library() command and then use the function name.
• The second is to use the package name and then :: to name the function.
• An example of this second method is stats::sd(1:5).
• The :: notation is good to avoid ambiguity or to leave a reminder of which package the function came
from for when you read your own code later.

Dr t Prathima, Dept. of IT, CBIT, Telangana

VALUE SEMANTICS
• R is unusual in that it efficiently simulates “copy by value" semantics.
• Any time a user has two references to data, each evolves independently: changes to one do not affect
the other.
• This is very desirable for part-time programmers and eliminates a large class of possible aliasing bugs
when writing code.
• Notice d2 keeps the old value of 1 for x. This feature allows for very convenient and safe coding.
• Many programming languages protect references or pointers in function calls in this manner; however, R
protects complex values and does so in all situations (not just function calls).
• Some care has to be taken when you want to share back changes, such as invoking a final assignment
such as d2 <- d after all desired changes have been made.

Dr t Prathima, Dept. of IT, CBIT, Telangana

ORGANIZING INTERMEDIATE VALUES
• Long sequences of calculations can
become difficult to read, debug, and
maintain.
• To avoid this, we suggest reserving the
variable named “.” to store intermediate
values.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Piped Notation
• The R package dplyr replaces the dot notation with what is called piped notation
• library("dplyr")
result <- data %>%
arrange(., sort_key) %>%
mutate(., ordered_sum_revenue = cumsum(revenue)) %>%
mutate(., fraction_revenue_seen = ordered_sum_revenue/sum(revenue))
• Each step of this example has been replaced by the corresponding dplyr equivalent.
• arrange() is dplyr’s replacement for order(), and mutate() is dplyr’s replacement for
assignment.
• The code translation is line by line, with the minor exception that assignment is written first (even though it
happens after all other steps).
• The calculation steps are sequenced by the magrittr pipe symbol %>%.
• The magrittr pipe allows you to write any of x %>% f, x %>%

Dr t Prathima, Dept. of IT, CBIT, Telangana

data.frame
• The R data.frame class is designed to store data in a very good “ready for analysis”
• format. data.frames are two-dimensional arrays where each column represents a variable,
measurement, or fact, and each row represents an individual or instance.
• In this format, an individual cell represents what is known about a single fact or variable for a single instance.
• data.frames are implemented as a named list of column vectors (list columns are possible, but they are
more of the exception than the rule for data.frames).
• In a data.frame, all columns have the same length, and this means we can think of the kth entry of all
columns as forming a row.
• Operations on data.frame columns tend to be efficient and vectorized.
• Adding, looking up, and removing columns is fast. Operations per row on a data.frame can be
expensive, so you should prefer vectorized column notations for large data.frame processing.
• R’s data.frame is much like a database table in that it has schema-like information: an explicit list of
column names and column types.
• Most analyses are best expressed in terms of transformations over data.frame columns.
• d <- data.frame(col1 = c(1, 2, 3), col2 = c(-1, 0, 1))
• d$col3 <- d$col1 + d$col2
• print(d)

Dr t Prathima, Dept. of IT, CBIT, Telangana

Working with files

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
XML—https://CRAN.R-project.org/package=XML
MongoDB—https://CRAN.R-project.org/package=mongolite
Dr t Prathima, Dept. of IT, CBIT, Telangana
Working with relational databases
• Relational databases scale easily to hundreds of millions of records and supply important production
features such as parallelism, consistency, transactions, logging, and audits.
• Relational databases are designed to support online transaction processing (OLTP)
• Loading the data
• Querying
• Summarizing
• Etc

• #PUMS American Community Survey data

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Exploring Data
• Summary statistics
• Typical problems revealed by data summaries
• common issues:
• Missing values
• Invalid values and outliers
• Data ranges that are too wide or too narrow
• The units of the data

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Invalid values and outliers

Dr t Prathima, Dept. of IT, CBIT, Telangana

Exploring Data
• Spotting problems using graphics and visualization
• – Avoid too many superimposed elements, such as too many curves in the same graphing space.
• – Find the right aspect ratio and scaling to properly bring out the details of the data.
• – Avoid having the data all skewed to one side or the other of your graph.
• Visualization is an iterative process. Its purpose is to answer questions about the data.
• Visually checking distributions for a single variable
• Histograms
• Density plots
• Bar charts
• Dot plots
• The visualizations in this section help you answer questions like these:
• What is the peak value of the distribution?
• How many peaks are there in the distribution (unimodality versus bimodality)?
• How much does the data vary? Is it concentrated in a certain interval or in a certain category?

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Visually checking relationships between two
variables
• Is there a relationship between the two inputs age and income in my data?
• If so, what kind of relationship, and how strong?
• Is there a relationship between the input marital status and the output health insurance? How strong?
• Line plots and scatter plots for comparing two continuous variables
• Smoothing curves and hexbin plots for comparing two continuous variables at high volume
• Different types of bar charts for comparing two discrete variables
• Variations on histograms and density plots for comparing a continuous and discrete variable

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Facet plots, also
known as trellis
plots or small
multiples,
are figures made
up of multiple
subplots which
have the same
set of axes,
where each
subplot shows a
subset of the
data

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana

Tools and Techniques For Data Science
No ratings yet
Tools and Techniques For Data Science
139 pages
Notes
No ratings yet
Notes
132 pages
Handout 1
No ratings yet
Handout 1
5 pages
Data Science Notes
No ratings yet
Data Science Notes
105 pages
Implementing Data Science Projects PDF
No ratings yet
Implementing Data Science Projects PDF
2 pages
Module 2
No ratings yet
Module 2
49 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
138 pages
Data Science Lifecycle Explained
No ratings yet
Data Science Lifecycle Explained
9 pages
What Is Data Science
No ratings yet
What Is Data Science
14 pages
Data Science and Its Life Cycle
No ratings yet
Data Science and Its Life Cycle
30 pages
Computational Data Science - Unit 1
No ratings yet
Computational Data Science - Unit 1
18 pages
Data Science Notes
100% (1)
Data Science Notes
138 pages
Data Science & R for Professionals
No ratings yet
Data Science & R for Professionals
95 pages
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
25 pages
Unit 2 PDS
No ratings yet
Unit 2 PDS
37 pages
DSA Lecture1
No ratings yet
DSA Lecture1
15 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
DATA SCIENCE Using R Notes
No ratings yet
DATA SCIENCE Using R Notes
116 pages
Data Science Bcs A
No ratings yet
Data Science Bcs A
20 pages
Intro to Data Science Basics
No ratings yet
Intro to Data Science Basics
171 pages
Data Science
No ratings yet
Data Science
10 pages
Introduction to Data Science Basics
No ratings yet
Introduction to Data Science Basics
44 pages
Unit 1 DS BCA NOTES
No ratings yet
Unit 1 DS BCA NOTES
7 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Data Science
100% (2)
Data Science
33 pages
Harsh Synopsis
No ratings yet
Harsh Synopsis
21 pages
Unit I
No ratings yet
Unit I
52 pages
CO1 - 2 - Data Science Roles, Stages in A Data Science Project
No ratings yet
CO1 - 2 - Data Science Roles, Stages in A Data Science Project
19 pages
WINSEM2024-25 BCSE206L TH VL2024250502024 2024-12-21 Reference-Material-II
No ratings yet
WINSEM2024-25 BCSE206L TH VL2024250502024 2024-12-21 Reference-Material-II
27 pages
01 Introduction
No ratings yet
01 Introduction
7 pages
Unit-I Introduction To Data Science
No ratings yet
Unit-I Introduction To Data Science
40 pages
Data Science Skills & F1 Score Guide
No ratings yet
Data Science Skills & F1 Score Guide
6 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
The Kinds of Data Scientist
No ratings yet
The Kinds of Data Scientist
7 pages
Data Science Applications by Rajesh - 91
No ratings yet
Data Science Applications by Rajesh - 91
46 pages
5th Sem Internship Eport
No ratings yet
5th Sem Internship Eport
83 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
19 pages
Data Science
No ratings yet
Data Science
9 pages
DS Notes
No ratings yet
DS Notes
159 pages
Data Science Process UNIT - II PS New
No ratings yet
Data Science Process UNIT - II PS New
21 pages
M1 - FDS
No ratings yet
M1 - FDS
19 pages
File
No ratings yet
File
27 pages
Introductiontodatascience 230122140841 B90a0856 1
No ratings yet
Introductiontodatascience 230122140841 B90a0856 1
44 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
Introductiontodatascience 230122140841 B90a0856
No ratings yet
Introductiontodatascience 230122140841 B90a0856
44 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Fundamentals of Data Science Course
100% (3)
Fundamentals of Data Science Course
62 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
85 pages
Data Science Introduction
No ratings yet
Data Science Introduction
24 pages
Data Science: A Guide for Professionals
No ratings yet
Data Science: A Guide for Professionals
8 pages
Ids Unit 1
No ratings yet
Ids Unit 1
25 pages
Ids Course Content
No ratings yet
Ids Course Content
98 pages
Ids Unit-I
No ratings yet
Ids Unit-I
34 pages
Week 3 - LAQ
No ratings yet
Week 3 - LAQ
5 pages
Data Science & Big Data Essentials
No ratings yet
Data Science & Big Data Essentials
46 pages
Life Cycle of DS Project
No ratings yet
Life Cycle of DS Project
9 pages
03 Parametric Families of Distributions
No ratings yet
03 Parametric Families of Distributions
4 pages
03 Principal Components Analysis
No ratings yet
03 Principal Components Analysis
3 pages
05 Pictorial and Tabular Methods in Descriptive Inference
No ratings yet
05 Pictorial and Tabular Methods in Descriptive Inference
5 pages
01 Hidden Markov Models
No ratings yet
01 Hidden Markov Models
3 pages
NP Completeness
No ratings yet
NP Completeness
18 pages
02 Regression and Classification Problems
No ratings yet
02 Regression and Classification Problems
7 pages
R Plotting Code Outputs
No ratings yet
R Plotting Code Outputs
1 page
Unit 1
No ratings yet
Unit 1
122 pages
PG R 23 M.tech CSE Syllabus
No ratings yet
PG R 23 M.tech CSE Syllabus
127 pages
M.Com Computer Applications Syllabus
No ratings yet
M.Com Computer Applications Syllabus
37 pages
Tutorial5 Logic
No ratings yet
Tutorial5 Logic
21 pages
Region Religion and Politics 100 Years of Shiromani Alcali Dal Amarjit S Narang Download
No ratings yet
Region Religion and Politics 100 Years of Shiromani Alcali Dal Amarjit S Narang Download
64 pages
Regular Letter 2024 Dulguime Jesus Carl 1
No ratings yet
Regular Letter 2024 Dulguime Jesus Carl 1
2 pages
Calcaneus Anatomy Overview
No ratings yet
Calcaneus Anatomy Overview
4 pages
Statistical Tests - Handout PDF
No ratings yet
Statistical Tests - Handout PDF
21 pages
Are Today's Teenagers Smarter and Better Than We Think - The New York Times
No ratings yet
Are Today's Teenagers Smarter and Better Than We Think - The New York Times
5 pages
Genetics Practicum Insights
No ratings yet
Genetics Practicum Insights
53 pages
Action Plan For NLC
No ratings yet
Action Plan For NLC
9 pages
CH1O3 Questions PDF
No ratings yet
CH1O3 Questions PDF
52 pages
Drugs
No ratings yet
Drugs
22 pages
CSF Anatomy & Physiology
No ratings yet
CSF Anatomy & Physiology
20 pages
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
100% (4)
Aircraft Electrical Load and Power Source Capacity Analysis: Standard Guide For
8 pages
Goat Housing Design Guide
No ratings yet
Goat Housing Design Guide
2 pages
Ep 20 Units
No ratings yet
Ep 20 Units
142 pages
Evolution of Handwriting Systems
100% (2)
Evolution of Handwriting Systems
38 pages
Heimdal The Gjallarhorn The Horn Resounding and Ragnarok by Ormungandr Melchizedek
100% (1)
Heimdal The Gjallarhorn The Horn Resounding and Ragnarok by Ormungandr Melchizedek
4 pages
Thuyết Trình Anh Văn Sáng Thứ 5
No ratings yet
Thuyết Trình Anh Văn Sáng Thứ 5
7 pages
PRIMARK To NIGERIA Group 12 ENG7144 - International Business & Marketing Presentation
No ratings yet
PRIMARK To NIGERIA Group 12 ENG7144 - International Business & Marketing Presentation
77 pages
Johnson Grammar School: Kuntloor-Hyderabad
No ratings yet
Johnson Grammar School: Kuntloor-Hyderabad
2 pages
XI - BST - 3 - Private, Public and Global Enterprises
No ratings yet
XI - BST - 3 - Private, Public and Global Enterprises
3 pages
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
No ratings yet
Critical Thinking Exercise: "Wild Child: The Story of Feral Children"
2 pages
lp4 Hydro
No ratings yet
lp4 Hydro
26 pages
Mini-Vert Brochure
No ratings yet
Mini-Vert Brochure
4 pages
Kavi Bhai Santokh Singh
No ratings yet
Kavi Bhai Santokh Singh
4 pages
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
No ratings yet
Step by Step On Changing ECC Source Systems Without Affecting Data Modeling Objects in SAP BW
16 pages
Nguyễn Văn Thành Trung-K59BF-ML15 PDF
No ratings yet
Nguyễn Văn Thành Trung-K59BF-ML15 PDF
9 pages
The Wizard's Harem - Volume Five - His Elven Dancer - Griz T. Orc & Kimiko Petaway - 2020 - Anna's Archive
No ratings yet
The Wizard's Harem - Volume Five - His Elven Dancer - Griz T. Orc & Kimiko Petaway - 2020 - Anna's Archive
45 pages
The Ultimate Guide To Reading The Water
No ratings yet
The Ultimate Guide To Reading The Water
39 pages
1 - People V Adriano, GR 205228
50% (2)
1 - People V Adriano, GR 205228
1 page
Goodwill Valuation in Accountancy
No ratings yet
Goodwill Valuation in Accountancy
4 pages
Cable Products Pricelist Cable Products Pricelist: Cable Products Price List Cable Products Price List
No ratings yet
Cable Products Pricelist Cable Products Pricelist: Cable Products Price List Cable Products Price List
24 pages

Unit 1

Uploaded by

Unit 1

Uploaded by

Introduction to Data Science

Dr T Prathima, Dept of IT, CBIT(A), Hyderabad

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

• The ideal data science environment

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

• The most common data science modeling tasks are these:

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

• Another aspect of vectors in R is that most R operations are vectorized.

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

• #PUMS American Community Survey data

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

Dr t Prathima, Dept. of IT, CBIT, Telangana

You might also like