Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
19 views74 pages

Unit 1

The document provides an introduction to data science, outlining the data science process, roles in a data science project, and stages involved in executing a project. It emphasizes the importance of defining measurable goals, data collection, modeling, evaluation, and deployment, while highlighting the collaborative nature of data science involving various stakeholders. Additionally, it discusses the significance of setting expectations and understanding model performance in relation to business objectives.

Uploaded by

meghanaalluri2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views74 pages

Unit 1

The document provides an introduction to data science, outlining the data science process, roles in a data science project, and stages involved in executing a project. It emphasizes the importance of defining measurable goals, data collection, modeling, evaluation, and deployment, while highlighting the collaborative nature of data science involving various stakeholders. Additionally, it discusses the significance of setting expectations and understanding model performance in relation to business objectives.

Uploaded by

meghanaalluri2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Introduction to Data Science

Dr T Prathima, Dept of IT, CBIT(A), Hyderabad


Unit-1
Introduction to data science: The Data Science Process: Roles in a data science project, Stages of a data science
project, Setting expectations.
Starting with R and data: Starting with R,working with data from files, Working with relational databases.
Exploring data: Using Summary Statistics to spot problems, Spotting problems with graphics and visualization.

Zumel, N., Mount, J., &Porzak, J., “Practical data science with R”, 2nd edition. Shelter Island, NY: Manning, 2019.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Data Science Process
• The statistician William S. Cleveland defined data science as an interdisciplinary field larger than statistics
itself.
• We define data science as managing the process that can transform hypotheses and data into actionable
predictions.
• Typical predictive analytic goals include predicting who will win an election, what products will sell well
together, which loans will default, and which advertisements will be clicked on.
• The data scientist is responsible for acquiring and managing the data, choosing the modeling technique,
writing the code, and verifying the results.
• Because data science draws on so many disciplines, many of the best data scientists we meet started as
programmers, statisticians, business intelligence analysts, or scientists.
• By adding a few more techniques to their repertoire, they became excellent data scientists.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Examples
• Much of the theoretical basis of data science comes from statistics.
• But data science as we know it is strongly influenced by technology and
software engineering methodologies, and has largely evolved in heavily
computer science– and information technology– driven groups.
• Engineering flavor of data science - some famous examples:
• Amazon’s product recommendation systems
• Google’s advertisement valuation systems
• LinkedIn’s contact recommendation system
• Twitter’s trending topics
• Walmart’s consumer demand projection systems

Dr t Prathima, Dept. of IT, CBIT, Telangana


Features
• All of these systems are built off large datasets. That’s not to say they’re all in the realm of big data. But
none of them could’ve been successful if they’d only used small datasets. To manage the data, these systems
require concepts from computer science: database theory, parallel programming theory, streaming data
techniques, and data warehousing.
• Most of these systems are online or live. Rather than producing a single report or analysis, the data science
team deploys a decision procedure or scoring procedure to either directly make decisions or directly show
results to a large number of end users. The production deployment is the last chance to get things right, as
the data scientist can’t always be around to explain defects.
• All of these systems are allowed to make mistakes at some non-negotiable rate.
• None of these systems are concerned with cause. They’re successful when they find useful correlations and
are not held to correctly sorting cause from effect.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Supplementary material
• https://github.com/WinVector/PDSwR2
• Beyond Spreadsheets with R by Jonathan Carroll (Manning, 20108) or
• R in Action by Robert Kabacoff (now available in a second edition http://www.manning.com/kabacoff2/),
along with the text’s associated website
• Quick-R (http://www.statmethods.net).
• For statistics, we recommend Statistics, Fourth Edition, by David Freedman, Robert Pisani, and Roger Purves
(W. W. Norton & Company, 2007).
• zip file (https://github.com/WinVector/PDSwR2/raw/master/CodeExamples.zip)
• https://forums.manning.com/forums/practical-data-science-with-rsecond-edition

Dr t Prathima, Dept. of IT, CBIT, Telangana


Defining Data Science
• Data science is a cross-disciplinary practice that draws on methods from data engineering, descriptive
statistics, data mining, machine learning, and predictive analytics.
• Much like operations research, data science focuses on implementing data-driven decisions and managing
their consequences.
• The data scientist is responsible for guiding a data science project from start to finish.
• Success in a data science project comes not from access to any one exotic tool, but from having quantifiable
goals, good methodology, cross-discipline interactions, and a repeatable workflow.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Roles in a Data Science Project

Dr t Prathima, Dept. of IT, CBIT, Telangana


Roles – Project Sponsor
Suppose you’re working for a German bank. The bank feels that it’s losing too much money to bad
loans and wants to reduce its losses. To do so, they want a tool to help loan officers more
accurately detect risky loans.
• The most important role in a data science project is the project sponsor. The sponsor is the person who wants the
data science result; generally, they represent the business interests.
• The sponsor is responsible for deciding whether the project is a success or failure. The data scientist may fill the
sponsor role for their own project if they feel they know and can represent the business needs, but that’s not the
optimal arrangement.
• The ideal sponsor meets the following condition: if they’re satisfied with the project outcome, then the project is
by definition a success.
• Getting sponsor sign-off becomes the central organizing goal of a data science project.
• In the loan application example, the sponsor might be the bank’s head of Consumer Lending.

• To ensure sponsor sign-off, you must get clear goals from them through directed interviews. You attempt to
capture the sponsor’s expressed goals as quantitative statements. An example goal might be “Identify 90% of
accounts that will go into default at least two months before the first missed payment with a false positive rate
of no more than 25%.”

Dr t Prathima, Dept. of IT, CBIT, Telangana


Roles – Client
• While the sponsor is the role that represents the business interests, the client is the role that represents the
model’s end users’ interests.
• Sometimes, the sponsor and client roles may be filled by the same person. Again, the data scientist may fill
the client role if they can weight business trade-offs, but this isn’t ideal.
• The client is more hands-on than the sponsor; they’re the interface between the technical details of building
a good model and the day-to-day work process into which the model will be deployed.
• They aren’t necessarily mathematically or statistically sophisticated, but are familiar with the relevant
business processes and serve as the domain expert on the team.
• In the loan application example, the client may be a loan officer or someone who represents the interests of
loan officers.
• As with the sponsor, you should keep the client informed and involved. Ideally, you’d like to have regular
meetings with them to keep your efforts aligned with the needs of the end users.
• Generally, the client belongs to a different group in the organization and has other responsibilities beyond
your project. Keep meetings focused, present results and progress in terms they can understand, and take
their critiques to heart.
• If the end users can’t or won’t use your model, then the project isn’t a success, in the long run.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Data Scientist
• The next role in a data science project is the data scientist, who’s responsible for taking all necessary steps to
make the project succeed, including setting the project strategy and keeping the client informed.
• They design the project steps, pick the data sources, and pick the tools to be used. Since they pick the
techniques that will be tried, they have to be well informed about statistics and machine learning.
• They’re also responsible for project planning and tracking, though they may do this with a project
management partner.
• At a more technical level, the data scientist also looks at the data, performs statistical tests and procedures,
applies machine learning models, and evaluates results—the science portion of data science.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Data Architect
• The data architect is responsible for all the data and its storage.
• Often this role is filled by someone outside of the data science group, such as a database administrator or
architect.
• Data architects often manage data warehouses for many different projects, and they may only be available
for quick consultation.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Operations
• The operations role is critical both in acquiring data and delivering the final results.
• The person filling this role usually has operational responsibilities outside of the data science group.
• For example, if you’re deploying a data science result that affects how products are sorted on an online
shopping site, then the person responsible for running the site will have a lot to say about how such a thing
can be deployed.
• This person will likely have constraints on response time, programming language, or data size that you need
to respect in deployment.
• The person in the operations role may already be supporting your sponsor or your client, so they’re often
easy to find (though their time may be already very much in demand).

Dr t Prathima, Dept. of IT, CBIT, Telangana


Stages of a Data Science Project

• The ideal data science environment


is one that encourages feedback
and iteration between the data
scientist and all other stakeholders.
• This is reflected in the lifecycle of a
data science project.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Defining the Goal
• The first task in a data science project is to define a measurable and quantifiable goal. At this
stage, learn all that you can about the context of your project:
• Why do the sponsors want the project in the first place? What do they lack, and what
do they need?
• What are they doing to solve the problem now, and why isn’t that good enough?
• What resources will you need: what kind of data and how much staff? Will you have
domain experts to collaborate with, and what are the computational resources?
• How do the project sponsors plan to deploy your results? What are the constraints
that have to be met for successful deployment?
• Once you and the project sponsor and other stakeholders have established preliminary answers to these questions,
you and they can start defining the precise goal of the project.
• The goal should be specific and measurable; not “We want to get better at finding bad loans,” but instead “We want to
reduce our rate of loan charge-offs by at least 10%, using a model that predicts which loan applicants are likely to
default.”

Dr t Prathima, Dept. of IT, CBIT, Telangana


Data Collection and Management
• This step encompasses identifying the data you
need, exploring it, and conditioning it to be
suitable for analysis. This stage is often the most
time-consuming step in the process. It’s also one
of the most important:
• What data is available to me?
• Will it help me solve the problem?
• Is it enough?
• Is the data quality good enough?

Dr t Prathima, Dept. of IT, CBIT, Telangana


Modelling

• The most common data science modeling tasks are these:


• Classifying—Deciding if something belongs to one category or another
• Scoring—Predicting or estimating a numeric value, such as a price or probability
• Ranking—Learning to order items by preferences
• Clustering—Grouping items into most-similar groups
• Finding relations—Finding correlations or potential causes of effects seen in the data
• Characterizing—Very general plotting and report generation from data

The loan application problem is a classification problem: you want to identify loan applicants who are likely to
default.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
Model Evaluation and Critique
• Once you have a model, you need to determine if it meets your goals:
• Is it accurate enough for your needs? Does it generalize well?
• Does it perform better than “the obvious guess”? Better than whatever estimate you currently use?
• Do the results of the model make sense in the context of the problem domain?

• Confusion Matrix

Dr t Prathima, Dept. of IT, CBIT, Telangana


Presentation and Documentation
A presentation for the model’s end users (the loan officers) would instead emphasize how the model will help
them do their job better:
• How should they interpret the model?
• What does the model output look like?
• If the model provides a trace of which rules in the decision tree executed, how do they read that?
• If the model provides a confidence score in addition to a classification, how should they use the confidence
score?
• When might they potentially overrule the model?

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
Model Deployment and Maintenance
• Finally, the model is put into operation.
• In many organizations, this means the data scientist no longer has primary responsibility for the day-to-day
operation of the model.
• But you still should ensure that the model will run smoothly and won’t make disastrous unsupervised
decisions.
• You also want to make sure that the model can be updated as its environment changes.
• And in many situations, the model will initially be deployed in a small pilot program.
• The test might bring out issues that you didn’t anticipate, and you may have to adjust the model accordingly.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Setting Expectations
• Setting expectations is a crucial part of defining the project goals and success criteria.
• The business-facing members of your team (in particular, the project sponsor) probably already have an idea
of the performance required to meet business goals:
• For example, the bank wants to reduce their losses from bad loans by at least 10%. Before you get too deep
into a project, you should make sure that the resources you have are enough for you to meet the business
goals.
• You get to know the data better during the exploration and cleaning phase;
• after you have a sense of the data, you can get a sense of whether the data is good enough to meet desired
performance thresholds.
• If it’s not, then you’ll have to revisit the project design and goalsetting stage.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Determining lower bounds on model performance
• Understanding how well a model should do for acceptable performance is important when defining
acceptance criteria.
• The null model represents the lower bound on model performance that you should strive for.
• You can think of the null model as being “the obvious guess” that your model must do better than.
• In situations where there’s no existing model or solution, the null model is the simplest possible model:
• A model that labels all loans as GoodLoan (in effect, using only the existing process to classify loans) would
be correct 70% of the time.
• So you know that any actual model that you fit to the data should be better than 70% accurate to be
useful—if accuracy were your only metric.
• Since this is the simplest possible model, its error rate is called the base error rate.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Determining lower bounds on model
performance
• How much better than 70% should you be? In statistics there’s a procedure called hypothesis testing, or
significance testing, that tests whether your model is equivalent to a null model (in this case, whether a new
model is basically only as accurate as guessing GoodLoan all the time).
• You want your model accuracy to be “significantly better”—in statistical terms—than 70%.
• if there is an existing model or process in place, you’d like to have an idea of its precision, recall, and false
positive rates;
• improving one of these metrics is almost always more important than considering accuracy alone. If the
purpose of your project is to improve the existing process, then the current model must be unsatisfactory for
at least one of these metrics.
• Knowing the limitations of the existing process helps you determine useful lower bounds on desired
performance.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Starting with R and data

Dr t Prathima, Dept. of IT, CBIT, Telangana


Few resources
• R in Action, Second Edition, Robert Kabacoff, Manning, 2015
• Beyond Spreadsheets with R, Jonathan Carroll, Manning, 2018
• The Art of R Programming, Norman Matloff, No Starch Press, 2011
• R for Everyone, Second Edition, Jared P. Lander, Addison-Wesley, 2017

Dr t Prathima, Dept. of IT, CBIT, Telangana


R coding style guides
• The Google R Style Guide
(https://google.github.io/styleguide/Rguide.html)
• Hadley Wickham's style guide from Advanced R (http://adv-
r.had.co.nz/Style.html)

Dr t Prathima, Dept. of IT, CBIT, Telangana


EXAMPLES AND THE COMMENT CHARACTER (#)
• R commands as free text, with the results prefixed by the hash mark, #, which is R’s comment character.
• print(seq_len(25))
• # [1] 1 2 3 4 5 6 7 8 9 10 11 12
• # [13] 13 14 15 16 17 18 19 20 21 22 23 24
• # [25] 25
• Notice the numbers were wrapped to three lines, and each line starts with the index of the first cell reported
on the line inside square brackets

Dr t Prathima, Dept. of IT, CBIT, Telangana


VECTORS AND LISTS
• Vectors (sequential arrays of values) are fundamental R data structures.
• Lists can hold different types in each slot;
• vectors can only hold the same primitive or atomic type in each slot.
• In addition to numeric indexing, both vectors and lists support name-keys.
• Retrieving items from a list or vector can be done by the operators shown next.
• VECTOR INDEXING R vectors and lists are indexed from 1, and not from 0 as with many other programming
languages.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Vectorised

• Another aspect of vectors in R is that most R operations are vectorized.


• A function or operator is called vectorized when applying it to a vector is shorthand for applying a function
to each entry of the vector independently.
• For example, the function nchar() counts how many characters are in a string. In R this function can be
used on a single string, or on a vector of strings

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
LOGICAL OPERATIONS
• R’s logical operators come in two flavors.
• R has standard infix scalar-valued operators that expect only one
value and have the same behavior and same names as you would see
in C or Java: && and ||.
• R also has vectorized infix operators that work on vectors of logical
values: & and |.
• Be sure to always use the scalar versions (&& and ||) in situations
such as if statements, and the vectorized versions (& and |) when
processing logical vectors.

Dr t Prathima, Dept. of IT, CBIT, Telangana


NULL AND NANA (NOT AVAILABLE) VALUES
• In R NULL is just a synonym for the empty or length-zero vector formed by using the concatenate operator
c() with no arguments.
• NULL is simply a length-zero vector.
• For example, c(c(), 1, NULL) is perfectly valid and returns the value 1.
• For example, the vector c("a", NA, "c") is a vector of three character strings where we do not know
the value of the second entry.
• Having NA is a great convenience as it allows us to annotate missing or unavailable values in place, which
can be critical in data processing.
• Also, NA means “not available,” not invalid (as NaN denotes), so NA has some convenient rules such as
the logical expression FALSE & NA simplifying to FALSE.

Dr t Prathima, Dept. of IT, CBIT, Telangana


IDENTIFIERS
• Identifiers or symbol names are how R refers to variables and functions.
• The Google R Style Guide insists on writing symbol names in what is called “CamelCase” (word boundaries in
names are denoted by uppercase letters, as in “CamelCase” itself).
• The Advanced R guide recommends an underscore style where names inside identifiers are broken up with
underscores (such as “day_one” instead of “DayOne”).
• Also, many R users use a dot to break up names with identifiers (such as “day.one”).
• In particular, important built-in R types such as data.frame and packages such as data.table use
the dot notation convention.

Dr t Prathima, Dept. of IT, CBIT, Telangana


LINE BREAKS
• It is generally recommended to keep R source code lines at 80 columns or fewer.
• R accepts multiple-line statements as long as where the statement ends is unambiguous.
• For example, to break the single statement “1 + 2" into multiple lines, write the code as follows:
• 1 +
• 2
• Do not write code like the following, as the first line is itself a valid statement, creating ambiguity:
• 1
• + 2

Dr t Prathima, Dept. of IT, CBIT, Telangana


SEMICOLONS
• R allows semicolons as end-of-statement markers, but does not require them.
• Most style guides recommend not using semicolons in R code and certainly not using them at ends of lines.

Dr t Prathima, Dept. of IT, CBIT, Telangana


ASSIGNMENT
• R has many assignment operators; the preferred one is <-.
• = can be used for assignment in R, but is also used to bind argument
values to argument names by name during function calls (so there is
some potential ambiguity in using =).

Dr t Prathima, Dept. of IT, CBIT, Telangana


LEFT-HAND SIDES OF ASSIGNMENTS
• R allows slice expressions on the left-hand sides of assignments, and both numeric and logical array
indexing.
• This allows for very powerful array-slicing commands and coding styles.
• For example, we can replace all the missing values (denoted by “NA") in a vector with zero as shown in the
following example:

Dr t Prathima, Dept. of IT, CBIT, Telangana


Factors
• R can handle many kinds of data: numeric, logical, integer, strings (called character types), and factors.
• Factors are an R type that encodes a fixed set of strings as integers.
• Factors can save a lot on storage while appearing to behave as strings
• Factors also encode the entire set of allowed values, which is useful—but can make combining data from
different sources (that saw different sets of values) a bit of a chore.
• To avoid issues, we suggest delaying conversion of strings to factors until late in an analysis.
• This is usually accomplished by adding the argument stringsAsFactors = FALSE to functions such
as data.frame() or read.table().
• We, however, do encourage using factors when you have a reason, such as wanting to use summary() or
preparing to produce dummy indicators

Dr t Prathima, Dept. of IT, CBIT, Telangana


NAMED ARGUMENTS
• R is centered around applying functions to data.
• Functions that take a large number of arguments rapidly become confusing and illegible.
• This is why R includes a named argument feature.
• As an example, if we wanted to set our working directory to “/tmp” we would usually use the setwd()
command like so: setwd("/tmp").
• However, help(setwd) shows us the first argument to setwd() has the name dir, so we could also
write this as setwd(dir = "/tmp").
• This becomes useful for functions that have a large number of arguments, and for setting optional function
arguments.
• Note: named arguments must be set by =, and not by an assignment operator such as <-.

Dr t Prathima, Dept. of IT, CBIT, Telangana


PACKAGE NOTATION
• In R there are two primary ways to use a function from a package.
• The first is to attach the package with the library() command and then use the function name.
• The second is to use the package name and then :: to name the function.
• An example of this second method is stats::sd(1:5).
• The :: notation is good to avoid ambiguity or to leave a reminder of which package the function came
from for when you read your own code later.

Dr t Prathima, Dept. of IT, CBIT, Telangana


VALUE SEMANTICS
• R is unusual in that it efficiently simulates “copy by value" semantics.
• Any time a user has two references to data, each evolves independently: changes to one do not affect
the other.
• This is very desirable for part-time programmers and eliminates a large class of possible aliasing bugs
when writing code.
• Notice d2 keeps the old value of 1 for x. This feature allows for very convenient and safe coding.
• Many programming languages protect references or pointers in function calls in this manner; however, R
protects complex values and does so in all situations (not just function calls).
• Some care has to be taken when you want to share back changes, such as invoking a final assignment
such as d2 <- d after all desired changes have been made.

Dr t Prathima, Dept. of IT, CBIT, Telangana


ORGANIZING INTERMEDIATE VALUES
• Long sequences of calculations can
become difficult to read, debug, and
maintain.
• To avoid this, we suggest reserving the
variable named “.” to store intermediate
values.

Dr t Prathima, Dept. of IT, CBIT, Telangana


Piped Notation
• The R package dplyr replaces the dot notation with what is called piped notation
• library("dplyr")
result <- data %>%
arrange(., sort_key) %>%
mutate(., ordered_sum_revenue = cumsum(revenue)) %>%
mutate(., fraction_revenue_seen = ordered_sum_revenue/sum(revenue))
• Each step of this example has been replaced by the corresponding dplyr equivalent.
• arrange() is dplyr’s replacement for order(), and mutate() is dplyr’s replacement for
assignment.
• The code translation is line by line, with the minor exception that assignment is written first (even though it
happens after all other steps).
• The calculation steps are sequenced by the magrittr pipe symbol %>%.
• The magrittr pipe allows you to write any of x %>% f, x %>%

Dr t Prathima, Dept. of IT, CBIT, Telangana


data.frame
• The R data.frame class is designed to store data in a very good “ready for analysis”
• format. data.frames are two-dimensional arrays where each column represents a variable,
measurement, or fact, and each row represents an individual or instance.
• In this format, an individual cell represents what is known about a single fact or variable for a single instance.
• data.frames are implemented as a named list of column vectors (list columns are possible, but they are
more of the exception than the rule for data.frames).
• In a data.frame, all columns have the same length, and this means we can think of the kth entry of all
columns as forming a row.
• Operations on data.frame columns tend to be efficient and vectorized.
• Adding, looking up, and removing columns is fast. Operations per row on a data.frame can be
expensive, so you should prefer vectorized column notations for large data.frame processing.
• R’s data.frame is much like a database table in that it has schema-like information: an explicit list of
column names and column types.
• Most analyses are best expressed in terms of transformations over data.frame columns.
• d <- data.frame(col1 = c(1, 2, 3), col2 = c(-1, 0, 1))
• d$col3 <- d$col1 + d$col2
• print(d)

Dr t Prathima, Dept. of IT, CBIT, Telangana


Working with files

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
XML—https://CRAN.R-project.org/package=XML
MongoDB—https://CRAN.R-project.org/package=mongolite
Dr t Prathima, Dept. of IT, CBIT, Telangana
Working with relational databases
• Relational databases scale easily to hundreds of millions of records and supply important production
features such as parallelism, consistency, transactions, logging, and audits.
• Relational databases are designed to support online transaction processing (OLTP)
• Loading the data
• Querying
• Summarizing
• Etc

• #PUMS American Community Survey data

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Exploring Data
• Summary statistics
• Typical problems revealed by data summaries
• common issues:
• Missing values
• Invalid values and outliers
• Data ranges that are too wide or too narrow
• The units of the data

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Invalid values and outliers

Dr t Prathima, Dept. of IT, CBIT, Telangana


Exploring Data
• Spotting problems using graphics and visualization
• – Avoid too many superimposed elements, such as too many curves in the same graphing space.
• – Find the right aspect ratio and scaling to properly bring out the details of the data.
• – Avoid having the data all skewed to one side or the other of your graph.
• Visualization is an iterative process. Its purpose is to answer questions about the data.
• Visually checking distributions for a single variable
• Histograms
• Density plots
• Bar charts
• Dot plots
• The visualizations in this section help you answer questions like these:
• What is the peak value of the distribution?
• How many peaks are there in the distribution (unimodality versus bimodality)?
• How much does the data vary? Is it concentrated in a certain interval or in a certain category?

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Visually checking relationships between two
variables
• Is there a relationship between the two inputs age and income in my data?
• If so, what kind of relationship, and how strong?
• Is there a relationship between the input marital status and the output health insurance? How strong?
• Line plots and scatter plots for comparing two continuous variables
• Smoothing curves and hexbin plots for comparing two continuous variables at high volume
• Different types of bar charts for comparing two discrete variables
• Variations on histograms and density plots for comparing a continuous and discrete variable

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Facet plots, also
known as trellis
plots or small
multiples,
are figures made
up of multiple
subplots which
have the same
set of axes,
where each
subplot shows a
subset of the
data

Dr t Prathima, Dept. of IT, CBIT, Telangana


Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana
Dr t Prathima, Dept. of IT, CBIT, Telangana

You might also like