Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
11 views124 pages

Introduction To Data Science and Big Data

The document provides an introduction to Data Science and Big Data, covering their significance, applications, and challenges. It discusses the importance of data in modern organizations, the data science life cycle, and various applications across sectors like finance, healthcare, and e-commerce. Additionally, it highlights the differences between Data Science and Business Intelligence, as well as the roles and skills required for data-related jobs.

Uploaded by

Shital Bedse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views124 pages

Introduction To Data Science and Big Data

The document provides an introduction to Data Science and Big Data, covering their significance, applications, and challenges. It discusses the importance of data in modern organizations, the data science life cycle, and various applications across sectors like finance, healthcare, and e-commerce. Additionally, it highlights the differences between Data Science and Business Intelligence, as well as the roles and skills required for data-related jobs.

Uploaded by

Shital Bedse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

Introduction to Data Science

and Big Data


Shital R. Bedse
Content
● Basics and need of Data Science and Big Data,
● Applications of Data Science,
● Data explosion,
● 5 V’s of Big Data,
● Relationship between Data Science and Information Science,
● Business intelligence versus Data Science,
● Data Science Life Cycle,
● Data: Data Types, Data Collection.
● Need of Data wrangling, Methods: Data Cleaning, Data Integration, Data
Reduction, Data Transformation, Data Discretization.

2
Data Science

3
Data Science
Data science is the study of data to extract meaningful
insights for business. It is a multidisciplinary approach that
combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer
engineering to analyze large amounts of data.

4
History of Data Science
The word first appeared in the ’60s as an alternative name
for statistics.
In the late ’90s, computer science professionals formalized
the term.
A proposed definition for data science saw it as a separate
field with three aspects: data design, collection, and
analysis. It still took another decade for the term to be used
outside of academia.
5
Why Data Science is Important?
Data science is important because it combines tools, methods, and
technology to generate meaning from data. Modern organizations
are inundated with data; there is a proliferation of devices that can
automatically collect and store information. Online systems and
payment portals capture more data in the fields of e-commerce,
medicine, finance, and every other aspect of human life. We have
text, audio, video, and image data available in vast quantities.

6
Why Data Science?
• Data is the oil for today’s world. With the right tools,technologies,
algorithms, we can use data and convert it into a distinctive business
advantage
• Data Science can help you to detect fraud using advanced machine
learning algorithms
• It helps you to prevent any significant monetary losses
• Allows to build intelligence ability in machines
• You can perform sentiment analysis to gauge customer brand
loyalty
7
Why Data Science?
• It enables you to take better and faster decisions
• Helps you to recommend the right product to the right customer to
enhance your business

8
Application of Data Science

9
1. In Search Engine:

The most useful application of Data Science is Search Engines. As we know when
we want to search for something on the internet, we mostly use Search engines
like Google, Yahoo, Safari, Firefox, etc. So Data Science is used to get Searches
faster.
For Example, When we search for something suppose “Data Structure and
algorithm courses ” then at that time on Internet Explorer we get the first link of
GeeksforGeeks Courses. This happens because the GeeksforGeeks website is
visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is done using Data Science, and we
get the Topmost visited Web Links.

10
2. In Transport

Data Science is also entered in real-time such as the Transport


field like Driverless Cars. With the help of Driverless Cars, it is easy
to reduce the number of Accidents.
For Example, In Driverless Cars the training data is fed into the
algorithm and with the help of Data Science techniques, the Data is
analyzed like what as the speed limit in highways, Busy Streets,
Narrow Roads, etc. And how to handle different situations while
driving etc.
11
3. In Finance

Data Science plays a key role in Financial Industries. Financial Industries


always have an issue of fraud and risk of losses. Thus, Financial Industries
needs to automate risk of loss analysis in order to carry out strategic
decisions for the company. Also, Financial Industries uses Data Science
Analytics tools in order to predict the future. It allows the companies to predict
customer lifetime value and their stock market moves.
For Example, In Stock Market, Data Science is the main part. In the Stock
Market, Data Science is used to examine past behavior with past data and
their goal is to examine the future outcome. Data is analyzed in such a way
that it makes it possible to predict future stock prices over a set timetable.

12
4. In E-Commerce

E-Commerce Websites like Amazon, Flipkart, etc. uses data


Science to make a better user experience with personalized
recommendations.
For Example, When we search for something on the
E-commerce websites we get suggestions similar to choices
according to our past data and also we get recommendations
according to most buy the product, most rated, most
searched, etc. This is all done with the help of Data Science.
13
5. In Health Care

In the Healthcare Industry data science act as a boon. Data Science is used for:
● Detecting Tumor.
● Drug discoveries.
● Medical Image Analysis.
● Virtual Medical Bots.
● Genetics and Genomics.
● Predictive Modeling for Diagnosis etc.

14
6. Image Recognition

Currently, Data Science is also used in Image Recognition.


For Example, When we upload our image with our friend on Facebook, Facebook
gives suggestions Tagging who is in the picture. This is done with the help of
machine learning and Data Science. When an Image is Recognized, the data
analysis is done on one’s Facebook friends and after analysis, if the faces which
are present in the picture matched with someone else profile then Facebook
suggests us auto-tagging.

15
7. Airline Routing Planning

With the help of Data Science, Airline Sector is also growing like with the help of
it, it becomes easy to predict flight delays. It also helps to decide whether to
directly land into the destination or take a halt in between like a flight can have a
direct route from Delhi to the U.S.A or it can halt in between after that reach at
the destination.

16
8. Targeting Recommendation

Targeting Recommendation is the most important application of Data Science.


Whatever the user searches on the Internet, he/she will see numerous posts
everywhere. This can be explained properly with an example: Suppose I want a
mobile phone, so I just Google search it and after that, I changed my mind to buy
offline. In Real -World Data Science helps those companies who are paying for
Advertisements for their mobile.
So everywhere on the internet in the social media, in the websites, in the apps
everywhere I will see the recommendation of that mobile phone which I searched
for. So this will force me to buy online.

17
9. Data Science in Gaming

In most of the games where a user will play with an opponent i.e. a
Computer Opponent, data science concepts are used with machine
learning where with the help of past data the Computer will improve its
performance. There are many games like Chess, EA Sports, etc. will use
Data Science concepts.

18
10. Medicine and Drug Development

The process of creating medicine is very difficult and time-consuming and has
to be done with full disciplined because it is a matter of Someone’s life.
Without Data Science, it takes lots of time, resources, and finance or
developing new Medicine or drug but with the help of Data Science, it
becomes easy because the prediction of success rate can be easily
determined based on biological data or factors. The algorithms based on data
science will forecast how this will react to the human body without lab
experiments.

19
11. In Delivery Logistics

Various Logistics companies like DHL, FedEx, etc. make use of Data Science.
Data Science helps these companies to find the best route for the Shipment of
their Products, the best time suited for delivery, the best mode of transport to
reach the destination, etc.

20
Data Science Challenges
● The wide variety of information & data is needed for accurate analysis
● Not adequate data science talent pool available
● Management does not provide financial support for a data science team.
● Unavailability of/difficult access to data
● Data Science results not effectively used by business decision-makers
● Explaining data science to others is difficult
● Privacy issues
● Lack of significant domain expert
● If an organization is very small, it can’t have a data science team.

21
Big Data

22
Big Data
Big Data refers to massive amounts of data produced by different
sources like social media platforms, web logs, sensors, IoT devices,
and many more. It can be either structured (like tables in DBMS),
semi-structured (like XML files), or unstructured (like audios,
videos, images).

23
Big Data
● Big Data literally means large amounts of data. Big data is the
pillar behind the idea that one can make useful inferences with a
large body of data that wasn’t possible before with smaller
datasets. So extremely large data sets may be analyzed
computationally to reveal patterns, trends, and associations that
are not transparent or easy to identify.

24
How much data is Big Data?
● Google processes 20 Petabytes(PB) per day (2008)
● Facebook has 2.5 PB of user data + 15 TB per day (2009)
● eBay has 6.5 PB of user data + 50 TB per day (2009)
● CERN’s Large Hadron Collider(LHC) generates 15 PB a year

25
Why Big Data?
Leveraging a Big Data analytics solution helps organizations to unlock the strategic
values and take full advantage of their assets.
It helps organizations:
● To understand Where, When and Why their customers buy
● Protect the company’s client base with improved loyalty programs
● Seizing cross-selling and upselling opportunities
● Provide targeted promotional information
● Optimize Workforce planning and operations
● Improve inefficiencies in the company’s supply chain
● Predict market trends
● Predict future needs
● Make companies more innovative and competitive
● It helps companies to discover new sources of revenue
26
Why Big Data?
●Companies are using Big Data to know what their customers want, who
are their best customers, why people choose different products. The more
a company knows about its customers, the more competitive it becomes.
●We can use it with Machine Learning for creating market strategies based
on predictions about customers. Leveraging big data makes companies
customer-centric.
●Companies can use Historical and real-time data to assess evolving
consumers’ preferences. This consequently enables businesses to
improve and update their marketing strategies which make companies
more responsive to customer needs.

27
Data Explosion
The world is currently used to sparing everything without exception in the
electronic space. Processing power, RAM speeds and hard-disk sizes have
expanded to level that has changed our viewpoint towards data and its storage.
Would you be able to envision having 256 or 512 MB RAM in your PC now?
On the off chance that we comprehend idea of byte, we can envision how data
growth has expanded over time and how storage systems handle it. We know that
1 byte is equivalent to 8 bits and these 8 bits can represent character or
expression. An archive with huge number of bytes will contain huge number of
characters, expressions and spaces etc. Similarly, megabyte (MB) is million bytes
of information, gigabyte (GB) is billion bytes of information and terabyte (TB) is
trillion bytes of information. We use these terms while managing data and storage,
on our everyday activities.
28
Data Explosion
But it doesn’t end here. Next comes the petabyte, which is quadrillion
bytes or million gigabyte. Ones after that are Exabyte, Zettabyte, and
Yottabyte. Yottabyte is basically trillion terabytes of information. There are
considerably higher numbers and we’ll stop here now.
As indicated by report by Global Information Enterprise, IDC, in 2009 that
aggregate sum of information in world was 800 EB. It is expected to
ascend to 44 ZB before finish of 2020, i.e., 44 trillion gigabytes. Second
explanation they made was that 11 ZB of this information will be put away
in cloud.
29
Data Explosion
The rapid or exponential increase in the amount of data that is
generated and stored in the computing systems, that reaches level
where data management becomes difficult, is called “Data
Explosion”.
The key drivers of data growth are following :
● Increase in storage capacities.
● Cheaper storage.
● Increase in data processing capabilities by modern computing
devices.
● Data generated and made available by different sectors.
30
Data Science Componenets

31
32
Statistics
Statistics is the most critical unit of Data Science basics. It
is the method or science of collecting and analyzing
numerical data in large quantities to get useful insights.

33
Data Engineering
is a branch of Data Science that is primarily concerned
with the practical applications of data acquisition and
analysis.
• It focuses on designing and building data pipelines
that can collect, prepare, and transform data (both
structured and unstructured) into usable formats Data
Scientists’ perusal.

34
Visualization
Visualization technique helps you access huge
amounts of data in easy to understand and
digestible visuals.

35
Machine Learning
Machine Learning explores the building and study
of algorithms that learn to make predictions about
unforeseen/future data.

36
Deep Learning
Deep Learning method is new machine learning
research where the algorithm selects the analysis
model to follow.

37
Data Science Jobs Roles

38
Data Science Jobs Roles
Most prominent Data Scientist job titles are:
• Data Scientist
• Data Engineer
• Data Analyst
• Statistician
• Data Architect
• Data Admin
• Business Analyst
• Data/Analytics Manager

39
Data Scientist
● Role: A Data Scientist is a professional who manages
enormous amounts of data to come up with compelling
business visions by using various tools, techniques,
methodologies, algorithms, etc.
● Languages: R, SAS, Python, SQL, Hive, Matlab, Pig, Spark

40
Data Engineer
● Role: The role of data engineer is of working with large
amounts of data. He develops, constructs, tests, and
maintains architectures like large scale processing system
and databases.
● Languages: SQL, Hive, R, SAS, Matlab, Python, Java,
Ruby, C + +, and Perl

41
Data Analyst
● Role: A data analyst is responsible for mining vast amounts
of data. He or she will look for relationships, patterns, trends
in data. Later he or she will deliver compelling reporting and
visualization for analyzing the data to take the most viable
business decisions.
● Languages: R, Python, HTML, JS, C, C+ + , SQL

42
Statistician
● Role: The statistician collects, analyses, understand
qualitative and quantitative data by using statistical
theories and methods.
● Languages: SQL, R, Matlab, Tableau, Python, Perl, Spark,
and Hive

43
Data Administrator
● Role: Data admin should ensure that the database is
accessible to all relevant users.He also makes sure that it
is performing correctly and is being kept safe from
hacking.
● Languages: Ruby on Rails, SQL, Java, C#, and Python

44
Business Analyst
● Role: This professional need to improves business
processes. He/she as an intermediary between the
business executive tea and IT department.
● Languages: SQL, Tableau, Power BI and, Python

45
Data Science VS BI

46
47
S.
Factor Data Science Business Intelligence
No
BI VS DATA SCIENCE
It is a field that uses mathematics, It is basically a set of technologies,
statistics and various other tools applications and processes that are
1. Concept
to discover the hidden patterns in used by the enterprises for business
the data. data analysis.

2. Focus It focuses on the future. It focuses on the past and present.

It deals with both structured as It mainly deals only with structured


3. Data
well as unstructured data. data.

Data science is much more It is less flexible as in case of


4. Flexibility flexible as data sources can be business intelligence data sources
added as per requirement. need to be pre-planned.

It makes use of the scientific 48


5. Method It makes use of the analytic method.
It has a higher complexity in comparison It is much simpler when compared to data
6. Complexity
to business intelligence. science.

7. Expertise It’s expertise is data scientist. It’s expertise is the business user.

It deals with the questions of what will


8. Questions It deals with the question of what happened.
happen and what if.

The data to be used is disseminated in


9. Storage Data warehouse is utilized to hold data.
real-time clusters.

The ELT (Extract-Load-Transform)


The ETL (Extract-Transform-Load) process is
10 Integration process is generally used for the
generally used for the integration of data for
. of data integration of data for data science
business intelligence applications.
applications.

It’s tools are InsightSquared Sales Analytics,


11 It’s tools are SAS, BigML, MATLAB, Excel,
Tools Klipfolio, ThoughtSpot, Cyfe, TIBCO Spotfire,
. etc.
etc.
49
Challenges of Data Science:
● The wide variety of information & data is needed for accurate analysis
● Not adequate data science talent pool available
● Management does not provide financial support for a data science
team.
● Unavailability of/difficult access to data
● Data Science results not effectively used by business decision-makers
● Explaining data science to others is difficult
● Privacy issues
● Lack of significant domain expert
● If an organization is very small, it can’t have a data science team.
50
Characteristics/5 V’s of Big data
Big data is a collection of data from many different
sources and is often describe by five characteristics:.
1. Volume
2. Value
3. Variety
4. Velocity
5. Veracity

51
52
1. Volume: the size and amounts of big data that companies manage and analyze.
Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes
(6.2 billion GB) per month. Also, by the year 2020 we will have almost 40000
Exabytes of data
2. Value: the most important “V” from the perspective of the business, the value of big
data usually comes from insight discovery and pattern recognition that lead to more
effective operations, stronger customer relationships and other clear and quantifiable
business benefits
● The bulk of Data having no Value is of no good to the company, unless you turn it
into something useful.
● Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information. Hence, you can state that Value! is the
most important V of all the 5V’s.
53
3. Variety: It refers to nature of data i.e. structured, semi-structured and
unstructured data.It also refers to heterogeneous sources.
Variety is basically the arrival of data from new sources that are both inside and
outside of an enterprise. It can be structured, semi-structured and unstructured.
●Structured data: This data is basically an organized data. It generally refers to
data that has defined the length and format of data.
●Semi- Structured data: This data is basically a semi-organised data. It is generally
a form of data that do not conform to the formal structure of data. Log files are the
examples of this type of data.
●Unstructured data: This data basically refers to unorganized data. It generally
refers to data that doesn’t fit neatly into the traditional row and column structure
of the relational database. Texts, pictures, videos etc. are the examples of
unstructured data which can’t be stored in the form of rows and columns. 54
55
4. Velocity: the speed at which companies receive, store and manage data –
e.g., the specific number of social media posts or search queries received within
a day, hour or other unit of time.
● Example: There are more than 3.5 billion searches per day are made on
Google. Also, Facebook users are increasing by 22%(Approx.) year by year.
5. Veracity: the “truth” or accuracy of data and information assets, which often
determines executive-level confidence
The additional characteristic of variability can also be considered:
● Variability: the changing nature of the data companies seek to capture,
manage and analyze – e.g., in sentiment or text analytics, changes in the
meaning of key words or phrases

56
What is a Data Science Project Lifecycle?
● In simple terms, a data science life cycle is nothing but a repetitive set of
steps that you need to take to complete and deliver a project/product to
your client.
● Although the data science projects and the teams involved in deploying
and developing the model will be different, every data science life cycle
will be slightly different in every other company. ‘
● However, most of the data science projects happen to follow a somewhat
similar process.

57
Who Are Involved in The Projects:
● Business Analyst
● Data Analyst
● Data Scientists
● Data Engineer
● Data Architect
● Machine Learning Engineer

58
Data Science Lifecycle
● Data Science Lifecycle revolves around the use of machine learning and
different analytical strategies to produce insights and predictions from
information in order to acquire a commercial enterprise objective. The
complete method includes a number of steps like data cleaning,
preparation, modeling, model evaluation, etc.
● The lifecycle below outlines the major stages that a data science project
typically goes through

59
60
1. Business Understanding
● In order to build a successful business model, its very important to first
understand the business problem that the client is facing.
● First understand his business, his requirements and what he is actually
wanting to achieve from the prediction.
● .A Business Analyst is generally responsible for gathering the required
details from the client and forwarding the data to the data scientist team
for further speculation.

61
Business Understanding
● Even a minute error in defining the problem and understanding the
requirement may be very crucial for the project hence it is to be done with
maximum precision.
● After asking the required questions to the company stakeholders or
clients, we move to the next process which is known as data collection.

62
2. Data Understanding
● This includes a series of all the reachable data. Here you need to intently
work with the commercial enterprise group as they are certainly
conscious of what information is present, what facts should be used for
this commercial enterprise problem, and different information.
● Includes describing the data, their structure, their relevance, their records
type. Explore the information using graphical plots.
● Basically, extracting any data that you can get about the information
through simply exploring the data.

63
3. Preparation of Data:
● After gathering the data from relevant sources we need to move
forward to data preparation. This stage helps us gain a better
understanding of the data and prepares it for further evaluation.

● It entails steps such as selecting relevant data, combining it by mixing


data sets, cleaning it, dealing with missing values by either removing
them or imputing them with relevant data, dealing with incorrect data
by removing it, and also checking for and dealing with outliers.

64
3. Preparation of Data:
● Format the data according to the desired structure and delete any
unnecessary columns or functions.

● Data preparation is the most time-consuming process, accounting for up


to 90% of the total project duration, and this is the most crucial step
throughout the entire life cycle.

65
4. Exploratory Data Analysis:
● Exploratory Data Analysis (EDA) is critical at this point because
summarising clean data enables the identification of the data’s structure,
outliers, anomalies, and trends.

● These insights can aid in identifying the optimal set of features, an


algorithm to use for model creation, and model construction.

66
5. Data Modeling
● Data modeling is the coronary heart of data analysis. A model takes the
organized data as input and gives the preferred output.
● This step consists of selecting the suitable kind of model, whether the
problem is a classification problem, or a regression problem or a
clustering problem

67
5. Data Modeling
● Depending on the type of data received we happen to choose the
appropriate machine learning algorithm that is best suited for the model.
Once this is done, we ought to tune the hyper parameters of the chosen
models to to obtain the preferred performance.

68
6. Model Evaluation
● Before the model is deployed, we need to ensure that we have picked the
right solution after a rigorous evaluation has been. Later on, it is then
deployed in the desired channel and format.
● Please take extra caution before executing each step in the life cycle to
avoid unwanted errors.

69
6. Model Evaluation
● For example, if you choose the wrong machine learning algorithm for data
modeling then you will not achieve the desired accuracy and it will be
difficult in getting approval for the project from the stakeholders.

● If your data is not cleaned properly, you will have to handle missing
values or the noise present in the dataset later on. Hence, in order to
make sure that the model is deployed properly and accepted in the real
world as an optimal use case, you will have to do rigorous testing in every
step.

70
7. Model Deployment:
● This is the last step in the data science life cycle. Each step in the data
science life cycle defined above must be labored upon carefully.
● If any step is performed improperly, and hence, have an effect on the
subsequent step and the complete effort goes to waste.
● For example, if data is no longer accumulated properly, you’ll lose records
and you will no longer be constructing an ideal model

71
7. Model Deployment:
● For example, If information is not cleaned properly, the model will no
longer work. If the model is not evaluated properly, it will fail in the actual
world. Right from Business perception to model deployment, every step
has to be given appropriate attention, time, and effort.

72
Data Types
● Whether you are a businessman, marketer, data scientist, or another
professional who works with some kinds of data, you should be familiar
with the key list of data types.

● Why? Because the various data classifications allow you to correctly use
measurements and thus to correctly make decisions.

73
74
75
76
Discrete vs Continuous Data

77
Dataset
A Dataset is a set or collection of data. This set is normally
presented in a tabular pattern. Every column describes a
particular variable. And each row corresponds to a given
member of the data set.

78
Data Munging /Data Wrangling
● Data wrangling is the process of cleaning, structuring
and enriching raw data into a desired format for better
decision making in less time.
● It enables businesses to tackle more complex data in
less time, produce more accurate results, and make
better decisions.

79
Importance of Data Wrangling
● Making raw data usable. Accurately wrangled data
guarantees that quality data is entered into the
downstream analysis.
● Getting all data from various sources into a centralized
location
● Piecing together raw data according to the required
format and understanding the business context of data

80
Importance of Data Wrangling
● Automated data integration tools are used as data wrangling
techniques that clean and convert source data into a standard
format that can be used repeatedly according to end
requirements. Businesses use this standardized data to perform
crucial, cross-data set analytics.
● Cleansing the data from the noise or flawed, missing elements
● Data wrangling acts as a preparation stage for the data mining
process, which involves gathering data and making sense of it.
● Helping business users make concrete, timely decisions

81
Benefits of Data Wrangling
● Data wrangling helps to improve data usability as it converts data into a
compatible format for the end system.
● It helps to quickly build data flows within an intuitive user interface and
easily schedule and automate the data-flow process.
● Integrates various types of information and their sources (like databases,
web services, files, etc.)

82
Data Wrangling Tools
● Spreadsheets / Excel Power Query - It is the most basic manual data
wrangling tool
● OpenRefine - An automated data cleaning tool that requires programming
skills
● Tabula – It is a tool suited for all data types
● Google DataPrep – It is a data service that explores, cleans, and prepares
data
● Data wrangler – It is a data cleaning and transforming tool

83
Data Wrangling Examples
● Merging several data sources into one data-set for analysis
● Identifying gaps or empty cells in data and either filling or removing them
● Deleting irrelevant or unnecessary data
● Identifying severe outliers in data and either explaining the inconsistencies
or deleting them to facilitate analysis
● Businesses also use data wrangling tools to
● Detect corporate fraud
● Support data security
● Ensure accurate and recurring data modeling results
● Ensure business compliance with industry standards
● Perform Customer Behavior Analysis
● Reduce time spent on preparing data for analysis
● Promptly recognize the business value of your data
84
Data Preprocessing
● Data Preprocessing: An Overview
○ Data Quality
○ Major Tasks in Data Preprocessing

● Data Cleaning
● Data Integration
● Data Reduction
● Data Transformation and Data Discretization
● Summary

85
Data preprocessing
● Data preprocessing is the process of transforming raw data into an
understandable format. It is also an important step in data mining as we
cannot work with raw data. The quality of the data should be checked
before applying machine learning or data mining algorithms.

86
Why is Data preprocessing important?
● Accuracy: To check whether the data entered is
correct or not.
● Completeness: To check whether the data is available
or not recorded.
● Consistency: To check whether the same data is kept
in all the places that do or do not match.
● Timeliness: The data should be updated correctly.
● Believability: The data should be trustable.
● Interpretability: The understandability of the data.
87
Major Tasks in Data Preprocessing
● Data cleaning
○ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
● Data integration
○ Integration of multiple databases, data cubes, or files
● Data reduction
○ Dimensionality reduction
○ Numerosity reduction
● Data transformation and data discretization
○ Normalization
○ Concept hierarchy generation

88
Chapter 3: Data Preprocessing
● Data Preprocessing: An Overview
○ Data Quality
○ Major Tasks in Data Preprocessing
● Data Cleaning
● Data Integration
● Data Reduction
● Data Transformation and Data Discretization
● Summary
89
89
Data Cleaning
● Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
○ incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
■ e.g., Occupation=“ ” (missing data)
○ noisy: containing noise, errors, or outliers
■ e.g., Salary=“−10” (an error)
○ inconsistent: containing discrepancies in codes or names, e.g.,
■ Age=“42”, Birthday=“03/07/2010”
■ Was rating “1, 2, 3”, now rating “A, B, C”
○ Intentional (e.g., disguised missing data)
■ Jan. 1 as everyone’s birthday?

90
Incomplete (Missing) Data
● Data is not always available
○ E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
● Missing data may be due to
○ equipment malfunction
○ inconsistent with other recorded data and thus
deleted
○ data not entered due to misunderstanding
○ certain data may not be considered important at the
time of entry
○ not register history or changes of the data
● Missing data may need to be inferred
91
How to Handle Missing Data?
● Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
● Fill in the missing value manually: tedious + infeasible?
● Fill in it automatically with
○ a global constant : e.g., “unknown”, a new class?!
○ the attribute mean
○ the attribute mean for all samples belonging to the
same class: smarter
○ the most probable value: inference-based such as
Bayesian formula or decision tree
92
Noisy Data
● Noise: random error or variance in a measured variable
● Incorrect attribute values may be due to
○ faulty data collection instruments
○ data entry problems
○ data transmission problems
○ technology limitation
○ inconsistency in naming convention
● Other data problems which require data cleaning
○ duplicate records
○ incomplete data
○ inconsistent data
93
How to Handle Noisy Data?
● Binning
○ first sort data and partition into (equal-frequency) bins
○ then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
● Regression
○ smooth by fitting the data into regression functions
● Clustering
○ detect and remove outliers
● Combined computer and human inspection
○ detect suspicious values and check by human (e.g., deal
with possible outliers)
94
Data Cleaning as a Process
● Data discrepancy detection
○ Use metadata (e.g., domain, range, dependency, distribution)
○ Check field overloading
○ Check uniqueness rule, consecutive rule and null rule
○ Use commercial tools
■ Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
■ Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering
to find outliers)
● Data migration and integration
○ Data migration tools: allow transformations to be specified
○ ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
● Integration of the two processes
○ Iterative and interactive (e.g., Potter’s Wheels)

95
Data Preprocessing
● Data Preprocessing: An Overview
○ Data Quality
○ Major Tasks in Data Preprocessing
● Data Cleaning
● Data Integration
● Data Reduction
● Data Transformation and Data Discretization
● Summary
96
Data Integration
● Data integration:
○ Combines data from multiple sources into a coherent store

○ Improve the accuracy and speed of data mining process

● Entity identification problem:


○ Identify real world entities from multiple data sources

○ Eg. Cust_id and Cust_No refer to the same attribute

● Detecting and resolving data value conflicts


○ For the same real world entity, attribute values from different

sources are different


○ Possible reasons: different representations, different scales
97
Handling Redundancy in Data Integration
● Redundant data occur often when integration of multiple
databases
○ Object identification: The same attribute or object may
have different names in different databases
○ Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
● Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
● Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
Correlation Analysis (Nominal Data)
● Χ2 (chi-square) test

● The larger the Χ2 value, the more likely the variables are related
● The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
● Correlation does not imply causality
○ # of hospitals and # of car-theft in a city are correlated
○ Both are causally linked to the third variable: population

99
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

● Χ2 (chi-square) calculation (numbers in parenthesis are


expected counts calculated based on the data distribution
in the two categories)

● It shows that like_science_fiction and play_chess are


correlated in the group
100
Correlation Analysis (Numeric Data)

● Correlation coefficient (also called Pearson’s product


moment coefficient)

where n is the number of tuples, and are the respective means


of A and B, σA and σB are the respective standard deviation of A and
B, and Σ(aibi) is the sum of the AB cross-product.
● If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
● rA,B = 0: independent; rAB < 0: negatively correlated
101
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

102
Correlation (viewed as linear relationship)

● Correlation measures the linear relationship between objects


● To compute correlation, we standardize data objects, A and B, and then
take their dot product

103
Covariance (Numeric Data)
Covariance is an indicator of the degree to which two random variables change with respect to
each other

104
● Positive covariance: If CovA,B > 0, then A and B both tend to be
larger than their expected values.
● Negative covariance: If CovA,B < 0 then if A is larger than its
expected value, B is likely to be smaller than its expected value.
● Independence: CovA,B = 0 but the converse is not true:
○ Some pairs of random variables may have a covariance of 0 but
are not independent. Only under some additional assumptions
(e.g., the data follow multivariate normal distributions) does a
covariance of 0 imply independence

105
106
Data Preprocessing
● Data Preprocessing: An Overview
○ Data Quality
○ Major Tasks in Data Preprocessing
● Data Cleaning
● Data Integration
● Data Reduction
● Data Transformation and Data Discretization
● Summary
107
Data Reduction Strategies
● Data reduction: Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or almost the same)
analytical results
● Why data reduction? — A database/data warehouse may store terabytes
of data. Complex data analysis may take a very long time to run on the
complete data set.
● Data reduction strategies
○ Dimensionality reduction, e.g., remove unimportant attributes
■ Wavelet transforms
■ Principal Components Analysis (PCA)
■ Feature subset selection, feature creation
○ Numerosity reduction (some simply call it: Data Reduction)
■ Regression and Log-Linear Models
■ Histograms, clustering, sampling
■ Data cube aggregation
○ Data compression
108
Data Reduction 1: Dimensionality Reduction

● Dimensionality reduction is the process of reducing the number of


random variables
or attributes under consideration.
● Dimensionality reduction techniques
○ Wavelet transforms
○ Principal Component Analysis
○ Supervised and nonlinear techniques (e.g., feature selection)

109
Mapping Data to a New Space
■ Fourier transform
■ Wavelet transform

Two Sine Waves Two Sine Waves + Noise Frequency

110
What Is Wavelet Transform?
● Decomposes a signal into different frequency
subbands
○ Applicable to n-dimensional signals

● Data are transformed to preserve relative distance


between objects at different levels of resolution
● Allow natural clusters to become more
distinguishable
● Used for image compression

111
Wavelet Transformation
Haar2 Daubechie4
● Discrete wavelet transform (DWT) for linear signal processing,
multi-resolution analysis
● Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
● Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
● Method:
○ Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
○ Each transform has 2 functions: data smoothing, weighted difference
○ Applies to pairs of data, resulting in two set of data of length L/2
○ Applies two functions recursively, until reaches the desired length
112
Principal Component Analysis (PCA)(Karhunen-Loeve)
● Find a projection that captures the largest amount of variation in data
● The original data are projected onto a much smaller space, resulting in
dimensionality reduction. We find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space

x2

x1
113
Principal Component Analysis (Steps)
● Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
○ Normalize input data: Each attribute falls within the same range
○ Compute k orthonormal (unit) vectors, i.e., principal components
○ Each input data (vector) is a linear combination of the k principal
component vectors
○ The principal components are sorted in order of decreasing “significance”
or strength
○ Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using
the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
● Works for numeric data only
114
Attribute Subset Selection
● Another way to reduce dimensionality of data
● Redundant attributes
○ Duplicate much or all of the information contained in one or
more other attributes
○ E.g., purchase price of a product and the amount of sales tax
paid
● Irrelevant attributes
○ Contain no information that is useful for the data mining task
at hand
○ E.g., students' ID is often irrelevant to the task of predicting
students' GPA

115
Heuristic Search in Attribute Selection
● There are 2d possible attribute combinations of d attributes
● Typical heuristic attribute selection methods:
○ Best single attribute under the attribute independence
assumption: choose by significance tests

○ Best step-wise feature selection:


■ The best single-attribute is picked first
■ Then next best attribute condition to the first, ...
○ Step-wise attribute elimination:
■ Repeatedly eliminate the worst attribute
○ Decision tree Induction
■ Flowchart like structure (internal node-test on attribute, leaf
116node-class prediction)
116
Attribute Creation (Feature Generation)
● Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
● Three general methodologies
○ Attribute extraction
■ Domain-specific
○ Mapping data to new space (see: data reduction)
■ E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)
○ Attribute construction
■ Feature Construction – new attributes are constructed and added from the given set of attributes
to help mining process
■ Data discretization-raw values replaced by interval labels/conceptual labels

117
Data Reduction 2: Numerosity Reduction
● Reduce data volume by choosing alternative, smaller forms of
data representation
● Parametric methods (e.g., regression)
○ Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
○ Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
● Non-parametric methods
○ Do not assume models
○ Major families: histograms, clustering, sampling, …
118
Data Preprocessing
● Data Preprocessing: An Overview
○ Data Quality
○ Major Tasks in Data Preprocessing
● Data Cleaning
● Data Integration
● Data Reduction
● Data Transformation and Data Discretization
● Summary
119
Data Transformation
● A function that maps the entire set of values of a given attribute to a new set of replacement values s.t. each old value
can be identified with one of the new values

● Methods

○ Smoothing: remove noise from data

○ Attribute/feature construction

■ New attributes constructed from the given ones

○ Aggregation: Summarization, data cube construction

○ Normalization: Scaled to fall within a smaller, specified range

■ min-max normalization

■ z-score normalization

■ normalization by decimal scaling


120
○ Discretization: Concept hierarchy climbing
Normalization
● Min-max normalization: to [new_minA, new_maxA]

○ Ex. Let income range $12,000 to $98,000 normalized to [0.0,


1.0]. Then $73,000 is mapped to
● Z-score normalization (μ: mean, σ: standard deviation):

○Ex. Let μ = 54,000, σ = 16,000. Then


● Normalization by decimal scaling
Where j is the smallest integer such that Max(|ν’|) < 1
121
Discretization
● Three types of attributes
○ Nominal—values from an unordered set, e.g., color, profession
○ Ordinal—values from an ordered set, e.g., military or academic rank
○ Numeric—real numbers, e.g., integer or real numbers
● Discretization: Divide the range of a continuous attribute into intervals
○ Interval labels can then be used to replace actual data values
○ Reduce data size by Discretization
○ Split (top-down) vs. merge (bottom-up)
○ Discretization can be performed recursively on an attribute
○ Prepare for further analysis, e.g., classification

122
Data Discretization Methods
● Typical methods: All the methods can be applied recursively
○ Binning
■ Top-down split, unsupervised

○ Histogram analysis
■ Top-down split, unsupervised

○ Clustering analysis (unsupervised, top-down split or


bottom-up merge)
○ Decision-tree analysis (supervised, top-down split)
○ Correlation (e.g., χ2) analysis (unsupervised, bottom-up
merge)
123
Simple Discretization: Binning
● Equal-width (distance) partitioning
○ Divides the range into N intervals of equal size: uniform grid
○ if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
○ The most straightforward, but outliers may dominate presentation
○ Skewed data is not handled well
● Equal-depth (frequency) partitioning
○ Divides the range into N intervals, each containing approximately same number
of samples
○ Good data scaling
○ Managing categorical attributes can be tricky

124

You might also like