3 Pillars of Big Data
• Exponentially growing massive data
• New Tech Ecosystems which provides capacity to store/process
varied structured and unstructured data
• Advanced Analytics: AI, Machine Learning, Deep Learning
%10
Business Issue Understanding
•What decisions needs to be made?
•What information is needed to inform those decisions?
•What type of analysis can provide the information needed to
inform those decisions?
Business Issue Understanding
ABC is a retail goods market, which has hundreds of shops all over the
country. According to their annual business plan, they have two main
objectives which focus on Sales Performance and Stock Management.
• Marketing team believes that their sales performance might increase
by offering cross-products with discount rates.
• At the same time, Operations Office has an objective to decrease costs
of expired goods.
Business Issue Understanding
•What decisions needs to be made?
Which products should we offer to our customers?
What should be the discount rates?
How the company sell goods which are close to their expiration
dates more effectively?
Business Issue Understanding
•What information is needed to inform those decisions?
Which products are sold together
The effect of discount rates on customer behaviour
Business Issue Understanding
•What type of analysis can provide the information needed to inform
those decisions?
Analysis of past sales of products to capture possible sales
patterns.
Linking sales patterns with product data
Data Understanding
•What data is needed?
Product data
Sales history by products
Stock data
Expiration dates of products
•What data is available?
•What are the important characteristics of the data?
Data Preperation
Data Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
Data Integration
•Integrating necessary data from various sources
Example:
Product data >> ERP Database Definitions
Sales history by products >> Sales Transaction Data
Stock data >> Stock Management Data
Expiration dates of products >> ERP Database
Data
Data Preperation
Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Data Manipulation
Restructuring data to be ready for model building
Ex:
True / False values to > 1 / 0
Time / Date handling
Product types :> Factor
Tools:
R Packages : Python: SQL
dpylr Spark
reshape2
numpy
tidyr
data.table pandas Alteryx
… Knime
Azure
Data
Data Preperation
Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Missing Value Handling
• Finding the missing values
• Deciding how to treat them
• Delete the record
• Fill manually
• Fill with mean, median, last value before, first next value
• Fill with a model (regression, decision tree…etc)
• Feature Selection
• Which features should be included in the model?
• Eliminating features with huge ratios of missing values
• If more than %40 of values are missing,feature could be excluded.
• Deselecting features which are highly correlated or represent the same phenomena
• Ex: Heat degree: one column in ‘degress celcius’ ; other column ‘fahrenheit’
• Ex: Date of birth ; Age
Data
Data Preperation
Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Feature Generation
• Deriving features from other features
• Ex: Total Sales Per District
Kadıköy
2011
540
2012
650
2013
800
2014
900
2015
910
2016
1105
2017
1200
2018
1400
Optimum NA 200 400 340 500 590 899 560
Natilus 440 420 450 465 502 520 510 560
Beşiktaş 820 890 905 910 902 900 920 880
Üsküdar 50 200 400 600 800 421 430 500
Kartal 30 50 90 150 250 320 220 150
…
Which districts has the highest sales on average?
Which districts sales has changed most / least over years?
Which districts have common charateristics according to their sales volume?
• Feature Generation
• Deriving features from other features
Total Sales Per District 2011 2012 2013 2014 2015 2016 2017 2018
• Ex: Kadıköy
Optimum
540
NA
650
200
800
400
900
340
910
500
1105
590
1200
899
1400
560
Natilus 440 420 450 465 502 520 510 560
Beşiktaş 820 890 905 910 902 900 920 880
Üsküdar 50 200 400 600 800 421 430 500
Kartal 30 50 90 150 250 320 220 150
…
Total Sales Per District Mean Variance Min Max Range
Kadıköy 938.1 267.53 1105 1400 295
Optimum 498.4 205.80 560 899 339
Natilus 483.4 44.18 510 560 50
Beşiktaş 890.9 29.08 880 920 40
Üsküdar 425.1 214.71 421 500 79
Kartal 157.5 94.44 150 320 170
…
Which districts has the highest sales on average? Kadıköy
Which districts sales has changed most / least over years? Most : Kadıköy / Least: Beşiktaş
Which districts have common charateristics according to their sales volume? ? > Cluster analysis
• Feature Generation
Total Sales Per District Mean Variance Min Max Range
Kadıköy 938.1 267.53 1105 1400 295
Optimum 498.4 205.80 560 899 339
Natilus 483.4 44.18 510 560 50
Beşiktaş 890.9 29.08 880 920 40
Üsküdar 425.1 214.71 421 500 79
Kartal 157.5 94.44 150 320 170
…
Which districts have common charateristics
according to their sales volume? ? > Cluster
analysis?
Data
Data Preperation
Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Dimensionality Reduction
• Reducing the number of features in a data model by grouping them or eliminating them
• Missing Value Ratio
• Low Variance Filter
• High Correlation Filter
• Decision Tree / Random Forest Importance Matrix
• Principal Component Analysis (PCA)
• Backward Feature Elimination
• Forward Feature Construction
Data Preperation
Data Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Outlier Removal
• Removing the observation points
that are distant from the observations
Data Preperation
•Data Integration
•Data Manipulation
•Missing Values Handling
•Feature Selection
•Feature Generation
•Dimensionality Reduction
•Outlier Removal
•Normalization
• Normalization
• Normalization of ratings means adjusting values measured on different scales to
a notionally common scale, to enable them to compare with each other and
leverage their effect on the model in a similiar scale.
Day Feature1_People Number_of_Complaints Feature_2_Temperature Types of Normalization:
1 2200 3 14
2 800 0 14 - Min-Max
3 1200 12 15 Normalization
4 4100 0 17
5 5200 0 14
- Decimal Scaling
6 220 18 12 - Standard Deviation
7 20 33 13
….
…
Std Dev
Normalization
Common Types of Normalization:
- Min-Max Normalization
- Decimal Scaling : Multiplying of dividing by pow(10,k)
- Standard Deviation:
[x - mean(x)] * sd(x)
…
Modeling
•Determine the methodology
•Determine the important factors or variables
•Build a model
•Run the model
…various modeling techniques are selected and applied, and
their parameters are calibrated to optimal values. Typically,
there are several techniques for the same data mining problem
type. Some techniques have specific requirements on the form
of data. Therefore, stepping back to the data preparation phase
is often needed." - Wikipedia